You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to use the dataloader to conveniently access the metadata and thus associate videos with signCLIP embeddings. However, as noted in the dataloader, sem-lex's metadata does not have unique video IDs. It turns out actually much of the metadata is duplicated, and those duplicates are also included in the dataloader.
2974 duplicates in the metadata
There are 91149-88175 = 2974 repeated IDs.
# use cut to take only column 2 (video_id)
cut -d "," semlex_metadata.csv -f2|head -n 3
video_id
uhdBQ9cLSPTCAkvOj6ko
vw74HcbvAlKFkp8et5fH
# count values
cut -d "," semlex_metadata.csv -f2|wc -l
91149
# count unique values (sort, then find unique lines, then count lines)
cut -d "," semlex_metadata.csv -f2|sort|uniq|wc -l
88175
But in fact, entire rows are repeated, even when including all fields.
# look at columns 2-50, the first 3 lines only
cut -d "," semlex_metadata.csv -f2-50|head -n 3
video_id,signer_id,duration,split,label_type,label,Handshape,Selected Fingers,Flexion,Flexion Change,Spread,Spread Change,Thumb Position,Thumb Contact,Sign Type,Path Movement,Repeated Movement,Major Location,Minor Location,Second Minor Location,Contact,Nondominant Handshape,Wrist Twist,SignBank Reference ID
uhdBQ9cLSPTCAkvOj6ko,42,1105.0,train,freetext,ran,,,,,,,,,,,,,,,,,,
vw74HcbvAlKFkp8et5fH,62,962.0,train,asllex,analyze,v,im,Fully Open,1.0,1.0,0.0,Closed,0.0,Symmetrical Or Alternating,Curved,1.0,Neutral,Neutral,Neutral,1.0,v,0.0,812.0
# take columns 2 through 50, sort and take unique values and count
cut -d "," semlex_metadata.csv -f2-50|sort|uniq|wc -l
89498
2973 Duplicates in the dataloader?
If I do tfds.load with_info I can get the counts of items, the snippet below prints 91148
config = SignDatasetConfig(name="only_annotations", include_pose=None, include_video=False)
dataset, info = tfds.load("SemLex", builder_kwargs=dict(config=config), with_info=True)
total_count = sum([info.splits[split].num_examples for split in info.splits])
print(total_count)
The metadata appears to be a Pandas dataframe, with unique nonunique IDs
We can see that the original has a column without any name, just numbers. That's characteristics of pd.to_csv()
head -n 2 semlex_metadata.csv
,video_id,signer_id,duration,split,label_type,label,Handshape,Selected Fingers,Flexion,Flexion Change,Spread,Spread Change,Thumb Position,Thumb Contact,Sign Type,Path Movement,Repeated Movement,Major Location,Minor Location,Second Minor Location,Contact,Nondominant Handshape,Wrist Twist,SignBank Reference ID
85133,uhdBQ9cLSPTCAkvOj6ko,42,1105.0,train,freetext,ran,,,,,,,,,,,,,,,,,,
However these are not unique either:
# the first few look like this
cut -d "," semlex_metadata.csv -f1|head -n 3
85133
52435
# sort and take unique values and count
cut -d "," semlex_metadata.csv -f1|sort|uniq|wc -l
78295
The combination of pandas id and video_id IS unique
# take only the pandas id and the video id, then count unique values
cut -d "," semlex_metadata.csv -f1,2|sort|uniq|wc -l
91149
Suggestions:
Combine the pandas ID and the video ID so that the exact items in the CSV can be recovered from the dataset
Deduplicate?
One thing to confirm is whether the .npy files themselves are redundant.
The text was updated successfully, but these errors were encountered:
cleong110
changed the title
Sem-Lex has duplicate metadata: deduplicate or include additional column
Sem-Lex has duplicate metadata: deduplicate or include additional column?
Dec 5, 2024
Use-Case: Recover original metadata from dataset
I wanted to use the dataloader to conveniently access the metadata and thus associate videos with signCLIP embeddings. However, as noted in the dataloader, sem-lex's metadata does not have unique video IDs. It turns out actually much of the metadata is duplicated, and those duplicates are also included in the dataloader.
2974 duplicates in the metadata
There are 91149-88175 = 2974 repeated IDs.
But in fact, entire rows are repeated, even when including all fields.
2973 Duplicates in the dataloader?
If I do tfds.load with_info I can get the counts of items, the snippet below prints 91148
The metadata appears to be a Pandas dataframe, with
uniquenonunique IDsWe can see that the original has a column without any name, just numbers. That's characteristics of pd.to_csv()
However these are not unique either:
The combination of pandas id and video_id IS unique
Suggestions:
One thing to confirm is whether the .npy files themselves are redundant.
The text was updated successfully, but these errors were encountered: