Sem-Lex has duplicate metadata: deduplicate or include additional column? #83

cleong110 · 2024-12-05T21:14:54Z

Use-Case: Recover original metadata from dataset

I wanted to use the dataloader to conveniently access the metadata and thus associate videos with signCLIP embeddings. However, as noted in the dataloader, sem-lex's metadata does not have unique video IDs. It turns out actually much of the metadata is duplicated, and those duplicates are also included in the dataloader.

2974 duplicates in the metadata

There are 91149-88175 = 2974 repeated IDs.

# use cut to take only column 2 (video_id)
cut -d "," semlex_metadata.csv -f2|head -n 3
video_id
uhdBQ9cLSPTCAkvOj6ko
vw74HcbvAlKFkp8et5fH

# count values
cut -d "," semlex_metadata.csv -f2|wc -l
91149

# count unique values (sort, then find unique lines, then count lines)
cut -d "," semlex_metadata.csv -f2|sort|uniq|wc -l
88175

But in fact, entire rows are repeated, even when including all fields.

# look at columns 2-50, the first 3 lines only
cut -d "," semlex_metadata.csv -f2-50|head -n 3
video_id,signer_id,duration,split,label_type,label,Handshape,Selected Fingers,Flexion,Flexion Change,Spread,Spread Change,Thumb Position,Thumb Contact,Sign Type,Path Movement,Repeated Movement,Major Location,Minor Location,Second Minor Location,Contact,Nondominant Handshape,Wrist Twist,SignBank Reference ID
uhdBQ9cLSPTCAkvOj6ko,42,1105.0,train,freetext,ran,,,,,,,,,,,,,,,,,,
vw74HcbvAlKFkp8et5fH,62,962.0,train,asllex,analyze,v,im,Fully Open,1.0,1.0,0.0,Closed,0.0,Symmetrical Or Alternating,Curved,1.0,Neutral,Neutral,Neutral,1.0,v,0.0,812.0

# take columns 2 through 50, sort and take unique values and count
cut -d "," semlex_metadata.csv -f2-50|sort|uniq|wc -l
89498

2973 Duplicates in the dataloader?

If I do tfds.load with_info I can get the counts of items, the snippet below prints 91148

config = SignDatasetConfig(name="only_annotations", include_pose=None, include_video=False)
dataset, info = tfds.load("SemLex", builder_kwargs=dict(config=config), with_info=True)
total_count = sum([info.splits[split].num_examples for split in info.splits])
print(total_count)

The metadata appears to be a Pandas dataframe, with unique nonunique IDs

We can see that the original has a column without any name, just numbers. That's characteristics of pd.to_csv()

head -n 2 semlex_metadata.csv 
,video_id,signer_id,duration,split,label_type,label,Handshape,Selected Fingers,Flexion,Flexion Change,Spread,Spread Change,Thumb Position,Thumb Contact,Sign Type,Path Movement,Repeated Movement,Major Location,Minor Location,Second Minor Location,Contact,Nondominant Handshape,Wrist Twist,SignBank Reference ID
85133,uhdBQ9cLSPTCAkvOj6ko,42,1105.0,train,freetext,ran,,,,,,,,,,,,,,,,,,

However these are not unique either:

# the first few look like this
cut -d "," semlex_metadata.csv -f1|head -n 3

85133
52435

# sort and take unique values and count
cut -d "," semlex_metadata.csv -f1|sort|uniq|wc -l
78295

The combination of pandas id and video_id IS unique

# take only the pandas id and the video id, then count unique values
cut -d "," semlex_metadata.csv -f1,2|sort|uniq|wc -l
91149

Suggestions:

Combine the pandas ID and the video ID so that the exact items in the CSV can be recovered from the dataset
Deduplicate?

One thing to confirm is whether the .npy files themselves are redundant.

The text was updated successfully, but these errors were encountered:

cleong110 changed the title ~~Sem-Lex has duplicate metadata: deduplicate or include additional column~~ Sem-Lex has duplicate metadata: deduplicate or include additional column? Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sem-Lex has duplicate metadata: deduplicate or include additional column? #83

Sem-Lex has duplicate metadata: deduplicate or include additional column? #83

cleong110 commented Dec 5, 2024 •

edited

Loading

Sem-Lex has duplicate metadata: deduplicate or include additional column? #83

Sem-Lex has duplicate metadata: deduplicate or include additional column? #83

Comments

cleong110 commented Dec 5, 2024 • edited Loading

Use-Case: Recover original metadata from dataset

2974 duplicates in the metadata

2973 Duplicates in the dataloader?

The metadata appears to be a Pandas dataframe, with unique nonunique IDs

The combination of pandas id and video_id IS unique

Suggestions:

cleong110 commented Dec 5, 2024 •

edited

Loading