Skip to content

Sem-Lex has duplicate metadata: deduplicate or include additional column? #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cleong110 opened this issue Dec 5, 2024 · 0 comments

Comments

@cleong110
Copy link
Contributor

cleong110 commented Dec 5, 2024

Use-Case: Recover original metadata from dataset

I wanted to use the dataloader to conveniently access the metadata and thus associate videos with signCLIP embeddings. However, as noted in the dataloader, sem-lex's metadata does not have unique video IDs. It turns out actually much of the metadata is duplicated, and those duplicates are also included in the dataloader.

2974 duplicates in the metadata

There are 91149-88175 = 2974 repeated IDs.

# use cut to take only column 2 (video_id)
cut -d "," semlex_metadata.csv -f2|head -n 3
video_id
uhdBQ9cLSPTCAkvOj6ko
vw74HcbvAlKFkp8et5fH

# count values
cut -d "," semlex_metadata.csv -f2|wc -l
91149

# count unique values (sort, then find unique lines, then count lines)
cut -d "," semlex_metadata.csv -f2|sort|uniq|wc -l
88175

But in fact, entire rows are repeated, even when including all fields.

# look at columns 2-50, the first 3 lines only
cut -d "," semlex_metadata.csv -f2-50|head -n 3
video_id,signer_id,duration,split,label_type,label,Handshape,Selected Fingers,Flexion,Flexion Change,Spread,Spread Change,Thumb Position,Thumb Contact,Sign Type,Path Movement,Repeated Movement,Major Location,Minor Location,Second Minor Location,Contact,Nondominant Handshape,Wrist Twist,SignBank Reference ID
uhdBQ9cLSPTCAkvOj6ko,42,1105.0,train,freetext,ran,,,,,,,,,,,,,,,,,,
vw74HcbvAlKFkp8et5fH,62,962.0,train,asllex,analyze,v,im,Fully Open,1.0,1.0,0.0,Closed,0.0,Symmetrical Or Alternating,Curved,1.0,Neutral,Neutral,Neutral,1.0,v,0.0,812.0

# take columns 2 through 50, sort and take unique values and count
cut -d "," semlex_metadata.csv -f2-50|sort|uniq|wc -l
89498

2973 Duplicates in the dataloader?

If I do tfds.load with_info I can get the counts of items, the snippet below prints 91148

config = SignDatasetConfig(name="only_annotations", include_pose=None, include_video=False)
dataset, info = tfds.load("SemLex", builder_kwargs=dict(config=config), with_info=True)
total_count = sum([info.splits[split].num_examples for split in info.splits])
print(total_count)

The metadata appears to be a Pandas dataframe, with unique nonunique IDs

We can see that the original has a column without any name, just numbers. That's characteristics of pd.to_csv()

head -n 2 semlex_metadata.csv 
,video_id,signer_id,duration,split,label_type,label,Handshape,Selected Fingers,Flexion,Flexion Change,Spread,Spread Change,Thumb Position,Thumb Contact,Sign Type,Path Movement,Repeated Movement,Major Location,Minor Location,Second Minor Location,Contact,Nondominant Handshape,Wrist Twist,SignBank Reference ID
85133,uhdBQ9cLSPTCAkvOj6ko,42,1105.0,train,freetext,ran,,,,,,,,,,,,,,,,,,

However these are not unique either:

# the first few look like this
cut -d "," semlex_metadata.csv -f1|head -n 3

85133
52435

# sort and take unique values and count
cut -d "," semlex_metadata.csv -f1|sort|uniq|wc -l
78295

The combination of pandas id and video_id IS unique

# take only the pandas id and the video id, then count unique values
cut -d "," semlex_metadata.csv -f1,2|sort|uniq|wc -l
91149

Suggestions:

  • Combine the pandas ID and the video ID so that the exact items in the CSV can be recovered from the dataset
  • Deduplicate?

One thing to confirm is whether the .npy files themselves are redundant.

@cleong110 cleong110 changed the title Sem-Lex has duplicate metadata: deduplicate or include additional column Sem-Lex has duplicate metadata: deduplicate or include additional column? Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant