Skip to content

fix: Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames (#7517) #7521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

giraffacarp
Copy link
Contributor

@giraffacarp giraffacarp commented Apr 15, 2025

Task

Support bytes-like objects (bytes and bytearray) in Features classes

Description

The Features classes only accept bytes objects for binary data, but not bytearray. This leads to errors when using IterableDataset.from_spark() with Spark DataFrames as they contain bytearray objects, even though both bytes and bytearray are valid bytes-like objects in Python.

Changes

  • Updated Features classes to accept both bytes and bytearray types for binary data fields.

Reasoning

  • bytes and bytearray serve the same purpose for binary data, with the only difference being mutability.
  • Modifying the Spark iterator to convert bytearray to bytes would be a workaround, not a true fix. I think the correct solution is to accept all bytes-like objects as input.
  • This approach is more robust and future-proof since Python 3.12+ provides a standard way to check for buffer protocol.

Testing

  • Added tests to cover bytearray inputs for image features.

Related Issues

@giraffacarp
Copy link
Contributor Author

@lhoestq let me know if you prefer to change the spark iterator so it outputs bytes

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq merged commit 22f62f6 into huggingface:main May 7, 2025
8 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames
3 participants