Skip to content

Commit f49a163

Browse files
polinaeternalhoestqmariosaskoalbertvillanova
authored
Support for multiple configs in packaged modules via metadata yaml info (#5331)
* load configs parameters from metadata * add dynamical builder creation * merge configs kwargs and dataset info in hub without script module * fix loading from configs for packaged module * update push to hub to include configs and update info in metadata yaml correctly * add class to read and write configs from metadata * add configs to Dataset.push_to_hub * refactor get_module of local ds * add test for push_to_hub with multiple configs * get readme from data_files in local packaged ds (probably not needed) * change cache dirs names to include dataset name * set configs on instance instead of dynamically creating new builder class * add test for loading with different configs from data_dir * modify config names tests, mlodify tests for one config local factory * fix pickling and update metadata methods to convert to/from builders configs * update get config names to load configs from meta, refactor import of builder cls * more tests for local and hub factories * change builder name of parametrized builders, fix inspec_metric * fix default configs names in inspect.py tests, change parametrized builder.name in init instead of setting additional attribute * fix docstrings in push_to_hub * get back parametrized builder name as an attr because it's used to set info.builder_name in parent's init * add test for loading parametrized dataset with num_proc>1 * fix writing 'data' dir for default config in push-to_hub * fix custom splits in push_to_hub (change data dir structure for custom configs) * pass only existing params to builder configs * fix test for push to hub with configs, test default config too * fix dataset_json reading in get_module, add tests for local and packeged factories * update dataset_infos.json for Dataset.push_to_hub() * add dataset_name attr to builder class to differentiate between packaged builder names * use builder.dataset_name everywhere in filenames and dirs, add it to tests * use datasets.asdict when parsing configs from BuilderCOnfig objects instead of custom func * resolve data_files for all metadata configs in order to not passing config_kwargs to builder in local modules (ugly); fix some outdated var names * get data files for metadata configs * pass 'default' config_name for packaged modules without config since it cannot be None * move converting metadata to configs out of configuring function to fix pickling issues * update hash of packaged builders with custom config * simplify update_hash * add test for custom split names in custom configs dataset with .push_to_hub * rename METADATA_CONFIGS_FIELD from configs_kwargs to builder_config * simplify metadata loading, some rename * update tests to reflect change of metadata configs field name * refactor data files recolving for metadata configs make them methods of MetadataConfigs class * add tests for resolvinf data files in metadata configs * update hash for packaged modules with configs in load instead of buidler * revert moving finding patterns and resolving data files in a separate func * don't raise error in packaged factory * extend sanitize_patterns for data_files from yaml * disallow pushing metadata with a dict data_files * update Dataset.push_to_hub * update DatasetDict.push_to_hub * remove redundant code * minor comment * error for bad data_files, and warn for bad params * add MetadataConfigs.get_default_config_name * error in sanitize_patterns on bad data_files * check default config name in get_dataset_builder_class * test push_to_hub when no metadata configs * fix ignored_params check * remove configs support from PackagedDatasetModuleFactory * remove it from tests * fix missed parameter for reduce * add integration tests for loading * fix regex for parquet filenames * fix metadata configs creation: put all splits in yaml, not only the last one * fix tests for push_to_hub * roll back push_to_hub_without_meta pattern string * escape/replace some special characters in pattern in string_to_dict * fix: escape '*' in string_to_dict again * fix: pattern in tests for backward compatibility in push to hub * join quentin's and mine tests (lot's of copypaste but more checks) * remove '-' from sharded parquet pattern * separate DataFilesDict and MetadataConfigs * set default config when has only one dataset_info * fix: pass config name to resolve_data_files_locally * fix: tests for local module without script * fix: default config=None when creating a builder * cache backward compat * fix legacy cache path creation for local datasets * fix of fix of legacy cache path for local datasets * fix dataset_name creation (make it not None only for packaged modules) * remove custom reduce and add check if dynamical builder class is pickable * test if builder is pickable with 'pickle', not 'dill' * get back custom reduce to enable pickle serialization * fix test for pickle: pickle instance, not class * remove get_builder_config method from metadata class * fix: pass config_kwargs as arguments to builder class * get dataset_name in get_module() * wrap yaml error message in metadata * move glob->regex to a func, add ref to fsspec * implement DataFilesList additions * get back all data_files resolving logic to load.py * move inferring modules for all splits to a func * refactor data files resolving and creation of builder configs from metadata * rename metadata configs field: builder_config -> configs * make error message in sanitize_patterns yaml-agnostic * improve error message in yaml validation * fix yaml data files validation (raise error) * move test datasets to datasets-maintainters repo * fix yaml validation * change yaml format to only allow lists and have a required config_name field * rename yaml field: builder_configs -> configs since https://github.com/huggingface/moon-landing/pull/6490 is deployed * update datasets ids in tests * rename to dataset_card_data * remove glob_pattern_to_regex from string_to_dict, use it only where we pass glob pattern * group dataset module fields (related to configs construction) * don't instantiate BASE_FEATURE * update docs * raise if data files resolving raised in during metadata config resolving because otherwise the error says that there is no data files in repository which is misleading --------- Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Mario Šaško <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Albert Villanova del Moral <[email protected]>
1 parent 67ac60b commit f49a163

31 files changed

+2074
-418
lines changed

ADD_NEW_DATASET.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,5 @@ Add datasets directly to the 🤗 Hugging Face Hub!
44

55
You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation:
66

7-
* [Create a dataset and upload files](https://huggingface.co/docs/datasets/upload_dataset)
8-
* [Advanced guide using dataset scripts](https://huggingface.co/docs/datasets/share)
7+
* [Create a dataset and upload files on the website](https://huggingface.co/docs/datasets/upload_dataset)
8+
* [Advanced guide using the CLI](https://huggingface.co/docs/datasets/share)

CONTRIBUTING.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,8 @@ Note that if any files were formatted by `pre-commit` hooks during committing, y
9595

9696
You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation:
9797

98-
* [Create a dataset and upload files](https://huggingface.co/docs/datasets/upload_dataset)
99-
* [Advanced guide using dataset scripts](https://huggingface.co/docs/datasets/share)
98+
* [Create a dataset and upload files on the website](https://huggingface.co/docs/datasets/upload_dataset)
99+
* [Advanced guide using the CLI](https://huggingface.co/docs/datasets/share)
100100

101101
## How to contribute to the dataset cards
102102

docs/source/_toctree.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -88,12 +88,12 @@
8888
- sections:
8989
- local: share
9090
title: Share
91-
- local: dataset_script
92-
title: Create a dataset loading script
9391
- local: dataset_card
9492
title: Create a dataset card
9593
- local: repository_structure
9694
title: Structure your repository
95+
- local: dataset_script
96+
title: Create a dataset loading script
9797
title: "Dataset repository"
9898
title: "How-to guides"
9999
- sections:

docs/source/about_dataset_load.mdx

+19-13
Original file line numberDiff line numberDiff line change
@@ -9,22 +9,35 @@ Let's begin with a basic Explain Like I'm Five.
99
A dataset is a directory that contains:
1010

1111
- Some data files in generic formats (JSON, CSV, Parquet, text, etc.)
12-
- An optional dataset script if it requires some code to read the data files. This is used to load files of all formats and structures.
12+
- A dataset card named `README.md` that contains documentation about the dataset as well as a YAML header to define the datasets tags and configurations
13+
- An optional dataset script if it requires some code to read the data files. This is sometimes used to load files of specific formats and structures.
1314

1415
The [`load_dataset`] function fetches the requested dataset locally or from the Hugging Face Hub.
1516
The Hub is a central repository where all the Hugging Face datasets and models are stored.
1617

1718
If the dataset only contains data files, then [`load_dataset`] automatically infers how to load the data files from their extensions (json, csv, parquet, txt, etc.).
19+
Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based on the data files format. There exist one builder per data file format in 🤗 Datasets:
20+
21+
* [`datasets.packaged_modules.text.Text`] for text
22+
* [`datasets.packaged_modules.csv.Csv`] for CSV and TSV
23+
* [`datasets.packaged_modules.json.Json`] for JSON and JSONL
24+
* [`datasets.packaged_modules.parquet.Parquet`] for Parquet
25+
* [`datasets.packaged_modules.arrow.Arrow`] for Arrow (streaming file format)
26+
* [`datasets.packaged_modules.sql.Sql`] for SQL databases
27+
* [`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
28+
* [`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders
29+
1830
If the dataset has a dataset script, then it downloads and imports it from the Hugging Face Hub.
19-
Code in the dataset script defines the dataset information (description, features, URL to the original files, etc.), and tells 🤗 Datasets how to generate and display examples from it.
31+
Code in the dataset script defines a custom [`DatasetBuilder`] the dataset information (description, features, URL to the original files, etc.), and tells 🤗 Datasets how to generate and display examples from it.
2032

2133
<Tip>
2234

23-
Read the [Share](./share) section to learn more about how to share a dataset. This section also provides a step-by-step guide on how to write your own dataset loading script!
35+
Read the [Share](./upload_dataset) section to learn more about how to share a dataset. This section also provides a step-by-step guide on how to write your own dataset loading script!
2436

2537
</Tip>
2638

27-
The dataset script downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive. If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again.
39+
🤗 Datasets downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive.
40+
If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again.
2841

2942
Now that you have a high-level understanding about how datasets are built, let's take a closer look at the nuts and bolts of how all this works.
3043

@@ -83,21 +96,14 @@ There are three main methods in [`DatasetBuilder`]:
8396

8497
The dataset is generated with a Python generator, which doesn't load all the data in memory. As a result, the generator can handle large datasets. However, before the generated samples are flushed to the dataset file on disk, they are stored in an `ArrowWriter` buffer. This means the generated samples are written by batch. If your dataset samples consumes a lot of memory (images or videos), then make sure to specify a low value for the `DEFAULT_WRITER_BATCH_SIZE` attribute in [`DatasetBuilder`]. We recommend not exceeding a size of 200 MB.
8598

86-
## Without loading scripts
87-
88-
As a user, you want to be able to quickly use a dataset. Implementing a dataset loading script can sometimes get in the way, or it may be a barrier for some people without a developer background. 🤗 Datasets removes this barrier by making it possible to load any dataset from the Hub without a dataset loading script. All a user has to do is upload the data files (see [upload_dataset_repo](#upload_dataset_repo) for a list of supported file formats) to a dataset repository on the Hub, and they will be able to load that dataset without having to create a loading script. This doesn't mean we are moving away from loading scripts because they still offer the most flexibility in controlling how a dataset is generated.
89-
90-
The loading script-free method uses the [huggingface_hub](https://github.com/huggingface/huggingface_hub) library to list the files in a dataset repository. You can also provide a path to a local directory instead of a repository name, in which case 🤗 Datasets will use [glob](https://docs.python.org/3/library/glob) instead. Depending on the format of the data files available, one of the data file builders will create your dataset for you. If you have a CSV file, the CSV builder will be used and if you have a Parquet file, the Parquet builder will be used. The drawback of this approach is it's not possible to simultaneously load a CSV and JSON file. You will need to load the two file types separately, and then concatenate them.
91-
9299
## Maintaining integrity
93100

94101
To ensure a dataset is complete, [`load_dataset`] will perform a series of tests on the downloaded files to make sure everything is there. This way, you don't encounter any surprises when your requested dataset doesn't get generated as expected. [`load_dataset`] verifies:
95102

96-
- The list of downloaded files.
97-
- The number of bytes of the downloaded files.
98-
- The SHA256 checksums of the downloaded files.
99103
- The number of splits in the generated `DatasetDict`.
100104
- The number of samples in each split of the generated `DatasetDict`.
105+
- The list of downloaded files.
106+
- The SHA256 checksums of the downloaded files (disabled by defaut).
101107

102108
If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files.
103109

docs/source/dataset_card.mdx

+2
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,6 @@ Creating a dataset card is easy and can be done in a just a few steps:
2525

2626
4. Once you're done, commit the changes to the `README.md` file and you'll see the completed dataset card on your repository.
2727

28+
YAML also allows you to customize the way your dataset is loaded by [defining splits and/or configurations](./repository_structure#define-your-splits-and-subsets-in-yaml) without the need to write any code.
29+
2830
Feel free to take a look at the [SNLI](https://huggingface.co/datasets/snli), [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/allocine) dataset cards as examples to help you get started.

docs/source/dataset_script.mdx

+6-3
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,15 @@
33

44
<Tip>
55

6-
The dataset script is optional if your dataset is in one of the following formats: CSV, JSON, JSON lines, text or Parquet.
7-
With those formats, you should be able to load your dataset automatically with [`~datasets.load_dataset`].
6+
The dataset script is likely not needed if your dataset is in one of the following formats: CSV, JSON, JSON lines, text or Parquet.
7+
With those formats, you should be able to load your dataset automatically with [`~datasets.load_dataset`],
8+
as long as your dataset repository has a [required structure](./repository_structure).
89

910
</Tip>
1011

11-
Write a dataset script to load and share your own datasets. It is a Python file that defines the different configurations and splits of your dataset, as well as how to download and process the data.
12+
Write a dataset script to load and share datasets that consist of data files in unsupported formats or require more complex data preparation.
13+
This is a more advanced way to define a dataset than using [YAML metadata in the dataset card](./repository_structure#define-your-splits-in-yaml).
14+
A dataset script is a Python file that defines the different configurations and splits of your dataset, as well as how to download and process the data.
1215

1316
The script can download data files from any website, or from the same dataset repository.
1417

docs/source/package_reference/loading_methods.mdx

+16
Original file line numberDiff line numberDiff line change
@@ -53,30 +53,46 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
5353

5454
[[autodoc]] datasets.packaged_modules.text.TextConfig
5555

56+
[[autodoc]] datasets.packaged_modules.text.Text
57+
5658
### CSV
5759

5860
[[autodoc]] datasets.packaged_modules.csv.CsvConfig
5961

62+
[[autodoc]] datasets.packaged_modules.csv.Csv
63+
6064
### JSON
6165

6266
[[autodoc]] datasets.packaged_modules.json.JsonConfig
6367

68+
[[autodoc]] datasets.packaged_modules.json.Json
69+
6470
### Parquet
6571

6672
[[autodoc]] datasets.packaged_modules.parquet.ParquetConfig
6773

74+
[[autodoc]] datasets.packaged_modules.parquet.Parquet
75+
6876
### Arrow
6977

7078
[[autodoc]] datasets.packaged_modules.arrow.ArrowConfig
7179

80+
[[autodoc]] datasets.packaged_modules.arrow.Arrow
81+
7282
### SQL
7383

7484
[[autodoc]] datasets.packaged_modules.sql.SqlConfig
7585

86+
[[autodoc]] datasets.packaged_modules.sql.Sql
87+
7688
### Images
7789

7890
[[autodoc]] datasets.packaged_modules.imagefolder.ImageFolderConfig
7991

92+
[[autodoc]] datasets.packaged_modules.imagefolder.ImageFolder
93+
8094
### Audio
8195

8296
[[autodoc]] datasets.packaged_modules.audiofolder.AudioFolderConfig
97+
98+
[[autodoc]] datasets.packaged_modules.audiofolder.AudioFolder

0 commit comments

Comments
 (0)