You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support for multiple configs in packaged modules via metadata yaml info (#5331)
* load configs parameters from metadata
* add dynamical builder creation
* merge configs kwargs and dataset info in hub without script module
* fix loading from configs for packaged module
* update push to hub to include configs and update info in metadata yaml correctly
* add class to read and write configs from metadata
* add configs to Dataset.push_to_hub
* refactor get_module of local ds
* add test for push_to_hub with multiple configs
* get readme from data_files in local packaged ds (probably not needed)
* change cache dirs names to include dataset name
* set configs on instance instead of dynamically creating new builder class
* add test for loading with different configs from data_dir
* modify config names tests, mlodify tests for one config local factory
* fix pickling and update metadata methods to convert to/from builders configs
* update get config names to load configs from meta, refactor import of builder cls
* more tests for local and hub factories
* change builder name of parametrized builders, fix inspec_metric
* fix default configs names in inspect.py tests, change parametrized builder.name in init instead of setting additional attribute
* fix docstrings in push_to_hub
* get back parametrized builder name as an attr because it's used to set info.builder_name in parent's init
* add test for loading parametrized dataset with num_proc>1
* fix writing 'data' dir for default config in push-to_hub
* fix custom splits in push_to_hub (change data dir structure for custom configs)
* pass only existing params to builder configs
* fix test for push to hub with configs, test default config too
* fix dataset_json reading in get_module, add tests for local and packeged factories
* update dataset_infos.json for Dataset.push_to_hub()
* add dataset_name attr to builder class to differentiate between packaged builder names
* use builder.dataset_name everywhere in filenames and dirs, add it to tests
* use datasets.asdict when parsing configs from BuilderCOnfig objects instead of custom func
* resolve data_files for all metadata configs in order to not passing config_kwargs to builder in local modules (ugly); fix some outdated var names
* get data files for metadata configs
* pass 'default' config_name for packaged modules without config since it cannot be None
* move converting metadata to configs out of configuring function to fix pickling issues
* update hash of packaged builders with custom config
* simplify update_hash
* add test for custom split names in custom configs dataset with .push_to_hub
* rename METADATA_CONFIGS_FIELD from configs_kwargs to builder_config
* simplify metadata loading, some rename
* update tests to reflect change of metadata configs field name
* refactor data files recolving for metadata configs
make them methods of MetadataConfigs class
* add tests for resolvinf data files in metadata configs
* update hash for packaged modules with configs in load instead of buidler
* revert moving finding patterns and resolving data files in a separate func
* don't raise error in packaged factory
* extend sanitize_patterns for data_files from yaml
* disallow pushing metadata with a dict data_files
* update Dataset.push_to_hub
* update DatasetDict.push_to_hub
* remove redundant code
* minor comment
* error for bad data_files, and warn for bad params
* add MetadataConfigs.get_default_config_name
* error in sanitize_patterns on bad data_files
* check default config name in get_dataset_builder_class
* test push_to_hub when no metadata configs
* fix ignored_params check
* remove configs support from PackagedDatasetModuleFactory
* remove it from tests
* fix missed parameter for reduce
* add integration tests for loading
* fix regex for parquet filenames
* fix metadata configs creation: put all splits in yaml, not only the last one
* fix tests for push_to_hub
* roll back push_to_hub_without_meta pattern string
* escape/replace some special characters in pattern in string_to_dict
* fix: escape '*' in string_to_dict again
* fix: pattern in tests for backward compatibility in push to hub
* join quentin's and mine tests (lot's of copypaste but more checks)
* remove '-' from sharded parquet pattern
* separate DataFilesDict and MetadataConfigs
* set default config when has only one dataset_info
* fix: pass config name to resolve_data_files_locally
* fix: tests for local module without script
* fix: default config=None when creating a builder
* cache backward compat
* fix legacy cache path creation for local datasets
* fix of fix of legacy cache path for local datasets
* fix dataset_name creation (make it not None only for packaged modules)
* remove custom reduce and add check if dynamical builder class is pickable
* test if builder is pickable with 'pickle', not 'dill'
* get back custom reduce to enable pickle serialization
* fix test for pickle: pickle instance, not class
* remove get_builder_config method from metadata class
* fix: pass config_kwargs as arguments to builder class
* get dataset_name in get_module()
* wrap yaml error message in metadata
* move glob->regex to a func, add ref to fsspec
* implement DataFilesList additions
* get back all data_files resolving logic to load.py
* move inferring modules for all splits to a func
* refactor data files resolving and creation of builder configs from metadata
* rename metadata configs field: builder_config -> configs
* make error message in sanitize_patterns yaml-agnostic
* improve error message in yaml validation
* fix yaml data files validation (raise error)
* move test datasets to datasets-maintainters repo
* fix yaml validation
* change yaml format to only allow lists and have a required config_name field
* rename yaml field: builder_configs -> configs
since https://github.com/huggingface/moon-landing/pull/6490 is deployed
* update datasets ids in tests
* rename to dataset_card_data
* remove glob_pattern_to_regex from string_to_dict, use it only where we pass glob pattern
* group dataset module fields (related to configs construction)
* don't instantiate BASE_FEATURE
* update docs
* raise if data files resolving raised in during metadata config resolving
because otherwise the error says that there is no data files in repository which is misleading
---------
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Mario Šaško <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Albert Villanova del Moral <[email protected]>
Copy file name to clipboardExpand all lines: docs/source/about_dataset_load.mdx
+19-13
Original file line number
Diff line number
Diff line change
@@ -9,22 +9,35 @@ Let's begin with a basic Explain Like I'm Five.
9
9
A dataset is a directory that contains:
10
10
11
11
- Some data files in generic formats (JSON, CSV, Parquet, text, etc.)
12
-
- An optional dataset script if it requires some code to read the data files. This is used to load files of all formats and structures.
12
+
- A dataset card named `README.md` that contains documentation about the dataset as well as a YAML header to define the datasets tags and configurations
13
+
- An optional dataset script if it requires some code to read the data files. This is sometimes used to load files of specific formats and structures.
13
14
14
15
The [`load_dataset`] function fetches the requested dataset locally or from the Hugging Face Hub.
15
16
The Hub is a central repository where all the Hugging Face datasets and models are stored.
16
17
17
18
If the dataset only contains data files, then [`load_dataset`] automatically infers how to load the data files from their extensions (json, csv, parquet, txt, etc.).
19
+
Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based on the data files format. There exist one builder per data file format in 🤗 Datasets:
20
+
21
+
*[`datasets.packaged_modules.text.Text`] for text
22
+
*[`datasets.packaged_modules.csv.Csv`] for CSV and TSV
23
+
*[`datasets.packaged_modules.json.Json`] for JSON and JSONL
24
+
*[`datasets.packaged_modules.parquet.Parquet`] for Parquet
25
+
*[`datasets.packaged_modules.arrow.Arrow`] for Arrow (streaming file format)
26
+
*[`datasets.packaged_modules.sql.Sql`] for SQL databases
27
+
*[`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
28
+
*[`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders
29
+
18
30
If the dataset has a dataset script, then it downloads and imports it from the Hugging Face Hub.
19
-
Code in the dataset script defines the dataset information (description, features, URL to the original files, etc.), and tells 🤗 Datasets how to generate and display examples from it.
31
+
Code in the dataset script defines a custom [`DatasetBuilder`]the dataset information (description, features, URL to the original files, etc.), and tells 🤗 Datasets how to generate and display examples from it.
20
32
21
33
<Tip>
22
34
23
-
Read the [Share](./share) section to learn more about how to share a dataset. This section also provides a step-by-step guide on how to write your own dataset loading script!
35
+
Read the [Share](./upload_dataset) section to learn more about how to share a dataset. This section also provides a step-by-step guide on how to write your own dataset loading script!
24
36
25
37
</Tip>
26
38
27
-
The dataset script downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive. If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again.
39
+
🤗 Datasets downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive.
40
+
If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again.
28
41
29
42
Now that you have a high-level understanding about how datasets are built, let's take a closer look at the nuts and bolts of how all this works.
30
43
@@ -83,21 +96,14 @@ There are three main methods in [`DatasetBuilder`]:
83
96
84
97
The dataset is generated with a Python generator, which doesn't load all the data in memory. As a result, the generator can handle large datasets. However, before the generated samples are flushed to the dataset file on disk, they are stored in an `ArrowWriter` buffer. This means the generated samples are written by batch. If your dataset samples consumes a lot of memory (images or videos), then make sure to specify a low value for the `DEFAULT_WRITER_BATCH_SIZE` attribute in [`DatasetBuilder`]. We recommend not exceeding a size of 200 MB.
85
98
86
-
## Without loading scripts
87
-
88
-
As a user, you want to be able to quickly use a dataset. Implementing a dataset loading script can sometimes get in the way, or it may be a barrier for some people without a developer background. 🤗 Datasets removes this barrier by making it possible to load any dataset from the Hub without a dataset loading script. All a user has to do is upload the data files (see [upload_dataset_repo](#upload_dataset_repo) for a list of supported file formats) to a dataset repository on the Hub, and they will be able to load that dataset without having to create a loading script. This doesn't mean we are moving away from loading scripts because they still offer the most flexibility in controlling how a dataset is generated.
89
-
90
-
The loading script-free method uses the [huggingface_hub](https://github.com/huggingface/huggingface_hub) library to list the files in a dataset repository. You can also provide a path to a local directory instead of a repository name, in which case 🤗 Datasets will use [glob](https://docs.python.org/3/library/glob) instead. Depending on the format of the data files available, one of the data file builders will create your dataset for you. If you have a CSV file, the CSV builder will be used and if you have a Parquet file, the Parquet builder will be used. The drawback of this approach is it's not possible to simultaneously load a CSV and JSON file. You will need to load the two file types separately, and then concatenate them.
91
-
92
99
## Maintaining integrity
93
100
94
101
To ensure a dataset is complete, [`load_dataset`] will perform a series of tests on the downloaded files to make sure everything is there. This way, you don't encounter any surprises when your requested dataset doesn't get generated as expected. [`load_dataset`] verifies:
95
102
96
-
- The list of downloaded files.
97
-
- The number of bytes of the downloaded files.
98
-
- The SHA256 checksums of the downloaded files.
99
103
- The number of splits in the generated `DatasetDict`.
100
104
- The number of samples in each split of the generated `DatasetDict`.
105
+
- The list of downloaded files.
106
+
- The SHA256 checksums of the downloaded files (disabled by defaut).
101
107
102
108
If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files.
Copy file name to clipboardExpand all lines: docs/source/dataset_card.mdx
+2
Original file line number
Diff line number
Diff line change
@@ -25,4 +25,6 @@ Creating a dataset card is easy and can be done in a just a few steps:
25
25
26
26
4. Once you're done, commit the changes to the `README.md` file and you'll see the completed dataset card on your repository.
27
27
28
+
YAML also allows you to customize the way your dataset is loaded by [defining splits and/or configurations](./repository_structure#define-your-splits-and-subsets-in-yaml) without the need to write any code.
29
+
28
30
Feel free to take a look at the [SNLI](https://huggingface.co/datasets/snli), [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/allocine) dataset cards as examples to help you get started.
Copy file name to clipboardExpand all lines: docs/source/dataset_script.mdx
+6-3
Original file line number
Diff line number
Diff line change
@@ -3,12 +3,15 @@
3
3
4
4
<Tip>
5
5
6
-
The dataset script is optional if your dataset is in one of the following formats: CSV, JSON, JSON lines, text or Parquet.
7
-
With those formats, you should be able to load your dataset automatically with [`~datasets.load_dataset`].
6
+
The dataset script is likely not needed if your dataset is in one of the following formats: CSV, JSON, JSON lines, text or Parquet.
7
+
With those formats, you should be able to load your dataset automatically with [`~datasets.load_dataset`],
8
+
as long as your dataset repository has a [required structure](./repository_structure).
8
9
9
10
</Tip>
10
11
11
-
Write a dataset script to load and share your own datasets. It is a Python file that defines the different configurations and splits of your dataset, as well as how to download and process the data.
12
+
Write a dataset script to load and share datasets that consist of data files in unsupported formats or require more complex data preparation.
13
+
This is a more advanced way to define a dataset than using [YAML metadata in the dataset card](./repository_structure#define-your-splits-in-yaml).
14
+
A dataset script is a Python file that defines the different configurations and splits of your dataset, as well as how to download and process the data.
12
15
13
16
The script can download data files from any website, or from the same dataset repository.
0 commit comments