huggingface
diff --git a/‎ADD_NEW_DATASET.md
+2-2 b/‎ADD_NEW_DATASET.md
+2-2
diff --git a/‎CONTRIBUTING.md
+2-2 b/‎CONTRIBUTING.md
+2-2
diff --git a/‎docs/source/_toctree.yml
+2-2 b/‎docs/source/_toctree.yml
+2-2
diff --git a/‎docs/source/about_dataset_load.mdx
+19-13 b/‎docs/source/about_dataset_load.mdx
+19-13
diff --git a/‎docs/source/dataset_card.mdx
+2 b/‎docs/source/dataset_card.mdx
+2
diff --git a/‎docs/source/dataset_script.mdx
+6-3 b/‎docs/source/dataset_script.mdx
+6-3
diff --git a/‎docs/source/package_reference/loading_methods.mdx
+16 b/‎docs/source/package_reference/loading_methods.mdx
+16
@@ -4,5 +4,5 @@ Add datasets directly to the 🤗 Hugging Face Hub!
 
 You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation:
 
-* [Create a dataset and upload files](https://huggingface.co/docs/datasets/upload_dataset)
-* [Advanced guide using dataset scripts](https://huggingface.co/docs/datasets/share)
+* [Create a dataset and upload files on the website](https://huggingface.co/docs/datasets/upload_dataset)
+* [Advanced guide using the CLI](https://huggingface.co/docs/datasets/share)
@@ -95,8 +95,8 @@ Note that if any files were formatted by `pre-commit` hooks during committing, y
 
 You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation:
 
-* [Create a dataset and upload files](https://huggingface.co/docs/datasets/upload_dataset)
-* [Advanced guide using dataset scripts](https://huggingface.co/docs/datasets/share)
+* [Create a dataset and upload files on the website](https://huggingface.co/docs/datasets/upload_dataset)
+* [Advanced guide using the CLI](https://huggingface.co/docs/datasets/share)
 
 ## How to contribute to the dataset cards
 
 
@@ -88,12 +88,12 @@
   - sections:
     - local: share
       title: Share
-    - local: dataset_script
-      title: Create a dataset loading script
     - local: dataset_card
       title: Create a dataset card
     - local: repository_structure
       title: Structure your repository
+    - local: dataset_script
+      title: Create a dataset loading script
     title: "Dataset repository"
   title: "How-to guides"
 - sections:
 
@@ -9,22 +9,35 @@ Let's begin with a basic Explain Like I'm Five.
 A dataset is a directory that contains:
 
 - Some data files in generic formats (JSON, CSV, Parquet, text, etc.)
-- An optional dataset script if it requires some code to read the data files. This is used to load files of all formats and structures.
+- A dataset card named `README.md` that contains documentation about the dataset as well as a YAML header to define the datasets tags and configurations
+- An optional dataset script if it requires some code to read the data files. This is sometimes used to load files of specific formats and structures.
 
 The [`load_dataset`] function fetches the requested dataset locally or from the Hugging Face Hub.
 The Hub is a central repository where all the Hugging Face datasets and models are stored.
 
 If the dataset only contains data files, then [`load_dataset`] automatically infers how to load the data files from their extensions (json, csv, parquet, txt, etc.).
+Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based on the data files format. There exist one builder per data file format in 🤗 Datasets:
+
+* [`datasets.packaged_modules.text.Text`] for text
+* [`datasets.packaged_modules.csv.Csv`] for CSV and TSV
+* [`datasets.packaged_modules.json.Json`] for JSON and JSONL
+* [`datasets.packaged_modules.parquet.Parquet`] for Parquet
+* [`datasets.packaged_modules.arrow.Arrow`] for Arrow (streaming file format)
+* [`datasets.packaged_modules.sql.Sql`] for SQL databases
+* [`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
+* [`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders
+
 If the dataset has a dataset script, then it downloads and imports it from the Hugging Face Hub. 
-Code in the dataset script defines the dataset information (description, features, URL to the original files, etc.), and tells 🤗 Datasets how to generate and display examples from it.
+Code in the dataset script defines a custom [`DatasetBuilder`] the dataset information (description, features, URL to the original files, etc.), and tells 🤗 Datasets how to generate and display examples from it.
 
 <Tip>
 
-Read the [Share](./share) section to learn more about how to share a dataset. This section also provides a step-by-step guide on how to write your own dataset loading script!
+Read the [Share](./upload_dataset) section to learn more about how to share a dataset. This section also provides a step-by-step guide on how to write your own dataset loading script!
 
 </Tip>
 
-The dataset script downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive. If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again.
+🤗 Datasets downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive.
+If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again.
 
 Now that you have a high-level understanding about how datasets are built, let's take a closer look at the nuts and bolts of how all this works.
 
@@ -83,21 +96,14 @@ There are three main methods in [`DatasetBuilder`]:
 
    The dataset is generated with a Python generator, which doesn't load all the data in memory. As a result, the generator can handle large datasets. However, before the generated samples are flushed to the dataset file on disk, they are stored in an `ArrowWriter` buffer. This means the generated samples are written by batch. If your dataset samples consumes a lot of memory (images or videos), then make sure to specify a low value for the `DEFAULT_WRITER_BATCH_SIZE` attribute in [`DatasetBuilder`]. We recommend not exceeding a size of 200 MB.
 
-## Without loading scripts
-
-As a user, you want to be able to quickly use a dataset. Implementing a dataset loading script can sometimes get in the way, or it may be a barrier for some people without a developer background. 🤗 Datasets removes this barrier by making it possible to load any dataset from the Hub without a dataset loading script. All a user has to do is upload the data files (see [upload_dataset_repo](#upload_dataset_repo) for a list of supported file formats) to a dataset repository on the Hub, and they will be able to load that dataset without having to create a loading script. This doesn't mean we are moving away from loading scripts because they still offer the most flexibility in controlling how a dataset is generated.
-
-The loading script-free method uses the [huggingface_hub](https://github.com/huggingface/huggingface_hub) library to list the files in a dataset repository. You can also provide a path to a local directory instead of a repository name, in which case 🤗 Datasets will use [glob](https://docs.python.org/3/library/glob) instead. Depending on the format of the data files available, one of the data file builders will create your dataset for you. If you have a CSV file, the CSV builder will be used and if you have a Parquet file, the Parquet builder will be used. The drawback of this approach is it's not possible to simultaneously load a CSV and JSON file. You will need to load the two file types separately, and then concatenate them.
-
 ## Maintaining integrity
 
 To ensure a dataset is complete, [`load_dataset`] will perform a series of tests on the downloaded files to make sure everything is there. This way, you don't encounter any surprises when your requested dataset doesn't get generated as expected. [`load_dataset`] verifies:
 
-- The list of downloaded files.
-- The number of bytes of the downloaded files.
-- The SHA256 checksums of the downloaded files.
 - The number of splits in the generated `DatasetDict`.
 - The number of samples in each split of the generated `DatasetDict`.
+- The list of downloaded files.
+- The SHA256 checksums of the downloaded files (disabled by defaut).
 
 If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files. 
 
 
@@ -25,4 +25,6 @@ Creating a dataset card is easy and can be done in a just a few steps:
 
 4. Once you're done, commit the changes to the `README.md` file and you'll see the completed dataset card on your repository.
 
+YAML also allows you to customize the way your dataset is loaded by [defining splits and/or configurations](./repository_structure#define-your-splits-and-subsets-in-yaml) without the need to write any code.
+
 Feel free to take a look at the [SNLI](https://huggingface.co/datasets/snli), [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/allocine) dataset cards as examples to help you get started.
@@ -3,12 +3,15 @@
 
 <Tip>
 
-The dataset script is optional if your dataset is in one of the following formats: CSV, JSON, JSON lines, text or Parquet.
-With those formats, you should be able to load your dataset automatically with [`~datasets.load_dataset`].
+The dataset script is likely not needed if your dataset is in one of the following formats: CSV, JSON, JSON lines, text or Parquet.
+With those formats, you should be able to load your dataset automatically with [`~datasets.load_dataset`],
+as long as your dataset repository has a [required structure](./repository_structure).
 
 </Tip>
 
-Write a dataset script to load and share your own datasets. It is a Python file that defines the different configurations and splits of your dataset, as well as how to download and process the data.
+Write a dataset script to load and share datasets that consist of data files in unsupported formats or require more complex data preparation.
+This is a more advanced way to define a dataset than using [YAML metadata in the dataset card](./repository_structure#define-your-splits-in-yaml).
+A dataset script is a Python file that defines the different configurations and splits of your dataset, as well as how to download and process the data.
 
 The script can download data files from any website, or from the same dataset repository.
 
 
@@ -53,30 +53,46 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
 
 [[autodoc]] datasets.packaged_modules.text.TextConfig
 
+[[autodoc]] datasets.packaged_modules.text.Text
+
 ### CSV
 
 [[autodoc]] datasets.packaged_modules.csv.CsvConfig
 
+[[autodoc]] datasets.packaged_modules.csv.Csv
+
 ### JSON
 
 [[autodoc]] datasets.packaged_modules.json.JsonConfig
 
+[[autodoc]] datasets.packaged_modules.json.Json
+
 ### Parquet
 
 [[autodoc]] datasets.packaged_modules.parquet.ParquetConfig
 
+[[autodoc]] datasets.packaged_modules.parquet.Parquet
+
 ### Arrow
 
 [[autodoc]] datasets.packaged_modules.arrow.ArrowConfig
 
+[[autodoc]] datasets.packaged_modules.arrow.Arrow
+
 ### SQL
 
 [[autodoc]] datasets.packaged_modules.sql.SqlConfig
 
+[[autodoc]] datasets.packaged_modules.sql.Sql
+
 ### Images
 
 [[autodoc]] datasets.packaged_modules.imagefolder.ImageFolderConfig
 
+[[autodoc]] datasets.packaged_modules.imagefolder.ImageFolder
+
 ### Audio
 
 [[autodoc]] datasets.packaged_modules.audiofolder.AudioFolderConfig
+
+[[autodoc]] datasets.packaged_modules.audiofolder.AudioFolder