From dd18988e7f44393f6a5c42b9172c1b967d8c4a6f Mon Sep 17 00:00:00 2001 From: Harry Yang Date: Mon, 21 Apr 2025 20:19:09 -0400 Subject: [PATCH 1/3] fix 7457 --- docs/source/cache.mdx | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/source/cache.mdx b/docs/source/cache.mdx index bf344a09bb7..f0ea833d9de 100644 --- a/docs/source/cache.mdx +++ b/docs/source/cache.mdx @@ -21,6 +21,14 @@ The default 🤗 Datasets cache directory is `~/.cache/huggingface/datasets`. Ch ``` $ export HF_HOME="/path/to/another/directory/datasets" + +``` +## HF_DATASETS_CACHE + +In addition to using `HF_HOME`, you can override the default 🤗 Datasets cache directory by setting the `HF_DATASETS_CACHE` environment variable. This variable allows you to specify a custom cache location for datasets converted into Arrow format. For instance: + +``` +$ export HF_DATASETS_CACHE="/path/to/your/custom/cache" ``` When you load a dataset, you also have the option to change where the data is cached. Change the `cache_dir` parameter to the path you want: From e705cd39c9ca22ae5cfe41c7b8dcfe80fe413e5a Mon Sep 17 00:00:00 2001 From: Harry Yang Date: Tue, 6 May 2025 10:44:37 -0400 Subject: [PATCH 2/3] fix 7457 and 7480 --- docs/source/cache.mdx | 50 ++++++++++++++++++++++++++++++++----------- 1 file changed, 38 insertions(+), 12 deletions(-) diff --git a/docs/source/cache.mdx b/docs/source/cache.mdx index f0ea833d9de..c6703bb8a5f 100644 --- a/docs/source/cache.mdx +++ b/docs/source/cache.mdx @@ -20,38 +20,64 @@ This guide focuses on the 🤗 Datasets cache and will show you how to: The default 🤗 Datasets cache directory is `~/.cache/huggingface/datasets`. Change the cache location by setting the shell environment variable, `HF_HOME` to another directory: ``` -$ export HF_HOME="/path/to/another/directory/datasets" + +\$ export HF\_HOME="/path/to/another/directory/datasets" ``` -## HF_DATASETS_CACHE -In addition to using `HF_HOME`, you can override the default 🤗 Datasets cache directory by setting the `HF_DATASETS_CACHE` environment variable. This variable allows you to specify a custom cache location for datasets converted into Arrow format. For instance: +Alternatively, you can set the `HF_DATASETS_CACHE` environment variable to control only the datasets-specific cache directory: ``` -$ export HF_DATASETS_CACHE="/path/to/your/custom/cache" + +export HF\_DATASETS\_CACHE="/path/to/datasets\_cache" + +``` + +⚠️ This only applies to files written by the `datasets` library (e.g., Arrow files and indices). +It does **not** affect files downloaded from the Hugging Face Hub (such as models, tokenizers, or raw dataset sources), which are controlled separately via the `HF_HUB_CACHE` variable: + +``` + +export HF\_HUB\_CACHE="/path/to/hub\_cache" + ``` +💡 If you'd like to relocate all Hugging Face caches—including datasets and hub downloads—use the `HF_HOME` variable instead: + +``` + +export HF\_HOME="/path/to/cache\_root" + +```` + +This results in: +- datasets cache → `/path/to/cache_root/datasets` +- hub cache → `/path/to/cache_root/hub` + +These distinctions are especially useful when working in shared environments or networked file systems (e.g., NFS). +See [issue #7480](https://github.com/huggingface/datasets/issues/7480) for discussion on how users encountered unexpected cache locations when `HF_HUB_CACHE` was not set alongside `HF_DATASETS_CACHE`. + When you load a dataset, you also have the option to change where the data is cached. Change the `cache_dir` parameter to the path you want: ```py >>> from datasets import load_dataset >>> dataset = load_dataset('username/dataset', cache_dir="/path/to/another/directory/datasets") -``` +```` ## Download mode -After you download a dataset, control how it is loaded by [`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below: +After you download a dataset, control how it is loaded by \[`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below: ```py >>> from datasets import load_dataset >>> dataset = load_dataset('rajpurkar/squad', download_mode='force_redownload') ``` -Refer to [`DownloadMode`] for a full list of download modes. +Refer to \[`DownloadMode`] for a full list of download modes. ## Cache files -Clean up the Arrow cache files in the directory with [`Dataset.cleanup_cache_files`]: +Clean up the Arrow cache files in the directory with \[`Dataset.cleanup_cache_files`]: ```py # Returns the number of removed cache files @@ -61,7 +87,7 @@ Clean up the Arrow cache files in the directory with [`Dataset.cleanup_cache_fil ## Enable or disable caching -If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in [`Dataset.map`]: +If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in \[`Dataset.map`]: ```py >>> updated_dataset = small_dataset.map(add_prefix, load_from_cache_file=False) @@ -69,7 +95,7 @@ If you're using a cached file locally, it will automatically reload the dataset In the example above, 🤗 Datasets will execute the function `add_prefix` over the entire dataset again instead of loading the dataset from its previous state. -Disable caching on a global scale with [`disable_caching`]: +Disable caching on a global scale with \[`disable_caching`]: ```py >>> from datasets import disable_caching @@ -80,7 +106,7 @@ When you disable caching, 🤗 Datasets will no longer reload cached files when -If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead. +If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in \[`load_dataset`] instead. @@ -90,6 +116,6 @@ If you want to reuse a dataset from scratch, try setting the `download_mode` par Disabling the cache and copying the dataset in-memory will speed up dataset operations. There are two options for copying the dataset in-memory: -1. Set `datasets.config.IN_MEMORY_MAX_SIZE` to a nonzero value (in bytes) that fits in your RAM memory. +1. Set `datasets.config.IN_MEMORY_MAX_SIZE` to a nonzero value (in bytes) that fits in your RAM memory. 2. Set the environment variable `HF_DATASETS_IN_MEMORY_MAX_SIZE` to a nonzero value. Note that the first method takes higher precedence. From 5b430a7a828cfe1a6d8a4d5822695f33f0cd0d9d Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Tue, 6 May 2025 17:50:56 +0200 Subject: [PATCH 3/3] Update cache.mdx --- docs/source/cache.mdx | 36 ++++++++++++++---------------------- 1 file changed, 14 insertions(+), 22 deletions(-) diff --git a/docs/source/cache.mdx b/docs/source/cache.mdx index c6703bb8a5f..a18a3d957e9 100644 --- a/docs/source/cache.mdx +++ b/docs/source/cache.mdx @@ -20,35 +20,27 @@ This guide focuses on the 🤗 Datasets cache and will show you how to: The default 🤗 Datasets cache directory is `~/.cache/huggingface/datasets`. Change the cache location by setting the shell environment variable, `HF_HOME` to another directory: ``` - -\$ export HF\_HOME="/path/to/another/directory/datasets" - +$ export HF_HOME="/path/to/another/directory/datasets" ``` Alternatively, you can set the `HF_DATASETS_CACHE` environment variable to control only the datasets-specific cache directory: ``` - -export HF\_DATASETS\_CACHE="/path/to/datasets\_cache" - +$ export HF_DATASETS_CACHE="/path/to/datasets_cache" ``` ⚠️ This only applies to files written by the `datasets` library (e.g., Arrow files and indices). -It does **not** affect files downloaded from the Hugging Face Hub (such as models, tokenizers, or raw dataset sources), which are controlled separately via the `HF_HUB_CACHE` variable: +It does **not** affect files downloaded from the Hugging Face Hub (such as models, tokenizers, or raw dataset sources), which are located in `~/.cache/huggingface/hub` by default and controlled separately via the `HF_HUB_CACHE` variable: ``` - -export HF\_HUB\_CACHE="/path/to/hub\_cache" - +$ export HF_HUB_CACHE="/path/to/hub_cache" ``` -💡 If you'd like to relocate all Hugging Face caches—including datasets and hub downloads—use the `HF_HOME` variable instead: +💡 If you'd like to relocate all Hugging Face caches — including datasets and hub downloads — use the `HF_HOME` variable instead: ``` - -export HF\_HOME="/path/to/cache\_root" - -```` +$ export HF_HOME="/path/to/cache_root" +``` This results in: - datasets cache → `/path/to/cache_root/datasets` @@ -62,22 +54,22 @@ When you load a dataset, you also have the option to change where the data is ca ```py >>> from datasets import load_dataset >>> dataset = load_dataset('username/dataset', cache_dir="/path/to/another/directory/datasets") -```` +``` ## Download mode -After you download a dataset, control how it is loaded by \[`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below: +After you download a dataset, control how it is loaded by [`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below: ```py >>> from datasets import load_dataset >>> dataset = load_dataset('rajpurkar/squad', download_mode='force_redownload') ``` -Refer to \[`DownloadMode`] for a full list of download modes. +Refer to [`DownloadMode`] for a full list of download modes. ## Cache files -Clean up the Arrow cache files in the directory with \[`Dataset.cleanup_cache_files`]: +Clean up the Arrow cache files in the directory with [`Dataset.cleanup_cache_files`]: ```py # Returns the number of removed cache files @@ -87,7 +79,7 @@ Clean up the Arrow cache files in the directory with \[`Dataset.cleanup_cache_fi ## Enable or disable caching -If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in \[`Dataset.map`]: +If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in [`Dataset.map`]: ```py >>> updated_dataset = small_dataset.map(add_prefix, load_from_cache_file=False) @@ -95,7 +87,7 @@ If you're using a cached file locally, it will automatically reload the dataset In the example above, 🤗 Datasets will execute the function `add_prefix` over the entire dataset again instead of loading the dataset from its previous state. -Disable caching on a global scale with \[`disable_caching`]: +Disable caching on a global scale with [`disable_caching`]: ```py >>> from datasets import disable_caching @@ -106,7 +98,7 @@ When you disable caching, 🤗 Datasets will no longer reload cached files when -If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in \[`load_dataset`] instead. +If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead.