From dd18988e7f44393f6a5c42b9172c1b967d8c4a6f Mon Sep 17 00:00:00 2001
From: Harry Yang <hy2611@nyu.edu>
Date: Mon, 21 Apr 2025 20:19:09 -0400
Subject: [PATCH 1/3] fix 7457

---
 docs/source/cache.mdx | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/docs/source/cache.mdx b/docs/source/cache.mdx
index bf344a09bb7..f0ea833d9de 100644
--- a/docs/source/cache.mdx
+++ b/docs/source/cache.mdx
@@ -21,6 +21,14 @@ The default 🤗 Datasets cache directory is `~/.cache/huggingface/datasets`. Ch
 
 ```
 $ export HF_HOME="/path/to/another/directory/datasets"
+
+```
+## HF_DATASETS_CACHE
+
+In addition to using `HF_HOME`, you can override the default 🤗 Datasets cache directory by setting the `HF_DATASETS_CACHE` environment variable. This variable allows you to specify a custom cache location for datasets converted into Arrow format. For instance:
+
+```
+$ export HF_DATASETS_CACHE="/path/to/your/custom/cache"
 ```
 
 When you load a dataset, you also have the option to change where the data is cached. Change the `cache_dir` parameter to the path you want:

From e705cd39c9ca22ae5cfe41c7b8dcfe80fe413e5a Mon Sep 17 00:00:00 2001
From: Harry Yang <hy2611@nyu.edu>
Date: Tue, 6 May 2025 10:44:37 -0400
Subject: [PATCH 2/3] fix 7457 and 7480

---
 docs/source/cache.mdx | 50 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/docs/source/cache.mdx b/docs/source/cache.mdx
index f0ea833d9de..c6703bb8a5f 100644
--- a/docs/source/cache.mdx
+++ b/docs/source/cache.mdx
@@ -20,38 +20,64 @@ This guide focuses on the 🤗 Datasets cache and will show you how to:
 The default 🤗 Datasets cache directory is `~/.cache/huggingface/datasets`. Change the cache location by setting the shell environment variable, `HF_HOME` to another directory:
 
 ```
-$ export HF_HOME="/path/to/another/directory/datasets"
+
+\$ export HF\_HOME="/path/to/another/directory/datasets"
 
 ```
-## HF_DATASETS_CACHE
 
-In addition to using `HF_HOME`, you can override the default 🤗 Datasets cache directory by setting the `HF_DATASETS_CACHE` environment variable. This variable allows you to specify a custom cache location for datasets converted into Arrow format. For instance:
+Alternatively, you can set the `HF_DATASETS_CACHE` environment variable to control only the datasets-specific cache directory:
 
 ```
-$ export HF_DATASETS_CACHE="/path/to/your/custom/cache"
+
+export HF\_DATASETS\_CACHE="/path/to/datasets\_cache"
+
+```
+
+⚠️ This only applies to files written by the `datasets` library (e.g., Arrow files and indices).  
+It does **not** affect files downloaded from the Hugging Face Hub (such as models, tokenizers, or raw dataset sources), which are controlled separately via the `HF_HUB_CACHE` variable:
+
+```
+
+export HF\_HUB\_CACHE="/path/to/hub\_cache"
+
 ```
 
+💡 If you'd like to relocate all Hugging Face caches—including datasets and hub downloads—use the `HF_HOME` variable instead:
+
+```
+
+export HF\_HOME="/path/to/cache\_root"
+
+````
+
+This results in:
+- datasets cache → `/path/to/cache_root/datasets`
+- hub cache → `/path/to/cache_root/hub`
+
+These distinctions are especially useful when working in shared environments or networked file systems (e.g., NFS).  
+See [issue #7480](https://github.com/huggingface/datasets/issues/7480) for discussion on how users encountered unexpected cache locations when `HF_HUB_CACHE` was not set alongside `HF_DATASETS_CACHE`.
+
 When you load a dataset, you also have the option to change where the data is cached. Change the `cache_dir` parameter to the path you want:
 
 ```py
 >>> from datasets import load_dataset
 >>> dataset = load_dataset('username/dataset', cache_dir="/path/to/another/directory/datasets")
-```
+````
 
 ## Download mode
 
-After you download a dataset, control how it is loaded by [`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:
+After you download a dataset, control how it is loaded by \[`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:
 
 ```py
 >>> from datasets import load_dataset
 >>> dataset = load_dataset('rajpurkar/squad', download_mode='force_redownload')
 ```
 
-Refer to [`DownloadMode`] for a full list of download modes.
+Refer to \[`DownloadMode`] for a full list of download modes.
 
 ## Cache files
 
-Clean up the Arrow cache files in the directory with [`Dataset.cleanup_cache_files`]:
+Clean up the Arrow cache files in the directory with \[`Dataset.cleanup_cache_files`]:
 
 ```py
 # Returns the number of removed cache files
@@ -61,7 +87,7 @@ Clean up the Arrow cache files in the directory with [`Dataset.cleanup_cache_fil
 
 ## Enable or disable caching
 
-If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in [`Dataset.map`]:
+If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in \[`Dataset.map`]:
 
 ```py
 >>> updated_dataset = small_dataset.map(add_prefix, load_from_cache_file=False)
@@ -69,7 +95,7 @@ If you're using a cached file locally, it will automatically reload the dataset
 
 In the example above, 🤗 Datasets will execute the function `add_prefix` over the entire dataset again instead of loading the dataset from its previous state.
 
-Disable caching on a global scale with [`disable_caching`]:
+Disable caching on a global scale with \[`disable_caching`]:
 
 ```py
 >>> from datasets import disable_caching
@@ -80,7 +106,7 @@ When you disable caching, 🤗 Datasets will no longer reload cached files when
 
 <Tip>
 
-If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead.
+If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in \[`load_dataset`] instead.
 
 </Tip>
 
@@ -90,6 +116,6 @@ If you want to reuse a dataset from scratch, try setting the `download_mode` par
 
 Disabling the cache and copying the dataset in-memory will speed up dataset operations. There are two options for copying the dataset in-memory:
 
-1. Set `datasets.config.IN_MEMORY_MAX_SIZE` to a nonzero value (in bytes) that fits in your RAM memory. 
+1. Set `datasets.config.IN_MEMORY_MAX_SIZE` to a nonzero value (in bytes) that fits in your RAM memory.
 
 2. Set the environment variable `HF_DATASETS_IN_MEMORY_MAX_SIZE` to a nonzero value. Note that the first method takes higher precedence.

From 5b430a7a828cfe1a6d8a4d5822695f33f0cd0d9d Mon Sep 17 00:00:00 2001
From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Date: Tue, 6 May 2025 17:50:56 +0200
Subject: [PATCH 3/3] Update cache.mdx

---
 docs/source/cache.mdx | 36 ++++++++++++++----------------------
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/docs/source/cache.mdx b/docs/source/cache.mdx
index c6703bb8a5f..a18a3d957e9 100644
--- a/docs/source/cache.mdx
+++ b/docs/source/cache.mdx
@@ -20,35 +20,27 @@ This guide focuses on the 🤗 Datasets cache and will show you how to:
 The default 🤗 Datasets cache directory is `~/.cache/huggingface/datasets`. Change the cache location by setting the shell environment variable, `HF_HOME` to another directory:
 
 ```
-
-\$ export HF\_HOME="/path/to/another/directory/datasets"
-
+$ export HF_HOME="/path/to/another/directory/datasets"
 ```
 
 Alternatively, you can set the `HF_DATASETS_CACHE` environment variable to control only the datasets-specific cache directory:
 
 ```
-
-export HF\_DATASETS\_CACHE="/path/to/datasets\_cache"
-
+$ export HF_DATASETS_CACHE="/path/to/datasets_cache"
 ```
 
 ⚠️ This only applies to files written by the `datasets` library (e.g., Arrow files and indices).  
-It does **not** affect files downloaded from the Hugging Face Hub (such as models, tokenizers, or raw dataset sources), which are controlled separately via the `HF_HUB_CACHE` variable:
+It does **not** affect files downloaded from the Hugging Face Hub (such as models, tokenizers, or raw dataset sources), which are located in `~/.cache/huggingface/hub` by default and controlled separately via the `HF_HUB_CACHE` variable:
 
 ```
-
-export HF\_HUB\_CACHE="/path/to/hub\_cache"
-
+$ export HF_HUB_CACHE="/path/to/hub_cache"
 ```
 
-💡 If you'd like to relocate all Hugging Face caches—including datasets and hub downloads—use the `HF_HOME` variable instead:
+💡 If you'd like to relocate all Hugging Face caches — including datasets and hub downloads — use the `HF_HOME` variable instead:
 
 ```
-
-export HF\_HOME="/path/to/cache\_root"
-
-````
+$ export HF_HOME="/path/to/cache_root"
+```
 
 This results in:
 - datasets cache → `/path/to/cache_root/datasets`
@@ -62,22 +54,22 @@ When you load a dataset, you also have the option to change where the data is ca
 ```py
 >>> from datasets import load_dataset
 >>> dataset = load_dataset('username/dataset', cache_dir="/path/to/another/directory/datasets")
-````
+```
 
 ## Download mode
 
-After you download a dataset, control how it is loaded by \[`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:
+After you download a dataset, control how it is loaded by [`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:
 
 ```py
 >>> from datasets import load_dataset
 >>> dataset = load_dataset('rajpurkar/squad', download_mode='force_redownload')
 ```
 
-Refer to \[`DownloadMode`] for a full list of download modes.
+Refer to [`DownloadMode`] for a full list of download modes.
 
 ## Cache files
 
-Clean up the Arrow cache files in the directory with \[`Dataset.cleanup_cache_files`]:
+Clean up the Arrow cache files in the directory with [`Dataset.cleanup_cache_files`]:
 
 ```py
 # Returns the number of removed cache files
@@ -87,7 +79,7 @@ Clean up the Arrow cache files in the directory with \[`Dataset.cleanup_cache_fi
 
 ## Enable or disable caching
 
-If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in \[`Dataset.map`]:
+If you're using a cached file locally, it will automatically reload the dataset with any previous transforms you applied to the dataset. Disable this behavior by setting the argument `load_from_cache_file=False` in [`Dataset.map`]:
 
 ```py
 >>> updated_dataset = small_dataset.map(add_prefix, load_from_cache_file=False)
@@ -95,7 +87,7 @@ If you're using a cached file locally, it will automatically reload the dataset
 
 In the example above, 🤗 Datasets will execute the function `add_prefix` over the entire dataset again instead of loading the dataset from its previous state.
 
-Disable caching on a global scale with \[`disable_caching`]:
+Disable caching on a global scale with [`disable_caching`]:
 
 ```py
 >>> from datasets import disable_caching
@@ -106,7 +98,7 @@ When you disable caching, 🤗 Datasets will no longer reload cached files when
 
 <Tip>
 
-If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in \[`load_dataset`] instead.
+If you want to reuse a dataset from scratch, try setting the `download_mode` parameter in [`load_dataset`] instead.
 
 </Tip>