Skip to content

Added cache dirs to load and file_utils #7499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gmongaras
Copy link

When adding "cache_dir" to datasets.load_dataset, the cache_dir gets lost in the function calls, changing the cache dir to the default path. This fixes a few of these instances.

@lhoestq
Copy link
Member

lhoestq commented Apr 15, 2025

hi ! the hf_hub_download cache_dir is a different cache directory than the one for datasets.

hf_hub_download uses the huggingface_hub cache which is located in by default in ~/.cache/huggingface/hub, while datasets uses a different cache for Arrow files and map() results ~/.cache/huggingface/datasets

@gmongaras
Copy link
Author

Is there a way to change the default cache directory for both of these on calling load_dataset? Currently, cache_dir makes dealing with where I want files to go a bit confusing as the documentation doesn't mention it only relocates.../datasets and not .../hub.

@lhoestq
Copy link
Member

lhoestq commented Apr 15, 2025

You can set HF_HOME which is the common parent directory for those two caches. Or individually HF_DATASETS_CACHE and HF_HUB_CACHE

@gmongaras
Copy link
Author

Got it. Can this be added to the documentation for load_dataset and related functions to avoid confusion with cache_dir?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants