Skip to content

fix: loading of datasets from Disk(#7373) #7489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sam-hey
Copy link

@sam-hey sam-hey commented Mar 29, 2025

Fixes dataset loading from disk by ensuring that memory maps and streams are properly closed.

For more details, see #7373.

@sam-hey
Copy link
Author

sam-hey commented Mar 31, 2025

@nepfaff Could you confirm if this fixes the issue for you? I checked Memray, and everything looked good on my end.

Install: pip install git+https://github.com/sam-hey/datasets.git@fix/concatenate_datasets

@nepfaff
Copy link

nepfaff commented Apr 2, 2025

Will aim to get to this soon. I don't have a rapid testing pipeline setup but need to wait for some AWS nodes to become free

@nepfaff
Copy link

nepfaff commented Apr 3, 2025

I now set up a small experiment:

# Log initial RAM usage
    process = psutil.Process(os.getpid())
    initial_ram = process.memory_info().rss / (1024 * 1024)  # Convert to MB
    logging.info(f"Initial RAM usage: {initial_ram:.2f} MB")

    chunk_datasets = [
        Dataset.load_from_disk(dataset_path, keep_in_memory=False) for _ in range(N)
    ]
    combined_dataset = concatenate_datasets(chunk_datasets)

    # Log final RAM usage
    final_ram = process.memory_info().rss / (1024 * 1024)  # Convert to MB
    ram_diff = final_ram - initial_ram
    logging.info(f"Final RAM usage: {final_ram:.2f} MB")
    logging.info(f"RAM usage increase: {ram_diff:.2f} MB")

The RAM usage is linearly correlated with N on datasets master!

For my test dataset:

  • N=5 => RAM usage increase: 26302.91 MB
  • N=10 => RAM usage increase: 52315.18 MB
  • N=20 => RAM usage increase: 104510.65 MB
  • N=40 => RAM usage increase: 209166.30 MB

Unfortunately, your patch doesn't seem to change this:

pip install git+https://github.com/sam-hey/datasets.git@fix/concatenate_datasets
pip list | grep datasets
datasets                 3.5.1.dev0

Gives exactly the same RAM statistics.

Edit: The results are a bit flawed as the memory increase all seems to come from Dataset.load_from_disk(dataset_path, keep_in_memory=False) here (which I don't think should happen either?) and not from concatenate_datasets. This seems different from my large-scale setup that runs out of memory during concatenate_datasets but I don't seem to be able to replicate this here...

@sam-hey sam-hey marked this pull request as draft April 3, 2025 10:18
@sam-hey sam-hey marked this pull request as ready for review April 3, 2025 13:43
@sam-hey
Copy link
Author

sam-hey commented Apr 3, 2025

Thanks a lot, @nepfaff, for taking a look at this! It seems that concatenate_datasets() is fixed with this PR. I can also confirm that loading a large number of files requires significant memory. However, as I understand it, this is expected/a bug since the memory consumption stems from pa.memory_map(), which returns a memory-mapped file.

This behavior might be related to this bug: apache/arrow#34423

Screenshot 2025-04-03 at 16 01 11

Comment on lines +92 to 93
memory_mapped_stream.close()
return pa_table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really ok to close the memory map, given the memory mapped table is still in use ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pa_table includes all information and is a copy: see the example of read_all(). https://arrow.apache.org/docs/python/ipc.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think read_all() doesn't load the data in memory here, it loads buffers that are memory mapped from disk

Copy link

@lasuomela lasuomela Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that this works:

import pyarrow as pa

BATCH_SIZE = 10000
NUM_BATCHES = 1000

schema = pa.schema([pa.field('nums', pa.int32())])

# Write in stream format
with pa.OSFile('bigfile.arrow', 'wb') as sink:
    with pa.ipc.new_stream(sink, schema) as writer:
        for _ in range(NUM_BATCHES):
            batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
            writer.write(batch)

# Read the stream back
with pa.memory_map('bigfile.arrow', 'rb') as source:
    with pa.ipc.open_stream(source) as reader:
        table = reader.read_all()

print("LEN:", table.num_rows)
print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))

# Read the first batch
print("")
print("First batch:")
print(table[0][0:BATCH_SIZE])

Out:

LEN: 10000000
RSS: 0MB

First batch:
[
  [
    0,
    1,
    2,
    3,
    4,
    ...
    9995,
    9996,
    9997,
    9998,
    9999
  ]
]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lasuomela Thanks a lot for checking! I’m currently on vacation and without a laptop to verify it myself.
@lhoestq Would this be sufficient proof for you?

@lhoestq
Copy link
Member

lhoestq commented Apr 24, 2025

Great ! have you tested that it also fixes the memory issue in your case @iamollas ?

Happy to know that it works for you @sam-hey ! Looking forward to merging this

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants