Feature request: download all features but only load only part of DGS Corpus at a time? #68

cleong110 · 2024-03-26T20:40:28Z

When attempting to load the DGS Corpus's default configuration on either my own workstation or in Colab, I run out of memory and crash.

Here are some screenshots

https://colab.research.google.com/drive/1_vWFvWo0ZMg5_6AFU6Ln2LPHwm9TW_Rz?usp=sharing for example, will crash given time. Is there a method such that all the features, video and pose and gloss, can all be downloaded, yet only some portion loaded into memory at a time?

cleong110 · 2024-03-26T20:57:20Z

Possibly something like

dgs_corpus = tfds.load('dgs_corpus', split=["train:2%"])

would work

cleong110 · 2024-03-26T20:59:11Z

If that works, I wonder if it would be good to:
a. warn the user of projected memory usage somehow when they run the .load command?
b. change the default load to not load the entire dataset into memory?

cleong110 · 2024-03-26T21:18:48Z

you can do it, DGS Corpus! I believe in you!

cleong110 · 2024-03-26T21:38:03Z

cleong110 · 2024-03-26T21:44:19Z

cleong110 · 2024-03-26T21:44:30Z

sigh

cleong110 · 2024-03-26T21:50:27Z

Giving it a try on my personal workstation

cleong110 · 2024-03-26T21:54:33Z

...nope, still "Killed"

cleong110 · 2024-03-26T22:03:06Z

All of these crash on my workstation, using up all 33 GB:

# dgs_corpus = tfds.load('dgs_corpus') # Killed
# dgs_corpus = tfds.load('dgs_corpus', split=["train:2%"]) # Killed
# dgs_corpus = tfds.load('dgs_corpus', split=["train:100"]) # Killed
# dgs_corpus = tfds.load('dgs_corpus', split=["train:10"]) # still Killed

cleong110 · 2024-03-26T22:05:16Z

Maybe one of these tricks can work?

https://www.tensorflow.org/guide/data_performance#reducing_memory_footprint

cleong110 · 2024-03-26T22:07:46Z

Maybe a custom data generator? https://medium.com/analytics-vidhya/write-your-own-custom-data-generator-for-tensorflow-keras-1252b64e41c3

cleong110 · 2024-03-26T22:10:49Z

https://www.tensorflow.org/datasets/performances#large_datasets

cleong110 · 2024-03-26T22:13:52Z

Maybe something from here? tensorflow/tfjs#7801

cleong110 · 2024-03-26T22:23:49Z

Oh hey, this looks relevant, and I see a familiar name: huggingface/datasets#741. It's transformers library though.

cleong110 · 2024-03-26T22:24:20Z

Reading https://www.tensorflow.org/datasets/api_docs/python/tfds/load, maybe we can do as_dataset separately?

cleong110 · 2024-03-26T22:27:25Z

https://www.reddit.com/r/deeplearning/comments/z8otan/if_the_dataset_is_too_big_to_fit_into_your_ram/ some more possibilities

cleong110 · 2024-03-26T22:31:55Z

OK, so tf.Dataset does support streaming... https://stackoverflow.com/questions/63140320/how-to-use-sequence-generator-on-tf-data-dataset-object-to-fit-partial-data-into, so is it the split generation where the issue comes?

cleong110 · 2024-03-26T23:10:21Z

Using the "manually add print statements to the site-packages in my conda env" method I kept following it all the way down and it gets killed in here:

https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py#L330,

and makes it to here,
https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/dataset_builder.py#L1584

and makes it to here

https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/split_builder.py#L415

and gets killed around there somewhere

cleong110 · 2024-03-27T16:04:24Z

https://github.com/gruns/icecream might be helpful, note to self

cleong110 · 2024-03-27T16:13:48Z

Or, you know, I could look at one of these: https://stackify.com/top-5-python-memory-profilers/

cleong110 · 2024-03-27T16:17:06Z

Or https://www.tensorflow.org/guide/profiler#memory_profile_tool

cleong110 · 2024-03-27T16:23:49Z

https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.07-Timing-and-Profiling.ipynb could also be of use

cleong110 · 2024-03-27T16:50:39Z

I've done a lot of searching but tfds just doesn't seem to have a way to stream part of a large dataset that I can find.

cleong110 · 2024-03-27T16:54:02Z

So I still can't figure out how to (1) only download a portion, (2) assuming it's all successfully downloaded, load only a portion into memory without the split generation using all available memory.

AmitMY · 2024-03-28T11:21:16Z

lots of comments... in the future it would be helpful if you keep editing the same comment or handful.

When you:

tfds.load('dgs_corpus', split=["train:2%"])

What happens is that first the entire dataset is being prepared, then only 2% of it is loaded. So you will need the exact same disk space.

Now since there are two processes here:

preparing the entire dataset
loading a part of the dataset

can you tell where the memory consumption is too high? my suspicion is number 1, but i don't know.

cleong110 · 2024-03-28T14:49:06Z

Right, sorry, I forget that I'm not the only one getting spammed by all these, apologies.

cleong110 · 2024-03-28T14:50:27Z

I'm also suspecting 1, based on the fact that I can sprinkle print statements all the way until https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/split_builder.py#L415.

Edit: my big issue is that testing it currently requires me to run it until it crashes. Which, if I'm doing it on Google Colab, means that any modifications I've made to the code are then gone. I've got a workstation I can test locally on but don't have access as conveniently.

Edit again:
Certainly the download_and_prepare is using a lot of RAM, though this particular instance of Colab has not yet crashed:

Edit 3:
... and it crashed after using all available RAM. so that step does seem to use high memory... but the file system was not actually deleted. OK I can work with that, perhaps

Edit 4: OK, trying it in a "high-memory" notebook in Colab Pro, I get this:

Edit 5: full stacktrace on the high-memory notebook:

ValueError                                Traceback (most recent call last)

[<ipython-input-5-6bad8ee20d7b>](https://localhost:8080/#) in <cell line: 1>()
----> 1 dgs_corpus = tfds.load('dgs_corpus', )

10 frames

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/logging/__init__.py](https://localhost:8080/#) in __call__(self, function, instance, args, kwargs)
    166     metadata = self._start_call()
    167     try:
--> 168       return function(*args, **kwargs)
    169     except Exception:
    170       metadata.mark_error()

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/load.py](https://localhost:8080/#) in load(name, split, data_dir, batch_size, shuffle_files, download, as_supervised, decoders, read_config, with_info, builder_kwargs, download_and_prepare_kwargs, as_dataset_kwargs, try_gcs)
    647       try_gcs,
    648   )
--> 649   _download_and_prepare_builder(dbuilder, download, download_and_prepare_kwargs)
    650 
    651   if as_dataset_kwargs is None:

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/load.py](https://localhost:8080/#) in _download_and_prepare_builder(dbuilder, download, download_and_prepare_kwargs)
    506   if download:
    507     download_and_prepare_kwargs = download_and_prepare_kwargs or {}
--> 508     dbuilder.download_and_prepare(**download_and_prepare_kwargs)
    509 
    510 

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/logging/__init__.py](https://localhost:8080/#) in __call__(self, function, instance, args, kwargs)
    166     metadata = self._start_call()
    167     try:
--> 168       return function(*args, **kwargs)
    169     except Exception:
    170       metadata.mark_error()

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_builder.py](https://localhost:8080/#) in download_and_prepare(self, download_dir, download_config, file_format)
    697           self.info.read_from_directory(self.data_dir)
    698         else:
--> 699           self._download_and_prepare(
    700               dl_manager=dl_manager,
    701               download_config=download_config,

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, download_config)
   1666       return
   1667 
-> 1668     split_infos = self._generate_splits(dl_manager, download_config)
   1669 
   1670     # Update the info object with the splits.

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_builder.py](https://localhost:8080/#) in _generate_splits(self, dl_manager, download_config)
   1641       ):
   1642         filename_template = self._get_filename_template(split_name=split_name)
-> 1643         future = split_builder.submit_split_generation(
   1644             split_name=split_name,
   1645             generator=generator,

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/split_builder.py](https://localhost:8080/#) in submit_split_generation(self, split_name, generator, filename_template, disable_shuffling)
    329     # `_build_from_xyz` method.
    330     if isinstance(generator, collections.abc.Iterable):
--> 331       return self._build_from_generator(**build_kwargs)
    332     else:  # Otherwise, beam required
    333       unknown_generator_type = TypeError(

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/split_builder.py](https://localhost:8080/#) in _build_from_generator(self, split_name, generator, filename_template, disable_shuffling)
    400       except Exception as e:  # pylint: disable=broad-except
    401         utils.reraise(e, prefix=f'Failed to encode example:\n{example}\n')
--> 402       writer.write(key, example)
    403     shard_lengths, total_size = writer.finalize()
    404 

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/writer.py](https://localhost:8080/#) in write(self, key, example)
    225       example: the Example to write to the shard.
    226     """
--> 227     serialized_example = self._serializer.serialize_example(example=example)
    228     self._shuffler.add(key, serialized_example)
    229     self._num_examples += 1

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/example_serializer.py](https://localhost:8080/#) in serialize_example(self, example)
     96       serialize_proto: `str`, the serialized `tf.train.Example` proto
     97     """
---> 98     return self.get_tf_example(example).SerializeToString()
     99 
    100

Edit: and here's how many resources were used:

Update: OK, it seems that just these three files being encoded is enough to use many gigabytes.

cleong110 · 2024-03-29T18:23:46Z

Well I am thoroughly stumped. I've narrowed it down to where in tfds the massive memory allocations are happening, but I still don't know why.

I just don't understand why it needs nearly 30 GiB to "encode" and then "serialize" the videos

Here's the memray report.
memray_output_file.tar.gz

For some reason reading in the frames results in over 50k allocations? many GiB worth?
This line right here, officer: https://github.com/google/etils/blob/main/etils/epath/abstract_path.py#L149
Called by this one https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/features/video_feature.py#L152

Also the serialize_example calls use huge memory as well

I don't know what to do or how to fix it. I know the UFC101 dataset doesn't have this issue. If anyone has thoughts let me know

This is all before we even get to the protobuf max size error

Update: well, I inspected the tmp folder that gets created, and there are indeed nearly 13k frames extracted from just one of the videos:

which ends up being nearly 5GB legitimately:

Note: perhaps something in here may be relevant?

Like maybe there's a setting in there to not load every file but just a list of paths?

AmitMY · 2024-03-30T14:20:52Z

Ok, so to me it seems like you are not using appropriate config

Like maybe there's a setting in there to not load every file but just a list of paths?

The "correct" config here would be:

config = DgsCorpusConfig(name="only-annotations", version="1.0.0", include_video=False, include_pose=None)
dgs_corpus = tfds.load('dgs_corpus', builder_kwargs=dict(config=config))

Which loads only the annotations.

You want to download the video but load them as paths? include_video=True, process_video=False

You want to load poses? include_pose="holistic" or include_pose="openpose"

cleong110 mentioned this issue Mar 26, 2024

feature: Quickstart with smaller dataset sign-language-processing/sign-vq#1

Open

cleong110 mentioned this issue Mar 29, 2024

Minimizing memory usage with a large custom dataset (possible memory leak with first epoch) tensorflow/datasets#4072

Open

cleong110 mentioned this issue Jul 3, 2024

Custom video dataset encoding/serialize uses all memory, process killed. How to fix? tensorflow/datasets#5499

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: download all features but only load only part of DGS Corpus at a time? #68

Feature request: download all features but only load only part of DGS Corpus at a time? #68

cleong110 commented Mar 26, 2024 •

edited

Loading

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024 •

edited

Loading

cleong110 commented Mar 27, 2024

cleong110 commented Mar 27, 2024

cleong110 commented Mar 27, 2024

cleong110 commented Mar 27, 2024

cleong110 commented Mar 27, 2024

cleong110 commented Mar 27, 2024

AmitMY commented Mar 28, 2024

cleong110 commented Mar 28, 2024

cleong110 commented Mar 28, 2024 •

edited

Loading

cleong110 commented Mar 29, 2024 •

edited

Loading

AmitMY commented Mar 30, 2024

Feature request: download all features but only load only part of DGS Corpus at a time? #68

Feature request: download all features but only load only part of DGS Corpus at a time? #68

Comments

cleong110 commented Mar 26, 2024 • edited Loading

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024

cleong110 commented Mar 26, 2024 • edited Loading

cleong110 commented Mar 27, 2024

cleong110 commented Mar 27, 2024

cleong110 commented Mar 27, 2024

cleong110 commented Mar 27, 2024

cleong110 commented Mar 27, 2024

cleong110 commented Mar 27, 2024

AmitMY commented Mar 28, 2024

cleong110 commented Mar 28, 2024

cleong110 commented Mar 28, 2024 • edited Loading

cleong110 commented Mar 29, 2024 • edited Loading

AmitMY commented Mar 30, 2024

cleong110 commented Mar 26, 2024 •

edited

Loading

cleong110 commented Mar 26, 2024 •

edited

Loading

cleong110 commented Mar 28, 2024 •

edited

Loading

cleong110 commented Mar 29, 2024 •

edited

Loading