You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* initial debugging and testing works
* pwais changes with RayBatchStream to alleviate training
* few bugs to iron out with multiprocessing, specifically pickled collate_fn
* working version of RayBatchStream
* additional docstrings
* cleanup
* much more documentation
* successfully trained AEA-script2_seq2 closed_loop without OOM
* porting over aria dataset-size feature
* added logic to handle eviction of a worker's cached_collated_batch
* antonio's implementation of stream batches
* training on a dataset with 4000 images works!
* some configuration speedups, loops aren't actually needed!
* quick fix adjustment to aria
* removed unnecessary looping
* much faster training when adding i variable to collate every 5 ray bundles
* cleanup unnecssary variables in Dataloader
* further cleanup
* adding caching of compressed images to RAM to reduce disk bottleneck
* added caching to RAM for masks
* found fast way to collate - many tricks applied
* quick update to aria to test on different datasets
* cleaned up the accelerated pil_to_numpy function
* cleaning up PR
* this commit was used to generate the time metrics and profiling metrics
* REAL commit used to run tests
* testing with nerfacto-big
* generated RayBundle collate and converting images from uint8s to float32 on GPU tests
* updating nerfacto to support uint8 easily, will need to figure out a way to contain this within the datamanager API
* datamanager updates, both splat and nerf
* must use writeable arrays because torch requires them
* cleaned up base_dataset, added pickle to utils, more code in full_image, and cleaner desc for base_datamanager
* lots of process on a parallel FullImageDatamanger
* can train big splats with pre-assertion hack or ROI hack and 0 workers
* fixed all undistortion issues with ParallelImageDatamanager
* adding some downsampling and parallel tests with splatfacto!
* deleted commented code in dataloaders.py and added bugfix to shuffling
* testing splatfacto-big
* cleaned up base_pipeline.py
* cleaned up base_pipeline.py ACTUALLY THIS TIME, forgot to save last time
* cleaned up a lot of code
* process_project_aria back to main branch and some cleanup in full_image_datamanager
* clarifying docstrings
* further PR cleanup
* updating models
* further cleanup
* removed caching of images into bytestrings
* adding caching of compressed images to RAM, forgot that hardware matters
* removing oom methods, adding the ability to add a flag to dataloading
* removed CacheDataloader, moved RayBatchStream to dataloaders.py, new vanilla_datamanager rewritten
* fixing base_piplines, deleting a weird datamanager_configs file that was accidently created
* cleaning up next_train
* replaced parallel datamanager with new datamanager
* reverted the original base_datamanager.py, new datamanager replaced parallel_datamanager.py
* modified VanillaConfig, but VanillaDataManager is the same as before
* cleaning up, 2 datamanagers now - original and new parallel one
* able to train with new nerfstudio dataloader now
* side by side datamanagers, moved tons of logic into dataloaders.py and created new files for our parallel datamangers
* added custom ray processing API to support implementations like LERF, cleaned up FullImageDatamanager to original because of new ParallelImageDatamanger
* adding functionality for ns-eval by adding FixedIndicesEvalDataloader to the setup_eval
* adding both ray API and image-view API to datamanagers for custom parallelization
* updating splatfacto config for 4k tests
* updating docstrings to be more descriptive
* new datamanager API breaks when setup_eval() has multiple workers, not sure why but single worker will have to do
* adding custom_view_processor to ImageBatchStream
* reverting full_images_datamanager to main branch
* removing nn.Module inheritance from Datamanager class
* don't need to move datamanger to device anymore since Datamanager is not a subclass of nn.Module
* finished integration test with nerfacto
* simplified config variables, integrated the parallelism/disk-data-loading all into one datamanager
* updated the splatfacto config to be simpler with the dataloading and now uses FullImageDatamanager (which has been changed)
* style checks and some cleanup
* new splatfacto test, cleaning up nerfacto integration test
* removing redundant parallel_full_images_datamaanger, as the OG full_image_datamanager now has full parallelized support
* ruff linting and pyright fixing
* further pyright fixing
* another pyright fixing
* fixing pyright error, camera optimization no longer part of datamanager
* fixing one pyright
* fixing dataloading error when camera is not undistorted with dataloader
* fixing comments and updating style
* undoing a style change i made
* undoing another style change i made by accident
* fixing slow runtime
* fixing a more general camera undistortion bug
* move images to device properly
* minor improvements
* add print statement about >500 images, cleanup method configs
* make method configs consistent across nerfacto models
* adding description comments
* updating description
* resolving some pyright issues with export.py, explained in PR desc
* fixing pyright issues in base_pipeline.py
* ran pyright on exporter and base_pipeline.py without issues
* adding a git ignore to a clearly checked pyright issue
* typo
* fixing most ns-dev-test cases
* cleanup, passing final ns-dev-test
* oops, accidentally pushed the deletion of a docstring, undoing that
* another cleanup
* some fixes to eval pipeline
* lint
* add asserts for spawn
* lint
* cleaning up import statements in parallel_datamanager.py
* adding new developer documentation if users would like to migrate their custom datamanagers to support new features
* removing unnecessary to_device no-op
* further updates to documentation
* lint
* more docs
* docs
* remove comment
* add docs, fix depth dataset with parallel datamanager, fix mask sampling bug
* remove profiling
* more profile removal
* custom_view_processor->custom_image_processor
* doc clarification
* datamanager doc nit
* whitespace
* nits
* remove stuff from __post_init__, tune num workers more, add random offset in raybatchstream
* removing unnecessary assertion, updating docstring because DataManager is no longer an nn.Module
* clarifying configuration with num_images_to_sample_from and num_times_to_repeat_images, cleaning up functions
* adding logic so that nerfacto users can load_from_disk and customize image batch sizes and repeat parameters
* ruff formatting! whoops forgot to format
* fixing logic, now if users set load_from_disk to true, datamanager will use 50 and 10. If users set it and specify their own values, we support that as well
* adding separate datamanager config so that target can be removed in method_configs
---------
Co-authored-by: Justin Kerr <[email protected]>
Co-authored-by: Brent Yi <[email protected]>
We currently don't have other implementations because most papers follow the VanillaDataManager implementation. However, it should be straightforward to add a VanillaDataManager with logic that progressively adds cameras, for instance, by relying on the step and modifying RayBundle and RayGT generation logic.
106
+
107
+
## Disk Caching for Large Datasets
108
+
As of January 2025, the FullImageDatamanager and ParallelImageDatamanager implementations now support parallelized dataloading and dataloading from disk to avoid Out-Of-Memory errors and support very large datasets. To train a NeRF-based method with a large dataset that's unable to fit in memory, please add the `load_from_disk` flag to your `ns-train` command. For example with nerfacto:
To train splatfacto with a large dataset that's unable to fit in memory, please set the device of `cache_images` to `"disk"`. For example with splatfacto:
114
+
```bash
115
+
ns-train splatfacto --data {PROCESSED_DATA_DIR} --pipeline.datamanager.cache-images disk
116
+
```
117
+
118
+
## Migrating Your DataManager to the new DataManager
119
+
Many methods subclass a DataManager and add extra data to it. If you would like your custom datamanager to also support new parallel features, you can migrate any custom dataloading logic to the new `custom_ray_processor()` API. This function takes in a full training batch (either image or ray bundle) and allows the user to modify or add to it. Let's take a look at an example for the LERF method, which was built on Nerfstudio's VanillaDataManager. This API provides an interface to attach new information to the RayBundle (for ray based methods), Cameras object (for splatting based methods), or ground truth dictionary. It runs in a background process if disk caching is enabled, otherwise it runs in the main process.
120
+
121
+
Naively transfering code to `custom_ray_processor` may still OOM on very large datasets if initialization code requires computing something over the whole dataset. To fully take advantage of parallelization make sure your subclassed datamanager computes new information inside the `custom_ray_processor`, or caches a subset of the whole dataset. This can also still be slow if pre-computation requires GPU-heavy steps on the same GPU used for training.
122
+
123
+
**Note**: Because the parallel DataManager uses background processes, any member of the DataManager needs to be *picklable* to be used inside `custom_ray_processor`.
124
+
125
+
```python
126
+
classLERFDataManager(VanillaDataManager):
127
+
"""Subclass VanillaDataManager to add extra data processing
128
+
129
+
Args:
130
+
config: the DataManagerConfig used to instantiate class
To migrate this custom datamanager to the new datamanager, we'll subclass the new ParallelDataManager and shift the data customization process from `next_train()` to `custom_ray_processor()`.
175
+
The function `custom_ray_processor()` is called with a fully populated ray bundle and ground truth batch, just like the subclassed `next_train` in the above code. This code, however, is run in a background process.
0 commit comments