amp & channels_last

channels_last:
- https://pytorch.org/blog/tensor-memory-format-matters/
- https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html
- https://github.com/pytorch/pytorch/issues/57806
- https://discuss.pytorch.org/t/why-does-pytorch-prefer-using-nchw/83637
- https://gist.github.com/mingfeima/595f63e5dd2ac6f87fdb47df4ffe4772
- memory format != dimension format

> While PyTorch operators expect all tensors to be in [Channels First (NCHW) dimension format](https://discuss.pytorch.org/t/why-does-pytorch-prefer-using-nchw/83637/4), PyTorch operators support 3 output [memory formats](https://github.com/pytorch/pytorch/blob/master/c10/core/MemoryFormat.h). 
Contiguous: Tensor memory is in the same order as the tensor’s dimensions.
ChannelsLast: Irrespective of the dimension order, the 2d (image) tensor is laid out as an HWC or [NHWC](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html) (N: batch, H: height, W: width, C: channels) tensor in memory. The dimensions could be permuted in any order.

amp:
- http://www.idris.fr/eng/ia/mixed-precision-eng.html
- https://pytorch.org/docs/stable/amp.html
- Had to remove last activation from segmentation head because of https://pytorch.org/docs/stable/amp.html#prefer-binary-cross-entropy-with-logits-over-binary-cross-entropy

extra:
- http://blog.ezyang.com/2019/05/pytorch-internals/
- https://discuss.huggingface.co/t/why-is-grad-norm-clipping-done-during-training-by-default/1866
- https://github.com/NVIDIA/apex/issues/221#issuecomment-478084841 Tensor loves the number 8

prefetch:
- https://www.jpatrickpark.com/post/prefetcher/
- https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py **best code I've seen so far**
- https://tigress-web.princeton.edu/~jdh4/PyTorchPerformanceTuningGuide_GTC2021.pdf **best slide I've seen so far**
- https://stackoverflow.com/questions/67085517/how-to-load-fetch-the-next-data-batches-for-the-next-epoch-during-the-current-e 
- https://androidkt.com/create-dataloader-with-collate_fn-for-variable-length-input-in-pytorch/

Going to train from scratch to see what's good, with a working log this time.
**UPDATE 12/07/2022:** Seems like the bottleneck is in dataloading, which takes an unholy amount of time even though I cached everything in RAM. Currently profiling CPU & GPU and trying out [this dataloader](https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py#L265) which allegedly actually does prefetch.
**UPDATE:** It all makes sense now, Pytorch's `Dataloader` can only prefetch batches in the current running epoch. For the next epoch, there is apparently no prefetch whatsoever.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

amp & channels_last #50

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

amp & channels_last #50

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions