Description
channels_last:
- https://pytorch.org/blog/tensor-memory-format-matters/
- https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html
- torch.cuda.amp cannot speed up on A100 pytorch/pytorch#57806
- https://discuss.pytorch.org/t/why-does-pytorch-prefer-using-nchw/83637
- https://gist.github.com/mingfeima/595f63e5dd2ac6f87fdb47df4ffe4772
- memory format != dimension format
While PyTorch operators expect all tensors to be in Channels First (NCHW) dimension format, PyTorch operators support 3 output memory formats.
Contiguous: Tensor memory is in the same order as the tensor’s dimensions.
ChannelsLast: Irrespective of the dimension order, the 2d (image) tensor is laid out as an HWC or NHWC (N: batch, H: height, W: width, C: channels) tensor in memory. The dimensions could be permuted in any order.
amp:
- http://www.idris.fr/eng/ia/mixed-precision-eng.html
- https://pytorch.org/docs/stable/amp.html
- Had to remove last activation from segmentation head because of https://pytorch.org/docs/stable/amp.html#prefer-binary-cross-entropy-with-logits-over-binary-cross-entropy
extra:
- http://blog.ezyang.com/2019/05/pytorch-internals/
- https://discuss.huggingface.co/t/why-is-grad-norm-clipping-done-during-training-by-default/1866
- Can not use tensor cores NVIDIA/apex#221 (comment) Tensor loves the number 8
prefetch:
- https://www.jpatrickpark.com/post/prefetcher/
- https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py best code I've seen so far
- https://tigress-web.princeton.edu/~jdh4/PyTorchPerformanceTuningGuide_GTC2021.pdf best slide I've seen so far
- https://stackoverflow.com/questions/67085517/how-to-load-fetch-the-next-data-batches-for-the-next-epoch-during-the-current-e
- https://androidkt.com/create-dataloader-with-collate_fn-for-variable-length-input-in-pytorch/
Going to train from scratch to see what's good, with a working log this time.
UPDATE 12/07/2022: Seems like the bottleneck is in dataloading, which takes an unholy amount of time even though I cached everything in RAM. Currently profiling CPU & GPU and trying out this dataloader which allegedly actually does prefetch.
UPDATE: It all makes sense now, Pytorch's Dataloader
can only prefetch batches in the current running epoch. For the next epoch, there is apparently no prefetch whatsoever.