Description
Fortunately no short godbolt reproducer as of now, only the entire source https://godbolt.org/z/6TMYYsW3e
While working on the C++ data loader of
https://github.com/official-stockfish/nnue-pytorch/blob/master/training_data_loader.cpp
I noticed a 2x performance difference between latest clang and gcc.
Running a perf profile on this showed __ieee754_logl
at the very top which is no where to be seen with gcc, assuming this function somehow didn't get properly optimized ?
Taking a look at the flamegraph shows it comes from the seemingly any std::bernoulli_distribution
*_distribution
call.
https://godbolt.org/z/6TMYYsW3e
I haven't been able to create a small standalone example as of yet which reproduces this, so if someone wants to compile the above example, then get the file from godbolt and run
clang++ -march=native test.cpp -O3 -o loader && ./loader test77-jan2022-2tb7p.high-simple-eval-1k.min-v2.binpack
The mentioned file can be downloaded from here https://huggingface.co/datasets/official-stockfish/master-smallnet-binpacks/tree/main
If you compile directly with libc++ instead of libstdc++, the program will be another 1.5x slower
clang++-21 libc++ 10.0457s
clang++-21 libstdc++ 5.43586s
g++-15 3.56669s