Depending on instruction set, significant speedups are observed for the vectorized path:
log1p wall time is reduced 60-93% (2.5x - 15x speedup)
expm1 wall time is reduced 0-85% (1x - 7x speedup)
The scalar path is slower by 20-30% due to the extra branch needed to handle +infinity correctly.
Full benchmarks measured on Intel(R) Xeon(R) Gold 6154 here: https://bitbucket.org/snippets/rmlarsen/MXBkpM
This commit fixes the AVX512 implementations of psqrt in the same
way that 3ed67cb0bb4af65fbf243df598604a8c7630bf7d
fixed the AVX2 version of this function. The
AVX512 versions of psqrt incorrectly return -0.0 for negative
values, instead of NaN. Fixing the issues requires adding
some additional instructions that slow down the algorithms. A
similar test to the one used in 3ed67cb0bb4af65fbf243df598604a8c7630bf7d
shows that the
corrected Packet16f code runs at 73% of the speed of the existing code,
while the corrected Packed8d function runs at 68% of the original.