It seems *sometimes* with aggressive optimizations the combination
`psub(padd(a, b), b)` trick to force rounding is compiled away. Here
we replace with inline assembly to prevent this (I tried `volatile`,
but that leads to additional loads from memory).
Also fixed an edge case for large inputs `a` where adding `b` bumps
the value up a power of two and ends up rounding away more than
just the fractional part. If we are over `2^digits` then just return
the input. This edge case was missed in the test since the test was
comparing approximate equality, which was still satisfied. Adding
a strict equality option catches it.
The recent addition of vectorized pow (!330) relies on `pfrexp` and
`pldexp`. This was missing for `Eigen::half` and `Eigen::bfloat16`.
Adding tests for these packet ops also exposed an issue with handling
negative values in `pfrexp`, returning an incorrect exponent.
Added the missing implementations, corrected the exponent in `pfrexp1`,
and added `packetmath` tests.
MSVC incorrectly handles `inf` cases for `std::sqrt<std::complex<T>>`.
Here we replace it with a custom version (currently used on GPU).
Also fixed the `packetmath` test, which previously skipped several
corner cases since `CHECK_CWISE1` only tests the first `PacketSize`
elements.