This allows us to do faster native scalar operations. Also
updated half/quarter packets to use the native type if available.
Benchmark improvement:
```
Comparing ./2910_without_float16 to ./2910_with_float16
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
BM_CalcMat<float>/10000/768/500 -0.0041 -0.0040 58276392 58039442 58273420 58039582
BM_CalcMat<_Float16>/10000/768/500 +0.0073 +0.0073 642506339 647214446 642481384 647188303
BM_CalcMat<Eigen::half>/10000/768/500 -0.3170 -0.3170 92511115 63182101 92506771 63179258
BM_CalcVec<float>/10000/768/500 +0.0022 +0.0022 5198157 5209469 5197913 5209334
BM_CalcVec<_Float16>/10000/768/500 +0.0025 +0.0026 10133324 10159111 10132641 10158507
BM_CalcVec<Eigen::half>/10000/768/500 -0.7760 -0.7760 45337937 10156952 45336532 10156389
OVERALL_GEOMEAN -0.2677 -0.2677 0 0 0 0
```
Fixes#2910.
The recent addition of vectorized pow (!330) relies on `pfrexp` and
`pldexp`. This was missing for `Eigen::half` and `Eigen::bfloat16`.
Adding tests for these packet ops also exposed an issue with handling
negative values in `pfrexp`, returning an incorrect exponent.
Added the missing implementations, corrected the exponent in `pfrexp1`,
and added `packetmath` tests.
Minimal implementation of AVX `Eigen::half` ops to bring in line
with `bfloat16`. Allows `packetmath_13` to pass.
Also adjusted `bfloat16` packet traits to match the supported set
of ops (e.g. Bessel is not actually implemented).
2. Simplify handling of special cases by taking advantage of the fact that the
builtin vrsqrt approximation handles negative, zero and +inf arguments correctly.
This speeds up the SSE and AVX implementations by ~20%.
3. Make the Newton-Raphson formula used for rsqrt more numerically robust:
Before: y = y * (1.5 - x/2 * y^2)
After: y = y * (1.5 - y * (x/2) * y)
Forming y^2 can overflow for very large or very small (denormalized) values of x, while x*y ~= 1. For AVX512, this makes it possible to compute accurate results for denormal inputs down to ~1e-42 in single precision.
4. Add a faster double precision implementation for Knights Landing using the vrsqrt28 instruction and a single Newton-Raphson iteration.
Benchmark results: https://bitbucket.org/snippets/rmlarsen/5LBq9o
Depending on instruction set, significant speedups are observed for the vectorized path:
log1p wall time is reduced 60-93% (2.5x - 15x speedup)
expm1 wall time is reduced 0-85% (1x - 7x speedup)
The scalar path is slower by 20-30% due to the extra branch needed to handle +infinity correctly.
Full benchmarks measured on Intel(R) Xeon(R) Gold 6154 here: https://bitbucket.org/snippets/rmlarsen/MXBkpM
It is based on the SSE version which is much more accurate, though very slightly slower.
This changeset also includes the following required changes:
- add packet-float to packet-int type traits
- add packet float<->int reinterpret casts
- add faster pselect for AVX based on blendv
Benchmark speed in Giga-sqrts/s
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
-----------------------------------------
SSE AVX
Fast=1 2.529G 4.380G
Fast=0 1.944G 1.898G
Fast=1 fixed 2.214G 3.739G
This table illustrates the worst case in terms speed impact: It was measured by repeatedly computing the sqrt of an n=4096 float vector that fits in L1 cache. For large vectors the operation becomes memory bound and the differences between the different versions almost negligible.