Guoqiang QI
38ae5353ab
1)provide a better generic paddsub op implementation
...
2)make paddsub op support the Packet2cf/Packet4f/Packet2f in NEON
3)make paddsub op support the Packet2cf/Packet4f in SSE
2021-01-13 22:54:03 +00:00
Antonio Sanchez
070d303d56
Add CUDA complex sqrt.
...
This is to support scalar `sqrt` of complex numbers `std::complex<T>` on
device, requested by Tensorflow folks.
Technically `std::complex` is not supported by NVCC on device
(though it is by clang), so the default `sqrt(std::complex<T>)` function only
works on the host. Here we create an overload to add back the
functionality.
Also modified the CMake file to add `--relaxed-constexpr` (or
equivalent) flag for NVCC to allow calling constexpr functions from
device functions, and added support for specifying compute architecture for
NVCC (was already available for clang).
2020-12-22 23:25:23 -08:00
Antonio Sanchez
5dc2fbabee
Fix implicit cast to double.
...
Triggers `-Wimplicit-float-conversion`, causing a bunch of build errors
in Google due to `-Wall`.
2020-12-12 09:26:20 -08:00
Antonio Sanchez
c6efc4e0ba
Replace M_LOG2E and M_LN2 with custom macros.
...
For these to exist we would need to define `_USE_MATH_DEFINES` before
`cmath` or `math.h` is first included. However, we don't
control the include order for projects outside Eigen, so even defining
the macro in `Eigen/Core` does not fix the issue for projects that
end up including `<cmath>` before Eigen does (explicitly or transitively).
To fix this, we define `EIGEN_LOG2E` and `EIGEN_LN2` ourselves.
2020-12-11 14:34:31 -08:00
Rasmus Munk Larsen
125cc9a5df
Implement vectorized complex square root.
...
Closes #1905
Measured speedup for sqrt of `complex<float>` on Skylake:
SSE:
```
name old time/op new time/op delta
BM_eigen_sqrt_ctype/1 49.4ns ± 0% 54.3ns ± 0% +10.01%
BM_eigen_sqrt_ctype/8 332ns ± 0% 50ns ± 1% -84.97%
BM_eigen_sqrt_ctype/64 2.81µs ± 1% 0.38µs ± 0% -86.49%
BM_eigen_sqrt_ctype/512 23.8µs ± 0% 3.0µs ± 0% -87.32%
BM_eigen_sqrt_ctype/4k 202µs ± 0% 24µs ± 2% -88.03%
BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.19ms ± 0% -88.18%
BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 1.5ms ± 1% -88.20%
BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 6.2ms ± 0% -88.18%
```
AVX2:
```
name old cpu/op new cpu/op delta
BM_eigen_sqrt_ctype/1 53.6ns ± 0% 55.6ns ± 0% +3.71%
BM_eigen_sqrt_ctype/8 334ns ± 0% 27ns ± 0% -91.86%
BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.22µs ± 2% -92.28%
BM_eigen_sqrt_ctype/512 23.8µs ± 1% 1.7µs ± 1% -92.81%
BM_eigen_sqrt_ctype/4k 201µs ± 0% 14µs ± 1% -93.24%
BM_eigen_sqrt_ctype/32k 1.62ms ± 0% 0.11ms ± 1% -93.29%
BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.9ms ± 1% -93.31%
BM_eigen_sqrt_ctype/1M 52.0ms ± 0% 3.5ms ± 1% -93.31%
```
AVX512:
```
name old cpu/op new cpu/op delta
BM_eigen_sqrt_ctype/1 53.7ns ± 0% 56.2ns ± 1% +4.75%
BM_eigen_sqrt_ctype/8 334ns ± 0% 18ns ± 2% -94.63%
BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.12µs ± 1% -95.54%
BM_eigen_sqrt_ctype/512 23.9µs ± 1% 1.0µs ± 1% -95.89%
BM_eigen_sqrt_ctype/4k 202µs ± 0% 8µs ± 1% -96.13%
BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.06ms ± 1% -96.15%
BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.5ms ± 4% -96.11%
BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 2.0ms ± 1% -96.13%
```
2020-12-08 18:13:35 -08:00
Rasmus Munk Larsen
f9fac1d5b0
Add log2() to Eigen.
2020-12-04 21:45:09 +00:00
Rasmus Munk Larsen
f23dc5b971
Revert "Add log2() operator to Eigen"
...
This reverts commit 4d91519a9be061da5d300079fca17dd0b9328050.
2020-12-03 14:32:45 -08:00
Rasmus Munk Larsen
4d91519a9b
Add log2() operator to Eigen
2020-12-03 22:31:44 +00:00
Rasmus Munk Larsen
25d8ae7465
Small cleanup of generic plog implementations:
...
Adding the term e*ln(2) is split into two step for no obvious reason.
This dates back to the original Cephes code from which the algorithm is adapted.
It appears that this was done in Cephes to prevent the compiler from reordering
the addition of the 3 terms in the approximation
log(1+x) ~= x - 0.5*x^2 + x^3*P(x)/Q(x)
which must be added in reverse order since |x| < (sqrt(2)-1).
This allows rewriting the code to just 2 pmadd and 1 padd instructions,
which on a Skylake processor speeds up the code by 5-7%.
2020-12-03 19:40:40 +00:00
Rasmus Munk Larsen
79818216ed
Revert "Fix Half NaN definition and test."
...
This reverts commit c770746d709686ef2b8b652616d9232f9b028e78.
2020-11-24 12:57:28 -08:00
Rasmus Munk Larsen
c770746d70
Fix Half NaN definition and test.
...
The `half_float` test was failing with `-mcpu=cortex-a55` (native `__fp16`) due
to a bad NaN bit-pattern comparison (in the case of casting a float to `__fp16`,
the signaling `NaN` is quieted). There was also an inconsistency between
`numeric_limits<half>::quiet_NaN()` and `NumTraits::quiet_NaN()`. Here we
correct the inconsistency and compare NaNs according to the IEEE 754
definition.
Also modified the `bfloat16_float` test to match.
Tested with `cortex-a53` and `cortex-a55`.
2020-11-24 20:53:07 +00:00
David Tellenbach
09f015852b
Replace numext::as_uint with numext::bit_cast<numext::uint32_t>
2020-10-29 07:28:28 +01:00
guoqiangqi
28aef8e816
Improve polynomial evaluation with instruction-level parallelism for pexp_float and pexp<Packet16f>
2020-10-20 11:37:09 +08:00
guoqiangqi
4a77eda1fd
remove unnecessary specialize template of pexp for scale float/double
2020-10-19 00:51:42 +00:00
Rasmus Munk Larsen
3a0b23e473
Fix compilation of pset1frombits calls on iOS.
2020-09-28 22:30:36 +00:00
Guoqiang QI
3012e755e9
Add plog ops support packet2d for NEON
2020-09-15 17:10:35 +00:00
Guoqiang QI
85428a3440
Add Neon psqrt<Packet2d> and pexp<Packet2d>
2020-09-08 09:04:03 +00:00
Joel Holdsworth
232f904082
Add shift_left<N> and shift_right<N> coefficient-wise unary Array functions
2020-03-19 17:24:06 +00:00
Rasmus Munk Larsen
ea51a9eace
Add missing EIGEN_DEVICE_FUNC attribute to template specializations for pexp to fix GPU build.
2019-11-27 10:17:09 -08:00
Gael Guennebaud
e5778b87b9
Fix duplicate symbol linking error.
2019-11-20 17:23:19 +01:00
Rasmus Munk Larsen
fab4e3a753
Address comments on Chebyshev evaluation code:
...
1. Use pmadd when possible.
2. Add casts to avoid c++03 warnings.
2019-10-02 12:48:17 -07:00
Rasmus Munk Larsen
bd0fac456f
Prevent infinite loop in the nvcc compiler while unrolling the recurrent templates for Chebyshev polynomial evaluation.
2019-10-01 13:15:30 -07:00
Srinivas Vasudevan
facdec5aa7
Add packetized versions of i0e and i1e special functions.
...
- In particular refactor the i0e and i1e code so scalar and vectorized path share code.
- Move chebevl to GenericPacketMathFunctions.
A brief benchmark with building Eigen with FMA, AVX and AVX2 flags
Before:
CPU: Intel Haswell with HyperThreading (6 cores)
Benchmark Time(ns) CPU(ns) Iterations
-----------------------------------------------------------------
BM_eigen_i0e_double/1 57.3 57.3 10000000
BM_eigen_i0e_double/8 398 398 1748554
BM_eigen_i0e_double/64 3184 3184 218961
BM_eigen_i0e_double/512 25579 25579 27330
BM_eigen_i0e_double/4k 205043 205042 3418
BM_eigen_i0e_double/32k 1646038 1646176 422
BM_eigen_i0e_double/256k 13180959 13182613 53
BM_eigen_i0e_double/1M 52684617 52706132 10
BM_eigen_i0e_float/1 28.4 28.4 24636711
BM_eigen_i0e_float/8 75.7 75.7 9207634
BM_eigen_i0e_float/64 512 512 1000000
BM_eigen_i0e_float/512 4194 4194 166359
BM_eigen_i0e_float/4k 32756 32761 21373
BM_eigen_i0e_float/32k 261133 261153 2678
BM_eigen_i0e_float/256k 2087938 2088231 333
BM_eigen_i0e_float/1M 8380409 8381234 84
BM_eigen_i1e_double/1 56.3 56.3 10000000
BM_eigen_i1e_double/8 397 397 1772376
BM_eigen_i1e_double/64 3114 3115 223881
BM_eigen_i1e_double/512 25358 25361 27761
BM_eigen_i1e_double/4k 203543 203593 3462
BM_eigen_i1e_double/32k 1613649 1613803 428
BM_eigen_i1e_double/256k 12910625 12910374 54
BM_eigen_i1e_double/1M 51723824 51723991 10
BM_eigen_i1e_float/1 28.3 28.3 24683049
BM_eigen_i1e_float/8 74.8 74.9 9366216
BM_eigen_i1e_float/64 505 505 1000000
BM_eigen_i1e_float/512 4068 4068 171690
BM_eigen_i1e_float/4k 31803 31806 21948
BM_eigen_i1e_float/32k 253637 253692 2763
BM_eigen_i1e_float/256k 2019711 2019918 346
BM_eigen_i1e_float/1M 8238681 8238713 86
After:
CPU: Intel Haswell with HyperThreading (6 cores)
Benchmark Time(ns) CPU(ns) Iterations
-----------------------------------------------------------------
BM_eigen_i0e_double/1 15.8 15.8 44097476
BM_eigen_i0e_double/8 99.3 99.3 7014884
BM_eigen_i0e_double/64 777 777 886612
BM_eigen_i0e_double/512 6180 6181 100000
BM_eigen_i0e_double/4k 48136 48140 14678
BM_eigen_i0e_double/32k 385936 385943 1801
BM_eigen_i0e_double/256k 3293324 3293551 228
BM_eigen_i0e_double/1M 12423600 12424458 57
BM_eigen_i0e_float/1 16.3 16.3 43038042
BM_eigen_i0e_float/8 30.1 30.1 23456931
BM_eigen_i0e_float/64 169 169 4132875
BM_eigen_i0e_float/512 1338 1339 516860
BM_eigen_i0e_float/4k 10191 10191 68513
BM_eigen_i0e_float/32k 81338 81337 8531
BM_eigen_i0e_float/256k 651807 651984 1000
BM_eigen_i0e_float/1M 2633821 2634187 268
BM_eigen_i1e_double/1 16.2 16.2 42352499
BM_eigen_i1e_double/8 110 110 6316524
BM_eigen_i1e_double/64 822 822 851065
BM_eigen_i1e_double/512 6480 6481 100000
BM_eigen_i1e_double/4k 51843 51843 10000
BM_eigen_i1e_double/32k 414854 414852 1680
BM_eigen_i1e_double/256k 3320001 3320568 212
BM_eigen_i1e_double/1M 13442795 13442391 53
BM_eigen_i1e_float/1 17.6 17.6 41025735
BM_eigen_i1e_float/8 35.5 35.5 19597891
BM_eigen_i1e_float/64 240 240 2924237
BM_eigen_i1e_float/512 1424 1424 485953
BM_eigen_i1e_float/4k 10722 10723 65162
BM_eigen_i1e_float/32k 86286 86297 8048
BM_eigen_i1e_float/256k 691821 691868 1000
BM_eigen_i1e_float/1M 2777336 2777747 256
This shows anywhere from a 50% to 75% improvement on these operations.
I've also benchmarked without any of these flags turned on, and got similar
performance to before (if not better).
Also tested packetmath.cpp + special_functions to ensure no regressions.
2019-09-11 18:34:02 -07:00
Deven Desai
cdb377d0cb
Fix for the HIP build+test errors introduced by the ndtri support.
...
The fixes needed are
* adding EIGEN_DEVICE_FUNC attribute to a couple of funcs (else HIPCC will error out when non-device funcs are called from global/device funcs)
* switching to using ::<math_func> instead std::<math_func> (only for HIPCC) in cases where the std::<math_func> is not recognized as a device func by HIPCC
* removing an errant "j" from a testcase (don't know how that made it in to begin with!)
2019-09-06 16:03:49 +00:00
Gael Guennebaud
17226100c5
Fix a circular dependency regarding pshift* functions and GenericPacketMathFunctions.
...
Another solution would have been to make pshift* fully generic template functions with
partial specialization which is always a mess in c++03.
2019-09-06 09:26:04 +02:00
Srinivas Vasudevan
e38dd48a27
PR 681: Add ndtri function, the inverse of the normal distribution function.
2019-08-12 19:26:29 -04:00
Rasmus Munk Larsen
1187bb65ad
Add more tests for corner cases of log1p and expm1. Add handling of infinite arguments to log1p such that log1p(inf) = inf.
2019-08-28 12:20:21 -07:00
Rasmus Munk Larsen
9aba527405
Revert changes to std_falback::log1p that broke handling of arguments less than -1. Fix packet op accordingly.
2019-08-27 15:35:29 -07:00
Rasmus Munk Larsen
a3298b22ec
Implement vectorized versions of log1p and expm1 in Eigen using Kahan's formulas, and change the scalar implementations to properly handle infinite arguments.
...
Depending on instruction set, significant speedups are observed for the vectorized path:
log1p wall time is reduced 60-93% (2.5x - 15x speedup)
expm1 wall time is reduced 0-85% (1x - 7x speedup)
The scalar path is slower by 20-30% due to the extra branch needed to handle +infinity correctly.
Full benchmarks measured on Intel(R) Xeon(R) Gold 6154 here: https://bitbucket.org/snippets/rmlarsen/MXBkpM
2019-08-12 13:53:28 -07:00
Gael Guennebaud
f11364290e
ICC does not support -fno-unsafe-math-optimizations
2019-03-22 09:26:24 +01:00
Gael Guennebaud
1c09ee8541
bug #1674 : workaround clang fast-math aggressive optimizations
2019-02-22 15:48:53 +01:00
Gael Guennebaud
871e2e5339
bug #1674 : disable GCC's unsafe-math-optimizations in sin/cos vectorization (results are completely wrong otherwise)
2019-02-03 08:54:47 +01:00
Gael Guennebaud
4356a55a61
PR 571: Implements an accurate argument reduction algorithm for huge inputs of sin/cos and call it instead of falling back to std::sin/std::cos.
...
This makes both the small and huge argument cases faster because:
- for small inputs this removes the last pselect
- for large inputs only the reduction part follows a scalar path,
the rest use the same SIMD path as the small-argument case.
2019-01-14 13:54:01 +01:00
Gael Guennebaud
9005f0111f
Replace compiler's alignas/alignof extension by respective c++11 keywords when available. This also fix a compilation issue with gcc-4.7.
2019-01-11 17:10:54 +01:00
Gael Guennebaud
3f14e0d19e
fix warning
2019-01-09 15:45:21 +01:00
Gael Guennebaud
e6b217b8dd
bug #1652 : implements a much more accurate version of vectorized sin/cos. This new version achieve same speed for SSE/AVX, and is slightly faster with FMA. Guarantees are as follows:
...
- no FMA: 1ULP up to 3pi, 2ULP up to sin(25966) and cos(18838), fallback to std::sin/cos for larger inputs
- FMA: 1ULP up to sin(117435.992) and cos(71476.0625), fallback to std::sin/cos for larger inputs
2019-01-09 15:25:17 +01:00
Gael Guennebaud
0f6f75bd8a
Implement a faster fix for sin/cos of large entries that also correctly handle INF input.
2018-12-23 17:26:21 +01:00
Gael Guennebaud
38d704def8
Make sure that psin/pcos return number in [-1,1] for large inputs (though sin/cos on large entries is quite useless because it's inaccurate)
2018-12-23 16:13:24 +01:00
Gael Guennebaud
5713fb7feb
Fix plog(+INF): it returned ~87 instead of +INF
2018-12-23 15:40:52 +01:00
Gael Guennebaud
b477d60bc6
Extend the generic psin_float code to handle cosine and make SSE and AVX use it (-> this adds pcos for AVX)
2018-11-30 11:26:30 +01:00
Gael Guennebaud
b131a4db24
bug #1631 : fix compilation with ARM NEON and clang, and cleanup the weird pshiftright_and_cast and pcast_and_shiftleft functions.
2018-11-27 23:45:00 +01:00
Gael Guennebaud
a1a5fbbd21
Update pshiftleft to pass the shift as a true compile-time integer.
2018-11-27 22:57:30 +01:00
Gael Guennebaud
fa7fd61eda
Unify SSE/AVX psin functions.
...
It is based on the SSE version which is much more accurate, though very slightly slower.
This changeset also includes the following required changes:
- add packet-float to packet-int type traits
- add packet float<->int reinterpret casts
- add faster pselect for AVX based on blendv
2018-11-27 22:41:51 +01:00
Gael Guennebaud
502f92fa10
Unify SSE and AVX pexp for double.
2018-11-26 23:12:44 +01:00
Gael Guennebaud
cf8b85d5c5
Unify SSE and AVX implementation of pexp
2018-11-26 16:36:19 +01:00
Gael Guennebaud
2c44c40114
First step toward a unification of packet log implementation, currently only SSE and AVX are unified.
...
To this end, I added the following functions: pzero, pcmp_*, pfrexp, pset1frombits functions.
2018-11-26 14:21:24 +01:00