220 Commits

Author SHA1 Message Date
Charles Schlosser
1a2bfca8f0 Fix annoying warnings 2023-07-07 20:19:58 +00:00
Chip Kerchner
211c5dfc67 Add optional offset parameter to ploadu_partial and pstoreu_partial 2023-06-23 19:53:05 +00:00
Charles Schlosser
59b3ef5409 Partially Vectorize Cast 2023-06-09 16:54:31 +00:00
Rasmus Munk Larsen
a347dbbab2 Delete last few occurences of HasHalfPacket. 2023-04-19 10:36:59 -07:00
Rasmus Munk Larsen
0488b708b4 Vectorize tensor.isnan() by using typed predicates. 2023-03-16 04:04:22 +00:00
Rasmus Munk Larsen
ce62177b5b Vectorize atanh & add a missing definition and unit test for atan. 2023-02-21 03:14:05 +00:00
Charles Schlosser
94b19dc5f2 Add CArg 2023-02-15 21:33:06 +00:00
Rasmus Munk Larsen
37b2e97175 Tweak special case handling in atan2. 2023-01-31 17:48:00 -08:00
Charles Schlosser
0471e61b4c Optimize various mathematical packet ops 2023-01-28 01:34:26 +00:00
Charles Schlosser
6d9f662a70 Tweak atan2 2023-01-26 17:38:21 +00:00
Charles Schlosser
82b152dbe7 Add signbit function 2022-11-04 00:31:20 +00:00
Rasmus Munk Larsen
273e0c884e Revert "Add constexpr, test for C++14 constexpr." 2022-09-16 21:14:29 +00:00
Tobias Schlüter
133498c329 Add constexpr, test for C++14 constexpr. 2022-09-07 03:42:34 +00:00
Rasmus Munk Larsen
7064ed1345 Specialize psign<Packet8i> for AVX2, don't vectorize psign<bool>. 2022-08-26 17:02:37 +00:00
Rasmus Munk Larsen
97e0784dc6 Vectorize the sign operator in Eigen. 2022-08-09 19:54:57 +00:00
Chip Kerchner
84cf3ff18d Add pload_partial, pstore_partial (and unaligned versions), pgather_partial, pscatter_partial, loadPacketPartial and storePacketPartial. 2022-06-27 19:18:00 +00:00
b-shi
518fc321cb AVX512 Optimizations for Triangular Solve 2022-03-16 18:04:50 +00:00
Erik Schultheis
421cbf0866 Replace Eigen type metaprogramming with corresponding std types and make use of alias templates 2022-03-16 16:43:40 +00:00
Rasmus Munk Larsen
51311ec651 Remove inline assembly for FMA (AVX) and add remaining extensions as packet ops: pmsub, pnmadd, and pnmsub. 2022-01-26 04:25:41 +00:00
Rasmus Munk Larsen
ea2c02060c Add reciprocal packet op and fast specializations for float with SSE, AVX, and AVX512. 2022-01-21 23:49:18 +00:00
Antonio Sanchez
cb50730993 Default eigen_packet_wrapper constructor.
This makes it trivial, allowing use of `memcpy`.

Fixes #2326
2021-09-14 10:57:22 -07:00
Rasmus Munk Larsen
d7d0bf832d Issue an error in case of direct inclusion of internal headers. 2021-09-10 19:12:26 +00:00
Antonio Sanchez
3d98a6ef5c Modify scalar pzero, ptrue, pselect, and p<binary> operations to avoid memset.
The `memset` function and bitwise manipulation only apply to POD types
that do not require initialization, otherwise resulting in UB. We currently
violate this in `ptrue` and `pzero`, we assume bitmasks for `pselect`, and
bitwise operations are applied byte-by-byte in the generic implementations.

This is causing issues for scalar types that do require initialization
or that contain non-POD info such as pointers (#2201). We either break
them, or force specializations of these functions for custom scalars,
even if they are not vectorized.

Here we modify these functions for scalars only - instead using only
scalar operations:
- `pzero`: `Scalar(0)` for all scalars.
- `ptrue`: `Scalar(1)` for non-trivial scalars, bitset to one bits for trivial scalars.
- `pselect`: ternary select comparing mask to `Scalar(0)` for all scalars
- `pand`, `por`, `pxor`, `pnot`: use operators `&`, `|`, `^`, `~` for all integer or non-trivial scalars, otherwise apply bytewise.

For non-scalar types, the original implementations are used to maintain
compatibility and minimize the number of changes.

Fixes #2201.
2021-08-03 08:44:28 -07:00
Rasmus Munk Larsen
22edb46823 Align local arrays to Packet boundary. 2021-04-06 16:22:36 +00:00
Antonio Sanchez
29ebd84cb7 Fix NEON sqrt for 32-bit, add prsqrt.
With !406, we accidentally broke arm 32-bit NEON builds, since
`vsqrt_f32` is only available for 64-bit.

Here we add back the `rsqrt` implementation for 32-bit, relying
on a `prsqrt` implementation with better handling of edge cases.

Note that several of the 32-bit NEON packet tests are currently
failing - either due to denormal handling (NEON versions flush
to zero, but scalar paths don't) or due to accuracy (e.g. sin/cos).
2021-02-26 14:08:40 -08:00
Antonio Sanchez
4cb563a01e Fix ldexp implementations.
The previous implementations produced garbage values if the exponent did
not fit within the exponent bits.  See #2131 for a complete discussion,
and !375 for other possible implementations.

Here we implement the 4-factor version. See `pldexp_impl` in
`GenericPacketMathFunctions.h` for a full description.

The SSE `pcmp*` methods were moved down since `pcmp_le<Packet4i>`
requires `por`.

Left as a "TODO" is to delegate to a faster version if we know the
exponent does fit within the exponent bits.

Fixes #2131.
2021-02-10 22:45:41 +00:00
Antonio Sanchez
fb4548e27b Implement bit_* for device.
Unfortunately `std::bit_and` and the like are host-only functions prior
to c++14 (since they are not `constexpr`).  They also never exist in the
global namespace, so the current implementation  always fails to compile via
NVCC - since `EIGEN_USING_STD` tries to import the symbol from the global
namespace on device.

To overcome these limitations, we implement these functionals here.
2021-02-01 13:27:45 -08:00
Antonio Sanchez
b2126fd6b5 Fix pfrexp/pldexp for half.
The recent addition of vectorized pow (!330) relies on `pfrexp` and
`pldexp`.  This was missing for `Eigen::half` and `Eigen::bfloat16`.
Adding tests for these packet ops also exposed an issue with handling
negative values in `pfrexp`, returning an incorrect exponent.

Added the missing implementations, corrected the exponent in `pfrexp1`,
and added `packetmath` tests.
2021-01-21 19:32:28 +00:00
Antonio Sanchez
bde6741641 Improved std::complex sqrt and rsqrt.
Replaces `std::sqrt` with `complex_sqrt` for all platforms (previously
`complex_sqrt` was only used for CUDA and MSVC), and implements
custom `complex_rsqrt`.

Also introduces `numext::rsqrt` to simplify implementation, and modified
`numext::hypot` to adhere to IEEE IEC 6059 for special cases.

The `complex_sqrt` and `complex_rsqrt` implementations were found to be
significantly faster than `std::sqrt<std::complex<T>>` and
`1/numext::sqrt<std::complex<T>>`.

Benchmark file attached.
```
GCC 10, Intel Xeon, x86_64:
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>>           9.21 ns         9.21 ns     73225448
BM_StdSqrt<std::complex<float>>        17.1 ns         17.1 ns     40966545
BM_Sqrt<std::complex<double>>          8.53 ns         8.53 ns     81111062
BM_StdSqrt<std::complex<double>>       21.5 ns         21.5 ns     32757248
BM_Rsqrt<std::complex<float>>          10.3 ns         10.3 ns     68047474
BM_DivSqrt<std::complex<float>>        16.3 ns         16.3 ns     42770127
BM_Rsqrt<std::complex<double>>         11.3 ns         11.3 ns     61322028
BM_DivSqrt<std::complex<double>>       16.5 ns         16.5 ns     42200711

Clang 11, Intel Xeon, x86_64:
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>>           7.46 ns         7.45 ns     90742042
BM_StdSqrt<std::complex<float>>        16.6 ns         16.6 ns     42369878
BM_Sqrt<std::complex<double>>          8.49 ns         8.49 ns     81629030
BM_StdSqrt<std::complex<double>>       21.8 ns         21.7 ns     31809588
BM_Rsqrt<std::complex<float>>          8.39 ns         8.39 ns     82933666
BM_DivSqrt<std::complex<float>>        14.4 ns         14.4 ns     48638676
BM_Rsqrt<std::complex<double>>         9.83 ns         9.82 ns     70068956
BM_DivSqrt<std::complex<double>>       15.7 ns         15.7 ns     44487798

Clang 9, Pixel 2, aarch64:
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
BM_Sqrt<std::complex<float>>           24.2 ns         24.1 ns     28616031
BM_StdSqrt<std::complex<float>>         104 ns          103 ns      6826926
BM_Sqrt<std::complex<double>>          31.8 ns         31.8 ns     22157591
BM_StdSqrt<std::complex<double>>        128 ns          128 ns      5437375
BM_Rsqrt<std::complex<float>>          31.9 ns         31.8 ns     22384383
BM_DivSqrt<std::complex<float>>        99.2 ns         98.9 ns      7250438
BM_Rsqrt<std::complex<double>>         46.0 ns         45.8 ns     15338689
BM_DivSqrt<std::complex<double>>        119 ns          119 ns      5898944
```
2021-01-17 08:50:57 -08:00
Guoqiang QI
38ae5353ab 1)provide a better generic paddsub op implementation
2)make paddsub op support the Packet2cf/Packet4f/Packet2f in NEON
3)make paddsub op support the Packet2cf/Packet4f in SSE
2021-01-13 22:54:03 +00:00
Antonio Sanchez
c6efc4e0ba Replace M_LOG2E and M_LN2 with custom macros.
For these to exist we would need to define `_USE_MATH_DEFINES` before
`cmath` or `math.h` is first included.  However, we don't
control the include order for projects outside Eigen, so even defining
the macro in `Eigen/Core` does not fix the issue for projects that
end up including `<cmath>` before Eigen does (explicitly or transitively).

To fix this, we define `EIGEN_LOG2E` and `EIGEN_LN2` ourselves.
2020-12-11 14:34:31 -08:00
Rasmus Munk Larsen
125cc9a5df Implement vectorized complex square root.
Closes #1905

Measured speedup for sqrt of `complex<float>` on Skylake:

SSE:
```
name                      old time/op             new time/op  delta
BM_eigen_sqrt_ctype/1     49.4ns ± 0%             54.3ns ± 0%  +10.01%
BM_eigen_sqrt_ctype/8      332ns ± 0%               50ns ± 1%  -84.97%
BM_eigen_sqrt_ctype/64    2.81µs ± 1%             0.38µs ± 0%  -86.49%
BM_eigen_sqrt_ctype/512   23.8µs ± 0%              3.0µs ± 0%  -87.32%
BM_eigen_sqrt_ctype/4k     202µs ± 0%               24µs ± 2%  -88.03%
BM_eigen_sqrt_ctype/32k   1.63ms ± 0%             0.19ms ± 0%  -88.18%
BM_eigen_sqrt_ctype/256k  13.0ms ± 0%              1.5ms ± 1%  -88.20%
BM_eigen_sqrt_ctype/1M    52.1ms ± 0%              6.2ms ± 0%  -88.18%
```

AVX2:
```
name                      old cpu/op  new cpu/op  delta
BM_eigen_sqrt_ctype/1     53.6ns ± 0%  55.6ns ± 0%   +3.71%
BM_eigen_sqrt_ctype/8      334ns ± 0%    27ns ± 0%  -91.86%
BM_eigen_sqrt_ctype/64    2.79µs ± 0%  0.22µs ± 2%  -92.28%
BM_eigen_sqrt_ctype/512   23.8µs ± 1%   1.7µs ± 1%  -92.81%
BM_eigen_sqrt_ctype/4k     201µs ± 0%    14µs ± 1%  -93.24%
BM_eigen_sqrt_ctype/32k   1.62ms ± 0%  0.11ms ± 1%  -93.29%
BM_eigen_sqrt_ctype/256k  13.0ms ± 0%   0.9ms ± 1%  -93.31%
BM_eigen_sqrt_ctype/1M    52.0ms ± 0%   3.5ms ± 1%  -93.31%
```

AVX512:
```
name                      old cpu/op  new cpu/op  delta
BM_eigen_sqrt_ctype/1     53.7ns ± 0%  56.2ns ± 1%   +4.75%
BM_eigen_sqrt_ctype/8      334ns ± 0%    18ns ± 2%  -94.63%
BM_eigen_sqrt_ctype/64    2.79µs ± 0%  0.12µs ± 1%  -95.54%
BM_eigen_sqrt_ctype/512   23.9µs ± 1%   1.0µs ± 1%  -95.89%
BM_eigen_sqrt_ctype/4k     202µs ± 0%     8µs ± 1%  -96.13%
BM_eigen_sqrt_ctype/32k   1.63ms ± 0%  0.06ms ± 1%  -96.15%
BM_eigen_sqrt_ctype/256k  13.0ms ± 0%   0.5ms ± 4%  -96.11%
BM_eigen_sqrt_ctype/1M    52.1ms ± 0%   2.0ms ± 1%  -96.13%
```
2020-12-08 18:13:35 -08:00
Rasmus Munk Larsen
f9fac1d5b0 Add log2() to Eigen. 2020-12-04 21:45:09 +00:00
David Tellenbach
305b8bd277 Remove duplicate #if clause 2020-12-04 18:55:46 +01:00
shrek1402
a9a2f2bebf The function 'prefetch' did not work correctly on the win64 platform 2020-12-04 17:18:08 +00:00
Rasmus Munk Larsen
f23dc5b971 Revert "Add log2() operator to Eigen"
This reverts commit 4d91519a9be061da5d300079fca17dd0b9328050.
2020-12-03 14:32:45 -08:00
Rasmus Munk Larsen
4d91519a9b Add log2() operator to Eigen 2020-12-03 22:31:44 +00:00
Rasmus Munk Larsen
79818216ed Revert "Fix Half NaN definition and test."
This reverts commit c770746d709686ef2b8b652616d9232f9b028e78.
2020-11-24 12:57:28 -08:00
Rasmus Munk Larsen
c770746d70 Fix Half NaN definition and test.
The `half_float` test was failing with `-mcpu=cortex-a55` (native `__fp16`) due
to a bad NaN bit-pattern comparison (in the case of casting a float to `__fp16`,
the signaling `NaN` is quieted). There was also an inconsistency between
`numeric_limits<half>::quiet_NaN()` and `NumTraits::quiet_NaN()`.  Here we
correct the inconsistency and compare NaNs according to the IEEE 754
definition.

Also modified the `bfloat16_float` test to match.

Tested with `cortex-a53` and `cortex-a55`.
2020-11-24 20:53:07 +00:00
mehdi-goli
e24a1f57e3 [SYCL Function pointer Issue]: SYCL does not support function pointer inside the kernel, due to the portability issue of a function pointer and memory address space among host and accelerators. To fix the issue, function pointers have been replaced by function objects. 2020-11-12 01:50:28 +00:00
Deven Desai
011e0db31d Fix for ROCm/HIP breakage - 201013
The following commit seems to have introduced regressions in ROCm/HIP support.

183a208212

It causes some unit-tests to fail with the following error

```
...
Eigen/src/Core/GenericPacketMath.h:322:3: error: no member named 'bit_and' in the global namespace; did you mean 'std::bit_and'?
...
Eigen/src/Core/GenericPacketMath.h:329:3: error: no member named 'bit_or' in the global namespace; did you mean 'std::bit_or'?
...
Eigen/src/Core/GenericPacketMath.h:336:3: error: no member named 'bit_xor' in the global namespace; did you mean 'std::bit_xor'?
...
```

The error occurs because, when compiling the device code in HIP/CUDA, the compiler will pick up the some of the std functions (whose calls are prefixed by EIGEN_USING_STD) from the global namespace (i.e. use ::bit_xor instead of std::bit_xor). For this to work, those functions must be declared in the global namespace in the HIP/CUDA header files. The `bit_and`, `bit_or` and `bit_xor` routines are not declared in the HIP header file that contain the decls for the std math functions ( `math_functions.h` ), and this is the cause of the error above.

It seems that the newer HIP compilers do support the calling of `std::` math routines within device code, and the ideal fix here would have been to change all calls to std math functions in EIGEN to use the `std::` namespace (instead of the global namespace ), when compiling  with HIP compiler. However it seems there was a recent commit to remove the EIGEN_USING_STD_MATH macro and collapse it uses into the EIGEN_USING_STD macro ( 4091f6b25c ).

Replacing all std math calls will essentially require re-surrecting the EIGEN_USING_STD_MATH macro, so not choosing that option.

Also HIP compilers only have support std math calls within device code, and not all std functions (specifically not for malloc/free which are prefixed via EIGEN_USING_STD). So modyfing EIGEN_USE_STD implementation to use std:: namspace for HIP will not work either.

Hence going for the ugly solution of special casing the three calls that breaking the HIP compile, to explicitly use the std:: namespace
2020-10-15 12:17:35 +00:00
Rasmus Munk Larsen
208b3626d1 Revert generic implementation of predux, since it break compilation of predux_any with MSVC. 2020-10-14 21:41:28 +00:00
Rasmus Munk Larsen
c6953f799b Add packet generic ops predux_fmin, predux_fmin_nan, predux_fmax, and predux_fmax_nan that implement reductions with PropagateNaN, and PropagateNumbers semantics. Add (slow) generic implementations for most reductions. 2020-10-13 21:48:31 +00:00
Rasmus Munk Larsen
9a4d04c05f Make bitwise_helper a device function to unbreak GPU builds. 2020-10-10 01:45:20 +00:00
David Tellenbach
4091f6b25c Drop EIGEN_USING_STD_MATH in favour of EIGEN_USING_STD 2020-10-09 02:05:05 +02:00
Rasmus Munk Larsen
183a208212 Implement generic bitwise logical packet ops that work for all types. 2020-10-08 22:45:20 +00:00
Rasmus Munk Larsen
b431024404 Don't make assumptions about NaN-propagation for pmin/pmax - it various across platforms.
Change test to only test for NaN-propagation for pfmin/pfmax.
2020-10-07 19:05:18 +00:00
Rasmus Munk Larsen
3b445d9bf2 Add a generic packet ops corresponding to {std}::fmin and {std}::fmax. The non-sensical NaN-propagation rules for std::min std::max implemented by pmin and pmax in Eigen is a longstanding source og confusion and bug report. This change is a first step towards addressing it, as discussing in issue #564. 2020-10-01 16:54:31 +00:00
Pedro Caldeira
35d149e34c Add missing functions for Packet8bf in Altivec architecture.
Including new tests for bfloat16 Packets.
Fix prsqrt on GenericPacketMath.
2020-09-08 09:22:11 -05:00
Guoqiang QI
85428a3440 Add Neon psqrt<Packet2d> and pexp<Packet2d> 2020-09-08 09:04:03 +00:00