Activates vectorization of the Eigen::half versions of the tanh and
logistic functions when they run on Neon. Both functions convert their
inputs to float before computing the output, and as a result of this
commit, the conversions and the computation in float are vectorized.
The `Complex.h` file applies equally to HIP/CUDA, so placing under the
generic `GPU` folder.
The `TensorReductionCuda.h` has already been deprecated, now removing
for the next Eigen version.
The `Serializer<T>` class implements a binary serialization that
can write to (`serialize`) and read from (`deserialize`) a byte
buffer. Also added convenience routines for serializing
a list of arguments.
This will mainly be for testing, specifically to transfer data to
and from the GPU.
NVCC does not understand `__forceinline`, so we need to use `inline`
when compiling for GPU.
ICC specializes `std::complex` operators for `float` and `double`
by default, which cannot be used on device and conflict with Eigen's
workaround in CUDA/Complex.h. This can be prevented by defining
`_OVERRIDE_COMPLEX_SPECIALIZATION_` before including `<complex>`.
Added this define to the tests and to `Eigen/Core`, but this will
not work if the user includes `<complex>` before `<Eigen/Core>`.
ICC also seems to generate a duplicate `Map` symbol in
`PlainObjectBase`:
```
error: "Map" has already been declared in the current scope
static ConstMapType Map(const Scalar *data)
```
I tracked this down to `friend class Eigen::Map`. Putting the `friend`
statements at the bottom of the class seems to resolve this issue.
Fixes#2180
Clang does a poor job of optimizing the GEBP microkernel on 32-bit ARM,
leading to excessive 16-byte register spills, slowing down basic f32
matrix multiplication by approx 50%.
By specializing `gebp_traits`, we can eliminate the register spills.
Volatile inline ASM both acts as a barrier to prevent reordering and
enforces strict register use. In a simple f32 matrix multiply example,
this modification reduces 16-byte spills from 109 instances to zero,
leading to a 1.5x speed increase (search for `16-byte Spill` in the
assembly in https://godbolt.org/z/chsPbE).
This is a replacement of !379. See there for further discussion.
Also moved `gebp_traits` specializations for NEON to
`Eigen/src/Core/arch/NEON/GeneralBlockPanelKernel.h` to be alongside
other NEON-specific code.
Fixes#2138.
This patch adds support for Arm's new vector extension SVE (Scalable Vector Extension). In contrast to other vector extensions that are supported by Eigen, SVE types are inherently *sizeless*. For the use in Eigen we fix their size at compile-time (note that this is not necessary in general, SVE is *length agnostic*).
During compilation the flag `-msve-vector-bits=N` has to be set where `N` is a power of two in the range of `128`to `2048`, indicating the length of an SVE vector.
Since SVE is rather young, we decided to disable it by default even if it would be available. A user has to enable it explicitly by defining `EIGEN_ARM64_USE_SVE`.
This patch introduces the packet types `PacketXf` and `PacketXi` for packets of `float` and `int32_t` respectively. The size of these packets depends on the SVE vector length. E.g. if `-msve-vector-bits=512` is set, `PacketXf` will contain `512/32 = 16` elements.
This MR is joint work with Miguel Tairum <miguel.tairum@arm.com>.
This PR tries to fix an incorrect usage of `if defined(EIGEN_ARCH_PPC)`
in `Eigen/Core` header.
In `Eigen/src/Core/util/Macros.h`, EIGEN_ARCH_PPC was explicitly defined
as either 0 or 1. As a result `if defined(EIGEN_ARCH_PPC)` will always be true.
This causes issues when building on non PPC platform and `MatrixProduct.h` is not
available.
This fix changes `if defined(EIGEN_ARCH_PPC)` => `if EIGEN_ARCH_PPC`.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This commit applies the following changes:
- Moving the `scamLauncher` specialization inside internal namespace to fix compiler crash on TensorScan for SYCL backend.
- Replacing `SYCL/sycl.hpp` to `CL/sycl.hpp` in order to follow SYCL 1.2.1 standard.
- minor fixes: commenting out an unused variable to avoid compiler warnings.
It is based on the SSE version which is much more accurate, though very slightly slower.
This changeset also includes the following required changes:
- add packet-float to packet-int type traits
- add packet float<->int reinterpret casts
- add faster pselect for AVX based on blendv
* Support compiling without IO streams
Add the preprocessor definition EIGEN_NO_IO which, if defined,
disables all use of the IO streams part of the standard library.
The major changes are
1. Moving CUDA/PacketMath.h to GPU/PacketMath.h
2. Moving CUDA/MathFunctions.h to GPU/MathFunction.h
3. Moving CUDA/CudaSpecialFunctions.h to GPU/GpuSpecialFunctions.h
The above three changes effectively enable the Eigen "Packet" layer for the HIP platform
4. Merging the "hip_basic" and "cuda_basic" unit tests into one ("gpu_basic")
5. Updating the "EIGEN_DEVICE_FUNC" marking in some places
The change has been tested on the HIP and CUDA platforms.