- The current implementation computes `size + total_threads`, which can
overflow and cause CUDA_ERROR_ILLEGAL_ADDRESS when size is close to
the maximum representable value.
- The num_blocks calculation can also overflow due to the implementation
of divup().
- This patch prevents these overflows and allows the kernel to work
correctly for the full representable range of tensor sizes.
- Also adds relevant tests.
The sum accuracy test currently uses the default test precision for
the given scalar type. However, scalars are generated via a normal
distribution, and given a large enough count and strong enough random
generator, the expected sum is zero. This causes the test to
periodically fail.
Here we estimate an upper-bound for the error as `sqrt(N) * prec` for
summing N values, with each having an approximate epsilon of `prec`.
Also fixed a few warnings generated by MSVC when compiling the
reduction test.
- The current implementation computes `size + total_threads`, which can
overflow and cause CUDA_ERROR_ILLEGAL_ADDRESS when size is close to
the maximum representable value.
- The num_blocks calculation can also overflow due to the implementation
of divup().
- This patch prevents these overflows and allows the kernel to work
correctly for the full representable range of tensor sizes.
- Also adds relevant tests.
For vectorized 1-dimensional inputs that do not take the special
blocking path (e.g. `std::complex<...>`), there was an
index-out-of-bounds error causing the broadcast size to be
computed incorrectly. Here we fix this, and make other minor
cleanup changes.
Fixes#2351.
The original test times out after 60 minutes on Windows, even when
setting flags to optimize for speed. Reducing the number of
contractions performed from 3600->27 for subtests 8,9 allow the
two to run in just over a minute each.
Some checks used incorrect values, partly from copy-paste errors,
partly from the change in behaviour introduced in !398.
Modified results to match scipy, simplified tests by updating
`VERIFY_IS_CWISE_APPROX` to work for scalars.
Without this flag, when compiling with nvcc, if the compute architecture of a card does
not exactly match any of those listed for `-gencode arch=compute_<arch>,code=sm_<arch>`,
then the kernel will fail to run with:
```
cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device.
```
This can happen, for example, when compiling with an older cuda version
that does not support a newer architecture (e.g. T4 is `sm_75`, but cuda
9.2 only supports up to `sm_70`).
With the `-arch=<arch>` flag, the code will compile and run at the
supplied architecture.
- Unify test/CMakeLists.txt and unsupported/test/CMakeLists.txt
- Added `EIGEN_CUDA_FLAGS` that are appended to the set of flags passed
to the cuda compiler (nvcc or clang).
The latter is to support passing custom flags (e.g. `-arch=` to nvcc,
or to disable cuda-specific warnings).
This is to make way for a new `Tuple` class that mimics `std::tuple`,
but can be reliably used on device and with aligned Eigen types.
The existing Tuple has very few references, and is actually an
analogue of `std::pair`.