- The current implementation computes `size + total_threads`, which can
overflow and cause CUDA_ERROR_ILLEGAL_ADDRESS when size is close to
the maximum representable value.
- The num_blocks calculation can also overflow due to the implementation
of divup().
- This patch prevents these overflows and allows the kernel to work
correctly for the full representable range of tensor sizes.
- Also adds relevant tests.
- The current implementation computes `size + total_threads`, which can
overflow and cause CUDA_ERROR_ILLEGAL_ADDRESS when size is close to
the maximum representable value.
- The num_blocks calculation can also overflow due to the implementation
of divup().
- This patch prevents these overflows and allows the kernel to work
correctly for the full representable range of tensor sizes.
- Also adds relevant tests.
Some checks used incorrect values, partly from copy-paste errors,
partly from the change in behaviour introduced in !398.
Modified results to match scipy, simplified tests by updating
`VERIFY_IS_CWISE_APPROX` to work for scalars.
The fixes needed are
* adding EIGEN_DEVICE_FUNC attribute to a couple of funcs (else HIPCC will error out when non-device funcs are called from global/device funcs)
* switching to using ::<math_func> instead std::<math_func> (only for HIPCC) in cases where the std::<math_func> is not recognized as a device func by HIPCC
* removing an errant "j" from a testcase (don't know how that made it in to begin with!)
Also, a few minor fixes for GPU tests running in HIP mode.
1. Adding an include for hip/hip_runtime.h in the Macros.h file
For HIP __host__ and __device__ are macros which are defined in hip headers.
Their definitions need to be included before their use in the file.
2. Fixing the compile failure in TensorContractionGpu introduced by the commit to
"Fuse computations into the Tensor contractions using output kernel"
3. Fixing a HIP/clang specific compile error by making the struct-member assignment explicit
This provide several advantages:
- more flexibility in designing unit tests
- unit tests can be glued to speed up compilation
- unit tests are compiled with same predefined macros, which is a requirement for zapcc