Use reinterpret_cast on GPU for bit_cast.

This seems to be the recommended approach for doing type punning in
CUDA. See for example
- https://stackoverflow.com/questions/47037104/cuda-type-punning-memcpy-vs-ub-union
- https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/
(the latter puns a double to an `int2`).
The issue is that for CUDA, the `memcpy` is not elided, and ends up
being an expensive operation.  We already have similar `reintepret_cast`s across
the Eigen codebase for GPU (as does TensorFlow).
This commit is contained in:
Antonio Sanchez 2021-10-08 11:30:09 -07:00 committed by Rasmus Munk Larsen
parent 24ebb37f38
commit 45e67a6fda

View File

@ -93,10 +93,18 @@ EIGEN_STRONG_INLINE EIGEN_DEVICE_FUNC Tgt bit_cast(const Src& src) {
#endif #endif
EIGEN_STATIC_ASSERT(sizeof(Src) == sizeof(Tgt), THIS_TYPE_IS_NOT_SUPPORTED); EIGEN_STATIC_ASSERT(sizeof(Src) == sizeof(Tgt), THIS_TYPE_IS_NOT_SUPPORTED);
Tgt tgt;
EIGEN_USING_STD(memcpy) // On GPU, the standard memcpy approach is not elided, actually producing an
memcpy(&tgt, &src, sizeof(Tgt)); // expensive memcpy. The standard (as used by the CUDA library, and suggested
return tgt; // in multiple forums) seems to be to violate strict aliasing rules.
#if defined(EIGEN_GPU_COMPILE_PHASE)
return *reinterpret_cast<const Tgt*>(&src);
#else
Tgt tgt;
EIGEN_USING_STD(memcpy)
memcpy(&tgt, &src, sizeof(Tgt));
return tgt;
#endif
} }
} // namespace numext } // namespace numext