Use reinterpret_cast on GPU for bit_cast.

This seems to be the recommended approach for doing type punning in CUDA. See for example - https://stackoverflow.com/questions/47037104/cuda-type-punning-memcpy-vs-ub-union - https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/ (the latter puns a double to an `int2`). The issue is that for CUDA, the `memcpy` is not elided, and ends up being an expensive operation. We already have similar `reintepret_cast`s across the Eigen codebase for GPU (as does TensorFlow).
2025-10-10 07:06:32 +08:00 · 2021-10-08 11:30:09 -07:00 · 2021-10-08 11:30:09 -07:00 · 45e67a6fda
commit 45e67a6fda
parent 24ebb37f38
1 changed files with 12 additions and 4 deletions
--- a/Eigen/src/Core/NumTraits.h
+++ b/Eigen/src/Core/NumTraits.h
@ -93,10 +93,18 @@ EIGEN_STRONG_INLINE EIGEN_DEVICE_FUNC Tgt bit_cast(const Src& src) {
 #endif

  EIGEN_STATIC_ASSERT(sizeof(Src) == sizeof(Tgt), THIS_TYPE_IS_NOT_SUPPORTED);
+
+  // On GPU, the standard memcpy approach is not elided, actually producing an
+  // expensive memcpy. The standard (as used by the CUDA library, and suggested
+  // in multiple forums) seems to be to violate strict aliasing rules.
+  #if defined(EIGEN_GPU_COMPILE_PHASE)
+    return *reinterpret_cast<const Tgt*>(&src);
+  #else
    Tgt tgt;
    EIGEN_USING_STD(memcpy)
    memcpy(&tgt, &src, sizeof(Tgt));
    return tgt;
+  #endif
 }
 }  // namespace numext