Make clipping outside [-18:18] consistent for vectorized and non-vectorized paths of scalar_logistic_op<float>.
Approved-by: Gael Guennebaud <g.gael@free.fr>
This helps avoids a conflict on certain Windows toolchains
(potentially due to some ADL name resolution bug) in the case
where aligned_free is defined in the global namespace. In any
case, tightening this up is harmless.
This is also important to make sure that A.conjugate() * B.conjugate() does not evaluate
its arguments into temporaries (e.g., if A and B are fixed and small, or * fall back to lazyProduct)
The row-major matrix-vector multiplication code uses a threshold to
check if processing 8 rows at a time would thrash the cache.
This change introduces two modifications to this logic.
1. A smaller threshold for ARM and ARM64 devices.
The value of this threshold was determined empirically using a Pixel2
phone, by benchmarking a large number of matrix-vector products in the
range [1..4096]x[1..4096] and measuring performance separately on
small and little cores with frequency pinning.
On big (out-of-order) cores, this change has little to no impact. But
on the small (in-order) cores, the matrix-vector products are up to
700% faster. Especially on large matrices.
The motivation for this change was some internal code at Google which
was using hand-written NEON for implementing similar functionality,
processing the matrix one row at a time, which exhibited substantially
better performance than Eigen.
With the current change, Eigen handily beats that code.
2. Make the logic for choosing number of simultaneous rows apply
unifiormly to 8, 4 and 2 rows instead of just 8 rows.
Since the default threshold for non-ARM devices is essentially
unchanged (32000 -> 32 * 1024), this change has no impact on non-ARM
performance. This was verified by running the same set of benchmarks
on a Xeon desktop.