5172 Commits

Author SHA1 Message Date
Gael Guennebaud
5c366fe1d7 Merged in rmlarsen/eigen (pull request PR-230)
Fix a bug in psqrt for SSE and AVX when EIGEN_FAST_MATH=1
2016-10-12 16:30:51 +00:00
Robert Lukierski
86711497c4 Adding EIGEN_DEVICE_FUNC in the Geometry module.
Additional CUDA necessary fixes in the Core (mostly usage of
EIGEN_USING_STD_MATH).
2016-10-12 16:35:17 +01:00
Rasmus Munk Larsen
47150af1c8 Fix copy-paste error: Must use _mm256_cmp_ps for AVX. 2016-10-12 08:34:39 -07:00
Gael Guennebaud
89e315152c bug #1325: fix compilation on NEON with clang 2016-10-12 16:55:47 +02:00
Benoit Steiner
5727e4d89c Reenabled the use of variadic templates on tegra x1 provides that the latest version (i.e. JetPack 2.3) is used. 2016-10-08 22:19:03 +00:00
Benoit Steiner
5c68051cd7 Merge the content of the ComputeCpp branch into the default branch 2016-10-07 11:04:16 -07:00
Gael Guennebaud
4860727ac2 Remove static qualifier of free-functions (inline is enough and this helps ICC to find the right overload) 2016-10-07 09:21:12 +02:00
Benoit Steiner
507b661106 Renamed predux_half into predux_downto4 2016-10-06 17:57:04 -07:00
Benoit Steiner
a498ff7df6 Fixed incorrect comment 2016-10-06 15:27:27 -07:00
Benoit Steiner
a7473d6d5a Fixed compilation error with gcc >= 5.3 2016-10-06 14:33:22 -07:00
Benoit Steiner
5e64cea896 Silenced a compilation warning 2016-10-06 14:24:17 -07:00
Benoit Steiner
d485d12c51 Added missing AVX intrinsics for fp16: in particular, implemented predux which is required by the matrix-vector code. 2016-10-06 10:41:03 -07:00
Rasmus Munk Larsen
48c635e223 Add a simple cost model to prevent Eigen's parallel GEMM from using too many threads when the inner dimension is small.
Timing for square matrices is unchanged, but both CPU and Wall time are significantly improved for skinny matrices. The benchmarks below are for multiplying NxK * KxN matrices with test names of the form BM_OuterishProd/N/K.

Improvements in Wall time:

Run on [redacted] (12 X 3501 MHz CPUs); 2016-10-05T17:40:02.462497196-07:00
CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_OuterishProd/64/1                    3088      1610    +47.9%
BM_OuterishProd/64/4                    3562      2414    +32.2%
BM_OuterishProd/64/32                   8861      7815    +11.8%
BM_OuterishProd/128/1                  11363      6504    +42.8%
BM_OuterishProd/128/4                  11128      9794    +12.0%
BM_OuterishProd/128/64                 27691     27396     +1.1%
BM_OuterishProd/256/1                  33214     28123    +15.3%
BM_OuterishProd/256/4                  34312     36818     -7.3%
BM_OuterishProd/256/128               174866    176398     -0.9%
BM_OuterishProd/512/1                7963684    104224    +98.7%
BM_OuterishProd/512/4                7987913    112867    +98.6%
BM_OuterishProd/512/256              8198378   1306500    +84.1%
BM_OuterishProd/1k/1                 7356256    324432    +95.6%
BM_OuterishProd/1k/4                 8129616    331621    +95.9%
BM_OuterishProd/1k/512              27265418   7517538    +72.4%

Improvements in CPU time:

Run on [redacted] (12 X 3501 MHz CPUs); 2016-10-05T17:40:02.462497196-07:00
CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_OuterishProd/64/1                    6169      1608    +73.9%
BM_OuterishProd/64/4                    7117      2412    +66.1%
BM_OuterishProd/64/32                  17702     15616    +11.8%
BM_OuterishProd/128/1                  45415      6498    +85.7%
BM_OuterishProd/128/4                  44459      9786    +78.0%
BM_OuterishProd/128/64                110657    109489     +1.1%
BM_OuterishProd/256/1                 265158     28101    +89.4%
BM_OuterishProd/256/4                 274234    183885    +32.9%
BM_OuterishProd/256/128              1397160   1408776     -0.8%
BM_OuterishProd/512/1               78947048    520703    +99.3%
BM_OuterishProd/512/4               86955578   1349742    +98.4%
BM_OuterishProd/512/256             74701613  15584661    +79.1%
BM_OuterishProd/1k/1                78352601   3877911    +95.1%
BM_OuterishProd/1k/4                78521643   3966221    +94.9%
BM_OuterishProd/1k/512              258104736  89480530    +65.3%
2016-10-06 10:33:10 -07:00
Benoit Steiner
9f3276981c Enabling AVX512 should also enable AVX2. 2016-10-06 10:29:48 -07:00
Gael Guennebaud
80b5133789 Fix compilation of qr.inverse() for column and full pivoting variants. 2016-10-06 09:55:50 +02:00
Benoit Steiner
4131074818 Deleted unecessary CMakeLists.txt file 2016-10-05 18:54:35 -07:00
Benoit Steiner
cb5cd69872 Silenced a compilation warning. 2016-10-05 18:50:53 -07:00
Benoit Steiner
78b569f685 Merged latest updates from trunk 2016-10-05 18:48:55 -07:00
Benoit Steiner
9c2b6c049b Silenced a few compilation warnings 2016-10-05 18:37:31 -07:00
Benoit Steiner
ae1385c7e4 Pull the latest updates from trunk 2016-10-05 14:54:36 -07:00
Benoit Steiner
698ff69450 Properly characterize the CUDA packet primitives for fp16 as device only 2016-10-04 16:53:30 -07:00
Rasmus Munk Larsen
7f67e6dfdb Update comment for fast sqrt. 2016-10-04 15:09:11 -07:00
Rasmus Munk Larsen
765615609d Update comment for fast sqrt. 2016-10-04 15:08:41 -07:00
Rasmus Munk Larsen
3ed67cb0bb Fix a bug in the implementation of Carmack's fast sqrt algorithm in Eigen (enabled by EIGEN_FAST_MATH), which causes the vectorized parts of the computation to return -0.0 instead of NaN for negative arguments.
Benchmark speed in Giga-sqrts/s
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
-----------------------------------------
                    SSE        AVX
Fast=1              2.529G     4.380G
Fast=0              1.944G     1.898G
Fast=1 fixed        2.214G     3.739G

This table illustrates the worst case in terms speed impact: It was measured by repeatedly computing the sqrt of an n=4096 float vector that fits in L1 cache. For large vectors the operation becomes memory bound and the differences between the different versions almost negligible.
2016-10-04 14:22:56 -07:00
Benoit Steiner
881b90e984 Use explicit type casting to generate packets of zeros. 2016-10-04 08:23:38 -07:00
Benoit Steiner
409e887d78 Added support for constand std::complex numbers on GPU 2016-10-03 11:06:24 -07:00
Gael Guennebaud
9d6d0dff8f bug #1317: fix performance regression with some Block expressions and clang by helping it to remove dead code.
The trick is to get rid of the nested expression in the evaluator by copying only the required information (here, the strides).
2016-10-01 15:37:00 +02:00
Gael Guennebaud
8b84801f7f bug #1310: workaround a compilation regression from 3.2 regarding triangular * homogeneous 2016-09-30 22:49:59 +02:00
Gael Guennebaud
67b4f45836 Fix angle range 2016-09-30 12:46:33 +02:00
Gael Guennebaud
27f3970453 Remove std:: prefix 2016-09-30 12:40:41 +02:00
Gael Guennebaud
3860a0bc8f bug #1312: Quaternion to AxisAngle conversion now ensures the angle will be in the range [-pi,pi]. This also increases accuracy when q.w is negative. 2016-09-29 23:23:35 +02:00
Gael Guennebaud
33500050c3 bug #1308: fix compilation of some small products involving nullary-expressions. 2016-09-29 09:40:44 +02:00
Benoit Steiner
27d7628f16 Updated the list of warnings to reflect the new message ids introduced in cuda 8.0 2016-09-28 17:42:59 -07:00
Gael Guennebaud
f3a00dd2b5 Merged in sergiu/eigen (pull request PR-229)
Disabled MSVC level 4 warning C4714
2016-09-27 09:28:08 +02:00
Gael Guennebaud
892afb9416 Add debug info. 2016-09-26 23:53:57 +02:00
Gael Guennebaud
779774f98c bug #1311: fix alignment logic in some cases of (scalar*small).lazyProduct(small) 2016-09-26 23:53:40 +02:00
Gael Guennebaud
48dfe98abd bug #1308: fix compilation of vector * rowvector::nullary. 2016-09-25 14:54:35 +02:00
Sergiu Deitsch
fe29157d02 disabled MSVC level 4 warning C4714
The level 4 warning (/W4) warns about functions marked as __forceinline not
inlined, and generates a lot of noise.
2016-09-25 14:25:47 +02:00
Gael Guennebaud
86caba838d bug #1304: fix Projective * scaling and Projective *= scaling 2016-09-23 13:41:21 +02:00
Benoit Steiner
2a69290ddb Added a specialization of Eigen::numext::real and Eigen::numext::imag for std::complex<T> to be used when compiling a cuda kernel. This is unfortunately necessary to be able to process complex numbers from a CUDA kernel on MacOS. 2016-09-22 15:52:23 -07:00
Gael Guennebaud
77e27fbeee bump to 3.3-rc1 2016-09-22 22:37:39 +02:00
Gael Guennebaud
2ada122bc6 merge 2016-09-22 22:33:18 +02:00
Gael Guennebaud
8f2bdde373 merge 2016-09-22 22:32:55 +02:00
Gael Guennebaud
ba0f844d6b Backout changeset ce3557ca69742af477546d031d644a6dab1ff614 2016-09-22 22:28:51 +02:00
Benoit Steiner
50e3bbfc90 Calls x.imag() instead of imag(x) when x is a complex number since the former
is a constexpr while the later isn't. This fixes compilation errors triggered by nvcc on Mac.
2016-09-22 13:17:25 -07:00
Gael Guennebaud
ca3746c6f8 Bypass identity reflectors. 2016-09-22 22:07:13 +02:00
Felix Gruber
8bde7da086 fix documentation of LinSpaced
The index of the highest value in a LinSpace is size-1.
2016-09-22 14:50:07 +02:00
Gael Guennebaud
66cbabafed Add a note regarding gcc bug #72867 2016-09-22 11:18:52 +02:00
Gael Guennebaud
9fa2c8650e Fix alignement of statically allocated temporaries in symv, and trmv. 2016-09-21 17:34:24 +02:00
Gael Guennebaud
ac5377e161 Improve cost estimation of complex division 2016-09-21 17:26:04 +02:00