Rasmus Munk Larsen
|
48c635e223
|
Add a simple cost model to prevent Eigen's parallel GEMM from using too many threads when the inner dimension is small.
Timing for square matrices is unchanged, but both CPU and Wall time are significantly improved for skinny matrices. The benchmarks below are for multiplying NxK * KxN matrices with test names of the form BM_OuterishProd/N/K.
Improvements in Wall time:
Run on [redacted] (12 X 3501 MHz CPUs); 2016-10-05T17:40:02.462497196-07:00
CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_OuterishProd/64/1 3088 1610 +47.9%
BM_OuterishProd/64/4 3562 2414 +32.2%
BM_OuterishProd/64/32 8861 7815 +11.8%
BM_OuterishProd/128/1 11363 6504 +42.8%
BM_OuterishProd/128/4 11128 9794 +12.0%
BM_OuterishProd/128/64 27691 27396 +1.1%
BM_OuterishProd/256/1 33214 28123 +15.3%
BM_OuterishProd/256/4 34312 36818 -7.3%
BM_OuterishProd/256/128 174866 176398 -0.9%
BM_OuterishProd/512/1 7963684 104224 +98.7%
BM_OuterishProd/512/4 7987913 112867 +98.6%
BM_OuterishProd/512/256 8198378 1306500 +84.1%
BM_OuterishProd/1k/1 7356256 324432 +95.6%
BM_OuterishProd/1k/4 8129616 331621 +95.9%
BM_OuterishProd/1k/512 27265418 7517538 +72.4%
Improvements in CPU time:
Run on [redacted] (12 X 3501 MHz CPUs); 2016-10-05T17:40:02.462497196-07:00
CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_OuterishProd/64/1 6169 1608 +73.9%
BM_OuterishProd/64/4 7117 2412 +66.1%
BM_OuterishProd/64/32 17702 15616 +11.8%
BM_OuterishProd/128/1 45415 6498 +85.7%
BM_OuterishProd/128/4 44459 9786 +78.0%
BM_OuterishProd/128/64 110657 109489 +1.1%
BM_OuterishProd/256/1 265158 28101 +89.4%
BM_OuterishProd/256/4 274234 183885 +32.9%
BM_OuterishProd/256/128 1397160 1408776 -0.8%
BM_OuterishProd/512/1 78947048 520703 +99.3%
BM_OuterishProd/512/4 86955578 1349742 +98.4%
BM_OuterishProd/512/256 74701613 15584661 +79.1%
BM_OuterishProd/1k/1 78352601 3877911 +95.1%
BM_OuterishProd/1k/4 78521643 3966221 +94.9%
BM_OuterishProd/1k/512 258104736 89480530 +65.3%
|
2016-10-06 10:33:10 -07:00 |
|
Gael Guennebaud
|
80b5133789
|
Fix compilation of qr.inverse() for column and full pivoting variants.
|
2016-10-06 09:55:50 +02:00 |
|
Benoit Steiner
|
4387433acf
|
Increased the robustness of the reduction tests on fp16
|
2016-10-05 10:42:41 -07:00 |
|
Benoit Steiner
|
aad20d700d
|
Increase the tolerance to numerical noise.
|
2016-10-05 10:39:24 -07:00 |
|
Benoit Steiner
|
8b69d5d730
|
::rand() returns a signed integer on win32
|
2016-10-05 08:55:02 -07:00 |
|
Benoit Steiner
|
ed7a220b04
|
Fixed a typo that impacts windows builds
|
2016-10-05 08:51:31 -07:00 |
|
Benoit Steiner
|
ceee1c008b
|
Silenced compilation warning
|
2016-10-04 18:47:53 -07:00 |
|
Benoit Steiner
|
698ff69450
|
Properly characterize the CUDA packet primitives for fp16 as device only
|
2016-10-04 16:53:30 -07:00 |
|
Benoit Steiner
|
6af5ac7e27
|
Cleanup the cuda executor code.
|
2016-10-04 08:52:13 -07:00 |
|
Benoit Steiner
|
2f6d1607c8
|
Cleaned up the random number generation code.
|
2016-10-04 08:38:23 -07:00 |
|
Benoit Steiner
|
881b90e984
|
Use explicit type casting to generate packets of zeros.
|
2016-10-04 08:23:38 -07:00 |
|
Benoit Steiner
|
616a7a1912
|
Improved support for compiling CUDA code with clang as the host compiler
|
2016-10-03 17:09:33 -07:00 |
|
Benoit Steiner
|
409e887d78
|
Added support for constand std::complex numbers on GPU
|
2016-10-03 11:06:24 -07:00 |
|
Gael Guennebaud
|
9d6d0dff8f
|
bug #1317: fix performance regression with some Block expressions and clang by helping it to remove dead code.
The trick is to get rid of the nested expression in the evaluator by copying only the required information (here, the strides).
|
2016-10-01 15:37:00 +02:00 |
|
Gael Guennebaud
|
8b84801f7f
|
bug #1310: workaround a compilation regression from 3.2 regarding triangular * homogeneous
|
2016-09-30 22:49:59 +02:00 |
|
Gael Guennebaud
|
67b4f45836
|
Fix angle range
|
2016-09-30 12:46:33 +02:00 |
|
Gael Guennebaud
|
27f3970453
|
Remove std:: prefix
|
2016-09-30 12:40:41 +02:00 |
|
Gael Guennebaud
|
3860a0bc8f
|
bug #1312: Quaternion to AxisAngle conversion now ensures the angle will be in the range [-pi,pi]. This also increases accuracy when q.w is negative.
|
2016-09-29 23:23:35 +02:00 |
|
Gael Guennebaud
|
33500050c3
|
bug #1308: fix compilation of some small products involving nullary-expressions.
|
2016-09-29 09:40:44 +02:00 |
|
Benoit Steiner
|
27d7628f16
|
Updated the list of warnings to reflect the new message ids introduced in cuda 8.0
|
2016-09-28 17:42:59 -07:00 |
|
Benoit Steiner
|
2bda1b0d93
|
Updated the tensor sum and mean reducer to enable them to process complex numbers on cuda gpus.
|
2016-09-28 17:08:41 -07:00 |
|
Gael Guennebaud
|
f3a00dd2b5
|
Merged in sergiu/eigen (pull request PR-229)
Disabled MSVC level 4 warning C4714
|
2016-09-27 09:28:08 +02:00 |
|
Gael Guennebaud
|
892afb9416
|
Add debug info.
|
2016-09-26 23:53:57 +02:00 |
|
Gael Guennebaud
|
779774f98c
|
bug #1311: fix alignment logic in some cases of (scalar*small).lazyProduct(small)
|
2016-09-26 23:53:40 +02:00 |
|
Benoit Steiner
|
6565f8d60f
|
Made the initialization of a CUDA device thread safe.
|
2016-09-26 11:00:32 -07:00 |
|
Gael Guennebaud
|
48dfe98abd
|
bug #1308: fix compilation of vector * rowvector::nullary.
|
2016-09-25 14:54:35 +02:00 |
|
Sergiu Deitsch
|
fe29157d02
|
disabled MSVC level 4 warning C4714
The level 4 warning (/W4) warns about functions marked as __forceinline not
inlined, and generates a lot of noise.
|
2016-09-25 14:25:47 +02:00 |
|
Gael Guennebaud
|
86caba838d
|
bug #1304: fix Projective * scaling and Projective *= scaling
|
2016-09-23 13:41:21 +02:00 |
|
Gael Guennebaud
|
b9f7a17e47
|
Add missing file.
|
2016-09-23 10:26:08 +02:00 |
|
Benoit Steiner
|
1301d744f8
|
Made the gaussian generator usable on GPU
|
2016-09-22 19:04:44 -07:00 |
|
Benoit Steiner
|
2a69290ddb
|
Added a specialization of Eigen::numext::real and Eigen::numext::imag for std::complex<T> to be used when compiling a cuda kernel. This is unfortunately necessary to be able to process complex numbers from a CUDA kernel on MacOS.
|
2016-09-22 15:52:23 -07:00 |
|
Gael Guennebaud
|
3946768916
|
Added tag 3.3-rc1 for changeset 77e27fbeee7acb289d7df809fc09a8cc8ee94eb7
|
2016-09-22 22:38:36 +02:00 |
|
Gael Guennebaud
|
77e27fbeee
|
bump to 3.3-rc1
3.3-rc1
|
2016-09-22 22:37:39 +02:00 |
|
Gael Guennebaud
|
2ada122bc6
|
merge
|
2016-09-22 22:33:18 +02:00 |
|
Gael Guennebaud
|
8f2bdde373
|
merge
|
2016-09-22 22:32:55 +02:00 |
|
Gael Guennebaud
|
ba0f844d6b
|
Backout changeset ce3557ca69742af477546d031d644a6dab1ff614
|
2016-09-22 22:28:51 +02:00 |
|
Gael Guennebaud
|
9bcdc8b756
|
Add a nullary-functor example performing index-based sub-matrices.
|
2016-09-22 22:27:54 +02:00 |
|
Benoit Steiner
|
50e3bbfc90
|
Calls x.imag() instead of imag(x) when x is a complex number since the former
is a constexpr while the later isn't. This fixes compilation errors triggered by nvcc on Mac.
|
2016-09-22 13:17:25 -07:00 |
|
Gael Guennebaud
|
ca3746c6f8
|
Bypass identity reflectors.
|
2016-09-22 22:07:13 +02:00 |
|
Felix Gruber
|
8bde7da086
|
fix documentation of LinSpaced
The index of the highest value in a LinSpace is size-1.
|
2016-09-22 14:50:07 +02:00 |
|
Gael Guennebaud
|
66cbabafed
|
Add a note regarding gcc bug #72867
|
2016-09-22 11:18:52 +02:00 |
|
Christoph Hertzberg
|
4b377715d7
|
Do not manually add absolute path to boost-library.
Also set C++ standard for blaze to C++14
|
2016-09-22 00:10:47 +02:00 |
|
Gael Guennebaud
|
aecc51a3e8
|
fix typo
|
2016-09-21 21:53:00 +02:00 |
|
Gael Guennebaud
|
1fc3a21ed0
|
Disable a failure test if extended double precision is in use (x87)
|
2016-09-21 20:09:07 +02:00 |
|
Gael Guennebaud
|
9fa2c8650e
|
Fix alignement of statically allocated temporaries in symv, and trmv.
|
2016-09-21 17:34:24 +02:00 |
|
Gael Guennebaud
|
ac5377e161
|
Improve cost estimation of complex division
|
2016-09-21 17:26:04 +02:00 |
|
Gael Guennebaud
|
5269d11935
|
Fix compilation if ICC.
|
2016-09-21 17:08:51 +02:00 |
|
Benoit Steiner
|
26f9907542
|
Added missing typedefs
|
2016-09-20 12:58:03 -07:00 |
|
RJ Ryan
|
608b1acd6d
|
Don't use c++11 features and fix include.
|
2016-09-20 07:49:05 -07:00 |
|
RJ Ryan
|
b2c6dc48d9
|
Add CUDA-specific std::complex<T> specializations for scalar_sum_op, scalar_difference_op, scalar_product_op, and scalar_quotient_op.
|
2016-09-20 07:18:20 -07:00 |
|