Rasmus Munk Larsen
d956204ab2
Remove "false &&" left over from test.
2018-09-26 17:03:30 -07:00
Rasmus Munk Larsen
3815aeed7a
Parallelize tensor contraction over the inner dimension in cases where where one or both of the outer dimensions (m and n) are small but k is large. This speeds up individual matmul microbenchmarks by up to 85%.
...
Naming below is BM_Matmul_M_K_N_THREADS, measured on a 2-socket Intel Broadwell-based server.
Benchmark Base (ns) New (ns) Improvement
------------------------------------------------------------------
BM_Matmul_1_80_13522_1 387457 396013 -2.2%
BM_Matmul_1_80_13522_2 406487 230789 +43.2%
BM_Matmul_1_80_13522_4 395821 123211 +68.9%
BM_Matmul_1_80_13522_6 391625 97002 +75.2%
BM_Matmul_1_80_13522_8 408986 113828 +72.2%
BM_Matmul_1_80_13522_16 399988 67600 +83.1%
BM_Matmul_1_80_13522_22 411546 60044 +85.4%
BM_Matmul_1_80_13522_32 393528 57312 +85.4%
BM_Matmul_1_80_13522_44 390047 63525 +83.7%
BM_Matmul_1_80_13522_88 387876 63592 +83.6%
BM_Matmul_1_1500_500_1 245359 248119 -1.1%
BM_Matmul_1_1500_500_2 401833 143271 +64.3%
BM_Matmul_1_1500_500_4 210519 100231 +52.4%
BM_Matmul_1_1500_500_6 251582 86575 +65.6%
BM_Matmul_1_1500_500_8 211499 80444 +62.0%
BM_Matmul_3_250_512_1 70297 68551 +2.5%
BM_Matmul_3_250_512_2 70141 52450 +25.2%
BM_Matmul_3_250_512_4 67872 58204 +14.2%
BM_Matmul_3_250_512_6 71378 63340 +11.3%
BM_Matmul_3_250_512_8 69595 41652 +40.2%
BM_Matmul_3_250_512_16 72055 42549 +40.9%
BM_Matmul_3_250_512_22 70158 54023 +23.0%
BM_Matmul_3_250_512_32 71541 56042 +21.7%
BM_Matmul_3_250_512_44 71843 57019 +20.6%
BM_Matmul_3_250_512_88 69951 54045 +22.7%
BM_Matmul_3_1500_512_1 369328 374284 -1.4%
BM_Matmul_3_1500_512_2 428656 223603 +47.8%
BM_Matmul_3_1500_512_4 205599 139508 +32.1%
BM_Matmul_3_1500_512_6 214278 139071 +35.1%
BM_Matmul_3_1500_512_8 184149 142338 +22.7%
BM_Matmul_3_1500_512_16 156462 156983 -0.3%
BM_Matmul_3_1500_512_22 163905 158259 +3.4%
BM_Matmul_3_1500_512_32 155314 157662 -1.5%
BM_Matmul_3_1500_512_44 235434 158657 +32.6%
BM_Matmul_3_1500_512_88 156779 160275 -2.2%
BM_Matmul_1500_4_512_1 363358 349528 +3.8%
BM_Matmul_1500_4_512_2 303134 263319 +13.1%
BM_Matmul_1500_4_512_4 176208 130086 +26.2%
BM_Matmul_1500_4_512_6 148026 115449 +22.0%
BM_Matmul_1500_4_512_8 131656 98421 +25.2%
BM_Matmul_1500_4_512_16 134011 82861 +38.2%
BM_Matmul_1500_4_512_22 134950 85685 +36.5%
BM_Matmul_1500_4_512_32 133165 90081 +32.4%
BM_Matmul_1500_4_512_44 133203 90644 +32.0%
BM_Matmul_1500_4_512_88 134106 100566 +25.0%
BM_Matmul_4_1500_512_1 439243 435058 +1.0%
BM_Matmul_4_1500_512_2 451830 257032 +43.1%
BM_Matmul_4_1500_512_4 276434 164513 +40.5%
BM_Matmul_4_1500_512_6 182542 144827 +20.7%
BM_Matmul_4_1500_512_8 179411 166256 +7.3%
BM_Matmul_4_1500_512_16 158101 155560 +1.6%
BM_Matmul_4_1500_512_22 152435 155448 -1.9%
BM_Matmul_4_1500_512_32 155150 149538 +3.6%
BM_Matmul_4_1500_512_44 193842 149777 +22.7%
BM_Matmul_4_1500_512_88 149544 154468 -3.3%
2018-09-26 16:47:13 -07:00
Gael Guennebaud
9419f506d0
Fix regression introduced by the previous fix for AVX512.
...
It brokes the complex-complex case on SSE.
2018-09-20 17:32:34 +02:00
Benoit Steiner
edf46bd7a2
Merged in yuefengz/eigen (pull request PR-370)
...
Use device's allocate function instead of internal::aligned_malloc.
2018-07-31 22:38:28 +00:00
Rasmus Munk Larsen
e478532625
Reduce the number of template specializations of classes related to tensor contraction to reduce binary size.
2018-07-27 12:36:34 -07:00
Eugene Zhulenev
6e654f3379
Reduce number of allocations in TensorContractionThreadPool.
2018-07-16 14:26:39 -07:00
Yuefeng Zhou
1eff6cf8a7
Use device's allocate function instead of internal::aligned_malloc. This would make it easier to track memory usage in device instances.
2018-02-20 16:50:05 -08:00
Eugene Zhulenev
e204ecdaaf
Remove SimpleThreadPool and always use {NonBlocking}ThreadPool
2018-07-16 15:06:57 -07:00
Eugene Zhulenev
01fd4096d3
Fuse computations into the Tensor contractions using output kernel
2018-07-10 13:16:38 -07:00
Gael Guennebaud
b3fd93207b
Fix typos found using codespell
2018-06-07 14:43:02 +02:00
Benoit Steiner
519d63d350
Added support for libxsmm kernel in multithreaded contractions
2016-12-21 15:06:06 -08:00
Benoit Steiner
14a112ee15
Use signed integers more consistently to encode the number of threads to use to evaluate a tensor expression.
2016-06-09 08:25:22 -07:00
Benoit Steiner
1ae2567861
Fixed some compilation warnings
2016-05-26 15:57:19 -07:00
Benoit Steiner
a09cbf9905
Merged in rmlarsen/eigen (pull request PR-188)
...
Minor cleanups: 1. Get rid of a few unused variables. 2. Get rid of last uses of EIGEN_USE_COST_MODEL.
2016-05-23 12:55:12 -07:00
Christoph Hertzberg
25a03c02d6
Fix some sign-compare warnings
2016-05-22 16:42:27 +02:00
Rasmus Munk Larsen
7df811cfe5
Minor cleanups: 1. Get rid of unused variables. 2. Get rid of last uses of EIGEN_USE_COST_MODEL.
2016-05-18 15:09:48 -07:00
Benoit Steiner
83dfb40f66
Turnon the new thread pool by default since it scales much better over multiple cores. It is still possible to revert to the old thread pool by compiling with the EIGEN_USE_SIMPLE_THREAD_POOL define.
2016-05-13 17:23:15 -07:00
Benoit Steiner
97605c7b27
New multithreaded contraction that doesn't rely on the thread pool to run the closure in the order in which they are enqueued. This is needed in order to switch to the new non blocking thread pool since this new thread pool can execute the closure in any order.
2016-05-13 17:11:29 -07:00
Benoit Steiner
a071629fec
Replace implicit cast with an explicit one
2016-05-12 10:40:07 -07:00
Benoit Steiner
f363e533aa
Added tests for full contractions using thread pools and gpu devices.
...
Fixed a couple of issues in the corresponding code.
2016-05-05 09:05:45 -07:00
Benoit Steiner
46177c8d64
Replace std::vector with our own implementation, as using the stl when compiling with nvcc and avx enabled leads to many issues.
2016-03-08 16:37:27 -08:00
Benoit Steiner
8768c063f5
Fixed the tensor chipping code.
2016-03-08 12:26:49 -08:00
Benoit Steiner
e09eb835db
Decoupled the packet type definition from the definition of the tensor ops. All the vectorization is now defined in the tensor evaluators. This will make it possible to relialably support devices with different packet types in the same compilation unit.
2016-03-08 12:07:33 -08:00
Benoit Steiner
3aeeca32af
Leverage the new blocking code in the tensor contraction code.
2016-01-22 16:36:30 -08:00
Benoit Steiner
2ab603316a
Use numext::mini/numext::maxi instead of std::min/std::max in the tensor code
2015-08-28 08:14:15 -07:00
Benoit Steiner
3625734bc8
Moved some utilities to TensorMeta.h to make it easier to reuse them accross several tensor operations.
...
Created the TensorDimensionList class to encode the list of all the dimensions of a tensor of rank n. This could be done using TensorIndexList, however TensorIndexList require cxx11 which isn't yet supported as widely as we'd like.
2015-06-29 10:49:55 -07:00
Benoit Steiner
6b800744ce
Moved away from std::async and std::future as the underlying mechnism for the thread pool device. On several platforms, the functions passed to std::async are not scheduled in the order in which they are given to std::async, which leads to massive performance issues in the contraction code.
...
Instead we now have a custom thread pool that ensures that the functions are picked up by the threads in the pool in the order in which they are enqueued in the pool.
2015-05-20 13:52:07 -07:00
Benoit Steiner
bb483313f6
Fixed another batch of compilation warnings
2015-02-28 02:32:46 -08:00
Benoit Steiner
f8fbb3f9a6
Fixed several compilation warnings reported by clang
2015-02-25 22:22:37 -08:00
Benoit Steiner
410895a7e4
Silenced several compilation warnings
2015-02-10 12:13:19 -08:00
Benoit Steiner
f697df7237
Improved support for RowMajor tensors
...
Misc fixes and API cleanups.
2015-01-14 15:38:48 -08:00
Benoit Steiner
71676eaddd
Added support for RowMajor inputs to the contraction code.
2015-01-14 12:36:57 -08:00
Benoit Steiner
b1789c112b
Improved handling of 1d tensors
2014-11-03 08:51:33 -08:00
Benoit Steiner
d853adffdb
Avoid calling get_future() more than once on a given promise.
2014-10-16 10:10:04 -07:00
Benoit Steiner
bfdd9f3ac9
Made the blocking computation aware of the l3 cache
...
Also optimized the blocking parameters to take into account the number of threads used for a computation
2014-10-15 15:32:59 -07:00
Benoit Steiner
dba55041ab
Added support for promises
...
Started to improve multithreaded contractions
2014-10-15 11:20:36 -07:00