Hugh Perkins
|
9341f258d4
|
Add labels to #ifdef, in TensorReductionCuda.h
|
2017-06-06 15:51:06 +01:00 |
|
Benoit Steiner
|
a71943b9a4
|
Made the Tensor code compile with clang 3.9
|
2017-03-02 10:47:29 -08:00 |
|
Igor Babuschkin
|
18c67df31c
|
Fix remaining CUDA >= 300 checks
|
2016-08-18 17:18:30 +01:00 |
|
Igor Babuschkin
|
1569a7d7ab
|
Add the necessary CUDA >= 300 checks back
|
2016-08-18 17:15:12 +01:00 |
|
Igor Babuschkin
|
841e075154
|
Remove CUDA >= 300 checks and enable outer reductin for doubles
|
2016-08-06 18:07:50 +01:00 |
|
Igor Babuschkin
|
9537e8b118
|
Make use of atomicExch for atomicExchCustom
|
2016-08-05 14:29:58 +01:00 |
|
Igor Babuschkin
|
eeb0d880ee
|
Enable efficient Tensor reduction for doubles
|
2016-07-01 19:08:26 +01:00 |
|
Benoit Steiner
|
cb2d8b8fa6
|
Made it possible to compile reductions for an old cuda architecture and run them on a recent gpu.
|
2016-06-29 15:42:01 -07:00 |
|
Benoit Steiner
|
b2a47641ce
|
Made the code compile when using CUDA architecture < 300
|
2016-06-29 15:32:47 -07:00 |
|
Benoit Steiner
|
37638dafd7
|
Simplified the code that dispatches vectorized reductions on GPU
|
2016-06-09 10:29:52 -07:00 |
|
Benoit Steiner
|
aa33446dac
|
Improved support for vectorization of 16-bit floats
|
2016-06-09 08:22:27 -07:00 |
|
Benoit Steiner
|
7ef9f47b58
|
Misc small improvements to the reduction code.
|
2016-06-06 14:09:46 -07:00 |
|
Benoit Steiner
|
c2a102345f
|
Improved the performance of full reductions.
AFTER:
BM_fullReduction/10 4541 4543 154017 21.0M items/s
BM_fullReduction/64 5191 5193 100000 752.5M items/s
BM_fullReduction/512 9588 9588 71361 25.5G items/s
BM_fullReduction/4k 244314 244281 2863 64.0G items/s
BM_fullReduction/5k 359382 359363 1946 64.8G items/s
BEFORE:
BM_fullReduction/10 9085 9087 74395 10.5M items/s
BM_fullReduction/64 9478 9478 72014 412.1M items/s
BM_fullReduction/512 14643 14646 46902 16.7G items/s
BM_fullReduction/4k 260338 260384 2678 60.0G items/s
BM_fullReduction/5k 385076 385178 1818 60.5G items/s
|
2016-06-03 17:27:08 -07:00 |
|
Benoit Steiner
|
873e6ac54b
|
Silenced compilation warning generated by nvcc.
|
2016-06-01 14:20:50 -07:00 |
|
Benoit Steiner
|
d27b0ad4c8
|
Added support for mean reductions on fp16
|
2016-06-01 11:12:07 -07:00 |
|
Benoit Steiner
|
5aeb3687c4
|
Only enable optimized reductions of fp16 if the reduction functor supports them
|
2016-05-31 10:33:40 -07:00 |
|
Benoit Steiner
|
36369ab63c
|
Resolved merge conflicts
|
2016-05-26 13:39:39 -07:00 |
|
Benoit Steiner
|
28fcb5ca2a
|
Merged latest reduction improvements
|
2016-05-26 12:19:33 -07:00 |
|
Benoit Steiner
|
c1c7f06c35
|
Improved the performance of inner reductions.
|
2016-05-26 11:53:59 -07:00 |
|
Benoit Steiner
|
0835667329
|
There is no need to make the fp16 full reduction kernel a static function.
|
2016-05-24 23:11:56 -07:00 |
|
Benoit Steiner
|
8d06c02ffd
|
Allow vectorized padding on GPU. This helps speed things up a little.
Before:
BM_padding/10 5000000 460 217.03 MFlops/s
BM_padding/80 5000000 460 13899.40 MFlops/s
BM_padding/640 5000000 461 888421.17 MFlops/s
BM_padding/4K 5000000 460 54316322.55 MFlops/s
After:
BM_padding/10 5000000 454 220.20 MFlops/s
BM_padding/80 5000000 455 14039.86 MFlops/s
BM_padding/640 5000000 452 904968.83 MFlops/s
BM_padding/4K 5000000 411 60750049.21 MFlops/s
|
2016-05-17 09:13:27 -07:00 |
|
Benoit Steiner
|
83dfb40f66
|
Turnon the new thread pool by default since it scales much better over multiple cores. It is still possible to revert to the old thread pool by compiling with the EIGEN_USE_SIMPLE_THREAD_POOL define.
|
2016-05-13 17:23:15 -07:00 |
|
Benoit Steiner
|
c4fc8b70ec
|
Removed unnecessary thread synchronization
|
2016-05-13 10:49:38 -07:00 |
|
Benoit Steiner
|
217d984abc
|
Fixed a typo in my previous commit
|
2016-05-11 10:22:15 -07:00 |
|
Benoit Steiner
|
08348b4e48
|
Fix potential race condition in the CUDA reduction code.
|
2016-05-11 10:08:51 -07:00 |
|
Benoit Steiner
|
4ede059de1
|
Properly gate the use of half2.
|
2016-05-10 17:04:01 -07:00 |
|
Benoit Steiner
|
0eb69b7552
|
Small improvement to the full reduction of fp16
|
2016-05-10 11:58:18 -07:00 |
|
Benoit Steiner
|
4013b8feca
|
Simplified the reduction code a little.
|
2016-05-10 09:40:42 -07:00 |
|
Benoit Steiner
|
4670d7d5ce
|
Improved the performance of full reductions on GPU:
Before:
BM_fullReduction/10 200000 11751 8.51 MFlops/s
BM_fullReduction/80 5000 523385 12.23 MFlops/s
BM_fullReduction/640 50 36179326 11.32 MFlops/s
BM_fullReduction/4K 1 2173517195 11.50 MFlops/s
After:
BM_fullReduction/10 500000 5987 16.70 MFlops/s
BM_fullReduction/80 200000 10636 601.73 MFlops/s
BM_fullReduction/640 50000 58428 7010.31 MFlops/s
BM_fullReduction/4K 1000 2006106 12461.95 MFlops/s
|
2016-05-09 17:09:54 -07:00 |
|
Benoit Steiner
|
2dde1b1028
|
Don't crash when attempting to reduce empty tensors.
|
2016-04-20 18:08:20 -07:00 |
|
Benoit Steiner
|
5b1106c56b
|
Fixed a compilation error with nvcc 7.
|
2016-04-19 14:57:57 -07:00 |
|
Benoit Steiner
|
7129d998db
|
Simplified the code that launches cuda kernels.
|
2016-04-19 14:55:21 -07:00 |
|
Benoit Steiner
|
884c075058
|
Use numext::ceil instead of std::ceil
|
2016-04-19 14:33:30 -07:00 |
|
Benoit Steiner
|
edc679f6c6
|
Fixed compilation warning
|
2016-03-18 07:12:34 -07:00 |
|
Benoit Steiner
|
68ac5c1738
|
Improved the performance of large outer reductions on cuda
|
2016-02-29 18:11:58 -08:00 |
|
Benoit Steiner
|
b2075cb7a2
|
Made the signature of the inner and outer reducers consistent
|
2016-02-29 10:53:38 -08:00 |
|
Benoit Steiner
|
3284842045
|
Optimized the performance of narrow reductions on CUDA devices
|
2016-02-29 10:48:16 -08:00 |
|
Benoit Steiner
|
34057cff23
|
Fixed a race condition that could affect some reductions on CUDA devices.
|
2016-01-15 15:11:56 -08:00 |
|
Benoit Steiner
|
aed4cb1269
|
Use warp shuffles instead of shared memory access to speedup the inner reduction kernel.
|
2016-01-14 21:45:14 -08:00 |
|
Benoit Steiner
|
8fe2532e70
|
Fixed a boundary condition bug in the outer reduction kernel
|
2016-01-14 09:29:48 -08:00 |
|
Benoit Steiner
|
c5e6900400
|
Silenced a few compilation warnings.
|
2016-01-11 17:06:39 -08:00 |
|
Benoit Steiner
|
01c55d37e6
|
Deleted unused variable.
|
2016-01-11 15:53:19 -08:00 |
|
Benoit Steiner
|
0504c56ea7
|
Silenced a nvcc compilation warning
|
2016-01-11 15:49:21 -08:00 |
|
Benoit Steiner
|
b523771a24
|
Silenced several compilation warnings triggered by nvcc.
|
2016-01-11 14:25:43 -08:00 |
|
Benoit Steiner
|
2c3b13eded
|
Merged in jeremy_barnes/eigen/shader-model-3.0 (pull request PR-152)
Alternative way of forcing instantiation of device kernels without causing warnings or requiring device to device kernel invocations.
|
2016-01-11 11:43:37 -08:00 |
|
Benoit Steiner
|
780623261e
|
Re-enabled the optimized reduction CUDA code.
|
2016-01-11 09:07:14 -08:00 |
|
Jeremy Barnes
|
403a7cb6c3
|
Alternative way of forcing instantiation of device kernels without
causing warnings or requiring device to device kernel invocations.
This allows Tensorflow to work on SM 3.0 (ie, Amazon EC2) machines.
|
2016-01-10 22:39:13 -05:00 |
|
Benoit Steiner
|
53749ff415
|
Prevent nvcc from miscompiling the cuda metakernel. Unfortunately this reintroduces some compulation warnings but it's much better than having to deal with random assertion failures.
|
2016-01-08 13:53:40 -08:00 |
|
Benoit Steiner
|
cfff40b1d4
|
Improved the performance of reductions on CUDA devices
|
2016-01-04 17:25:00 -08:00 |
|
Benoit Steiner
|
a1e08fb2a5
|
Optimized the configuration of the outer reduction cuda kernel
|
2015-12-22 16:30:10 -08:00 |
|