eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2025-05-04 01:34:07 +08:00

Author	SHA1	Message	Date
Hugh Perkins	9341f258d4	Add labels to #ifdef, in TensorReductionCuda.h	2017-06-06 15:51:06 +01:00
Benoit Steiner	a71943b9a4	Made the Tensor code compile with clang 3.9	2017-03-02 10:47:29 -08:00
Igor Babuschkin	18c67df31c	Fix remaining CUDA >= 300 checks	2016-08-18 17:18:30 +01:00
Igor Babuschkin	1569a7d7ab	Add the necessary CUDA >= 300 checks back	2016-08-18 17:15:12 +01:00
Igor Babuschkin	841e075154	Remove CUDA >= 300 checks and enable outer reductin for doubles	2016-08-06 18:07:50 +01:00
Igor Babuschkin	9537e8b118	Make use of atomicExch for atomicExchCustom	2016-08-05 14:29:58 +01:00
Igor Babuschkin	eeb0d880ee	Enable efficient Tensor reduction for doubles	2016-07-01 19:08:26 +01:00
Benoit Steiner	cb2d8b8fa6	Made it possible to compile reductions for an old cuda architecture and run them on a recent gpu.	2016-06-29 15:42:01 -07:00
Benoit Steiner	b2a47641ce	Made the code compile when using CUDA architecture < 300	2016-06-29 15:32:47 -07:00
Benoit Steiner	37638dafd7	Simplified the code that dispatches vectorized reductions on GPU	2016-06-09 10:29:52 -07:00
Benoit Steiner	aa33446dac	Improved support for vectorization of 16-bit floats	2016-06-09 08:22:27 -07:00
Benoit Steiner	7ef9f47b58	Misc small improvements to the reduction code.	2016-06-06 14:09:46 -07:00
Benoit Steiner	c2a102345f	Improved the performance of full reductions. AFTER: BM_fullReduction/10 4541 4543 154017 21.0M items/s BM_fullReduction/64 5191 5193 100000 752.5M items/s BM_fullReduction/512 9588 9588 71361 25.5G items/s BM_fullReduction/4k 244314 244281 2863 64.0G items/s BM_fullReduction/5k 359382 359363 1946 64.8G items/s BEFORE: BM_fullReduction/10 9085 9087 74395 10.5M items/s BM_fullReduction/64 9478 9478 72014 412.1M items/s BM_fullReduction/512 14643 14646 46902 16.7G items/s BM_fullReduction/4k 260338 260384 2678 60.0G items/s BM_fullReduction/5k 385076 385178 1818 60.5G items/s	2016-06-03 17:27:08 -07:00
Benoit Steiner	873e6ac54b	Silenced compilation warning generated by nvcc.	2016-06-01 14:20:50 -07:00
Benoit Steiner	d27b0ad4c8	Added support for mean reductions on fp16	2016-06-01 11:12:07 -07:00
Benoit Steiner	5aeb3687c4	Only enable optimized reductions of fp16 if the reduction functor supports them	2016-05-31 10:33:40 -07:00
Benoit Steiner	36369ab63c	Resolved merge conflicts	2016-05-26 13:39:39 -07:00
Benoit Steiner	28fcb5ca2a	Merged latest reduction improvements	2016-05-26 12:19:33 -07:00
Benoit Steiner	c1c7f06c35	Improved the performance of inner reductions.	2016-05-26 11:53:59 -07:00
Benoit Steiner	0835667329	There is no need to make the fp16 full reduction kernel a static function.	2016-05-24 23:11:56 -07:00
Benoit Steiner	8d06c02ffd	Allow vectorized padding on GPU. This helps speed things up a little. Before: BM_padding/10 5000000 460 217.03 MFlops/s BM_padding/80 5000000 460 13899.40 MFlops/s BM_padding/640 5000000 461 888421.17 MFlops/s BM_padding/4K 5000000 460 54316322.55 MFlops/s After: BM_padding/10 5000000 454 220.20 MFlops/s BM_padding/80 5000000 455 14039.86 MFlops/s BM_padding/640 5000000 452 904968.83 MFlops/s BM_padding/4K 5000000 411 60750049.21 MFlops/s	2016-05-17 09:13:27 -07:00
Benoit Steiner	83dfb40f66	Turnon the new thread pool by default since it scales much better over multiple cores. It is still possible to revert to the old thread pool by compiling with the EIGEN_USE_SIMPLE_THREAD_POOL define.	2016-05-13 17:23:15 -07:00
Benoit Steiner	c4fc8b70ec	Removed unnecessary thread synchronization	2016-05-13 10:49:38 -07:00
Benoit Steiner	217d984abc	Fixed a typo in my previous commit	2016-05-11 10:22:15 -07:00
Benoit Steiner	08348b4e48	Fix potential race condition in the CUDA reduction code.	2016-05-11 10:08:51 -07:00
Benoit Steiner	4ede059de1	Properly gate the use of half2.	2016-05-10 17:04:01 -07:00
Benoit Steiner	0eb69b7552	Small improvement to the full reduction of fp16	2016-05-10 11:58:18 -07:00
Benoit Steiner	4013b8feca	Simplified the reduction code a little.	2016-05-10 09:40:42 -07:00
Benoit Steiner	4670d7d5ce	Improved the performance of full reductions on GPU: Before: BM_fullReduction/10 200000 11751 8.51 MFlops/s BM_fullReduction/80 5000 523385 12.23 MFlops/s BM_fullReduction/640 50 36179326 11.32 MFlops/s BM_fullReduction/4K 1 2173517195 11.50 MFlops/s After: BM_fullReduction/10 500000 5987 16.70 MFlops/s BM_fullReduction/80 200000 10636 601.73 MFlops/s BM_fullReduction/640 50000 58428 7010.31 MFlops/s BM_fullReduction/4K 1000 2006106 12461.95 MFlops/s	2016-05-09 17:09:54 -07:00
Benoit Steiner	2dde1b1028	Don't crash when attempting to reduce empty tensors.	2016-04-20 18:08:20 -07:00
Benoit Steiner	5b1106c56b	Fixed a compilation error with nvcc 7.	2016-04-19 14:57:57 -07:00
Benoit Steiner	7129d998db	Simplified the code that launches cuda kernels.	2016-04-19 14:55:21 -07:00
Benoit Steiner	884c075058	Use numext::ceil instead of std::ceil	2016-04-19 14:33:30 -07:00
Benoit Steiner	edc679f6c6	Fixed compilation warning	2016-03-18 07:12:34 -07:00
Benoit Steiner	68ac5c1738	Improved the performance of large outer reductions on cuda	2016-02-29 18:11:58 -08:00
Benoit Steiner	b2075cb7a2	Made the signature of the inner and outer reducers consistent	2016-02-29 10:53:38 -08:00
Benoit Steiner	3284842045	Optimized the performance of narrow reductions on CUDA devices	2016-02-29 10:48:16 -08:00
Benoit Steiner	34057cff23	Fixed a race condition that could affect some reductions on CUDA devices.	2016-01-15 15:11:56 -08:00
Benoit Steiner	aed4cb1269	Use warp shuffles instead of shared memory access to speedup the inner reduction kernel.	2016-01-14 21:45:14 -08:00
Benoit Steiner	8fe2532e70	Fixed a boundary condition bug in the outer reduction kernel	2016-01-14 09:29:48 -08:00
Benoit Steiner	c5e6900400	Silenced a few compilation warnings.	2016-01-11 17:06:39 -08:00
Benoit Steiner	01c55d37e6	Deleted unused variable.	2016-01-11 15:53:19 -08:00
Benoit Steiner	0504c56ea7	Silenced a nvcc compilation warning	2016-01-11 15:49:21 -08:00
Benoit Steiner	b523771a24	Silenced several compilation warnings triggered by nvcc.	2016-01-11 14:25:43 -08:00
Benoit Steiner	2c3b13eded	Merged in jeremy_barnes/eigen/shader-model-3.0 (pull request PR-152) Alternative way of forcing instantiation of device kernels without causing warnings or requiring device to device kernel invocations.	2016-01-11 11:43:37 -08:00
Benoit Steiner	780623261e	Re-enabled the optimized reduction CUDA code.	2016-01-11 09:07:14 -08:00
Jeremy Barnes	403a7cb6c3	Alternative way of forcing instantiation of device kernels without causing warnings or requiring device to device kernel invocations. This allows Tensorflow to work on SM 3.0 (ie, Amazon EC2) machines.	2016-01-10 22:39:13 -05:00
Benoit Steiner	53749ff415	Prevent nvcc from miscompiling the cuda metakernel. Unfortunately this reintroduces some compulation warnings but it's much better than having to deal with random assertion failures.	2016-01-08 13:53:40 -08:00
Benoit Steiner	cfff40b1d4	Improved the performance of reductions on CUDA devices	2016-01-04 17:25:00 -08:00
Benoit Steiner	a1e08fb2a5	Optimized the configuration of the outer reduction cuda kernel	2015-12-22 16:30:10 -08:00

1 2

55 Commits