17 Commits

Author SHA1 Message Date
Benoit Steiner
c2a102345f Improved the performance of full reductions.
AFTER:
BM_fullReduction/10        4541       4543     154017  21.0M items/s
BM_fullReduction/64        5191       5193     100000  752.5M items/s
BM_fullReduction/512       9588       9588      71361  25.5G items/s
BM_fullReduction/4k      244314     244281       2863  64.0G items/s
BM_fullReduction/5k      359382     359363       1946  64.8G items/s

BEFORE:
BM_fullReduction/10        9085       9087      74395  10.5M items/s
BM_fullReduction/64        9478       9478      72014  412.1M items/s
BM_fullReduction/512      14643      14646      46902  16.7G items/s
BM_fullReduction/4k      260338     260384       2678  60.0G items/s
BM_fullReduction/5k      385076     385178       1818  60.5G items/s
2016-06-03 17:27:08 -07:00
Benoit Steiner
b5d6b52a4d Fixed compilation warning 2016-05-24 23:10:57 -07:00
Benoit Steiner
c3859a2b58 Added the ability to use a scratch buffer in cuda kernels 2016-05-09 17:05:53 -07:00
Benoit Steiner
7129d998db Simplified the code that launches cuda kernels. 2016-04-19 14:55:21 -07:00
Benoit Steiner
b9ea40c30d Don't take the address of a kernel on CUDA devices that don't support this feature. 2016-04-19 14:35:11 -07:00
Benoit Steiner
609b3337a7 Print some information to stderr when a CUDA kernel fails 2016-02-27 20:42:57 +00:00
Benoit Steiner
46fc23f91c Print an error message to stderr when the initialization of the CUDA runtime fails. This helps debugging setup issues. 2016-02-19 13:44:22 -08:00
Benoit Steiner
f268db1c4b Added the ability to query the minor version of a cuda device 2016-02-19 16:31:04 +00:00
Benoit Steiner
6b5dff875e Made it possible to limit the number of blocks that will be used to evaluate a tensor expression on a CUDA device. This makesit possible to set aside streaming multiprocessors for other computations. 2016-02-01 12:46:32 -08:00
Benoit Steiner
c5e6900400 Silenced a few compilation warnings. 2016-01-11 17:06:39 -08:00
Benoit Steiner
b523771a24 Silenced several compilation warnings triggered by nvcc. 2016-01-11 14:25:43 -08:00
Jeremy Barnes
91678f489a Cleaned up double-defined macro from last commit 2016-01-10 22:44:45 -05:00
Jeremy Barnes
403a7cb6c3 Alternative way of forcing instantiation of device kernels without
causing warnings or requiring device to device kernel invocations.

This allows Tensorflow to work on SM 3.0 (ie, Amazon EC2) machines.
2016-01-10 22:39:13 -05:00
Benoit Steiner
53749ff415 Prevent nvcc from miscompiling the cuda metakernel. Unfortunately this reintroduces some compulation warnings but it's much better than having to deal with random assertion failures. 2016-01-08 13:53:40 -08:00
Benoit Steiner
4aac55f684 Silenced some compilation warnings triggered by nvcc 2015-12-17 13:39:01 -08:00
Benoit Steiner
df31ca3b9e Made it possible to refer t oa GPUDevice from code compile with a regular C++ compiler 2015-11-23 10:03:53 -08:00
Benoit Steiner
9fa65d3838 Split TensorDeviceType.h in 3 files to make it more manageable 2015-11-20 17:42:50 -08:00