Benoit Steiner
|
c2a102345f
|
Improved the performance of full reductions.
AFTER:
BM_fullReduction/10 4541 4543 154017 21.0M items/s
BM_fullReduction/64 5191 5193 100000 752.5M items/s
BM_fullReduction/512 9588 9588 71361 25.5G items/s
BM_fullReduction/4k 244314 244281 2863 64.0G items/s
BM_fullReduction/5k 359382 359363 1946 64.8G items/s
BEFORE:
BM_fullReduction/10 9085 9087 74395 10.5M items/s
BM_fullReduction/64 9478 9478 72014 412.1M items/s
BM_fullReduction/512 14643 14646 46902 16.7G items/s
BM_fullReduction/4k 260338 260384 2678 60.0G items/s
BM_fullReduction/5k 385076 385178 1818 60.5G items/s
|
2016-06-03 17:27:08 -07:00 |
|
Benoit Steiner
|
b5d6b52a4d
|
Fixed compilation warning
|
2016-05-24 23:10:57 -07:00 |
|
Benoit Steiner
|
c3859a2b58
|
Added the ability to use a scratch buffer in cuda kernels
|
2016-05-09 17:05:53 -07:00 |
|
Benoit Steiner
|
7129d998db
|
Simplified the code that launches cuda kernels.
|
2016-04-19 14:55:21 -07:00 |
|
Benoit Steiner
|
b9ea40c30d
|
Don't take the address of a kernel on CUDA devices that don't support this feature.
|
2016-04-19 14:35:11 -07:00 |
|
Benoit Steiner
|
609b3337a7
|
Print some information to stderr when a CUDA kernel fails
|
2016-02-27 20:42:57 +00:00 |
|
Benoit Steiner
|
46fc23f91c
|
Print an error message to stderr when the initialization of the CUDA runtime fails. This helps debugging setup issues.
|
2016-02-19 13:44:22 -08:00 |
|
Benoit Steiner
|
f268db1c4b
|
Added the ability to query the minor version of a cuda device
|
2016-02-19 16:31:04 +00:00 |
|
Benoit Steiner
|
6b5dff875e
|
Made it possible to limit the number of blocks that will be used to evaluate a tensor expression on a CUDA device. This makesit possible to set aside streaming multiprocessors for other computations.
|
2016-02-01 12:46:32 -08:00 |
|
Benoit Steiner
|
c5e6900400
|
Silenced a few compilation warnings.
|
2016-01-11 17:06:39 -08:00 |
|
Benoit Steiner
|
b523771a24
|
Silenced several compilation warnings triggered by nvcc.
|
2016-01-11 14:25:43 -08:00 |
|
Jeremy Barnes
|
91678f489a
|
Cleaned up double-defined macro from last commit
|
2016-01-10 22:44:45 -05:00 |
|
Jeremy Barnes
|
403a7cb6c3
|
Alternative way of forcing instantiation of device kernels without
causing warnings or requiring device to device kernel invocations.
This allows Tensorflow to work on SM 3.0 (ie, Amazon EC2) machines.
|
2016-01-10 22:39:13 -05:00 |
|
Benoit Steiner
|
53749ff415
|
Prevent nvcc from miscompiling the cuda metakernel. Unfortunately this reintroduces some compulation warnings but it's much better than having to deal with random assertion failures.
|
2016-01-08 13:53:40 -08:00 |
|
Benoit Steiner
|
4aac55f684
|
Silenced some compilation warnings triggered by nvcc
|
2015-12-17 13:39:01 -08:00 |
|
Benoit Steiner
|
df31ca3b9e
|
Made it possible to refer t oa GPUDevice from code compile with a regular C++ compiler
|
2015-11-23 10:03:53 -08:00 |
|
Benoit Steiner
|
9fa65d3838
|
Split TensorDeviceType.h in 3 files to make it more manageable
|
2015-11-20 17:42:50 -08:00 |
|