Added benchmarks for contraction on CPU.

This commit is contained in:
Benoit Steiner 2016-05-13 14:32:17 -07:00
parent c4fc8b70ec
commit 069a0b04d7
2 changed files with 45 additions and 3 deletions

View File

@ -1,4 +1,6 @@
Each benchmark comes in 2 flavors: one that runs on CPU, and one that runs on GPU. The tensor benchmark suite is made of several parts.
The first part is a generic suite, in which each benchmark comes in 2 flavors: one that runs on CPU, and one that runs on GPU.
To compile the floating point CPU benchmarks, simply call: To compile the floating point CPU benchmarks, simply call:
g++ tensor_benchmarks_cpu.cc benchmark_main.cc -I ../../ -std=c++11 -O3 -DNDEBUG -pthread -mavx -o benchmarks_cpu g++ tensor_benchmarks_cpu.cc benchmark_main.cc -I ../../ -std=c++11 -O3 -DNDEBUG -pthread -mavx -o benchmarks_cpu
@ -6,7 +8,8 @@ g++ tensor_benchmarks_cpu.cc benchmark_main.cc -I ../../ -std=c++11 -O3 -DNDEBUG
To compile the floating point GPU benchmarks, simply call: To compile the floating point GPU benchmarks, simply call:
nvcc tensor_benchmarks_gpu.cu benchmark_main.cc -I ../../ -std=c++11 -O2 -DNDEBUG -arch compute_35 -o benchmarks_gpu nvcc tensor_benchmarks_gpu.cu benchmark_main.cc -I ../../ -std=c++11 -O2 -DNDEBUG -arch compute_35 -o benchmarks_gpu
We also provide a version of the generic GPU tensor benchmarks that uses half floats (aka fp16) instead of regular floats. To compile these benchmarks, simply call the command line below. You'll need a recent GPU that supports compute capability 5.3 or higher to run them and nvcc 7.5 or higher to compile the code.
To compile the half float GPU benchmarks, simply call the command line below. You'll need a recent GPU that supports compute capability 5.3 or higher to run them and nvcc 7.5 or higher to compile the code.
nvcc tensor_benchmarks_fp16_gpu.cu benchmark_main.cc -I ../../ -std=c++11 -O2 -DNDEBUG -arch compute_53 -o benchmarks_fp16_gpu nvcc tensor_benchmarks_fp16_gpu.cu benchmark_main.cc -I ../../ -std=c++11 -O2 -DNDEBUG -arch compute_53 -o benchmarks_fp16_gpu
last but not least, we also provide a suite of benchmarks to measure the scalability of the contraction code on CPU. To compile these benchmarks, call
g++ contraction_benchmarks_cpu.cc benchmark_main.cc -I ../../ -std=c++11 -O3 -DNDEBUG -pthread -mavx -o benchmarks_cpu

View File

@ -0,0 +1,39 @@
#define EIGEN_USE_THREADS
#include <string>
#include "tensor_benchmarks.h"
#define CREATE_THREAD_POOL(threads) \
Eigen::ThreadPool pool(threads); \
Eigen::ThreadPoolDevice device(&pool, threads);
// Contractions for number of threads ranging from 1 to 32
// Dimensions are Rows, Cols, Depth
#define BM_ContractionCPU(D1, D2, D3) \
static void BM_##Contraction##_##D1##x##D2##x##D3(int iters, int Threads) { \
StopBenchmarkTiming(); \
CREATE_THREAD_POOL(Threads); \
BenchmarkSuite<Eigen::ThreadPoolDevice, float> suite(device, D1, D2, D3); \
suite.contraction(iters); \
} \
BENCHMARK_RANGE(BM_##Contraction##_##D1##x##D2##x##D3, 1, 32);
// Vector Matrix and Matrix Vector products
BM_ContractionCPU(1, 2000, 500);
BM_ContractionCPU(2000, 1, 500);
// Various skinny matrices
BM_ContractionCPU(250, 3, 512);
BM_ContractionCPU(1500, 3, 512);
BM_ContractionCPU(512, 800, 4);
BM_ContractionCPU(512, 80, 800);
BM_ContractionCPU(512, 80, 13522);
BM_ContractionCPU(1, 80, 13522);
BM_ContractionCPU(3200, 512, 4);
BM_ContractionCPU(3200, 512, 80);
BM_ContractionCPU(3200, 80, 512);