22 Commits

Author SHA1 Message Date
Eugene Zhulenev
f2209d06e4 Add block evaluationto CwiseUnaryOp and add PreferBlockAccess enum to all evaluators 2018-08-10 16:53:36 -07:00
Gael Guennebaud
b3fd93207b Fix typos found using codespell 2018-06-07 14:43:02 +02:00
RJ Ryan
59985cfd26 Disable use of recurrence for computing twiddle factors. Fixes FFT precision issues for large FFTs. https://github.com/tensorflow/tensorflow/issues/10749#issuecomment-354557689 2017-12-31 10:44:56 -05:00
Benoit Steiner
53725c10b8 Merged in mehdi_goli/opencl/DataDependancy (pull request PR-10)
DataDependancy

* Wrapping data type to the pointer class for sycl in non-terminal nodes; not having that breaks Tensorflow Conv2d code.

* Applying Ronnan's Comments.

* Applying benoit's comments
2017-06-28 17:55:23 +00:00
Rasmus Munk Larsen
e6b1020221 Adds a fast memcpy function to Eigen. This takes advantage of the following:
1. For small fixed sizes, the compiler generates inline code for memcpy, which is much faster.

2. My colleague eriche at googl dot com discovered that for large sizes, memmove is significantly faster than memcpy (at least on Linux with GCC or Clang). See benchmark numbers measured on a Haswell (HP Z440) workstation here: https://docs.google.com/a/google.com/spreadsheets/d/1jLs5bKzXwhpTySw65MhG1pZpsIwkszZqQTjwrd_n0ic/pubhtml This is of course surprising since memcpy is a less constrained version of memmove. This stackoverflow thread contains some speculation as to the causes: http://stackoverflow.com/questions/22793669/poor-memcpy-performance-on-linux

Below are numbers for copying and slicing tensors using the multithreaded TensorDevice. The numbers show significant improvements for memcpy of very small blocks and for memcpy of large blocks single threaded (we were already able to saturate memory bandwidth for >1 threads before on large blocks). The "slicingSmallPieces" benchmark also shows small consistent improvements, since memcpy cost is a fair portion of that particular computation.

The benchmarks operate on NxN matrices, and the names are of the form BM_$OP_${NUMTHREADS}T/${N}.

Measured improvements in wall clock time:

Run on rmlarsen3.mtv (12 X 3501 MHz CPUs); 2017-01-20T11:26:31.493023454-08:00
CPU: Intel Haswell with HyperThreading (6 cores) dL1:32KB dL2:256KB dL3:15MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_memcpy_1T/2                          3.48      2.39    +31.3%
BM_memcpy_1T/8                          12.3      6.51    +47.0%
BM_memcpy_1T/64                          371       383     -3.2%
BM_memcpy_1T/512                       66922     66720     +0.3%
BM_memcpy_1T/4k                      9892867   6849682    +30.8%
BM_memcpy_1T/5k                     14951099  10332856    +30.9%
BM_memcpy_2T/2                          3.50      2.46    +29.7%
BM_memcpy_2T/8                          12.3      7.66    +37.7%
BM_memcpy_2T/64                          371       376     -1.3%
BM_memcpy_2T/512                       66652     66788     -0.2%
BM_memcpy_2T/4k                      6145012   6117776     +0.4%
BM_memcpy_2T/5k                      9181478   9010942     +1.9%
BM_memcpy_4T/2                          3.47      2.47    +31.0%
BM_memcpy_4T/8                          12.3      6.67    +45.8
BM_memcpy_4T/64                          374       376     -0.5%
BM_memcpy_4T/512                       67833     68019     -0.3%
BM_memcpy_4T/4k                      5057425   5188253     -2.6%
BM_memcpy_4T/5k                      7555638   7779468     -3.0%
BM_memcpy_6T/2                          3.51      2.50    +28.8%
BM_memcpy_6T/8                          12.3      7.61    +38.1%
BM_memcpy_6T/64                          373       378     -1.3%
BM_memcpy_6T/512                       66871     66774     +0.1%
BM_memcpy_6T/4k                      5112975   5233502     -2.4%
BM_memcpy_6T/5k                      7614180   7772246     -2.1%
BM_memcpy_8T/2                          3.47      2.41    +30.5%
BM_memcpy_8T/8                          12.4      10.5    +15.3%
BM_memcpy_8T/64                          372       388     -4.3%
BM_memcpy_8T/512                       67373     66588     +1.2%
BM_memcpy_8T/4k                      5148462   5254897     -2.1%
BM_memcpy_8T/5k                      7660989   7799058     -1.8%
BM_memcpy_12T/2                         3.50      2.40    +31.4%
BM_memcpy_12T/8                         12.4      7.55    +39.1
BM_memcpy_12T/64                         374       378     -1.1%
BM_memcpy_12T/512                      67132     66683     +0.7%
BM_memcpy_12T/4k                     5185125   5292920     -2.1%
BM_memcpy_12T/5k                     7717284   7942684     -2.9%
BM_slicingSmallPieces_1T/2              47.3      47.5     +0.4%
BM_slicingSmallPieces_1T/8              53.6      52.3     +2.4%
BM_slicingSmallPieces_1T/64              491       476     +3.1%
BM_slicingSmallPieces_1T/512           21734     18814    +13.4%
BM_slicingSmallPieces_1T/4k           394660    396760     -0.5%
BM_slicingSmallPieces_1T/5k           218722    209244     +4.3%
BM_slicingSmallPieces_2T/2              80.7      79.9     +1.0%
BM_slicingSmallPieces_2T/8              54.2      53.1     +2.0
BM_slicingSmallPieces_2T/64              497       477     +4.0%
BM_slicingSmallPieces_2T/512           21732     18822    +13.4%
BM_slicingSmallPieces_2T/4k           392885    390490     +0.6%
BM_slicingSmallPieces_2T/5k           221988    208678     +6.0%
BM_slicingSmallPieces_4T/2              80.8      80.1     +0.9%
BM_slicingSmallPieces_4T/8              54.1      53.2     +1.7%
BM_slicingSmallPieces_4T/64              493       476     +3.4%
BM_slicingSmallPieces_4T/512           21702     18758    +13.6%
BM_slicingSmallPieces_4T/4k           393962    404023     -2.6%
BM_slicingSmallPieces_4T/5k           249667    211732    +15.2%
BM_slicingSmallPieces_6T/2              80.5      80.1     +0.5%
BM_slicingSmallPieces_6T/8              54.4      53.4     +1.8%
BM_slicingSmallPieces_6T/64              488       478     +2.0%
BM_slicingSmallPieces_6T/512           21719     18841    +13.3%
BM_slicingSmallPieces_6T/4k           394950    397583     -0.7%
BM_slicingSmallPieces_6T/5k           223080    210148     +5.8%
BM_slicingSmallPieces_8T/2              81.2      80.4     +1.0%
BM_slicingSmallPieces_8T/8              58.1      53.5     +7.9%
BM_slicingSmallPieces_8T/64              489       480     +1.8%
BM_slicingSmallPieces_8T/512           21586     18798    +12.9%
BM_slicingSmallPieces_8T/4k           394592    400165     -1.4%
BM_slicingSmallPieces_8T/5k           219688    208301     +5.2%
BM_slicingSmallPieces_12T/2             80.2      79.8     +0.7%
BM_slicingSmallPieces_12T/8             54.4      53.4     +1.8
BM_slicingSmallPieces_12T/64             488       476     +2.5%
BM_slicingSmallPieces_12T/512          21931     18831    +14.1%
BM_slicingSmallPieces_12T/4k          393962    396541     -0.7%
BM_slicingSmallPieces_12T/5k          218803    207965     +5.0%
2017-01-24 13:55:18 -08:00
Benoit Steiner
fd220dd8b0 Use numext::conj instead of std::conj 2016-08-01 18:16:16 -07:00
Rasmus Munk Larsen
235e83aba6 Eigen cost model part 1. This implements a basic recursive framework to estimate the cost of evaluating tensor expressions. 2016-04-14 13:57:35 -07:00
Benoit Steiner
3da495e6b9 Relaxed the condition used to gate the fft code. 2016-03-31 18:11:51 -07:00
Benoit Steiner
0f5cc504fe Properly gate the fft code 2016-03-31 12:59:39 -07:00
Benoit Steiner
23aed8f2e4 Use EIGEN_PI instead of redefining our own constant PI 2016-03-05 08:04:45 -08:00
Benoit Steiner
ec35068edc Don't rely on the M_PI constant since not all compilers provide it. 2016-03-04 16:42:38 -08:00
Benoit Steiner
c561eeb7bf Don't use implicit type conversions in initializer lists since not all compilers support them. 2016-03-04 14:12:45 -08:00
Benoit Steiner
203490017f Prevent unecessary Index to int conversions 2016-02-21 08:49:36 -08:00
Rasmus Munk Larsen
8eb127022b Get rid of duplicate code. 2016-02-19 16:33:30 -08:00
Rasmus Munk Larsen
d5e2ec7447 Speed up tensor FFT by up ~25-50%.
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_tensor_fft_single_1D_cpu/8            132       134     -1.5%
BM_tensor_fft_single_1D_cpu/9           1162      1229     -5.8%
BM_tensor_fft_single_1D_cpu/16           199       195     +2.0%
BM_tensor_fft_single_1D_cpu/17          2587      2267    +12.4%
BM_tensor_fft_single_1D_cpu/32           373       341     +8.6%
BM_tensor_fft_single_1D_cpu/33          5922      4879    +17.6%
BM_tensor_fft_single_1D_cpu/64           797       675    +15.3%
BM_tensor_fft_single_1D_cpu/65         13580     10481    +22.8%
BM_tensor_fft_single_1D_cpu/128         1753      1375    +21.6%
BM_tensor_fft_single_1D_cpu/129        31426     22789    +27.5%
BM_tensor_fft_single_1D_cpu/256         4005      3008    +24.9%
BM_tensor_fft_single_1D_cpu/257        70910     49549    +30.1%
BM_tensor_fft_single_1D_cpu/512         8989      6524    +27.4%
BM_tensor_fft_single_1D_cpu/513       165402    107751    +34.9%
BM_tensor_fft_single_1D_cpu/999       198293    115909    +41.5%
BM_tensor_fft_single_1D_cpu/1ki        21289     14143    +33.6%
BM_tensor_fft_single_1D_cpu/1k        361980    233355    +35.5%
BM_tensor_fft_double_1D_cpu/8            138       131     +5.1%
BM_tensor_fft_double_1D_cpu/9           1253      1133     +9.6%
BM_tensor_fft_double_1D_cpu/16           218       200     +8.3%
BM_tensor_fft_double_1D_cpu/17          2770      2392    +13.6%
BM_tensor_fft_double_1D_cpu/32           406       368     +9.4%
BM_tensor_fft_double_1D_cpu/33          6418      5153    +19.7%
BM_tensor_fft_double_1D_cpu/64           856       728    +15.0%
BM_tensor_fft_double_1D_cpu/65         14666     11148    +24.0%
BM_tensor_fft_double_1D_cpu/128         1913      1502    +21.5%
BM_tensor_fft_double_1D_cpu/129        36414     24072    +33.9%
BM_tensor_fft_double_1D_cpu/256         4226      3216    +23.9%
BM_tensor_fft_double_1D_cpu/257        86638     52059    +39.9%
BM_tensor_fft_double_1D_cpu/512         9397      6939    +26.2%
BM_tensor_fft_double_1D_cpu/513       203208    114090    +43.9%
BM_tensor_fft_double_1D_cpu/999       237841    125583    +47.2%
BM_tensor_fft_double_1D_cpu/1ki        20921     15392    +26.4%
BM_tensor_fft_double_1D_cpu/1k        455183    250763    +44.9%
BM_tensor_fft_single_2D_cpu/8           1051      1005     +4.4%
BM_tensor_fft_single_2D_cpu/9          16784     14837    +11.6%
BM_tensor_fft_single_2D_cpu/16          4074      3772     +7.4%
BM_tensor_fft_single_2D_cpu/17         75802     63884    +15.7%
BM_tensor_fft_single_2D_cpu/32         20580     16931    +17.7%
BM_tensor_fft_single_2D_cpu/33        345798    278579    +19.4%
BM_tensor_fft_single_2D_cpu/64         97548     81237    +16.7%
BM_tensor_fft_single_2D_cpu/65       1592701   1227048    +23.0%
BM_tensor_fft_single_2D_cpu/128       472318    384303    +18.6%
BM_tensor_fft_single_2D_cpu/129      7038351   5445308    +22.6%
BM_tensor_fft_single_2D_cpu/256      2309474   1850969    +19.9%
BM_tensor_fft_single_2D_cpu/257     31849182  23797538    +25.3%
BM_tensor_fft_single_2D_cpu/512     10395194   8077499    +22.3%
BM_tensor_fft_single_2D_cpu/513     144053843  104242541    +27.6%
BM_tensor_fft_single_2D_cpu/999     279885833  208389718    +25.5%
BM_tensor_fft_single_2D_cpu/1ki     45967677  36070985    +21.5%
BM_tensor_fft_single_2D_cpu/1k      619727095  456489500    +26.3%
BM_tensor_fft_double_2D_cpu/8           1110      1016     +8.5%
BM_tensor_fft_double_2D_cpu/9          17957     15768    +12.2%
BM_tensor_fft_double_2D_cpu/16          4558      4000    +12.2%
BM_tensor_fft_double_2D_cpu/17         79237     66901    +15.6%
BM_tensor_fft_double_2D_cpu/32         21494     17699    +17.7%
BM_tensor_fft_double_2D_cpu/33        357962    290357    +18.9%
BM_tensor_fft_double_2D_cpu/64        105179     87435    +16.9%
BM_tensor_fft_double_2D_cpu/65       1617143   1288006    +20.4%
BM_tensor_fft_double_2D_cpu/128       512848    419397    +18.2%
BM_tensor_fft_double_2D_cpu/129      7271322   5636884    +22.5%
BM_tensor_fft_double_2D_cpu/256      2415529   1922032    +20.4%
BM_tensor_fft_double_2D_cpu/257     32517952  24462177    +24.8%
BM_tensor_fft_double_2D_cpu/512     10724898   8287617    +22.7%
BM_tensor_fft_double_2D_cpu/513     146007419  108603266    +25.6%
BM_tensor_fft_double_2D_cpu/999     296351330  221885776    +25.1%
BM_tensor_fft_double_2D_cpu/1ki     59334166  48357539    +18.5%
BM_tensor_fft_double_2D_cpu/1k      666660132  483840349    +27.4%
2016-02-19 16:29:23 -08:00
Benoit Steiner
4d4211c04e Avoid unecessary type conversions 2016-02-05 18:19:41 -08:00
Benoit Steiner
5b7713dd33 Record whether the underlying tensor storage can be accessed directly during the evaluation of an expression. 2016-01-19 17:05:10 -08:00
Benoit Steiner
c444a0a8c3 Consistently use the same index type in the fft codebase. 2015-10-29 16:39:47 -07:00
Benoit Steiner
09ea3a7acd Silenced a few more compilation warnings 2015-10-29 16:22:52 -07:00
Benoit Steiner
a3e144727c Fixed compilation warning 2015-10-26 10:48:11 -07:00
Benoit Steiner
c40c2ceb27 Reordered the code of fft constructor to prevent compilation warnings 2015-10-23 09:38:19 -07:00
Benoit Steiner
a147c62998 Added support for fourier transforms (code courtesy of thucjw@gmail.com) 2015-10-22 16:51:30 -07:00