4012 Commits

Author SHA1 Message Date
Eugene Zhulenev
68a2a8c445 Use packet ops instead of AVX2 intrinsics 2019-04-23 11:41:02 -07:00
Anuj Rawat
8c7a6feb8e Adding lowlevel APIs for optimized RHS packet load in TensorFlow
SpatialConvolution

Low-level APIs are added in order to optimized packet load in gemm_pack_rhs
in TensorFlow SpatialConvolution. The optimization is for scenario when a
packet is split across 2 adjacent columns. In this case we read it as two
'partial' packets and then merge these into 1. Currently this only works for
Packet16f (AVX512) and Packet8f (AVX2). We plan to add this for other
packet types (such as Packet8d) also.

This optimization shows significant speedup in SpatialConvolution with
certain parameters. Some examples are below.

Benchmark parameters are specified as:
Batch size, Input dim, Depth, Num of filters, Filter dim

Speedup numbers are specified for number of threads 1, 2, 4, 8, 16.

AVX512:

Parameters                  | Speedup (Num of threads: 1, 2, 4, 8, 16)
----------------------------|------------------------------------------
128,   24x24,  3, 64,   5x5 |2.18X, 2.13X, 1.73X, 1.64X, 1.66X
128,   24x24,  1, 64,   8x8 |2.00X, 1.98X, 1.93X, 1.91X, 1.91X
 32,   24x24,  3, 64,   5x5 |2.26X, 2.14X, 2.17X, 2.22X, 2.33X
128,   24x24,  3, 64,   3x3 |1.51X, 1.45X, 1.45X, 1.67X, 1.57X
 32,   14x14, 24, 64,   5x5 |1.21X, 1.19X, 1.16X, 1.70X, 1.17X
128, 128x128,  3, 96, 11x11 |2.17X, 2.18X, 2.19X, 2.20X, 2.18X

AVX2:

Parameters                  | Speedup (Num of threads: 1, 2, 4, 8, 16)
----------------------------|------------------------------------------
128,   24x24,  3, 64,   5x5 | 1.66X, 1.65X, 1.61X, 1.56X, 1.49X
 32,   24x24,  3, 64,   5x5 | 1.71X, 1.63X, 1.77X, 1.58X, 1.68X
128,   24x24,  1, 64,   5x5 | 1.44X, 1.40X, 1.38X, 1.37X, 1.33X
128,   24x24,  3, 64,   3x3 | 1.68X, 1.63X, 1.58X, 1.56X, 1.62X
128, 128x128,  3, 96, 11x11 | 1.36X, 1.36X, 1.37X, 1.37X, 1.37X

In the higher level benchmark cifar10, we observe a runtime improvement
of around 6% for AVX512 on Intel Skylake server (8 cores).

On lower level PackRhs micro-benchmarks specified in TensorFlow
tensorflow/core/kernels/eigen_spatial_convolutions_test.cc, we observe
the following runtime numbers:

AVX512:

Parameters                                                     | Runtime without patch (ns) | Runtime with patch (ns) | Speedup
---------------------------------------------------------------|----------------------------|-------------------------|---------
BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56)  |  41350                     | 15073                   | 2.74X
BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56)  |   7277                     |  7341                   | 0.99X
BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56)  |   8675                     |  8681                   | 1.00X
BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56)  |  24155                     | 16079                   | 1.50X
BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56)  |  25052                     | 17152                   | 1.46X
BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) |  18269                     | 18345                   | 1.00X
BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) |  19468                     | 19872                   | 0.98X
BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432)   | 156060                     | 42432                   | 3.68X
BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432)   | 132701                     | 36944                   | 3.59X

AVX2:

Parameters                                                     | Runtime without patch (ns) | Runtime with patch (ns) | Speedup
---------------------------------------------------------------|----------------------------|-------------------------|---------
BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56)  | 26233                      | 12393                   | 2.12X
BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56)  |  6091                      |  6062                   | 1.00X
BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56)  |  7427                      |  7408                   | 1.00X
BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56)  | 23453                      | 20826                   | 1.13X
BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56)  | 23167                      | 22091                   | 1.09X
BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) | 23422                      | 23682                   | 0.99X
BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) | 23165                      | 23663                   | 0.98X
BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432)   | 72689                      | 44969                   | 1.62X
BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432)   | 61732                      | 39779                   | 1.55X

All benchmarks on Intel Skylake server with 8 cores.
2019-04-20 06:46:43 +00:00
William D. Irons
8de66719f9 Collapsed revision from PR-619
* Add support for pcmp_eq in AltiVec/Complex.h
* Fixed implementation of pcmp_eq for double

The new logic is based on the logic from NEON for double.
2019-03-26 18:14:49 +00:00
Gael Guennebaud
f11364290e ICC does not support -fno-unsafe-math-optimizations 2019-03-22 09:26:24 +01:00
Deven Desai
51e399fc15 updates requested in the PR feedback. Also droping coded within #ifdef EIGEN_HAS_OLD_HIP_FP16 2019-03-19 21:45:25 +00:00
Deven Desai
2dbea5510f Merged eigen/eigen into default 2019-03-19 16:52:38 -04:00
Rasmus Larsen
5c93b38c5f Merged in rmlarsen/eigen (pull request PR-618)
Make clipping outside [-18:18] consistent for vectorized and non-vectorized paths of scalar_logistic_op<float>.

Approved-by: Gael Guennebaud <g.gael@free.fr>
2019-03-18 15:51:55 +00:00
Gael Guennebaud
cf7e2e277f bug #1692: enable enum as sizes of Matrix and Array 2019-03-17 21:59:30 +01:00
Rasmus Munk Larsen
e42f9aa68a Make clipping outside [-18:18] consistent for vectorized and non-vectorized paths of scalar_logistic_<float>. 2019-03-15 17:15:14 -07:00
Rasmus Munk Larsen
8450a6d519 Clean up half packet traits and add a few more missing packet ops. 2019-03-14 15:18:06 -07:00
David Tellenbach
97f9a46cb9 PR 593: Add variadtic ctor for DiagonalMatrix with unit tests 2019-03-14 10:18:24 +01:00
Gael Guennebaud
d7d2f0680e bug #1684: partially workaround clang's 6/7 bug #40815 2019-03-13 10:40:01 +01:00
Rasmus Munk Larsen
77f7d4a894 Clean up PacketMathHalf.h and add a few missing logical packet ops. 2019-03-11 17:51:16 -07:00
Gael Guennebaud
656d9bc66b Apply SSE's pmin/pmax fix for GCC <= 5 to AVX's pmin/pmax 2019-03-10 21:19:18 +01:00
Gael Guennebaud
bfbf7da047 bug #1689 fix used-but-marked-unused warning 2019-03-05 23:46:24 +01:00
Gael Guennebaud
b0d406d91c Enable construction of Ref<VectorType> from a runtime vector. 2019-03-03 15:25:25 +01:00
Sam Hasinoff
9ba81cf0ff Fully qualify Eigen::internal::aligned_free
This helps avoids a conflict on certain Windows toolchains
(potentially due to some ADL name resolution bug) in the case
where aligned_free is defined in the global namespace. In any
case, tightening this up is harmless.
2019-03-02 17:42:16 +00:00
Gael Guennebaud
0b25a5c431 fix alignment in ploadquad 2019-02-22 21:39:36 +01:00
Gael Guennebaud
cca6c207f4 AVX512: implement faster ploadquad<Packet16f> thus speeding up GEMM 2019-02-21 17:18:28 +01:00
Gael Guennebaud
1c09ee8541 bug #1674: workaround clang fast-math aggressive optimizations 2019-02-22 15:48:53 +01:00
Gael Guennebaud
7e3084bb6f Fix compilation on ARM. 2019-02-22 14:56:12 +01:00
Gael Guennebaud
42c23f14ac Speed up col/row-wise reverse for fixed size matrices by propagating compile-time sizes. 2019-02-21 22:44:40 +01:00
Rasmus Munk Larsen
4d7f317102 Add a few missing packet ops: cmp_eq for NEON. pfloor for GPU. 2019-02-21 13:32:13 -08:00
Gael Guennebaud
2a39659d79 Add fully generic Vector<Type,Size> and RowVector<Type,Size> type aliases. 2019-02-20 15:23:23 +01:00
Gael Guennebaud
302377110a Update documentation of Matrix and Array type aliases. 2019-02-20 15:18:48 +01:00
Gael Guennebaud
44b54fa4a3 Protect c++11 type alias with Eigen's macro, and add respective unit test. 2019-02-20 14:43:05 +01:00
Gael Guennebaud
7195f008ce Merged in ra_bauke/eigen (pull request PR-180)
alias template for matrix and array classes, see also bug #864

Approved-by: Heiko Bauke <heiko.bauke@mail.de>
2019-02-20 13:22:39 +00:00
Gael Guennebaud
edd413c184 bug #1409: make EIGEN_MAKE_ALIGNED_OPERATOR_NEW* macros empty in c++17 mode:
- this helps clang 5 and 6 to support alignas in STL's containers.
 - this makes the public API of our (and users) classes cleaner
2019-02-20 13:52:11 +01:00
Christoph Hertzberg
a1646fc960 Commas at the end of enumerator lists are not allowed in C++03 2019-02-19 14:32:25 +01:00
Gael Guennebaud
ab78cabd39 Add C++17 detection macro, and make sure throw(xpr) is not used if the compiler is in c++17 mode. 2019-02-19 14:04:35 +01:00
Gael Guennebaud
7580112c31 Fix harmless Scalar vs RealScalar cast. 2019-02-18 22:12:28 +01:00
Gael Guennebaud
31b6e080a9 Fix regression: .conjugate() was popped out but not re-introduced. 2019-02-18 14:45:55 +01:00
Gael Guennebaud
c69d0d08d0 Set cost of conjugate to 0 (in practice it boils down to a no-op).
This is also important to make sure that A.conjugate() * B.conjugate() does not evaluate
its arguments into temporaries (e.g., if A and B are fixed and small, or * fall back to lazyProduct)
2019-02-18 14:43:07 +01:00
Gael Guennebaud
512b74aaa1 GEMM: catch all scalar-multiple variants when falling-back to a coeff-based product.
Before only s*A*B was caught which was both inconsistent with GEMM, sub-optimal,
and could even lead to compilation-errors (https://stackoverflow.com/questions/54738495).
2019-02-18 11:47:54 +01:00
Christoph Hertzberg
ec032ac03b Guard C++11-style default constructor. Also, this is only needed for MSVC 2019-02-16 09:44:05 +01:00
Gael Guennebaud
83309068b4 bug #1680: improve MSVC inlining by declaring many triavial constructors and accessors as STRONG_INLINE. 2019-02-15 16:35:35 +01:00
Gael Guennebaud
559320745e bug #1678: Fix lack of __FMA__ macro on MSVC with AVX512 2019-02-15 10:30:28 +01:00
Gael Guennebaud
d85ae650bf bug #1678: workaround MSVC compilation issues with AVX512 2019-02-15 10:24:17 +01:00
Rasmus Munk Larsen
65e23ca7e9 Revert b55b5c7280
.
2019-02-14 13:46:13 -08:00
Gael Guennebaud
2edfc6807d Fix compilation of empty products of the form: Mx0 * 0xN 2019-02-11 18:24:07 +01:00
Gael Guennebaud
013cc3a6b3 Make GEMM fallback to GEMV for runtime vectors.
This is a more general and simpler version of changeset 4c0fa6ce0f81ce67dd6723528ddf72f66ae92ba2
2019-02-07 16:24:09 +01:00
Gael Guennebaud
fa2fcb4895 Backed out changeset 4c0fa6ce0f81ce67dd6723528ddf72f66ae92ba2 2019-02-07 16:07:08 +01:00
Gael Guennebaud
b3c4344a68 bug #1676: workaround GCC's bug in c++17 mode. 2019-02-07 15:21:35 +01:00
Eugene Zhulenev
6d0f6265a9 Remove duplicated comment line 2019-02-04 10:30:25 -08:00
Eugene Zhulenev
690b2c45b1 Fix GeneralBlockPanelKernel Android compilation 2019-02-04 10:29:15 -08:00
Gael Guennebaud
871e2e5339 bug #1674: disable GCC's unsafe-math-optimizations in sin/cos vectorization (results are completely wrong otherwise) 2019-02-03 08:54:47 +01:00
Rasmus Larsen
e7b481ea74 Merged in rmlarsen/eigen (pull request PR-578)
Speed up Eigen matrix*vector and vector*matrix multiplication.

Approved-by: Eugene Zhulenev <ezhulenev@google.com>
2019-02-02 01:53:44 +00:00
Sameer Agarwal
b55b5c7280 Speed up row-major matrix-vector product on ARM
The row-major matrix-vector multiplication code uses a threshold to
check if processing 8 rows at a time would thrash the cache.

This change introduces two modifications to this logic.

1. A smaller threshold for ARM and ARM64 devices.

The value of this threshold was determined empirically using a Pixel2
phone, by benchmarking a large number of matrix-vector products in the
range [1..4096]x[1..4096] and measuring performance separately on
small and little cores with frequency pinning.

On big (out-of-order) cores, this change has little to no impact. But
on the small (in-order) cores, the matrix-vector products are up to
700% faster. Especially on large matrices.

The motivation for this change was some internal code at Google which
was using hand-written NEON for implementing similar functionality,
processing the matrix one row at a time, which exhibited substantially
better performance than Eigen.

With the current change, Eigen handily beats that code.

2. Make the logic for choosing number of simultaneous rows apply
unifiormly to 8, 4 and 2 rows instead of just 8 rows.

Since the default threshold for non-ARM devices is essentially
unchanged (32000 -> 32 * 1024), this change has no impact on non-ARM
performance. This was verified by running the same set of benchmarks
on a Xeon desktop.
2019-02-01 15:23:53 -08:00
Rasmus Munk Larsen
4c0fa6ce0f Speed up Eigen matrix*vector and vector*matrix multiplication.
This change speeds up Eigen matrix * vector and vector * matrix multiplication for dynamic matrices when it is known at runtime that one of the factors is a vector.

The benchmarks below test

c.noalias()= n_by_n_matrix * n_by_1_matrix;
c.noalias()= 1_by_n_matrix * n_by_n_matrix;
respectively.

Benchmark measurements:

SSE:
Run on *** (72 X 2992 MHz CPUs); 2019-01-28T17:51:44.452697457-08:00
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_MatVec/64                            1096       312    +71.5%
BM_MatVec/128                           4581      1464    +68.0%
BM_MatVec/256                          18534      5710    +69.2%
BM_MatVec/512                         118083     24162    +79.5%
BM_MatVec/1k                          704106    173346    +75.4%
BM_MatVec/2k                         3080828    742728    +75.9%
BM_MatVec/4k                        25421512   4530117    +82.2%
BM_VecMat/32                             352       130    +63.1%
BM_VecMat/64                            1213       425    +65.0%
BM_VecMat/128                           4640      1564    +66.3%
BM_VecMat/256                          17902      5884    +67.1%
BM_VecMat/512                          70466     24000    +65.9%
BM_VecMat/1k                          340150    161263    +52.6%
BM_VecMat/2k                         1420590    645576    +54.6%
BM_VecMat/4k                         8083859   4364327    +46.0%

AVX2:
Run on *** (72 X 2993 MHz CPUs); 2019-01-28T17:45:11.508545307-08:00
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_MatVec/64                             619       120    +80.6%
BM_MatVec/128                           9693       752    +92.2%
BM_MatVec/256                          38356      2773    +92.8%
BM_MatVec/512                          69006     12803    +81.4%
BM_MatVec/1k                          443810    160378    +63.9%
BM_MatVec/2k                         2633553    646594    +75.4%
BM_MatVec/4k                        16211095   4327148    +73.3%
BM_VecMat/64                             925       227    +75.5%
BM_VecMat/128                           3438       830    +75.9%
BM_VecMat/256                          13427      2936    +78.1%
BM_VecMat/512                          53944     12473    +76.9%
BM_VecMat/1k                          302264    157076    +48.0%
BM_VecMat/2k                         1396811    675778    +51.6%
BM_VecMat/4k                         8962246   4459010    +50.2%

AVX512:
Run on *** (72 X 2993 MHz CPUs); 2019-01-28T17:35:17.239329863-08:00
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_MatVec/64                             401       111    +72.3%
BM_MatVec/128                           1846       513    +72.2%
BM_MatVec/256                          36739      1927    +94.8%
BM_MatVec/512                          54490      9227    +83.1%
BM_MatVec/1k                          487374    161457    +66.9%
BM_MatVec/2k                         2016270    643824    +68.1%
BM_MatVec/4k                        13204300   4077412    +69.1%
BM_VecMat/32                             324       106    +67.3%
BM_VecMat/64                            1034       246    +76.2%
BM_VecMat/128                           3576       802    +77.6%
BM_VecMat/256                          13411      2561    +80.9%
BM_VecMat/512                          58686     10037    +82.9%
BM_VecMat/1k                          320862    163750    +49.0%
BM_VecMat/2k                         1406719    651397    +53.7%
BM_VecMat/4k                         7785179   4124677    +47.0%
Currently watchingStop watching
2019-01-31 14:24:08 -08:00
Gael Guennebaud
7ef879f6bf GEBP: improves pipelining in the 1pX4 path with FMA.
Prior to this change, a product with a LHS having 8 rows was faster with AVX-only than with AVX+FMA.
With AVX+FMA I measured a speed up of about x1.25 in such cases.
2019-01-30 23:45:12 +01:00