eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2025-10-12 16:11:29 +08:00

Author	SHA1	Message	Date
Rasmus Munk Larsen	c475228b28	Vectorize atan() for double.	2022-10-01 01:49:30 +00:00
Rasmus Munk Larsen	7b2901e2aa	Add vectorized integer division for int32 with AVX512, AVX or SSE.	2022-09-21 00:27:23 +00:00
Rasmus Munk Larsen	bd393e15c3	Vectorize acos, asin, and atan for float.	2022-08-29 19:49:33 +00:00
Charles Schlosser	e5af9f87f2	Vectorize pow for integer base / exponent types	2022-08-29 19:23:54 +00:00
Matthew Sterrett	7a3b667c43	Add support for AVX512-FP16 for vectorizing half precision math	2022-08-17 18:15:21 +00:00
Matthew Sterrett	39fcc89798	Removed unnecessary checks for FP16C	2022-08-16 18:14:41 +00:00
aaraujom	d49ede4dc4	Add AVX512 s/dgemm optimizations for compute kernel (2nd try)	2022-05-28 02:00:21 +00:00
Antonio Sánchez	9b9496ad98	Revert "Add AVX512 optimizations for matrix multiply" This reverts commit 25db0b4a824ba9a092bbb514fbada51bf9d37a18	2022-05-13 18:50:33 +00:00
aaraujom	25db0b4a82	Add AVX512 optimizations for matrix multiply	2022-05-12 23:41:19 +00:00
Antonio Sánchez	07db964bde	Restrict new AVX512 trsm to AVX512VL, rename files for consistency.	2022-04-14 16:58:32 +00:00
Antonio Sánchez	9a14d91a99	Fix AVX512 builds with MSVC.	2022-03-18 16:04:53 +00:00
b-shi	518fc321cb	AVX512 Optimizations for Triangular Solve	2022-03-16 18:04:50 +00:00
Antonio Sánchez	e7f4a901ee	Define EIGEN_HAS_AVX512_MATH in PacketMath.	2022-02-04 22:25:52 +00:00
Antonio Sánchez	96da541cba	Fix AVX512 math function consistency, enable for ICC.	2022-02-04 19:35:18 +00:00
Rasmus Munk Larsen	51311ec651	Remove inline assembly for FMA (AVX) and add remaining extensions as packet ops: pmsub, pnmadd, and pnmsub.	2022-01-26 04:25:41 +00:00
Rasmus Munk Larsen	ea2c02060c	Add reciprocal packet op and fast specializations for float with SSE, AVX, and AVX512.	2022-01-21 23:49:18 +00:00
Ilya Tokar	a0fc640c18	Add support for packets of int64 on x86	2022-01-21 19:55:23 +00:00
Kolja Brix	afa616bc9e	Fix some typos found	2021-09-23 15:22:00 +00:00
Rasmus Munk Larsen	7b975acb1f	Remove unused variable.	2021-09-16 20:27:13 +00:00
Rasmus Munk Larsen	92849d814b	Remove unused variable.	2021-09-16 20:21:31 +00:00
Rasmus Munk Larsen	d7d0bf832d	Issue an error in case of direct inclusion of internal headers.	2021-09-10 19:12:26 +00:00
Antonio Sanchez	3d4ba855e0	Fix AVX integer packet issues. Most are instances of AVX2 functions not protected by `EIGEN_VECTORIZE_AVX2`. There was also a missing semi-colon for AVX512.	2021-09-01 14:14:43 -07:00
Jakub Lichman	dc5b1f7d75	AVX512 and AVX2 support for Packet16i and Packet8i added	2021-08-25 19:38:23 +00:00
Gauri Deshpande	e6a5a594a7	remove denormal flushing in fp32tobf16 for avx & avx512	2021-08-09 22:15:21 +00:00
Jakub Lichman	d87648a6be	Tests added and AVX512 bug fixed for pcmp_lt_or_nan	2021-04-25 20:58:56 +00:00
Jakub Lichman	2b1dfd1ba0	HasExp added for AVX512 Packet8d	2021-04-20 19:07:58 +00:00
Antonio Sanchez	1d79c68ba0	Fix ldexp for AVX512 (#2215 ) Wrong shuffle was used. Need to interleave low/high halves with a `permute` instruction. Fixes #2215.	2021-04-20 16:25:22 +00:00
Christoph Hertzberg	69a4f70956	Revert "Uses _mm512_abs_pd for Packet8d pabs" This reverts commit f019b97aca82071f35726b1aaebf1c598770f0f5	2021-03-23 18:52:19 +00:00
Steve Bronder	f019b97aca	Uses _mm512_abs_pd for Packet8d pabs	2021-03-18 15:47:52 +00:00
Antonio Sanchez	7ff0b7a980	Updated pfrexp implementation. The original implementation fails for 0, denormals, inf, and NaN. See #2150	2021-02-17 02:23:24 +00:00
Antonio Sanchez	4cb563a01e	Fix ldexp implementations. The previous implementations produced garbage values if the exponent did not fit within the exponent bits. See #2131 for a complete discussion, and !375 for other possible implementations. Here we implement the 4-factor version. See `pldexp_impl` in `GenericPacketMathFunctions.h` for a full description. The SSE `pcmp*` methods were moved down since `pcmp_le<Packet4i>` requires `por`. Left as a "TODO" is to delegate to a faster version if we know the exponent does fit within the exponent bits. Fixes #2131.	2021-02-10 22:45:41 +00:00
Rasmus Munk Larsen	cdd8fdc32e	Vectorize `pow(x, y)`. This closes https://gitlab.com/libeigen/eigen/-/issues/2085 , which also contains a description of the algorithm. I ran some testing (comparing to `std::pow(double(x), double(y)))` for `x` in the set of all (positive) floats in the interval `[std::sqrt(std::numeric_limits<float>::min()), std::sqrt(std::numeric_limits<float>::max())]`, and `y` in `{2, sqrt(2), -sqrt(2)}` I get the following error statistics: ``` max_rel_error = 8.34405e-07 rms_rel_error = 2.76654e-07 ``` If I widen the range to all normal float I see lower accuracy for arguments where the result is subnormal, e.g. for `y = sqrt(2)`: ``` max_rel_error = 0.666667 rms = 6.8727e-05 count = 1335165689 argmax = 2.56049e-32, 2.10195e-45 != 1.4013e-45 ``` which seems reasonable, since these results are subnormals with only couple of significant bits left.	2021-01-18 13:25:16 +00:00
Antonio Sanchez	839aa505c3	Fix typo in AVX512 packet math.	2020-12-11 21:35:44 -08:00
Antonio Sanchez	8c9976d7f0	Fix more SSE/AVX packet conversions for peven. MSVC doesn't like function-style casts and forces us to use intrinsics.	2020-12-11 15:46:42 -08:00
Rasmus Munk Larsen	125cc9a5df	Implement vectorized complex square root. Closes #1905 Measured speedup for sqrt of `complex<float>` on Skylake: SSE: ``` name old time/op new time/op delta BM_eigen_sqrt_ctype/1 49.4ns ± 0% 54.3ns ± 0% +10.01% BM_eigen_sqrt_ctype/8 332ns ± 0% 50ns ± 1% -84.97% BM_eigen_sqrt_ctype/64 2.81µs ± 1% 0.38µs ± 0% -86.49% BM_eigen_sqrt_ctype/512 23.8µs ± 0% 3.0µs ± 0% -87.32% BM_eigen_sqrt_ctype/4k 202µs ± 0% 24µs ± 2% -88.03% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.19ms ± 0% -88.18% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 1.5ms ± 1% -88.20% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 6.2ms ± 0% -88.18% ``` AVX2: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.6ns ± 0% 55.6ns ± 0% +3.71% BM_eigen_sqrt_ctype/8 334ns ± 0% 27ns ± 0% -91.86% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.22µs ± 2% -92.28% BM_eigen_sqrt_ctype/512 23.8µs ± 1% 1.7µs ± 1% -92.81% BM_eigen_sqrt_ctype/4k 201µs ± 0% 14µs ± 1% -93.24% BM_eigen_sqrt_ctype/32k 1.62ms ± 0% 0.11ms ± 1% -93.29% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.9ms ± 1% -93.31% BM_eigen_sqrt_ctype/1M 52.0ms ± 0% 3.5ms ± 1% -93.31% ``` AVX512: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.7ns ± 0% 56.2ns ± 1% +4.75% BM_eigen_sqrt_ctype/8 334ns ± 0% 18ns ± 2% -94.63% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.12µs ± 1% -95.54% BM_eigen_sqrt_ctype/512 23.9µs ± 1% 1.0µs ± 1% -95.89% BM_eigen_sqrt_ctype/4k 202µs ± 0% 8µs ± 1% -96.13% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.06ms ± 1% -96.15% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.5ms ± 4% -96.11% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 2.0ms ± 1% -96.13% ```	2020-12-08 18:13:35 -08:00
Antonio Sanchez	e2f21465fe	Special function implementations for half/bfloat16 packets. Current implementations fail to consider half-float packets, only half-float scalars. Added specializations for packets on AVX, AVX512 and NEON. Added tests to `special_packetmath`. The current `special_functions` tests would fail for half and bfloat16 due to lack of precision. The NEON tests also fail with precision issues and due to different handling of `sqrt(inf)`, so special functions bessel, ndtri have been disabled. Tested with AVX, AVX512.	2020-12-04 10:16:29 -08:00
Rasmus Munk Larsen	e57281a741	Fix a few issues for AVX512. This change enables vectorized versions of log, exp, log1p, expm1 when AVX512DQ is not available.	2020-12-01 11:31:47 -08:00
Antonio Sanchez	89f90b585d	AVX512 missing ops. This allows the `packetmath` tests to pass for AVX512 on skylake. Made `half` and `bfloat16` consistent in terms of ops they support. Note the `log` tests are currently disabled for `bfloat16` since they fail due to poor precision (they were previously disabled for `Packet8bf` via test function specialization -- I just removed that specialization and disabled it in the generic test).	2020-11-30 16:28:57 +00:00
Guoqiang QI	4700713faf	Add AVX plog<Packet4d> and AVX512 plog<Packet8d> ops,also unified AVX512 plog<Packet16f> op with generic api	2020-10-15 00:54:45 +00:00
Rasmus Munk Larsen	af6f43d7ff	Add specializations for pmin/pmax with prescribed NaN propagation semantics for SSE/AVX/AVX512.	2020-10-14 23:11:24 +00:00
Rasmus Munk Larsen	4e4d3f32d1	Clean up packetmath tests and fix various bugs to make bfloat16 pass (almost) all packetmath tests with SSE, AVX, and AVX512.	2020-10-09 20:05:49 +00:00
Teng Lu	3ec4f0b641	Fix undefine BF16 union behavior in AVX512.	2020-07-29 02:20:21 +00:00
Sheng Yang	56b3e3f3f8	AVX path for BF16	2020-07-14 01:34:03 +00:00
Sheng Yang	116c5235ac	BF16 for scalar_cmp_with_cast_op	2020-07-01 18:33:42 +00:00
Teng Lu	386d809bde	Support BFloat16 in Eigen	2020-06-20 19:16:24 +00:00
Rasmus Munk Larsen	9b411757ab	Add missing packet ops for bool, and make it pass the same packet op unit tests as other arithmetic types. This change also contains a few minor cleanups: 1. Remove packet op pnot, which is not needed for anything other than pcmp_le_or_nan, which can be done in other ways. 2. Remove the "HasInsert" enum, which is no longer needed since we removed the corresponding packet ops. 3. Add faster pselect op for Packet4i when SSE4.1 is supported. Among other things, this makes the fast transposeInPlace() method available for Matrix<bool>. Run on ************** (72 X 2994 MHz CPUs); 2020-05-09T10:51:02.372347913-07:00 CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB Benchmark Time(ns) CPU(ns) Iterations ----------------------------------------------------------------------- BM_TransposeInPlace<float>/4 9.77 9.77 71670320 BM_TransposeInPlace<float>/8 21.9 21.9 31929525 BM_TransposeInPlace<float>/16 66.6 66.6 10000000 BM_TransposeInPlace<float>/32 243 243 2879561 BM_TransposeInPlace<float>/59 844 844 829767 BM_TransposeInPlace<float>/64 933 933 750567 BM_TransposeInPlace<float>/128 3944 3945 177405 BM_TransposeInPlace<float>/256 16853 16853 41457 BM_TransposeInPlace<float>/512 204952 204968 3448 BM_TransposeInPlace<float>/1k 1053889 1053861 664 BM_TransposeInPlace<bool>/4 14.4 14.4 48637301 BM_TransposeInPlace<bool>/8 36.0 36.0 19370222 BM_TransposeInPlace<bool>/16 31.5 31.5 22178902 BM_TransposeInPlace<bool>/32 111 111 6272048 BM_TransposeInPlace<bool>/59 626 626 1000000 BM_TransposeInPlace<bool>/64 428 428 1632689 BM_TransposeInPlace<bool>/128 1677 1677 417377 BM_TransposeInPlace<bool>/256 7126 7126 96264 BM_TransposeInPlace<bool>/512 29021 29024 24165 BM_TransposeInPlace<bool>/1k 116321 116330 6068	2020-05-14 22:39:13 +00:00
Rasmus Munk Larsen	c1d944dd91	Remove packet ops pinsertfirst and pinsertlast that are only used in a single place, and can be replaced by other ops when constructing the first/final packet in linspaced_op_impl::packetOp. I cannot measure any performance changes for SSE, AVX, or AVX512. name old time/op new time/op delta BM_LinSpace<float>/1 1.63ns ± 0% 1.63ns ± 0% ~ (p=0.762 n=5+5) BM_LinSpace<float>/8 4.92ns ± 3% 4.89ns ± 3% ~ (p=0.421 n=5+5) BM_LinSpace<float>/64 34.6ns ± 0% 34.6ns ± 0% ~ (p=0.841 n=5+5) BM_LinSpace<float>/512 217ns ± 0% 217ns ± 0% ~ (p=0.421 n=5+5) BM_LinSpace<float>/4k 1.68µs ± 0% 1.68µs ± 0% ~ (p=1.000 n=5+5) BM_LinSpace<float>/32k 13.3µs ± 0% 13.3µs ± 0% ~ (p=0.905 n=5+4) BM_LinSpace<float>/256k 107µs ± 0% 107µs ± 0% ~ (p=0.841 n=5+5) BM_LinSpace<float>/1M 427µs ± 0% 427µs ± 0% ~ (p=0.690 n=5+5)	2020-05-08 15:41:50 -07:00
Rasmus Munk Larsen	225ab040e0	Remove unused packet op "palign". Clean up a compiler warning in c++03 mode in AVX512/Complex.h.	2020-05-07 17:14:26 -07:00
Rasmus Munk Larsen	e80ec24357	Remove unused packet op "preduxp".	2020-04-23 18:17:14 +00:00
Rasmus Munk Larsen	5ab87d8aba	Move eigen_packet_wrapper to GenericPacketMath.h and use it for SSE/AVX/AVX512 as it is already used for NEON. This will allow us to define multiple packet types backed by the same vector type, e.g., __m128i. Use this machanism to define packets for half and clean up the packet op implementations.	2020-04-15 18:17:19 +00:00

1 2 3

122 Commits