Gael Guennebaud
bcf9bb5c1f
Avoid packing rhs multiple-times when blocking on the lhs only.
2015-02-26 17:01:33 +01:00
Gael Guennebaud
4ec3f04b3a
Make sure that the block size computation is tested by our unit test.
2015-02-26 17:00:36 +01:00
Gael Guennebaud
a8ad8887bf
Implement a more generic blocking-size selection algorithm. See explanations inlines.
...
It performs extremely well on Haswell. The main issue is to reliably and quickly find the
actual cache size to be used for our 2nd level of blocking, that is: max(l2,l3/nb_core_sharing_l3)
2015-02-26 16:04:35 +01:00
Gael Guennebaud
400becc591
Fix typos in block-size testing code, and set peeling on k to 8.
2015-02-26 15:57:06 +01:00
Benoit Jacob
692136350b
So I extensively measured the impact of the offset in this prefetch. I tried offset values from 0 to 128 (on this float* pointer, so implicitly times 4 bytes).
...
On x86, I tested a Sandy Bridge with AVX with 12M cache and a Haswell with AVX+FMA with 6M cache on MatrixXf sizes up to 2400.
I could not see any significant impact of this offset.
On Nexus 5, the offset has a slight effect: values around 32 (times sizeof float) are worst. Anything else is the same: the current 64 (8*pk), or... 0.
So let's just go with 0!
Note that we needed a fix anyway for not accounting for the value of RhsProgress. 0 nicely avoids the issue altogether!
2015-02-25 12:37:14 -05:00
Christoph Hertzberg
531fa9de77
bug #970 : Add EIGEN_DEVICE_FUNC to RValue functions, in case Cuda supports RValue-references.
2015-02-24 21:03:28 +01:00
Benoit Jacob
26275b250a
Fix my recent prefetch changes:
...
- the first prefetch is actually harmful on Haswell with FMA,
but it is the most beneficial on ARM.
- the second prefetch... I was very stupid and multiplied by sizeof(scalar)
and offset of a scalar* pointer. The old offset was 64; pk = 8, so 64=pk*8.
So this effectively restores the older offset. Actually, there were
two prefetches here, one with offset 48 and one with offset 64. I could not
confirm any benefit from this strange 48 offset on either the haswell or
my ARM device.
2015-02-23 16:55:17 -05:00
Christoph Hertzberg
052b6b40f1
Fix two trivial warnings
2015-02-22 12:40:51 +01:00
Christoph Hertzberg
ecbf2a6656
log1p is defined only for real Scalars in C++11
2015-02-21 19:58:24 +01:00
Gael Guennebaud
3cf642baa3
Fix compilation of unit tests disabling assertion cheking
2015-02-21 14:13:48 +01:00
Gael Guennebaud
2da1594750
Fix doc of Ref<>
2015-02-20 11:52:22 +01:00
Gael Guennebaud
b192e29eae
In C++11 destructors do not throw by default (fix CommaInitializer unit test)
2015-02-20 09:28:34 +01:00
Benoit Steiner
ab41652d81
Pulled latest changes from trunk
2015-02-19 21:23:37 -08:00
Benoit Steiner
7765039f1c
Marked the CUDA packet primitives as EIGEN_DEVICE_FUNC since they'll end up being executed on the GPU device.
2015-02-19 21:22:51 -08:00
Gael Guennebaud
a66f5fc2fd
Fix regression with C++11 support of lambda: now internal::result_of falls back to std::result_of in C++11.
2015-02-19 23:32:12 +01:00
Gael Guennebaud
1b7e12847d
Fix some calls to result_of on binary functors as unary ones.
2015-02-19 23:30:41 +01:00
Gael Guennebaud
0f4dd15dfc
Declare const some const variables
2015-02-19 23:28:57 +01:00
Gael Guennebaud
829dddd0fd
Add support for C++11 result_of/lambdas
2015-02-19 15:18:37 +01:00
Benoit Jacob
db05f2d01e
rotating kernel: avoid compiling anything outside of ARM
2015-02-18 15:43:52 -05:00
Benoit Jacob
0ed00d5438
remove a newly introduced redundant typedef - sorry.
2015-02-18 15:05:01 -05:00
Benoit Jacob
9bd8a4bab5
bug #955 - Implement a rotating kernel alternative in the 3px4 gebp path
...
This is substantially faster on ARM, where it's important to minimize the number of loads.
This is specific to the case where all packet types are of size 4. I made my best attempt to minimize how dirty this is... opinions welcome.
Eventually one could have a generic rotated kernel, but it would take some work to get there. Also, on sandy bridge, in my experience, it's not beneficial (even about 1% slower).
2015-02-18 15:03:35 -05:00
Hauke Heibel
ee27d50633
Fixed template parameter.
2015-02-18 18:51:08 +01:00
Gael Guennebaud
73a24de424
merge
2015-02-18 15:51:00 +01:00
Gael Guennebaud
63eb0f6fe6
Clean a bit computeProductBlockingSizes (use Index type, remove CEIL macro)
2015-02-18 15:49:05 +01:00
Benoit Jacob
4a3e6c8be1
bug #958 - Allow testing specific blocking sizes
...
This is only a debugging/testing patch. It allows testing specific
product blocking sizes, typically to study the impact on performance.
Example usage:
int testk, testm, testn;
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZES
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_K testk
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_M testm
#define EIGEN_TEST_SPECIFIC_BLOCKING_SIZE_N testn
#include <Eigen/Core>
2015-02-18 09:43:55 -05:00
Gael Guennebaud
c7bb1e8ea8
Fix a regression when using OpenMP, and fix bug #714 : the number of threads might be lower than the number of requested ones
2015-02-18 15:19:23 +01:00
Jan Blechta
168ceb271e
Really use zero guess in ConjugateGradients::solve as documented
...
and expected for consistency with other methods.
2015-02-18 14:26:10 +01:00
Gael Guennebaud
8fdcaded5e
merge
2015-03-04 10:18:08 +01:00
Gael Guennebaud
c43154bbc5
Check for no-reallocation in SparseMatrix::insert (bug #974 )
2015-03-04 10:16:46 +01:00
Gael Guennebaud
1ce0178363
Improve efficiency of SparseMatrix::insert/coeffRef for sequential outer-index insertion strategies (bug #974 )
2015-03-04 09:39:26 +01:00
Gael Guennebaud
05274219a7
Add a CG-based solver for rectangular least-square problems (bug #975 ).
2015-03-04 09:34:27 +01:00
Benoit Jacob
2aa09e6b4e
Fix asm comments in 1px1 kernel
2015-03-03 13:44:00 -05:00
Benoit Jacob
eae8e27b7d
Add a benchmark-default-sizes action to benchmark-blocking-sizes.cpp
2015-03-03 11:41:21 -05:00
Marc Glisse
37a93c4263
New scoring functor to select the pivot.
...
This is can be useful for non-floating point scalars, where choosing the biggest element is generally not the best choice.
2015-03-03 17:08:28 +01:00
Benoit Jacob
ccc1277a42
must also disable complex<double> when disabling double vectorization
2015-03-03 10:17:05 -05:00
Benoit Jacob
f839099512
Work around an ICE in Clang 3.5 in the iOS toolchain with double NEON intrinsics.
2015-03-03 09:35:22 -05:00
Benoit Jacob
1ec0f4fadf
HalfPacket also needed to be disabled for double, on ARMv8.
2015-03-02 16:08:54 -05:00
Gael Guennebaud
3109f0e74e
Add SSE vectorization of Quaternion::conjugate. Significant speed-up when combined with products like q1*q2.conjugate()
2015-03-02 20:09:33 +01:00
Gael Guennebaud
9aee1e300a
Increase unit-test L1 cache size to ensure we are doing at least 2 peeled loop within product kernel.
2015-02-27 22:55:12 +01:00
Gael Guennebaud
b10cd3afd2
Re-enbale detection of min/max parentheses protection, and re-enable mpreal_support unit test.
2015-02-27 22:38:00 +01:00
Gael Guennebaud
548b781380
Fix bug #945 : workaround MSVC warning
2015-02-18 12:53:49 +01:00
Gael Guennebaud
6f4adc9e94
Add missing install directives for arch/CUDA
2015-02-18 11:40:06 +01:00
Gael Guennebaud
63464754ef
Add an internal assertion in makeCompressed to catch a possible risk of null-pointer access.
2015-02-18 11:29:54 +01:00
Gael Guennebaud
eb563049f7
Remove some dead stores.
2015-02-18 11:26:48 +01:00
Gael Guennebaud
dc7e6acc05
Fix possible usage of a null pointer in CholmodSupport
2015-02-18 11:26:25 +01:00
Gael Guennebaud
d4eda01488
Big 957, workaround MSVC/ICC compilation issue
2015-02-18 11:24:32 +01:00
Gael Guennebaud
20cac72b82
Packet must be passed by const reference and not by value to avoid alignment issue.
2015-02-17 22:58:32 +01:00
Christoph Hertzberg
97a36ecba4
Suppress some remaining Index conversion warnings
2015-02-17 18:52:39 +01:00
Gael Guennebaud
159fb181c2
Disable __m128* wrappers when compiling with AVX and -fabi-version=4
2015-02-17 16:27:20 +01:00
Gael Guennebaud
91ab2489dd
Fix compilation with GCC/AVX (workaround __m128 and __m256 being the same type with default ABI)
2015-02-17 16:08:07 +01:00