Antonio Sanchez
5f9cfb2529
Add missing adolc isinf/isnan.
...
Also modified cmake/FindAdolc.cmake to eliminate warnings, and added
search paths to match install layout.
Fixed : #2157
2021-02-19 22:26:56 +00:00
frgossen
33e0af0130
Return nan at poles of polygamma, digamma, and zeta if limit is not defined
2021-02-19 16:35:11 +00:00
David Tellenbach
36200b7855
Remove vim specific comments to recognoize correct file-type.
...
As discussed in #2143 we remove editor specific comments.
2021-02-09 09:13:09 +01:00
Ralf Hannemann-Tamas
984d010b7b
add specialization of check_sparse_solving() for SuperLU solver, in order to test adjoint and transpose solves
2021-02-08 22:00:31 +00:00
Antonio Sanchez
3f4684f87d
Include <cstdint>
in one place, remove custom typedefs
...
Originating from
[this SO issue](https://stackoverflow.com/questions/65901014/how-to-solve-this-all-error-2-in-this-case ),
some win32 compilers define `__int32` as a `long`, but MinGW defines
`std::int32_t` as an `int`, leading to a type conflict.
To avoid this, we remove the custom `typedef` definitions for win32. The
Tensor module requires C++11 anyways, so we are guaranteed to have
included `<cstdint>` already in `Eigen/Core`.
Also re-arranged the headers to only include `<cstdint>` in one place to
avoid this type of error again.
2021-01-26 14:23:05 -08:00
David Tellenbach
660c6b857c
Remove std::cerr in iterative solver since we don't have iostream.
...
This fixes #2123
2021-01-21 11:40:05 +01:00
Maozhou, Ge
21a8a2487c
fix paddings of TensorVolumePatchOp
2021-01-15 11:51:49 +08:00
Antonio Sanchez
070d303d56
Add CUDA complex sqrt.
...
This is to support scalar `sqrt` of complex numbers `std::complex<T>` on
device, requested by Tensorflow folks.
Technically `std::complex` is not supported by NVCC on device
(though it is by clang), so the default `sqrt(std::complex<T>)` function only
works on the host. Here we create an overload to add back the
functionality.
Also modified the CMake file to add `--relaxed-constexpr` (or
equivalent) flag for NVCC to allow calling constexpr functions from
device functions, and added support for specifying compute architecture for
NVCC (was already available for clang).
2020-12-22 23:25:23 -08:00
Turing Eret
19e6496ce0
Replace call to FixedDimensions() with a singleton instance of
...
FixedDimensions.
2020-12-16 07:34:44 -07:00
Turing Eret
bc7d1599fb
TensorStorage with FixedDimensions now has zero instance memory overhead.
...
Removed m_dimension as instance member of TensorStorage with
FixedDimensions and instead use the template parameter. This
means that the sizeof a pure fixed-size storage is exactly
equal to the data it is storing.
2020-12-14 07:19:34 -07:00
Antonio Sanchez
2dbac2f99f
Fix bad NEON fp16 check
2020-12-04 13:42:18 -08:00
Antonio Sanchez
e2f21465fe
Special function implementations for half/bfloat16 packets.
...
Current implementations fail to consider half-float packets, only
half-float scalars. Added specializations for packets on AVX, AVX512 and
NEON. Added tests to `special_packetmath`.
The current `special_functions` tests would fail for half and bfloat16 due to
lack of precision. The NEON tests also fail with precision issues and
due to different handling of `sqrt(inf)`, so special functions bessel, ndtri
have been disabled.
Tested with AVX, AVX512.
2020-12-04 10:16:29 -08:00
Rasmus Munk Larsen
71c85df4c1
Clean up the Tensor header and get rid of the EIGEN_SLEEP macro.
2020-12-02 11:04:04 -08:00
Antonio Sanchez
17268b155d
Add bit_cast for half/bfloat to/from uint16_t, fix TensorRandom
...
The existing `TensorRandom.h` implementation makes the assumption that
`half` (`bfloat16`) has a `uint16_t` member `x` (`value`), which is not
always true. This currently fails on arm64, where `x` has type `__fp16`.
Added `bit_cast` specializations to allow casting to/from `uint16_t`
for both `half` and `bfloat16`. Also added tests in
`half_float`, `bfloat16_float`, and `cxx11_tensor_random` to catch
these errors in the future.
2020-11-18 20:32:35 +00:00
Antonio Sanchez
3669498f5a
Fix rule-of-3 for the Tensor module.
...
Adds copy constructors to Tensor ops, inherits assignment operators from
`TensorBase`.
Addresses #1863
2020-11-18 18:14:53 +00:00
mehdi-goli
a725a3233c
[SYCL clean up the code] : removing exrta #pragma unroll in SYCL which was causing issues in embeded systems
2020-10-28 08:34:49 +00:00
Rasmus Munk Larsen
61fc78bbda
Get rid of nested template specialization in TensorReductionGpu.h, which was broken by c6953f799b01d36f4236b64f351cc1446e0abe17.
2020-10-13 23:53:11 +00:00
Rasmus Munk Larsen
c6953f799b
Add packet generic ops predux_fmin
, predux_fmin_nan
, predux_fmax
, and predux_fmax_nan
that implement reductions with PropagateNaN
, and PropagateNumbers
semantics. Add (slow) generic implementations for most reductions.
2020-10-13 21:48:31 +00:00
David Tellenbach
8f8d77b516
Add EIGEN prefix for HAS_LGAMMA_R
2020-10-08 18:32:19 +02:00
Eugene Zhulenev
2279f2c62f
Use lgamma_r if it is available (update check for glibc 2.19+)
2020-10-08 00:26:45 +00:00
Rasmus Munk Larsen
b431024404
Don't make assumptions about NaN-propagation for pmin/pmax - it various across platforms.
...
Change test to only test for NaN-propagation for pfmin/pfmax.
2020-10-07 19:05:18 +00:00
Zhuyie
e4b24e7fb2
Fix Eigen::ThreadPool::CurrentThreadId returning wrong thread id when EIGEN_AVOID_THREAD_LOCAL and NDEBUG are defined
2020-09-25 09:36:43 +00:00
Rasmus Munk Larsen
e55182ac09
Get rid of initialization logic for blueNorm by making the computed constants static const or constexpr.
...
Move macro definition EIGEN_CONSTEXPR to Core and make all methods in NumTraits constexpr when EIGEN_HASH_CONSTEXPR is 1.
2020-09-18 17:38:58 +00:00
Deven Desai
603e213d13
Fixing a CUDA / P100 regression introduced by PR 181
...
PR 181 ( https://gitlab.com/libeigen/eigen/-/merge_requests/181 ) adds `__launch_bounds__(1024)` attribute to GPU kernels, that did not have that attribute explicitly specified.
That PR seems to cause regressions on the CUDA platform. This PR/commit makes the changes in PR 181, to be applicable for HIP only
2020-08-20 00:29:57 +00:00
Deven Desai
46f8a18567
Adding an explicit launch_bounds(1024) attribute for GPU kernels.
...
Starting with ROCm 3.5, the HIP compiler will change from HCC to hip-clang.
This compiler change introduce a change in the default value of the `__launch_bounds__` attribute associated with a GPU kernel. (default value means the value assumed by the compiler as the `__launch_bounds attribute__` value, when it is not explicitly specified by the user)
Currently (i.e. for HIP with ROCm 3.3 and older), the default value is 1024. That changes to 256 with ROCm 3.5 (i.e. hip-clang compiler). As a consequence of this change, if a GPU kernel with a `__luanch_bounds__` attribute of 256 is launched at runtime with a threads_per_block value > 256, it leads to a runtime error. This is leading to a couple of Eigen unit test failures with ROCm 3.5.
This commit adds an explicit `__launch_bounds(1024)__` attribute to every GPU kernel that currently does not have it explicitly specified (and hence will end up getting the default value of 256 with the change to hip-clang)
2020-08-05 01:46:34 +00:00
Rasmus Munk Larsen
b92206676c
Inherit alignment trait from argument in TensorBroadcasting to avoid segfault when the argument is unaligned.
2020-07-28 19:19:37 +00:00
Antonio Sanchez
9cb8771e9c
Fix tensor casts for large packets and casts to/from std::complex
...
The original tensor casts were only defined for
`SrcCoeffRatio`:`TgtCoeffRatio` 1:1, 1:2, 2:1, 4:1. Here we add the
missing 1:N and 8:1.
We also add casting `Eigen::half` to/from `std::complex<T>`, which
was missing to make it consistent with `Eigen:bfloat16`, and
generalize the overload to work for any complex type.
Tests were added to `basicstuff`, `packetmath`, and
`cxx11_tensor_casts` to test all cast configurations.
2020-06-30 18:53:55 +00:00
Teng Lu
386d809bde
Support BFloat16 in Eigen
2020-06-20 19:16:24 +00:00
Ilya Tokar
231ce21535
Run two independent chains, when reducing tensors.
...
Running two chains exposes more instruction level parallelism,
by allowing to execute both chains at the same time.
Results are a bit noisy, but for medium length we almost hit
theoretical upper bound of 2x.
BM_fullReduction_16T/3 [using 16 threads] 17.3ns ±11% 17.4ns ± 9% ~ (p=0.178 n=18+19)
BM_fullReduction_16T/4 [using 16 threads] 17.6ns ±17% 17.0ns ±18% ~ (p=0.835 n=20+19)
BM_fullReduction_16T/7 [using 16 threads] 18.9ns ±12% 18.2ns ±10% ~ (p=0.756 n=20+18)
BM_fullReduction_16T/8 [using 16 threads] 19.8ns ±13% 19.4ns ±21% ~ (p=0.512 n=20+20)
BM_fullReduction_16T/10 [using 16 threads] 23.5ns ±15% 20.8ns ±24% -11.37% (p=0.000 n=20+19)
BM_fullReduction_16T/15 [using 16 threads] 35.8ns ±21% 26.9ns ±17% -24.76% (p=0.000 n=20+19)
BM_fullReduction_16T/16 [using 16 threads] 38.7ns ±22% 27.7ns ±18% -28.40% (p=0.000 n=20+19)
BM_fullReduction_16T/31 [using 16 threads] 146ns ±17% 74ns ±11% -49.05% (p=0.000 n=20+18)
BM_fullReduction_16T/32 [using 16 threads] 154ns ±19% 84ns ±30% -45.79% (p=0.000 n=20+19)
BM_fullReduction_16T/64 [using 16 threads] 603ns ± 8% 308ns ±12% -48.94% (p=0.000 n=17+17)
BM_fullReduction_16T/128 [using 16 threads] 2.44µs ±13% 1.22µs ± 1% -50.29% (p=0.000 n=17+17)
BM_fullReduction_16T/256 [using 16 threads] 9.84µs ±14% 5.13µs ±30% -47.82% (p=0.000 n=19+19)
BM_fullReduction_16T/512 [using 16 threads] 78.0µs ± 9% 56.1µs ±17% -28.02% (p=0.000 n=18+20)
BM_fullReduction_16T/1k [using 16 threads] 325µs ± 5% 263µs ± 4% -19.00% (p=0.000 n=20+16)
BM_fullReduction_16T/2k [using 16 threads] 1.09ms ± 3% 0.99ms ± 1% -9.04% (p=0.000 n=20+20)
BM_fullReduction_16T/4k [using 16 threads] 7.66ms ± 3% 7.57ms ± 3% -1.24% (p=0.017 n=20+20)
BM_fullReduction_16T/10k [using 16 threads] 65.3ms ± 4% 65.0ms ± 3% ~ (p=0.718 n=20+20)
2020-06-16 15:55:11 -04:00
mehdi-goli
d3e81db6c5
Eigen moved the scanLauncehr
function inside the internal namespace.
...
This commit applies the following changes:
- Moving the `scamLauncher` specialization inside internal namespace to fix compiler crash on TensorScan for SYCL backend.
- Replacing `SYCL/sycl.hpp` to `CL/sycl.hpp` in order to follow SYCL 1.2.1 standard.
- minor fixes: commenting out an unused variable to avoid compiler warnings.
2020-05-11 16:10:33 +01:00
Rasmus Munk Larsen
2fd8a5a08f
Add parallelization of TensorScanOp for types without packet ops.
...
Clean up the code a bit and do a few micro-optimizations to improve performance for small tensors.
Benchmark numbers for Tensor<uint32_t>:
name old time/op new time/op delta
BM_cumSumRowReduction_1T/8 [using 1 threads] 76.5ns ± 0% 61.3ns ± 4% -19.80% (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/64 [using 1 threads] 2.47µs ± 1% 2.40µs ± 1% -2.77% (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/256 [using 1 threads] 39.8µs ± 0% 39.6µs ± 0% -0.60% (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/4k [using 1 threads] 13.9ms ± 0% 13.4ms ± 1% -4.19% (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/8 [using 2 threads] 76.8ns ± 0% 59.1ns ± 0% -23.09% (p=0.016 n=5+4)
BM_cumSumRowReduction_2T/64 [using 2 threads] 2.47µs ± 1% 2.41µs ± 1% -2.53% (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/256 [using 2 threads] 39.8µs ± 0% 34.7µs ± 6% -12.74% (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/4k [using 2 threads] 13.8ms ± 1% 7.2ms ± 6% -47.74% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/8 [using 8 threads] 76.4ns ± 0% 61.8ns ± 3% -19.02% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/64 [using 8 threads] 2.47µs ± 1% 2.40µs ± 1% -2.84% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/256 [using 8 threads] 39.8µs ± 0% 28.3µs ±11% -28.75% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/4k [using 8 threads] 13.8ms ± 0% 2.7ms ± 5% -80.39% (p=0.008 n=5+5)
BM_cumSumColReduction_1T/8 [using 1 threads] 59.1ns ± 0% 80.3ns ± 0% +35.94% (p=0.029 n=4+4)
BM_cumSumColReduction_1T/64 [using 1 threads] 3.06µs ± 0% 3.08µs ± 1% ~ (p=0.114 n=4+4)
BM_cumSumColReduction_1T/256 [using 1 threads] 175µs ± 0% 176µs ± 0% ~ (p=0.190 n=4+5)
BM_cumSumColReduction_1T/4k [using 1 threads] 824ms ± 1% 844ms ± 1% +2.37% (p=0.008 n=5+5)
BM_cumSumColReduction_2T/8 [using 2 threads] 59.0ns ± 0% 90.7ns ± 0% +53.74% (p=0.029 n=4+4)
BM_cumSumColReduction_2T/64 [using 2 threads] 3.06µs ± 0% 3.10µs ± 0% +1.08% (p=0.016 n=4+5)
BM_cumSumColReduction_2T/256 [using 2 threads] 176µs ± 0% 189µs ±18% ~ (p=0.151 n=5+5)
BM_cumSumColReduction_2T/4k [using 2 threads] 836ms ± 2% 611ms ±14% -26.92% (p=0.008 n=5+5)
BM_cumSumColReduction_8T/8 [using 8 threads] 59.3ns ± 2% 90.6ns ± 0% +52.79% (p=0.008 n=5+5)
BM_cumSumColReduction_8T/64 [using 8 threads] 3.07µs ± 0% 3.10µs ± 0% +0.99% (p=0.016 n=5+4)
BM_cumSumColReduction_8T/256 [using 8 threads] 176µs ± 0% 80µs ±19% -54.51% (p=0.008 n=5+5)
BM_cumSumColReduction_8T/4k [using 8 threads] 827ms ± 2% 180ms ±14% -78.24% (p=0.008 n=5+5)
2020-05-06 14:48:37 -07:00
Rasmus Munk Larsen
0e59f786e1
Fix accidental copy of loop variable.
2020-05-05 21:35:38 +00:00
Rasmus Munk Larsen
7b76c85daf
Vectorize and parallelize TensorScanOp.
...
TensorScanOp is used in TensorFlow for a number of operations, such as cumulative logexp reduction and cumulative sum and product reductions.
The benchmarks numbers below are for cumulative row- and column reductions of NxN matrices.
name old time/op new time/op delta
BM_cumSumRowReduction_1T/4 [using 1 threads ] 25.1ns ± 1% 35.2ns ± 1% +40.45%
BM_cumSumRowReduction_1T/8 [using 1 threads ] 73.4ns ± 0% 82.7ns ± 3% +12.74%
BM_cumSumRowReduction_1T/32 [using 1 threads ] 988ns ± 0% 832ns ± 0% -15.77%
BM_cumSumRowReduction_1T/64 [using 1 threads ] 4.07µs ± 2% 3.47µs ± 0% -14.70%
BM_cumSumRowReduction_1T/128 [using 1 threads ] 18.0µs ± 0% 16.8µs ± 0% -6.58%
BM_cumSumRowReduction_1T/512 [using 1 threads ] 287µs ± 0% 281µs ± 0% -2.22%
BM_cumSumRowReduction_1T/2k [using 1 threads ] 4.78ms ± 1% 4.78ms ± 2% ~
BM_cumSumRowReduction_1T/10k [using 1 threads ] 117ms ± 1% 117ms ± 1% ~
BM_cumSumRowReduction_8T/4 [using 8 threads ] 25.0ns ± 0% 35.2ns ± 0% +40.82%
BM_cumSumRowReduction_8T/8 [using 8 threads ] 77.2ns ±16% 81.3ns ± 0% ~
BM_cumSumRowReduction_8T/32 [using 8 threads ] 988ns ± 0% 833ns ± 0% -15.67%
BM_cumSumRowReduction_8T/64 [using 8 threads ] 4.08µs ± 2% 3.47µs ± 0% -14.95%
BM_cumSumRowReduction_8T/128 [using 8 threads ] 18.0µs ± 0% 17.3µs ±10% ~
BM_cumSumRowReduction_8T/512 [using 8 threads ] 287µs ± 0% 58µs ± 6% -79.92%
BM_cumSumRowReduction_8T/2k [using 8 threads ] 4.79ms ± 1% 0.64ms ± 1% -86.58%
BM_cumSumRowReduction_8T/10k [using 8 threads ] 117ms ± 1% 18ms ± 6% -84.50%
BM_cumSumColReduction_1T/4 [using 1 threads ] 23.9ns ± 0% 33.4ns ± 1% +39.68%
BM_cumSumColReduction_1T/8 [using 1 threads ] 71.6ns ± 1% 49.1ns ± 3% -31.40%
BM_cumSumColReduction_1T/32 [using 1 threads ] 973ns ± 0% 165ns ± 2% -83.10%
BM_cumSumColReduction_1T/64 [using 1 threads ] 4.06µs ± 1% 0.57µs ± 1% -85.94%
BM_cumSumColReduction_1T/128 [using 1 threads ] 33.4µs ± 1% 4.1µs ± 1% -87.67%
BM_cumSumColReduction_1T/512 [using 1 threads ] 1.72ms ± 4% 0.21ms ± 5% -87.91%
BM_cumSumColReduction_1T/2k [using 1 threads ] 119ms ±53% 11ms ±35% -90.42%
BM_cumSumColReduction_1T/10k [using 1 threads ] 1.59s ±67% 0.35s ±49% -77.96%
BM_cumSumColReduction_8T/4 [using 8 threads ] 23.8ns ± 0% 33.3ns ± 0% +40.06%
BM_cumSumColReduction_8T/8 [using 8 threads ] 71.6ns ± 1% 49.2ns ± 5% -31.33%
BM_cumSumColReduction_8T/32 [using 8 threads ] 1.01µs ±12% 0.17µs ± 3% -82.93%
BM_cumSumColReduction_8T/64 [using 8 threads ] 4.15µs ± 4% 0.58µs ± 1% -86.09%
BM_cumSumColReduction_8T/128 [using 8 threads ] 33.5µs ± 0% 4.1µs ± 4% -87.65%
BM_cumSumColReduction_8T/512 [using 8 threads ] 1.71ms ± 3% 0.06ms ±16% -96.21%
BM_cumSumColReduction_8T/2k [using 8 threads ] 97.1ms ±14% 3.0ms ±23% -96.88%
BM_cumSumColReduction_8T/10k [using 8 threads ] 1.97s ± 8% 0.06s ± 2% -96.74%
2020-05-05 00:19:43 +00:00
Rasmus Munk Larsen
ab773c7e91
Extend support for Packet16b:
...
* Add ptranspose<*,4> to support matmul and add unit test for Matrix<bool> * Matrix<bool>
* work around a bug in slicing of Tensor<bool>.
* Add tensor tests
This speeds up matmul for boolean matrices by about 10x
name old time/op new time/op delta
BM_MatMul<bool>/8 267ns ± 0% 479ns ± 0% +79.25% (p=0.008 n=5+5)
BM_MatMul<bool>/32 6.42µs ± 0% 0.87µs ± 0% -86.50% (p=0.008 n=5+5)
BM_MatMul<bool>/64 43.3µs ± 0% 5.9µs ± 0% -86.42% (p=0.008 n=5+5)
BM_MatMul<bool>/128 315µs ± 0% 44µs ± 0% -85.98% (p=0.008 n=5+5)
BM_MatMul<bool>/256 2.41ms ± 0% 0.34ms ± 0% -85.68% (p=0.008 n=5+5)
BM_MatMul<bool>/512 18.8ms ± 0% 2.7ms ± 0% -85.53% (p=0.008 n=5+5)
BM_MatMul<bool>/1k 149ms ± 0% 22ms ± 0% -85.40% (p=0.008 n=5+5)
2020-04-28 16:12:47 +00:00
Eugene Zhulenev
3c02fefec5
Add async evaluation support to TensorSlicingOp.
...
Device::memcpy is not async-safe and might lead to deadlocks. Always evaluate slice expression in async mode.
2020-04-22 19:55:01 +00:00
Changming Sun
b1aa07a8d3
Fix a bug in TensorIndexList.h
2020-04-13 18:22:03 +00:00
jangsoopark
39142904cc
Resolve C4346 when building eigen on windows
2020-04-08 14:55:39 +09:00
Deven Desai
7158ed4e0e
Fixing HIP breakage caused by the recent commit that introduces Packet4h2 as the Eigen::Half packet type
2020-03-12 01:06:24 +00:00
Sami Kama
b733b8b680
remove duplicate pset1 for half and add some comments about why we need expose pmul/add/div/min/max on host
2020-03-10 20:28:43 +00:00
Cédric Hubert
98bfc5aaa8
Update MarketIO.h
2020-02-28 12:41:51 +00:00
Ilya Tokar
eb6cc29583
Avoid a division in NonBlockingThreadPool::Steal.
...
Looking at profiles we spend ~10-20% of Steal on simply computing
random % size. We can reduce random 32-bit int into [0, size) range with
a single multiplication and shift. This transformation is described in
https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
2020-02-14 16:02:57 -05:00
Eugene Zhulenev
f584bd9b30
Fail at compile time if default executor tries to use non-default device
2020-02-06 22:43:24 +00:00
Eugene Zhulenev
3fda850c46
Remove dead code from TensorReduction.h
2020-01-29 18:45:31 +00:00
Jeff Daily
b5df8cabd7
fix hip-clang compilation due to new HIP scalar accessor
2020-01-20 21:08:52 +00:00
Deven Desai
6d284bb1b7
Fix for HIP breakage - 200115. Adding a missing EIGEN_DEVICE_FUNC attr
2020-01-16 00:51:43 +00:00
Srinivas Vasudevan
f6c6de5d63
Ensure Igamma does not NaN or Inf for large values.
2020-01-14 21:32:48 +00:00
Eugene Zhulenev
b9362fb8f7
Convert StridedLinearBufferCopy::Kind to enum class
2020-01-13 11:43:24 -08:00
Matthew Powelson
2ea5a715cf
Properly initialize b vector in SplineFitting
...
InterpolateWithDerivative does not initialize the be vector correctly. This issue is discussed In stackoverflow question 48382939.
2020-01-09 21:29:04 +00:00
Ilya Tokar
19876ced76
Bug #1785 : Introduce numext::rint.
...
This provides a new op that matches std::rint and previous behavior of
pround. Also adds corresponding unsupported/../Tensor op.
Performance is the same as e. g. floor (tested SSE/AVX).
2020-01-07 21:22:44 +00:00
mehdi-goli
d0ae052da4
[SYCL Backend]
...
* Adding Missing operations for vector comparison in SYCL. This caused compiler error for vector comparison when compiling SYCL
* Fixing the compiler error for placement new in TensorForcedEval.h This caused compiler error when compiling SYCL backend
* Reducing the SYCL warning by removing the abort function inside the kernel
* Adding Strong inline to functions inside SYCL interop.
2020-01-07 15:13:37 +00:00