This allows us to do faster native scalar operations. Also
updated half/quarter packets to use the native type if available.
Benchmark improvement:
```
Comparing ./2910_without_float16 to ./2910_with_float16
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
BM_CalcMat<float>/10000/768/500 -0.0041 -0.0040 58276392 58039442 58273420 58039582
BM_CalcMat<_Float16>/10000/768/500 +0.0073 +0.0073 642506339 647214446 642481384 647188303
BM_CalcMat<Eigen::half>/10000/768/500 -0.3170 -0.3170 92511115 63182101 92506771 63179258
BM_CalcVec<float>/10000/768/500 +0.0022 +0.0022 5198157 5209469 5197913 5209334
BM_CalcVec<_Float16>/10000/768/500 +0.0025 +0.0026 10133324 10159111 10132641 10158507
BM_CalcVec<Eigen::half>/10000/768/500 -0.7760 -0.7760 45337937 10156952 45336532 10156389
OVERALL_GEOMEAN -0.2677 -0.2677 0 0 0 0
```
Fixes#2910.