(does not support complex and does not re-use the QR decomposition)
* Rewrite the cache friendly product to have only one instance per scalar type !
This significantly speeds up compilation time and reduces executable size.
The current drawback is that some trivial expressions might be
evaluated like conjugate or negate.
* Renamed "cache optimal" to "cache friendly"
* Added the ability to directly access matrix data of some expressions via:
- the stride()/_stride() methods
- DirectAccessBit flag (replace ReferencableBit)