Eigen Expression Template. Optimization Impact: ⭐️⭐⭐️️⚪⚪
Expression templates are one of those “powerful but mysterious” C++ features that libraries like Eigen, Blaze, and even TensorFlow (C++) use to get zero-overhead performance while still writing math like normal algebra.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
struct Vec{
Vec(int size): d_(std::vector<double>(size, 0.0)){}
double& operator[](int i){return d_.at(i);}
int size() const {return d_.size();}
private:
std::vector<double> d_;
};
template<typename VecType>
struct Scaled {
const VecType& v;
double scale;
// Not doing anything here
Scaled(const VecType& vec, double s) : v(vec), scale(s) {}
double operator[](int i) const {
return scale * v[i];
}
int size() const { return v.size(); }
};
// Expression Template for adding two vectors
template<typename LHS, typename RHS>
struct Add {
const LHS& lhs;
const RHS& rhs;
Add(const LHS& a, const RHS& b) : lhs(a), rhs(b) {}
// Only evaluate when provided with an i
double operator[](int i) const {
return lhs[i] + rhs[i];
}
int size() const { return lhs.size(); }
};
// Overload * for scaling
template<typename VecType>
Scaled<VecType> operator*(double scalar, const VecType& v) {
// So we are not evaluating here
return Scaled<VecType>(v, scalar);
}
// Overload + for addition
template<typename LHS, typename RHS>
Add<LHS, RHS> operator+(const LHS& lhs, const RHS& rhs) {
// Not evaluating here
return Add<LHS, RHS>(lhs, rhs);
}
template<typename Expr>
void assign(Vec& dest, const Expr& expr) {
for (int i = 0; i < dest.size(); ++i)
// Expr is either Add or Scaled
// By calling expr[i], the overloaded [i] will be called in a chain. This minimizes tmp object creation
dest[i] = expr[i]; // No temporaries, direct computation
}
int main() {
Vec A(1000000), B(1000000), C(1000000);
for (int i = 0; i < A.size(); ++i) {
A[i] = i;
B[i] = 2*i;
}
// Expression: C = A + 2 * B; — no temporaries created
assign(C, A + 2 * B);
std::cout << "C[100] = " << C[100] << std::endl;
}
As you can see:
- No evaluation is done when
+ or *
are called. The final evaluation happens whenexp[i]
is called inassign()
. Then, all adds and multiplications are carried out, minimizing the creation of temporary objects.
Expression Template’s Issue with STL Functions
Expression templates are transformed into:
1
Vec3d::Zero() -> Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<double>, Eigen::Matrix<double, 3, 1>>
So if in an STL container, like in std::accumulate
, it actually does
1
2
3
4
5
6
7
8
```cpp
return std::accumulate(point_cloud->points.begin(), point_cloud->points.end(), Vec3d::Zero(),
[](const Vec3d& sum, const PCLPointXYZI& point) {
return sum + Vec3d(point.x, point.y, point.z);
});
->
result = binary_op(result, *first);
Without .eval()
will return another expression template. Then, when assigning back to result
, it will cause an issue, because the previous expression template is not assignable.
So the rule of thumb here is, always append .eval()
inside STL functions!
Another bug that’s extremely subtle is that sum + Vec3d(point.x, point.y, point.z)
returns an expression, ` (a CwiseBinaryOp object).`. So we need to force evaluation again. This can be done by:
- Calling
eval()
explicitly - Forcing return type:
->Vec3d
So the fixed version is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
std::accumulate(point_cloud->points.begin(), point_cloud->points.end(), Vec3d::Zero().eval(),
[](const Vec3d& sum, const PCLPointXYZI& point){
Vec3d new_sum = sum + Vec3d(point.x, point.y, point.z);
return new_sum;
});
// Or
std::accumulate(point_cloud->points.begin(), point_cloud->points.end(), Vec3d::Zero().eval(),
[](const Vec3d& sum, const PCLPointXYZI& point) -> Vec3d {
return sum + Vec3d(point.x, point.y, point.z);
});
// Or
std::accumulate(point_cloud->points.begin(), point_cloud->points.end(), Vec3d::Zero().eval(),
[](const Vec3d& sum, const PCLPointXYZI& point){
return sum + Vec3d(point.x, point.y, point.z).eval();
});
Why Expression Template? Fused Computation
Eigen makes extensive use of expression template to “delay” computation so different steps can be fused together to create a fused computation. For example, if we want to do a +_2*b
, it’s actually faster to do
1
2
3
for (int i = 0; i < N; ++i){
C[i] = A[i] + 2 * B[i];
}
Than:
1
2
tmp = 2.0 * B; // vector multiply
C = A + tmp; // vector add
- This creates a
tmp
, which needs extra memory read/write fortmp
. - The loop is also easier for the optimizer to optimize
For maximum optimization, one can do:
1
-O3 -march=native -funroll-loops -ftree-vectorize
This will actually expand to: -march=haswell -msse4.2 -mavx2 -mfma -mbmi2 ...
Don’t trust me? Check this code snippet and its assembly out, and run it yourself! 😃
Some notes are:
- SIMD registers like
xmm0
can be found. But no SIMD instructions likemovapd
are used