SIMT and SIMD
CUDA is a very good embodiment of SIMD (Single-Instruction-Multiple Data). SIMD is great for addressing embarassingly parallel problems, problems that are so “embarassingly” simple, that there are no dependency on each other.
- One example is point cloud transformation. All points can be transformed into a different frame by MMA with transform matrices.
Up until 2016, a CUDA core does SIMD. In SIMD, the same exact instructions are applied to the same data, like a marching band
1
C[i] = A[i] * B[i]
However, when there is a condition in the kernel, there is a divergence. In the below example, the Stream Manager would split this logic into two passes for threads, one for C[i] = A[i] * B[i], and one for [i] = A[i] + B[i]
1
2
3
4
if A[i] > 0:
C[i] = A[i] * B[i]
else:
C[i] = A[i] + B[i]
Single-Instruction-Multiple-Threads SIMT can handle moderate levels of logical branches on the thread level. That is, it would take 1 pass for an arbitrary thread to decide which instruction it should execute. This means each thread has a program counter. Meanwhile, there are 128kb L1 caches for 1 thread to share data with others.
1
2
3
4
5
6
7
8
9
10
11
12
13
__global__ void conditional_op(float* A, float* B, float* C, int n) {
// Compute the unique thread ID
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// Ensure we don’t go out of bounds
if (idx < n) {
if (A[idx] > 0) {
C[idx] = A[idx] * B[idx]; // Branch 1
} else {
C[idx] = A[idx] + B[idx]; // Branch 2
}
}
}
For SIMT:
- The more branches, the less performance. Nested logic in a kernel is not a good idea
- Contiguous Memory (coalesced) are faster to access in a warp of memory