[CUDA - 3] SIMT in CUDA

SIMD SIMT

Posted by Rico's Nerd Cluster on January 14, 2026

SIMT and SIMD

CUDA is a very good embodiment of SIMD (Single-Instruction-Multiple Data). SIMD is great for addressing embarassingly parallel problems, problems that are so “embarassingly” simple, that there are no dependency on each other.

  • One example is point cloud transformation. All points can be transformed into a different frame by MMA with transform matrices.

Up until 2016, a CUDA core does SIMD. In SIMD, the same exact instructions are applied to the same data, like a marching band

1
C[i] = A[i] * B[i]

However, when there is a condition in the kernel, there is a divergence. In the below example, the Stream Manager would split this logic into two passes for threads, one for C[i] = A[i] * B[i], and one for [i] = A[i] + B[i]

1
2
3
4
if A[i] > 0:
    C[i] = A[i] * B[i]
else:
    C[i] = A[i] + B[i]

Single-Instruction-Multiple-Threads SIMT can handle moderate levels of logical branches on the thread level. That is, it would take 1 pass for an arbitrary thread to decide which instruction it should execute. This means each thread has a program counter. Meanwhile, there are 128kb L1 caches for 1 thread to share data with others.

1
2
3
4
5
6
7
8
9
10
11
12
13
__global__ void conditional_op(float* A, float* B, float* C, int n) {
    // Compute the unique thread ID
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    // Ensure we don’t go out of bounds
    if (idx < n) {
        if (A[idx] > 0) {
            C[idx] = A[idx] * B[idx];  // Branch 1
        } else {
            C[idx] = A[idx] + B[idx];  // Branch 2
        }
    }
}

For SIMT:

  • The more branches, the less performance. Nested logic in a kernel is not a good idea
  • Contiguous Memory (coalesced) are faster to access in a warp of memory