My Journey Into CUDA

A Great introduction video can be found here

GPU (GA102) Architecture

A graphics card’s brain is its GPU. NVidia’s Ampere GPU architecture family has GA102 and GA104. GA102 is shared across NVidia 3080 (Sep1 2020), 2080Ti (Sep1 2020), 3090 (May 31 2021), 3090Ti (Jan 27 2022). Here’s the architecture

1 GPU -> 7 2D array of Graphics Processing Clusters (GPC) -> 12 Stream Processors (SM) -> 4 Warps, 1 Ray-Tracing(RT cores), *4 "Special Function Unit"*

In a warp:

1 warp -> 32 CUDA cores + 1 Tensor Core

The architecture has 10752 CUDA cores, 336 Tensor cores, and 84 RT cores. These cores do all the computation of a GPU

A CUDA core executes standard floating-point and integer operations. Great for vector calculation, or rastererization in graphics. It’s also a “thread”
- Non-mixed-precision matrix multiplication and acummulation (MMA) is done here. Within 1 clock cycle, one multiply and one add is done here.A x B + C
- A different section of the core can do: Bit shifting, bit masking, inqueing incoming operands and instructions, and accumulating outputs.
- Division, square root, and trigonometry are done on the special function unit in Stream Manager.
- In ML, this is useful in:
  - scalar operations, summations,
  - data augmentation
  - Activations like ReLU, Softmax
Tensor Cores: see below “Tensor Cores” section
A Ray-Tracing corecomputes ray tracing, which simulates how light interacts with surfaces to create realistic shadows and reflections. It handles computations for ray-object intersection and bounding volume (3D Rendering)

The GPU series is manufactured very smart way. Defects usually are damaged cores that affect its own SM circuitry. So depending on a chip’s defects, the chip’s SM circuitry can be deactivated accordingly so the chip can go on to a 3080, 3080Ti, etc.

Note that each GPU tier has a different clock speed, which follows the same defect-recycling philosophy as well

Tensor Cores

This nice tutorial explains how tensor cores works. The main features include:

A tensor core performs mixed-precision matrix multiplication and acummulation (MMA) with FP16 and INT8. It’s great for ML training.
- The Ampere architecture can also handle TF32 float (a 19-bit floating point value), or BF16
A tensor core breaks down matrix multiplication and addition into 4x4 submatrices, or tiles to do D = A*B + C. A and B are FP16, C and D are FP16 or FP32.
- Now, you might be wondering how to break down matrices of any size to this. Let’s walk through a simple example.
  - Imagine A: 10x8, B: 8x4, C: 10x4. So result D: 10x4. Let’s allocate that memory
  - Pad A so it’s 16x8. Now A is
    1 2 3 4 A11, A12 A21, A22 A31, A32 A41, A42
  - So A*B is:
    1 2 3 4 A11 * B11 + A12 * B12 A21 * B11 + A22 * B12 A31 * B11 + A32 * B12 A41 * B11 + A42 * B12
  - So, the warp will figure out the synchronization points to finally add results together. The point is with padding, matrix multiplication of any sizes can be decomposed into tile multiplications and multiple additions.

Thread Architecture

A warp has 32 threads.
Same instructions are issued to the same threads in a warp
A thread block or a warp can be organized as 1D or 2D for indexing. Each thread ultimately, has 1D indexing. In general, CUDA’s 2D structures follow the below convention (which is different than the pixel convention)

block_id = block_x + block_y * grid_dim.x
thread_id = block_id * (block_dim.x * block_dim.y) + thread_y * block_dim.x + thread_x

SIMT and SIMD

CUDA is a very good embodiment of SIMD (Single-Instruction-Multiple Data). SIMD is great for addressing embarassingly parallel problems, problems that are so “embarassingly” simple, that there are no dependency on each other.

One example is point cloud transformation. All points can be transformed into a different frame by MMA with transform matrices.

Up until 2016, a CUDA core does SIMD. In SIMD, the same exact instructions are applied to the same data, like a marching band

C[i] = A[i] * B[i]

However, when there is a condition in the kernel, there is a divergence. In the below example, the Stream Manager would split this logic into two passes for threads, one for C[i] = A[i] * B[i], and one for [i] = A[i] + B[i]

if A[i] > 0:
    C[i] = A[i] * B[i]
else:
    C[i] = A[i] + B[i]

Single-Instruction-Multiple-Threads SIMT can handle moderate levels of logical branches on the thread level. That is, it would take 1 pass for an arbitrary thread to decide which instruction it should execute. This means each thread has a program counter. Meanwhile, there are 128kb L1 caches for 1 thread to share data with others.

__global__ void conditional_op(float* A, float* B, float* C, int n) {
    // Compute the unique thread ID
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    // Ensure we don’t go out of bounds
    if (idx < n) {
        if (A[idx] > 0) {
            C[idx] = A[idx] * B[idx];  // Branch 1
        } else {
            C[idx] = A[idx] + B[idx];  // Branch 2
        }
    }
}

For SIMT:

The more branches, the less performance. Nested logic in a kernel is not a good idea
Contiguous Memory (coalesced) are faster to access in a warp of memory

GPU Peripherals

Voltage Regulation:

A graphics card will have 12v input, and 1.1v output to the GA102 chip. This could produce significant amount of heat, so we need a heat sink.
A graphics card also has 24 graphics memory chips (GDDR6x, GDDR7). In gaming, 3D models are loaded SSD -> graphics memory -> L2 cache.
- A GA102 chip has 2 L2 Cache (3MB each), which is very limited
- The 24 graphics memory chips transfer 384 bits/s (bus width) forgraphics memory -> L2 cache. The bandwidth is 1.15Tbytes/s
- Graphics memory chips micron GDDR6x encodes 2 bits into 1 4-voltage level bit, using PAM4 encoding. But now, the industry agrees to use PAM3, an encoding scheme that uses 3 voltage levels that’s on GDDR7
  - This improves power efficiency, reduces encoder complexity, and increases SNR
High-Bandwidth Memory Chip Micron HBM3E
- This chip is a memory “cube”, with layers of memory that are connected by “Through-Silicon Vias”.
- This is usually used in AI Datacenters 😊

Data Transfer

`pin_memory`

A page of memory is the smallest fixed-length block of virtual memory managed by OS. Please check here for more about paging. In CUDA, tensors are loaded from CPU (host) to GPU (device) following this process:

CUDA allocates an array of “temporary page-locked/pinned” memory on CPU as a staging area
Copy tensors to the staging area
Transfer data from CPU to GPU

So, in Python, DataLoader(pin_memory=True) stores samples in page-locked memory and speeds-up the transfer CPU and GPU.

From the NVidia documentation

CUDA Operations

TODO

GPU Architecture, Tensor Cores, SIMD SIMT, Pinned Memory