What is Cuda
CUDA is fundamentally C++ language with extensions. kernels, __global__, __device__ etc are defined in the C++ space. If you want to expose C to CUDA, just follow the standard
1
extern "C" void launch_kernel(...);
So this means
- You can use host-side containers like
std::vector, then pass its.data()pointer tocudaMalloc/cudaMemcpyor to kernels. But you cannot usestd::vectordirectly on the GPU device from device code. - For device-side containers use libraries designed for CUDA, e.g.
thrust::device_vector,cub, or manage raw device pointers yourself.
nvcc is a compiler driver CUDA uses and is not a single compiler. It splits .cu into:
- host code (compiled by your host compiler, like
g++) - device code (compiled by NVIDIA’s device toolchain,
PTX + SASS)
ATen is a C++ tensor library PyTorch and libtorch uses to manipulate tensors. Autograd is on top of ATen.
- It provides
at::Tensor, core tensor operations, device CPU/CUDA handling, and backend dispatch mechanism #include <ATen/ATen.h>- In PyTorch/ATen extensions you usually work with
at::Tensoron the host and pass raw pointers (tensor.data_ptr<T>()) into CUDA kernels; ATen handles CPU/CUDA dispatch and memory details.
Programming Hierarchy
The hierarchy:
1
Grid → Blocks → Threads
- grid: a collection of many blocks. blocks are independent of each other.
- block: a group of threads (max 1024) that can coorperate (shared memory + sync)
- block is a software-visible grouping
- A warp is a group of 32 threads, which are schedulred and executed together by the hardware. A warp is the smallest unit the GPU actually runs.
- So if you launch 512 threads in a block, it actually splits into
512 / 32 = 16 warps. - So,threadIdx.x = 0–31 → warp 0,threadIdx.x = 32–63 → warp 1 - thread: executes kernel code independently
- threads inside the same block can share memory
- can synchronize via
__syncthreads()
thread indexing:
1
thread_id = threadIdx.x + blockDim.x ∗ threadIdx.y + blockDim.x ∗ blockDim.y ∗ threadIdx.z
__shared__: shared memory (per block) on GPU
Why Warps Matter
Warp executes in SIMT style. All 32 threads execute the same instruction at the same time. If you write
1
2
if (threadIdx.x < 16) do_A();
else do_B();
The warp must execute both path serially. This is called Warp-Divergence (which could slow down GPU performance).
Memory Access Efficiency
If 32 threads in a warp access:
1
2
3
xyz[i*3 + 0]
xyz[i*3 + 1]
xyz[i*3 + 2]
in contiguous memory, the GPU can coalesce it into one big memory transaction. If accesses are scattered, memory access will be slower.
CUDA threads have built-in IDs:
blockIdx,threadIdx,blockDim,gridDim
Kernel and Variables
A kernel is just a function that runs on the GPU. You don’t call it like a normal function, but instead you launch it over a grid of many lightweight threads. Each thread computes a small piece of the overall work.
A kernel must have a function qualifier which indicates where the kernel is run:
1
2
3
4
5
6
7
__global__ void NmDistanceKernel(...) { ... }
__device__ float sq(float a) { return a*a; }
__host__ void foo() { ... } // same as just: void foo() { ... }
__host__ __device__ inline float clamp(float x, float lo, float hi) {
return x < lo ? lo : (x > hi ? hi : x);
}
__global__means the kernel is called from the CPU, runs on the GPU. Return Type must bevoid. This is what makes it a kernel: it could launched with<<<grid, block>>>__device__: called from the GPU, runs on the GPU. It’s not launchable with<<< >>>from the CPU, but instead it runs on other kernels?__host__: called from the CPU, runs on the CPU. It’s basically a normal C++ function- combinations:
__host__ __device__: this is compiled into two versions of the function, one for CPU calls, one for GPU calls. It’s useful for small utilities in both places. - In such a function, you can’t freely use host-only code likeprintf,new, os calls, unless guarded
Variable specifiers:
1
2
__constant__ float LUT[256];
__shared__ float tile[256];
__shared__: shared memory (per block) on GPU, slower__device__: global device variable (lives on GPU), faster__constant__: constant memory on GPU (cached, read-only from kernels)__managed__: unified memory (accessible from CPU & GPU; managed migration)
My First CUDA Program
A kernel launch looks like
1
Kernel<<<gridDim, blockDim>>>(...)
Inside kernels, you see built-in indices:
1
2
3
blockIdx.x, blockIdx.y, blockIdx.z
threadIdx.x, threadIdx.y, threadIdx.z
gridDim.*, blockDim.*: sizes
Warp and Streaming Processors:
- threads are executed in groups of 32 threads, each group is a warp. Warps are schedulred onto SMs (streaming processors). Threads in the same warp execute the same instruction (SIMT). If they branch differently, performance can drop.
Memory types:
- global memory: large, slowish, accessible by all threads. Your tensor data lives here.
- shared memory: small, fast, per-block, used for coorperation (your
__shared__ float buf[...]) - registers: fastest, per thread (your local float x1, y1, z1)
__syncthreads(): block level barrier, all threads in the block wait until everyone reaches it.
Export
In chamfer_cuda.cpp, we use pybind11 to bind two functions for the forward pass and backward pass:
1
2
3
4
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("forward", &chamfer_forward, "chamfer forward (CUDA)");
m.def("backward", &chamfer_backward, "chamfer backward (CUDA)");
}
- forward pass: compute nearest neighbor squared distances + argmin indices
- backward pass: compute gradients w.r.t point coordinates using those argmin indices
How to Compile
Compilation Method 1 - the classical setup.py
setup.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
setup(
name='chamfer_3D',
ext_modules=[
CUDAExtension('chamfer_3D', [
"/".join(__file__.split('/')[:-1] + ['chamfer_cuda.cpp']),
"/".join(__file__.split('/')[:-1] + ['chamfer3D.cu']),
]),
],
cmdclass={
'build_ext': BuildExtension
})
setuptoolsregister an extension with sourceschamfer_cuda.cppandchamfer3D.cuBuildExtensioninvokes host side C++ compiler (for.cpp) and nvcc (for.cufiles) into object files, then link them together into a shared object.so- the object files are linked against PyTorch/ATen/c10 and CUDA runtime libraries (e.g.,
libcudart), so the.sois an importable Python extension- it’s
-fPIC, position independent code, so it can be linked into a shared library - Pybind11 or torch macros are used to expose functions to python
kernelsin C++ areat::Tensor::data_ptr<T>()ortensor.contiguous().data_ptr()
- it’s
- the object files are linked against PyTorch/ATen/c10 and CUDA runtime libraries (e.g.,
Then in a new console:
1
python3 setup.py build_ext --inplace
Compilation Method 2: Pytorch JIT Compilation
If a .so is not found, one can use Pytorch to compile using torch.utils.cpp_extension.load, no setup.py or build_ext is needed. You do need Ninja
1
apt-get install -y ninja-build
1
2
3
4
5
6
from torch.utils.cpp_extension import load
chamfer_3D = load(
name="chamfer_3D",
sources=["chamfer_cuda.cpp", "chamfer3D.cu"]
)
What it does:
- Calls
nvcc+ the host C++ compiler the first time the module is imported - Caches the compiled
.soin~/.cache/torch_extensions/(keyed by source hash + PyTorch version) - Returns the extension as a regular Python module —
chamfer_3D.forward(...)just works - On subsequent imports: cache hit → instant load, no recompile