There’s no such thing as “atomic memory” in CUDA—atomicity is a property of an operation, not a memory type. atomicAdd is an instruction you apply to a memory location (typically in global memory or shared memory) so that the read → modify → write sequence happens as one indivisible action.
Example: histogram update (race condition vs atomic)
Say every thread looks at a value x (0–255) and increments a histogram bin hist[x]. Below is a race condition because two threads could hit the same bin
Instead, one needs to use atomicAdd to avoid the race condition
1
2
3
4
5
6
7
__global__voidhisto_good(constunsignedchar*data,intN,int*hist){inti=blockIdx.x*blockDim.x+threadIdx.x;if(i<N){intbin=data[i];atomicAdd(&hist[bin],1);// read-modify-write is indivisible }}
pragma unroll
#pragma unroll is to tell NVCC compiler to expand a for loop directly, without loop counter, or branch. This makes instruction scheduling better. Note, this is a hint, not a guarantee. But note that this could increase your code size, and consume more registers which could slow you down. Another note, it’s usually not effective on CPU
Also, #pragma unroll is effective only on the very next for loop,
1
2
3
4
#pragma unroll
for(inti=0;i<4;++i){...}for(intj=0;j<4;++j){...}// not affected
Custom Functions
DIVUP is divide and round up. USed to calculate how many blocks are needed to cover up all m elements.
1
2
3
#define DIVUP(x, y) (((x) + (y) - 1) / (y))
// for 1000 elements, I have 256 threads per block. How many blocks do I needdim3blocks(DIVUP(1000,256),b);