Model Parameter vs VRAM

VRAM

Posted by Rico's Nerd Cluster on March 5, 2026

VRAM = Video RAM. It is the memory on your GPU, separate from your normal computer RAM. The GPU uses VRAM to store:

  • model weights
  • input images / batches
  • intermediate activations
  • gradients during training
  • optimizer states during training
  • temporary CUDA/TensorRT buffers

For inference/training, GPU memory has several components:

1
Total VRAM ≈ weights + activations + gradients + optimizer states + temporary buffers + input/output tensors

1. Model parameter memory is linear

Weights scale almost perfectly linearly with parameter count:

Precision Memory per parameter Example: 100M params
FP32 4 bytes ~400 MB
FP16 / BF16 2 bytes ~200 MB
INT8 1 byte ~100 MB
INT4 0.5 byte ~50 MB

So for inference weights only:

1
weight memory = parameter count × bytes per parameter

For inference, especially CNNs/ViTs/detectors, activation memory can dominate. It depends on:

1
batch size × image resolution × feature map sizes × number of layers

So two models with the same parameter count can use very different VRAM if one has larger intermediate feature maps.

Example: a detector with a large high-resolution neck/FPN can use more VRAM than another model with more parameters but smaller intermediate tensors.

2. Training VRAM is much less linear

Training adds:

  • weights
  • gradients
  • optimizer states
  • saved activations for backprop
  • augmentation / dataloader staging
  • loss buffers

For Adam/AdamW in mixed precision, a rough rule is:

1
training parameter memory ≈ 12–18 bytes per parameter

before activations.

So a 100M-param model might need 1.2–1.8 GB just for parameter-related training state, then activations can add much more.

For YOLO-style detection training, VRAM is usually more sensitive to:

1
image size > batch size > model size

Parameter count matters, but changing from 640 to 1280 image size can blow up activation memory much faster than moving from a small to medium model.

For inference, model size is more predictive, but resolution and backend still matter. So the relationship is:

1
2
3
Weights vs params: linear
Total VRAM vs params: only loosely correlated
Training VRAM vs params: partly linear, often dominated by activations