Deep Learning - PyTorch Model Training

Checkpointing, Op Determinisim, 🤗 HuggingFace Trainer

Posted by Rico's Nerd Cluster on March 6, 2022

Checkpointing

Checkpointing is a technique to trade compute for memory during training. Instead of storing all intermediate activations (outputs layers) for backprop, which consumes a lot of memory, checkpointing discards some and recomputes them during the backward pass. Thus, this saves memory at the expense of additional computation

1
2
3
4
5
6
7
8
9
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.inc = nn.Conv2d(3, 16, kernel_size=3, padding=1)  # Example layer
        self.inc = checkpoint.checkpoint(self.inc)  # Enable checkpointing

    def forward(self, x):
        x = self.inc(x)  # Checkpointed layer
        return x

checkpointing can be used on functions as well.

Op Determinisim

Here is a good reference on Op Determinism. Below is how this story goes

  • Tensor operations are not necessarily deterministic:
    • tf.nn.softmax_cross_entropy_with_logits (From a quick search, it’s still not clear to me why this is non-deterministic. Mathematically, the quantity should be deterministic.)
  • Op Determinisim will make sure you get the same output with the same code, same hardware. But it will disable asynchronicity, so it will slow down these operations
    • Use the same software environment in every run (OS, checkpoints, version of CUDA and TensorFlow, environmental variables, etc). Note that determinism is not guaranteed across different versions of TensorFlow.
  • How to enable Op Determinism?
    • PyTorch
    • TensorFlow: python tf.keras.utils.set_random_seed(1) tf.config.experimental.enable_op_determinism()

      • This effectively sets the pseudorandom number generators (PRNGs) in Python seed, the NumPy seed, and the TensorFlow seed.
      • Without setting the seed, tf.random.normal would raise RuntimeError, but Python and Numpy won’t
  • Have more consistent training loss, one can do
1
2
3
4
5
6
7
8
# Setup random seed
torch.manual_seed(1)
np.random.seed(1)
random.seed(1)

# Optional to enforce sequential execution
torch.backends.cudnn.deterministic = True 
torch.backends.cudnn.benchmark = False

HuggingFace Trainer

Hugging Face has a Trainer class that has distributed training by default, and mixed precision training in a Trainer class. It goes hand-in-hand with the TrainerArgument class

Boiler plate for image segmentation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def compute_metrics(pred):
    preds = pred.predictions.argmax(-1)
    labels = pred.label_ids
    # Flatten the tensors
    preds = preds.flatten()
    labels = labels.flatten()
    # Compute IoU
    metric = load_metric("iou")
    iou = metric.compute(predictions=preds, references=labels, average="macro")
    return {"iou": iou}

training_args = TrainingArguments(
    output_dir="./segmentation_results",
    num_train_epochs=10,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="./segmentation_logs",
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="iou",
    greater_is_better=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")
trainer.save_model("./trained_segmentation_model")