PyTorch Versioning Is Necessary Because We Have In-Place and Out-of_Place Matrix Ops
Takeaways:
- x.add_()/multiply_()
is to do in-place addition, and updates the gradient.
- x+something
actually creates a new tensor.
- detach()
means detaching from the computational graph, and creates a new tensor that shares the same data but does NOT require gradients. So if you need to modify the tensor but do not need to modify the gradients, this is one option.
- x.clone()
creates a new tensor
Test Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import torch
# Initial setup
# see version = 0
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
print(f"Initial x: {x}")
print(f"Initial x._version: {x._version}")
# out-of-place operation (create a new tensor)
y = x * 2
print(f"Computed y = x * 2: {y}")
# see version = 0
print(f"x._version after y = x * 2: {x._version}")
# out-of-place operation (create a new tensor)
x = x + 1.0
print(f"Modified x after x + 1.0: {x}")
# see version = 0
print(f"x._version after non-in-place operation: {x._version}")
# In-place modification on a detached version of x
x_detached = x.detach()
x_detached.add_(1.0)
print(f"Modified x_detached after x_detached.add_(1.0): {x_detached}")
# see version = 1
print(f"x_detached._version after in-place operation: {x_detached._version}")
x.add_(1)
# See x._version after in-place operation: 2
print(f"x._version after in-place operation: {x._version}")
x_clone = x.clone()
PyTorch Allocates A Caching Allocator, torch.cuda.empty_cache()
Clears It
- When creating tensors on a GPU, PyTorch requests a chunk (e.g., 20MB) larger than the tensor (3MB). The cache really decreases the number of memory request calls.
- When a tensor on GPU goes out of scope, the memory of the stale tensor remains in PyTorch cache, but it’s labelled as
unused
. - To free up that cache, we call
torch.cuda.empty_cache()
. Note thattorch.cuda.empty_cache()
only removes UNUSED caches
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import torch
def print_memory_allocated_and_reserved():
print(f"Allocated {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB")
print(f"Reserved {torch.cuda.memory_reserved() / 1024 ** 2:.2f} MB")
# Check initial GPU memory usage
print("Initial GPU memory:")
print_memory_allocated_and_reserved()
# Allocate a large tensor
a = torch.randn(1000, 1000, device="cuda")
# Check GPU memory usage after allocation
print("After allocating tensor 'a':")
print_memory_allocated_and_reserved()
# Delete the tensor
del a
# Check GPU memory usage after deletion
print("After deleting tensor 'a':")
print_memory_allocated_and_reserved()
# Empty the cache
torch.cuda.empty_cache()
# Check GPU memory usage after emptying cache
print("After emptying cache:")
print_memory_allocated_and_reserved()
See:
1
2
3
4
5
6
7
8
9
10
11
12
Initial GPU memory:
Allocated 0.00 MB
Reserved 0.00 MB
After allocating tensor 'a':
Allocated 3.81 MB
Reserved 20.00 MB
After deleting tensor 'a':
Allocated 0.00 MB
Reserved 20.00 MB
After emptying cache:
Allocated 0.00 MB
Reserved 0.00 MB
Allow Memory Segments In CUDA
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- memory is allocated by default contiguously. By using this, we are able to use fragemented memory, but it could be slightly & negligibly slower.
Gradient Checkpointing
Gradient Checkpointing trades computational overhead for reduced memory consumption. When this is enabled, some intermediate outputs of layers are NOT stored. They will be recomputed in the backward pass (a.k.a checkpointed) when gradients are needed. One illustrative example of what it does is:
1
2
3
4
y = f3(f2(f1(x)))
f3 = 2x
f2 = z + 3
f1 = 4w
With checkpointing, we store z = 2x
, compute but do not store w=z + 3
. Eventually, we get output 4w
. But in backpropagation, we compute w=z + 3
again.