SoFunction
Updated on 2024-11-17

How to solve the CPU memory overflow problem during pytorch training process

CPU memory overflow problem during pytorch training process

Expecting the results of the night, I got up in the morning to find that the CPU memory overflow caused the program to run down, today depressed day.

Upon inquiry, the common causes of memory overflow are:

  • Loss summation without item()
  • num_workers is too large
  • Extensive use of list to tensor

investigation process

I've searched for a lot of reasons, and I feel like the above is closer, but after changing it a bunch of times, the RAM still rubs off on me.

Later calls to thememory_profilerThis bag, found

The above part of my program added more than 70m per round, probably due to this loop (Also troubleshooting a lot of programs are off the table.).

Then I changed the loop to something like the following

Looking at the MEMORY usage for each set of data reveals something amazing:

The first set of data didn't take up much memory, and then the memory started to skyrocket when the same data was called repeatedly, and I still don't understand why this happened.

cure

Later calls to the

from einops import rearrange

Modified the array dimensions a bit and sent it together into the network

That settles it.

pytorch memory overflow, Ubuntu process killed problem

One reason why pytorch video memory is increasing

optimizer.zero_grad()
()
()
train_loss += loss

Referring to other people's code, I found that the loss sentence is usually written like this

loss_sum += [0]

This is because the data type of the output loss is Variable. and PyTorch's dynamic graph mechanism constructs graphs from Variable.

Primarily, when calculating with Variable, the symbols of the newly generated Variable's operations are recorded and used when backpropagating the derivation.

If you directly add up the losses here, the system will think that this is also part of the calculation graph, which means that the network will keep stretching bigger and bigger then consuming more and more video memory.

Calculations with Tensor are to be written as:

train_loss += ()
correct_total += (predict, label_batch).sum().item()
train_loss += ()

When variables in the model need to be extracted to participate in calculations, ** .item()** is used

summarize

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.