SoFunction
Updated on 2024-11-14

Debug Guide for PyTorch

I. Introduction to ipdb

Many people who are new to python use print or log to debug their programs, but this is only convenient for small-scale programs. A better way to debug your program is to check the variables and methods in it while you run it.

Interested in pycharm debug mode, the function is also very powerful, can meet the general needs, here do not do more details, we here introduce a more applicable to pytorch a flexible pdb interactive debugging tool.

Pdb is an interactive debugging tool integrated into the Python standard library that lets you jump to any Python code breakpoint you want, look at any variable, single-step through the code, and even change the value of a variable without having to restart the program.

ipdb is an enhanced version of pdb that provides auto-completion of code in debug mode, better syntax highlighting and code tracing, better introspection, and most importantly, is fully compatible with the pdb interface, which can be installed via pip install ipdb.

Use of ipdb

First look at an example, to use ipdb, you just need to insert ipdb.set_trace() in the place where you want to debug, when the code runs to this place, it will automatically enter the interactive debugging mode.

import ipdb


def sum(x):
    r = 0
    for ii in x:
        r += ii
    return r


def mul(x):
    r = 1
    for ii in x:
        r *= 11
    return r


ipdb.set_trace()
x = [1, 2, 3, 4, 5]
r = sum(x)
r = mul(x)
> /Users/mac/Desktop/jupyter/(19)<module>()
     18 ipdb.set_trace()
---> 19 x = [1, 2, 3, 4, 5]
     20 r = sum(x)

ipdb> l 1,5  # l(ist) 1,5 abbreviation to see code for lines 1 through 5
      1 import ipdb
      2 
      3 
      4 def sum(x):
      5     r = 0

ipdb> n  # n(ext) abbreviation to perform the next step
> /Users/mac/Desktop/jupyter/(20)<module>()
     19 x = [1, 2, 3, 4, 5]
---> 20 r = sum(x)
     21 r = mul(x)

ipdb> s  # Abbreviation for s(tep), inside the sum function
--Call--
> /Users/mac/Desktop/jupyter/(4)sum()
      3 
----> 4 def sum(x):
      5     r = 0

ipdb> n  # n(ext) Single-step execution
> /Users/mac/Desktop/jupyter/(5)sum()
      4 def sum(x):
----> 5     r = 0
      6     for ii in x:

ipdb> n
> /Users/mac/Desktop/jupyter/(6)sum()
      5     r = 0
----> 6     for ii in x:
      7         r += ii

ipdb> u  # u(p) abbreviation, calling back to the previous layer of calls
> /Users/mac/Desktop/jupyter/(20)<module>()
     19 x = [1, 2, 3, 4, 5]
---> 20 r = sum(x)
     21 r = mul(x)

ipdb> d  # Abbreviation for d(own), jumping to the next level of the call
> /Users/mac/Desktop/jupyter/(6)sum()
      5     r = 0
----> 6     for ii in x:
      7         r += ii

ipdb> n
> /Users/mac/Desktop/jupyter/(7)sum()
      6     for ii in x:
----> 7         r += ii
      8     return r

ipdb> !r  # View the value of the variable r, whose name conflicts with the debugging command `r(eturn)`.
0
    
ipdb> return  # Continue to run until the function returns
--Return--
15
> /Users/mac/Desktop/jupyter/(8)sum()
      7         r += ii
----> 8     return r
      9 

ipdb> n
> /Users/mac/Desktop/jupyter/(21)<module>()
     19 x = [1, 2, 3, 4, 5]
     20 r = sum(x)
---> 21 r = mul(x)

ipdb> x  # View variable x
[1, 2, 3, 4, 5]
    
ipdb> x[0] = 10000  # Modify variable x
    
ipdb> x
[10000, 2, 3, 4, 5]
    
ipdb> b 12  # b(reak) abbreviation, set breakpoint on line 10
Breakpoint 1 at /Users/mac/Desktop/jupyter/:12
    
ipdb> c  # c(ontinue) abbreviation, keep running until you hit a breakpoint
> /Users/mac/Desktop/jupyter/(12)mul()
     11 def mul(x):
1--> 12     r = 1
     13     for ii in x:

ipdb> return  # You can see that the product of the modified x is computed
--Return--
1200000
> /Users/mac/Desktop/jupyter/(15)mul()
     14         r *= ii
---> 15     return r
     16 

ipdb> q  # Abbreviation for q(uit), quit debugging.

The above is only a part of how to use ipdb, there are also some tips on how to use ipdb:

  • The keys can be autocomplemented, and the complements are used similarly to those in IPython
  • j(ump) can skip the execution of certain lines of code in between
  • You can change the value of a variable directly in ipdb.
  • help can view the usage of debugging commands, for example, h h can view the usage of the help command, h j(ump) can view the usage of the j(ump) command.

Debugging in PyTorch

PyTorch, as a dynamic graph framework, can be used in conjunction with ipdb to make the debugging process easier, as we explain in the following three points:

  • How to view the output of each layer of a neural network in PyTorch
  • How to analyze the gradient of each parameter in PyTorch
  • How to dynamically modify PyTorch's training flow

First, run the "cat and dog" program given in the previous post: python train --debug-file='debug/'

After the program has been running for a while, an identification file is created in the debug directory, and when the program detects the existence of this file, it automatically enters debug mode.

99it [00:17,  6.07it/s]loss: 0.22854854568839075
119it [00:21,  5.79it/s]loss: 0.21267264398435753
139it [00:24,  5.99it/s]loss: 0.19839374726372108
> e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/(80)train()
     79         loss_meter.reset()
---> 80         confusion_matrix.reset()
     81         for ii, (data, label) in tqdm(enumerate(train_dataloader)):

ipdb> break 88    # Set a breakpoint at line 88 to enter debug mode when the program runs here.
Breakpoint 1 at e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/:88

ipdb> # Print the standard deviation of all parameters and their gradients
for (name,p) in model.named_parameters(): \
    print(name,(),())
. tensor(0.2615, device='cuda:0') tensor(0.3769, device='cuda:0')
. tensor(0.4862, device='cuda:0') tensor(0.3368, device='cuda:0')
. tensor(0.2738, device='cuda:0') tensor(0.3023, device='cuda:0')
. tensor(0.5867, device='cuda:0') tensor(0.3753, device='cuda:0')
.3. tensor(0.2168, device='cuda:0') tensor(0.2883, device='cuda:0')
.3. tensor(0.2256, device='cuda:0') tensor(0.1147, device='cuda:0')
.3. tensor(0.0935, device='cuda:0') tensor(0.1605, device='cuda:0')
.3. tensor(0.1421, device='cuda:0') tensor(0.0583, device='cuda:0')
. tensor(0.1976, device='cuda:0') tensor(0.2137, device='cuda:0')
. tensor(0.4058, device='cuda:0') tensor(0.1798, device='cuda:0')
.4. tensor(0.2144, device='cuda:0') tensor(0.4214, device='cuda:0')
.4. tensor(0.4994, device='cuda:0') tensor(0.0958, device='cuda:0')
.4. tensor(0.1063, device='cuda:0') tensor(0.2963, device='cuda:0')
.4. tensor(0.0489, device='cuda:0') tensor(0.0719, device='cuda:0')
. tensor(0.1736, device='cuda:0') tensor(0.3544, device='cuda:0')
. tensor(0.2420, device='cuda:0') tensor(0.0896, device='cuda:0')
.6. tensor(0.1211, device='cuda:0') tensor(0.2428, device='cuda:0')
.6. tensor(0.0670, device='cuda:0') tensor(0.0162, device='cuda:0')
.6. tensor(0.0593, device='cuda:0') tensor(0.1917, device='cuda:0')
.6. tensor(0.0227, device='cuda:0') tensor(0.0160, device='cuda:0')
. tensor(0.1207, device='cuda:0') tensor(0.2179, device='cuda:0')
. tensor(0.1484, device='cuda:0') tensor(0.0381, device='cuda:0')
.7. tensor(0.1235, device='cuda:0') tensor(0.2279, device='cuda:0')
.7. tensor(0.0450, device='cuda:0') tensor(0.0100, device='cuda:0')
.7. tensor(0.0609, device='cuda:0') tensor(0.1628, device='cuda:0')
.7. tensor(0.0132, device='cuda:0') tensor(0.0079, device='cuda:0')
. tensor(0.1093, device='cuda:0') tensor(0.2459, device='cuda:0')
. tensor(0.0646, device='cuda:0') tensor(0.0135, device='cuda:0')
.9. tensor(0.0840, device='cuda:0') tensor(0.1860, device='cuda:0')
.9. tensor(0.0177, device='cuda:0') tensor(0.0033, device='cuda:0')
.9. tensor(0.0476, device='cuda:0') tensor(0.1393, device='cuda:0')
.9. tensor(0.0058, device='cuda:0') tensor(0.0030, device='cuda:0')
. tensor(0.0872, device='cuda:0') tensor(0.1676, device='cuda:0')
. tensor(0.0484, device='cuda:0') tensor(0.0088, device='cuda:0')
.10. tensor(0.0859, device='cuda:0') tensor(0.2145, device='cuda:0')
.10. tensor(0.0160, device='cuda:0') tensor(0.0025, device='cuda:0')
.10. tensor(0.0456, device='cuda:0') tensor(0.1429, device='cuda:0')
.10. tensor(0.0070, device='cuda:0') tensor(0.0021, device='cuda:0')
. tensor(0.0786, device='cuda:0') tensor(0.2003, device='cuda:0')
. tensor(0.0422, device='cuda:0') tensor(0.0069, device='cuda:0')
.11. tensor(0.0690, device='cuda:0') tensor(0.1400, device='cuda:0')
.11. tensor(0.0138, device='cuda:0') tensor(0.0022, device='cuda:0')
.11. tensor(0.0366, device='cuda:0') tensor(0.1517, device='cuda:0')
.11. tensor(0.0109, device='cuda:0') tensor(0.0023, device='cuda:0')
. tensor(0.0729, device='cuda:0') tensor(0.1736, device='cuda:0')
. tensor(0.0814, device='cuda:0') tensor(0.0084, device='cuda:0')
.12. tensor(0.0977, device='cuda:0') tensor(0.1385, device='cuda:0')
.12. tensor(0.0102, device='cuda:0') tensor(0.0032, device='cuda:0')
.12. tensor(0.0365, device='cuda:0') tensor(0.1312, device='cuda:0')
.12. tensor(0.0038, device='cuda:0') tensor(0.0026, device='cuda:0')
. tensor(0.0285, device='cuda:0') tensor(0.0865, device='cuda:0')
. tensor(0.0362, device='cuda:0') tensor(0.0192, device='cuda:0')

ipdb>     # View learning rates
0.001

ipdb>  = 0.002    # Change in learning rate

ipdb> for p in optimizer.param_groups: \
    p['lr'] = 

ipdb> ()    # Save the model
'checkpoints/squeezenet_20191004212249.pth'

ipdb> c    # Continue to run until line 88 to pause
222it [16:38, 35.62s/it]> e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/(88)train()
     87             optimizer.zero_grad()
1--> 88             score = model(input)
     89             loss = criterion(score, target)

ipdb> s    # Go inside model(input), i.e. model.__call__(input)
--Call--
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\(537)__call__()
    536 
--> 537     def __call__(self, *input, **kwargs):
    538         for hook in self._forward_pre_hooks.values():

ipdb> n    # Next
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\(538)__call__()
    537     def __call__(self, *input, **kwargs):
--> 538         for hook in self._forward_pre_hooks.values():
    539             result = hook(self, input)

ipdb> n    # Next
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\(544)__call__()
    543                 input = result
--> 544         if torch._C._get_tracing_state():
    545             result = self._slow_forward(*input, **kwargs)

ipdb> n    # Next
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\(547)__call__()
    546         else:
--> 547             result = (*input, **kwargs)
    548         for hook in self._forward_hooks.values():

ipdb> s    # Enter the contents of the forward function
--Call--
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\(914)forward()
    913 
--> 914     def forward(self, input, target):
    915         return F.cross_entropy(input, target, weight=,

ipdb> input    # View the value of the input variable
tensor([[4.5005, 2.0725],
        [3.5933, 7.8643],
        [2.9086, 3.4209],
        [2.7740, 4.4332],
        [6.0164, 2.3033],
        [5.2261, 3.2189],
        [2.6529, 2.0749],
        [6.3259, 2.2383],
        [3.0629, 3.4832],
        [2.7008, 8.2818],
        [5.5684, 2.1567],
        [3.0689, 6.1022],
        [3.4848, 5.3831],
        [1.7920, 5.7709],
        [6.5032, 2.8080],
        [2.3071, 5.2417],
        [3.7474, 5.0263],
        [4.3682, 3.6707],
        [2.2196, 6.9298],
        [5.2201, 2.3034],
        [6.4315, 1.4970],
        [3.4684, 4.0371],
        [3.9620, 1.7629],
        [1.7069, 7.8898],
        [3.0462, 1.6505],
        [2.4081, 6.4456],
        [2.1932, 7.4614],
        [2.3405, 2.7603],
        [1.9478, 8.4156],
        [2.7935, 7.8331],
        [1.8898, 3.8836],
        [3.3008, 1.6832]], device='cuda:0', grad_fn=<AsStridedBackward>)

ipdb> ()    # View the mean and standard deviation of inputs
tensor(3.9630, device='cuda:0')
ipdb> ()
tensor(1.9513, device='cuda:0')

ipdb> u    # Jumping back up a level
> c:\programdata\anaconda3\lib\site-packages\torch\nn\modules\(547)__call__()
    546         else:
--> 547             result = (*input, **kwargs)
    548         for hook in self._forward_hooks.values():

ipdb> u    # Jumping back up a level
> e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/(88)train()
     87             optimizer.zero_grad()
1--> 88             score = model(input)
     89             loss = criterion(score, target)

ipdb> clear    # Clear all breakpoints
Clear all breaks? y
Deleted breakpoint 1 at e:/Users/mac/Desktop/jupyter/mdFile/deeplearning/:88

ipdb> c    # Keep running, remember to delete "debug/" first, otherwise you'll be in debug mode again soon!
59it [06:21,  5.75it/s]loss: 0.24856307208538073
76it [06:24,  5.91it/s]

When you want to enter debug mode, change some parameter values in the program, or want to analyze the program, you can create a debug identifier file, and then the program will enter debug mode, and after debugging is complete, you can delete the file and enter c in the ipdb debugging interface to continue running the program. If you want to exit the program, you can also use this method, first create a debug identifier file, and then type quit to exit the program at the same time you exit the debug. This is a safer way of exiting a program than using Ctrl + C because it ensures that a data-loaded multiprocessor program will also exit correctly and free up memory, graphics, and other resources.

PyTorch and ipdb collections can accomplish a lot of functions that other frameworks can't or are hard to accomplish. According to the author's daily use of the summary, the main parts are as follows:

  1. Suspend the program by debugging. When the program enters debug mode, no more CPU and GPU operations will be performed, but the memory and graphics memory and the corresponding stack space will not be freed.
  2. Analyze the program through debug, view the output of each layer, and view the parameters of the network. With commands such as u(p), d(own), and s(tep), you can enter the specified code, and with n(ext), you can execute it in a single step, so that you can see the results of each layer, and easily analyze the distribution of values of the network and other information.
  3. As a dynamic graph framework, PyTorch has the advantage of being interpreted by Python's dynamic language, which allows us to change the values or attributes of certain variables with ipdb while the program is running, and these changes take effect immediately. For example, you can adjust the learning rate based on the loss function shortly after training without having to restart the program.
  4. If you run your program through the %run magic method in IPython, you can use the %debug command to go directly to debug mode when the program exits abnormally, jump to the place where the error was reported via u(p) and d(own), look at the corresponding variable, find out why, and change the code accordingly. Sometimes we have been training a pattern for several hours, but just before we save the pattern, it exits due to a small spelling error. It would take hours to correct the error and re-run the program, which would be a waste of time. The best way to do this is to look at %debug to enter debug mode, and in debug mode run model . save() to save the model. In IPython, the %pdb magic method allows the program to enter debug mode without having to manually type %debug when a problem occurs.

IV. Problems easily encountered in the realization of the project through PyTorch

When PyTorch calls CuDNN to report errors, the error messages such as CUDNN_STATUS_BAD_PARAM, it is difficult to get useful help information from the content of these reports, and finally run the code using PCU, at this time, you will generally get relatively friendly error messages, such as the execution of the () (()) in the ipdb, the underlying TH library in PyTorch will provide relatively detailed information.

The main common mistakes are the following:

  • Type mismatch problems. For example, the input target for CrossEntropyLoss should be a LongTensor, while many people enter FloatTensor.
  • Some of the data has to be transferred from the CPU to the GPU. For example, when the model is stored on the GPU, the input input needs to be transferred to the GPU in order to be entered into the model. It is also possible to store multiple models in a list object, and when () is executed, the objects in the list will not be transferred to CUDA, the correct usage is to use ModuleList instead.
  • Tensor shape mismatch. This kind of problem is usually due to the wrong shape of the input data or a problem with the design of the network structure, which is usually known by jumping to the specified code via u(p) and checking the shapes of the input and model parameters.

In addition, you may often encounter the problem that the program runs normally without reporting errors, but the model fails to converge. For example, for the binary classification problem, the cross-entropy loss has been hovering around 0.69 (ln2), or the value of the overflow and other problems, at this time, you can enter the debug mode, with a single-step execution to check the mean and variance of the output of each layer, and observe from which layer of the output began to appear abnormal values. Also check the mean and variance of the gradient of each parameter to see if the gradient disappears or the gradient explodes. In general, by adding a BatchNorm layer before the activation function, proper parameter initialization, using the Adam optimizer, and setting the learning rate to 0.001, you can basically ensure that the model converges to some degree.

V. Summary

This chapter takes students to realize a classic competition on Kaggle from scratch, focusing on how to reasonably combine the arrangement of the program, while introducing some debugging skills in PyTorch, the next chapter will formally enter the programming of the real journey, some of the details will not be so detailed, be prepared.

Above is the details of PyTorch's Debug Guide, for more information about PyTorch Debug please follow my other related articles!