Solving the pitfalls stepped by Batch Normalization layer in Pytorch

1. Note the definition of momentum

Momentum smoothing of the BN layer in Pytorch is computed in the opposite way to the common momentum method, with a default momentum = 0.1

The expression in the BN layer is:

where γ and β are parameters that can be learned. In Pytorch, the parameters of the class of the BN layer are:

CLASS .BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

See the documentation for the exact meaning of each parameter, and note that affine defines whether the parameters γ and β of the BN layer are learnable (unlearnable defaults to constants 1 and 0).

2. Note that the BN layer contains statistical data values, i.e., mean and variance.

track_running_stats – a boolean value that when set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics and always uses batch statistics in both training and eval modes. Default: True

During the training process (), the statistical values-mean and variance-of the BN of the TRAIN process are estimated from the current batch data.

And when testing, after (), if track_running_stats=True, the statistics used by the model at this moment are from the Running status, i.e., accumulated to the current value through the exponential decay rule. Otherwise it still uses the estimates based on the current batch data.

3. Statistical update of the BN layer

is implemented automatically in the forward() method after each training phase(), instead of being done in the gradient computation with backpropagation in update()

4. Freezing of BN and its statistics

As you can see from the above analysis, the correct way to freeze the BN is to single out the BN and reset its state to eval when the model is being trained (overriding the training state after ()).

Solution:

You should use apply instead of searching its children, while named_children() doesn't iteratively search submodules.

def set_bn_eval(m):
    classname = m.__class__.__name__
    if ('BatchNorm') != -1:
      ()
(set_bn_eval)

Alternatively, override the train() method in the module:

def train(self, mode=True):
        """
        Override the default train() to freeze the BN parameters
        """
        super(MyNet, self).train(mode)
        if self.freeze_bn:
            print("Freezing Mean/Var of BatchNorm2D.")
            if self.freeze_bn_affine:
                print("Freezing Weight/Bias of BatchNorm2D.")
        if self.freeze_bn:
            for m in ():
                if isinstance(m, nn.BatchNorm2d):
                    ()
                    if self.freeze_bn_affine:
                        .requires_grad = False
                        .requires_grad = False

5. Fix/frozen Batch Norm when training may lead to RuntimeError: expected scalar type Half but found Float

Solution:

import torch
import  as nn
from  import init
from torchvision import models
from  import Variable
from apex.fp16_utils import *
def fix_bn(m):
    classname = m.__class__.__name__
    if ('BatchNorm') != -1:
        ()
model = models.resnet50(pretrained=True)
()
model = network_to_half(model)
()
(fix_bn) # fix batchnorm
input = Variable((8, 3, 224, 224).cuda().half())
output = model(input)
output_mean = (output)
output_mean.backward()

Please do

def fix_bn(m):
    classname = m.__class__.__name__
    if ('BatchNorm') != -1:
        ().half()

Reason for this is, for regular training it is better (performance-wise) to use cudnn batch norm, which requires its weights to be in fp32, thus batch norm modules are not converted to half in network_to_half. However, cudnn does not support batchnorm backward in the eval mode , which is what you are doing, and to use pytorch implementation for this, weights have to be of the same type as inputs.

Supplementary: deep learning summary: what to look for when doing dropout and Batch Normalization with pytorch, and what to look for when doing dropout and BN with tensorflow

Things to keep in mind when doing dropout and BN with pytorch

pytorch does dropout.

That is, we use dropout when we train, but not when we train.

Inside pytorch, the whole network parameters are fixed by (), including not updating some forward parameters, there is no dropout, the BN parameters are fixed, and theoretically () should be used for all validation set

() indicates that the calculation of the gradient will be incorporated.

net_dropped = (
    (1, N_HIDDEN),
    (0.5),  # drop 50% of the neuron
    (),
    (N_HIDDEN, N_HIDDEN),
    (0.5),  # drop 50% of the neuron
    (),
    (N_HIDDEN, 1),
)
for t in range(500):
    pred_drop = net_dropped(x)
    loss_drop = loss_func(pred_drop, y)
    optimizer_drop.zero_grad()
    loss_drop.backward()
    optimizer_drop.step()
    if t % 10 == 0:
        # change to eval mode in order to fix drop out effect
        net_dropped.eval()  # parameters for dropout differ from train mode
        test_pred_drop = net_dropped(test_x)
        # change back to train mode
        net_dropped.train()

pytorch does Batch Normalization.

() fixes the whole network parameters, fixes the parameters of the BN, moving_mean and moving_var, not understanding this look at the following figure.

            if self.do_bn:
                bn = nn.BatchNorm1d(10, momentum=0.5)
                setattr(self, 'bn%i' % i, bn)   # IMPORTANT set layer to the Module
                (bn)
    for epoch in range(EPOCH):
        print('Epoch: ', epoch)
        for net, l in zip(nets, losses):
            ()              # set eval mode to fix moving_mean and moving_var
            pred, layer_input, pre_act = net(test_x)
            ()             # free moving_mean and moving_var
        plot_histogram(*layer_inputs, *pre_acts)

moving_mean and moving_var

在这里插入图片描述

Things to keep in mind when doing dropout and BN with tensorflow

Both dropout and BN have a training parameter indicating whether it is a train or a test, indicating that the dropout is not a dropout, and the BN is a fixed parameter of the BN;

tf_is_training = (, None)  # to control dropout when training and testing
# dropout net
d1 = (tf_x, N_HIDDEN, )
d1 = (d1, rate=0.5, training=tf_is_training)   # drop out 50% of inputs
d2 = (d1, N_HIDDEN, )
d2 = (d2, rate=0.5, training=tf_is_training)   # drop out 50% of inputs
d_out = (d2, 1)
for t in range(500):
    ([o_train, d_train], {tf_x: x, tf_y: y, tf_is_training: True})  # train, set is_training=True
    if t % 10 == 0:
        # plotting
        ()
        o_loss_, d_loss_, o_out_, d_out_ = (
            [o_loss, d_loss, o_out, d_out], {tf_x: test_x, tf_y: test_y, tf_is_training: False} # test, set is_training=False
        )
# pytorch
    def add_layer(self, x, out_size, ac=None):
        x = (x, out_size, kernel_initializer=self.w_init, bias_initializer=B_INIT)
        self.pre_activation.append(x)
        # the momentum plays important rule. the default 0.99 is too high in this case!
        if self.is_bn: x = .batch_normalization(x, momentum=0.4, training=tf_is_train)    # when have BN
        out = x if ac is None else ac(x)
        return out

When the parameter of BN's training is train, it just means that the parameter of BN is changeable, it doesn't mean that BN will update moving_mean and moving_var by itself, because this operation is a forward updating OPS, we must make sure that moving_mean and moving_var are updated before we do the train, and the operation of updating The moving_mean and moving_var operations are in .UPDATE_OPS

 # !! IMPORTANT !! the moving_mean and moving_variance need to be updated,
        # pass the update_ops with control_dependencies to the train_op
        update_ops = tf.get_collection(.UPDATE_OPS)
        with tf.control_dependencies(update_ops):
             = (LR).minimize()

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.