About the flexible use of details (including rewriting SGD, plus L1 regular)

The flexible use of the

1. Basic usage:

To build an OptimizerOptimizer, it must be given an iterator containing parameters to optimize, and then, we can specify specific optimization options, the

Examples include learning rate, weight decay value, etc.

Notes:If you want to put the model in the GPU, you need to execute () before building an Optimizer to make sure that the parameters inside the optimizer are also in the GPU.

Example:

optimizer = ((), lr = 0.01, momentum=0.9)

2. Flexibility in setting learning rates for each tier

Send in the parameters of the layers of the model that need to be BP'd, which are not necessarily contiguous.

This time, instead of an iterable variable, the Optimizer's argument is an iterable dictionary

(The key of the dictionary must contain 'params' (a look at the source code tells us that the optimizer accesses parameters via 'params').

The other keys are what the optimizer can accept, e.g., 'lr','weight_decay'), and you can form a list of these dictionaries, the

This makes an iterable dictionary.

Notes:At this point, you can set the options to be passed as keyword parameters in the optimizer, at which point they will be considered default values (this default parameter is used when there is no such keyword parameter key-value pair inside the dictionary)

This is useful when you only want to vary a single option, while keeping all others consistent between parameter groups.

Example:

optimizer = SGD([
        {'params': model.(), 'lr': 1e-2},
        {'params': model.()},
        {'params': model.()},
        {'params': model.()},
        {'params': model.()},
      ], weight_decay1=5e-4, lr=1e-1, momentum=0.9)

The Optimizer of the type created above has a default value of 1e-1 for lr and 0.9 for momentum. features12 has a parameter learning rate of 1e-2.

Flexibility to change the learning rate for each tier

The initialization function is as follows:

__init__(self, params, lr=<object object>, momentum=0, dampening=0, weight_decay=0, nesterov=False)

params (iterable): iterable of parameters to optimize or dicts defining parameter groups (params can be iterable parameters, or a dictionary defining parameter groups, as shown above, with dictionary keys including: params, lr. momentum, dampening, weight_decay, nesterov)

To change the learning rate for each layer, access the param_groups property of the optimizer. type(optimizer.param_groups) -> list

optimizer.param_groups[0].keys()
Out[21]: ['dampening', 'nesterov', 'params', 'lr', 'weight_decay', 'momentum']

So, to change the learning rate of a certain layer parameter, you can access optimizer.param_groups and just specify that a certain index changes the 'lr' parameter.

def adjust_learning_rate(optimizer, decay_rate=0.9):
  for para in optimizer.param_groups:
    para['lr'] = para['lr']*decay_rate

Rewrite, plus L1 regular

Looking at the source code for et al Optimizer, I see that there is no option for L1 regularity, which is more likely to yield sparse solutions.

At this point, you can change the /home/smiles/anaconda2/lib/python2.7/site-packages/torch/optim/ file to simulate the operation of L2 regularization.

The L1 regularization is derived as follows:

dw = 1 * sign(w)

The changes are as follows:

import torch
from  import Optimizer, required

class SGD(Optimizer):
  def __init__(self, params, lr=required, momentum=0, dampening=0,
         weight_decay1=0, weight_decay2=0, nesterov=False):
    defaults = dict(lr=lr, momentum=momentum, dampening=dampening,
            weight_decay1=weight_decay1, weight_decay2=weight_decay2, nesterov=nesterov)
    if nesterov and (momentum <= 0 or dampening != 0):
      raise ValueError("Nesterov momentum requires a momentum and zero dampening")
    super(SGD, self).__init__(params, defaults)

  def __setstate__(self, state):
    super(SGD, self).__setstate__(state)
    for group in self.param_groups:
      ('nesterov', False)

  def step(self, closure=None):
    """Performs a single optimization step.

    Arguments:
      closure (callable, optional): A closure that reevaluates the model
        and returns the loss.
    """
    loss = None
    if closure is not None:
      loss = closure()

    for group in self.param_groups:
      weight_decay1 = group['weight_decay1']
      weight_decay2 = group['weight_decay2']
      momentum = group['momentum']
      dampening = group['dampening']
      nesterov = group['nesterov']

      for p in group['params']:
        if  is None:
          continue
        d_p = 
        if weight_decay1 != 0:
          d_p.add_(weight_decay1, ())
        if weight_decay2 != 0:
          d_p.add_(weight_decay2, )
        if momentum != 0:
          param_state = [p]
          if 'momentum_buffer' not in param_state:
            buf = param_state['momentum_buffer'] = torch.zeros_like()
            buf.mul_(momentum).add_(d_p)
          else:
            buf = param_state['momentum_buffer']
            buf.mul_(momentum).add_(1 - dampening, d_p)
          if nesterov:
            d_p = d_p.add(momentum, buf)
          else:
            d_p = buf

        .add_(-group['lr'], d_p)

    return loss

An example of use:

optimizer = SGD([
        {'params': model.()},
        {'params': model.()},
        {'params': model.()},
        {'params': model.()},
        {'params': model.()},
      ], weight_decay1=5e-4, lr=1e-1, momentum=0.9)

Above this article on the flexible use of details (including rewriting SGD, plus L1 regular) is all I have to share with you, I hope to give you a reference, but also hope that you support me more.