为什么在训练期间需要调用zero_grad() ?

|  zero_grad(self)
|      Sets gradients of all model parameters to zero.

当前回答

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.

正因为如此,当你开始训练循环时,理想情况下你应该将梯度归零,以便正确地进行参数更新。否则,梯度将是旧梯度(您已经用于更新模型参数)和新计算的梯度的组合。因此,它将指向一些其他方向,而不是指向最小值(或最大值,在最大化目标的情况下)的预期方向。

这里有一个简单的例子:

import torch
from torch.autograd import Variable
import torch.optim as optim

def linear_model(x, W, b):
    return torch.matmul(x, W) + b

data, targets = ...

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

optimizer = optim.Adam([W, b])

for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()

或者,如果你在做香草梯度下降,那么:

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()

    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()

    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data

注意:

当对损失张量调用.backward()时,梯度的积累(即总和)发生。 从v1.7.0开始,Pytorch提供了将梯度重置为None optimizer.zero_grad(set_to_none=True)的选项,而不是用一个0的张量填充它们。文档声称,这种设置减少了内存需求,并略微提高了性能,但如果不小心处理,可能容易出错。

其他回答

如果使用梯度方法来减少错误(或损失),Zero_grad()将重新启动循环,而不损失上一步。

如果不使用zero_grad(),损失将会增加,而不是按要求减少。

例如:

如果你使用zero_grad(),你会得到以下输出:

model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2

如果你不使用zero_grad(),你会得到以下输出:

model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

During the feed forward propagation the weights are assigned to inputs and after the 1st iteration the weights are initialized what the model has learnt seeing the samples(inputs). And when we start back propagation we want to update weights in order to get minimum loss of our cost function. So we clear off our previous weights in order to obtained more better weights. This we keep doing in training and we do not perform this in testing because we have got the weights in training time which is best fitted in our data. Hope this would clear more!

简单来说,我们需要ZERO_GRAD

因为当我们开始一个训练循环时,我们不希望过去的梯度或过去的结果干扰我们当前的结果,因为PyTorch在反向传播中收集/累积梯度时的工作方式,以及如果过去的结果可能会混淆并给出错误的结果,所以我们每次经过循环时都将梯度设置为零。 这里有一个例子: `

# let us write a training loop
torch.manual_seed(42)

epochs = 200
for epoch in range(epochs):
  model_1.train()

  y_pred = model_1(X_train)

  loss = loss_fn(y_pred,y_train)

  optimizer.zero_grad()

  loss.backward()

optimizer.step () ` 在这个for循环中,如果我们不将优化器设置为零,那么每次过去的值都会被加起来并改变结果。 所以我们使用zero_grad来避免错误的累积结果。

你不需要调用grad_zero()或者你可以衰减渐变,例如:

optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
    for p in group['params']:
        if p.grad is not None:
            ''' original code from git:
            if set_to_none:
                p.grad = None
            else:
                if p.grad.grad_fn is not None:
                    p.grad.detach_()
                else:
                    p.grad.requires_grad_(False)
                p.grad.zero_()
                
            '''
            p.grad = p.grad / 2

这样学习就更容易继续

In PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients on every loss.backward() call.

正因为如此,当你开始训练循环时,理想情况下你应该将梯度归零,以便正确地进行参数更新。否则,梯度将是旧梯度(您已经用于更新模型参数)和新计算的梯度的组合。因此,它将指向一些其他方向,而不是指向最小值(或最大值,在最大化目标的情况下)的预期方向。

这里有一个简单的例子:

import torch
from torch.autograd import Variable
import torch.optim as optim

def linear_model(x, W, b):
    return torch.matmul(x, W) + b

data, targets = ...

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

optimizer = optim.Adam([W, b])

for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()

或者,如果你在做香草梯度下降,那么:

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()

    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()

    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data

注意:

当对损失张量调用.backward()时,梯度的积累(即总和)发生。 从v1.7.0开始,Pytorch提供了将梯度重置为None optimizer.zero_grad(set_to_none=True)的选项,而不是用一个0的张量填充它们。文档声称,这种设置减少了内存需求,并略微提高了性能,但如果不小心处理,可能容易出错。