如果我想在Keras中使用BatchNormalization函数,那么我只需要在开始时调用它一次吗?

我阅读了它的文档:http://keras.io/layers/normalization/

我不知道该怎么称呼它。下面是我试图使用它的代码:

model = Sequential()
keras.layers.normalization.BatchNormalization(epsilon=1e-06, mode=0, momentum=0.9, weights=None)
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(64, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(2, init='uniform'))
model.add(Activation('softmax'))

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, validation_split=0.2, verbose = 2)

我问是因为如果我运行包含批处理规格化的第二行代码,如果我不运行第二行代码,我会得到类似的输出。所以要么我没有在正确的地方调用函数,要么我猜这没有太大的区别。


当前回答

它是另一种类型的层,所以你应该把它作为一个层添加到你的模型的适当位置

model.add(keras.layers.normalization.BatchNormalization())

请看一个例子:https://github.com/fchollet/keras/blob/master/examples/kaggle_otto_nn.py

其他回答

它是另一种类型的层,所以你应该把它作为一个层添加到你的模型的适当位置

model.add(keras.layers.normalization.BatchNormalization())

请看一个例子:https://github.com/fchollet/keras/blob/master/examples/kaggle_otto_nn.py

这个帖子有误导性。我试着评论卢卡斯·拉马丹的回答,但我还没有权限,所以我把这个放在这里。

Batch normalization works best after the activation function, and here or here is why: it was developed to prevent internal covariate shift. Internal covariate shift occurs when the distribution of the activations of a layer shifts significantly throughout training. Batch normalization is used so that the distribution of the inputs (and these inputs are literally the result of an activation function) to a specific layer doesn't change over time due to parameter updates from each batch (or at least, allows it to change in an advantageous way). It uses batch statistics to do the normalizing, and then uses the batch normalization parameters (gamma and beta in the original paper) "to make sure that the transformation inserted in the network can represent the identity transform" (quote from original paper). But the point is that we're trying to normalize the inputs to a layer, so it should always go immediately before the next layer in the network. Whether or not that's after an activation function is dependent on the architecture in question.

为了更详细地回答这个问题,正如Pavel所说,批处理规范化只是另一层,因此您可以使用它来创建所需的网络架构。

一般的用例是在网络中的线性层和非线性层之间使用BN,因为它将激活函数的输入归一化,这样您就位于激活函数的线性部分的中心(例如Sigmoid)。这里有一个小讨论

在你上面的例子中,这可能是这样的:


# import BatchNormalization
from keras.layers.normalization import BatchNormalization

# instantiate model
model = Sequential()

# we can think of this chunk as the input layer
model.add(Dense(64, input_dim=14, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))

# we can think of this chunk as the hidden layer    
model.add(Dense(64, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.5))

# we can think of this chunk as the output layer
model.add(Dense(2, init='uniform'))
model.add(BatchNormalization())
model.add(Activation('softmax'))

# setting up the optimization of our weights 
sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd)

# running the fitting
model.fit(X_train, y_train, nb_epoch=20, batch_size=16, show_accuracy=True, validation_split=0.2, verbose = 2)

希望这能让你更清楚一些。

现在有一个Conv2D,一个ReLu,一个BatchNormalization层几乎成为一种趋势。所以我编写了一个小函数来一次性调用所有这些函数。使模型定义看起来更清晰,更容易阅读。

def Conv2DReluBatchNorm(n_filter, w_filter, h_filter, inputs):
    return BatchNormalization()(Activation(activation='relu')(Convolution2D(n_filter, w_filter, h_filter, border_mode='same')(inputs)))

关于BN应该应用在当前层的非线性之前还是应用在前一层的激活之前,这个线程有一些相当大的争论。

虽然没有正确答案,但批处理规范化的作者是这么说的 它应立即应用于当前层的非线性之前。原因(引自原文)-

"We add the BN transform immediately before the nonlinearity, by normalizing x = Wu+b. We could have also normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift. In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian” (Hyv¨arinen & Oja, 2000); normalizing it is likely to produce activations with a stable distribution."