为什么使用softmax而不是标准归一化?

假设我们改变softmax函数，使输出激活由

where c is a positive constant. Note that c=1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., c→∞. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a "softened" version of the maximum function. This is the origin of the term "softmax". You can follow the details from this source (equation 83).

2017-02-16 15:28:12

与标准归一化相比，Softmax有一个很好的属性。

它对分布均匀的神经网络的低刺激(想象一个模糊的图像)和高刺激(例如。大数字，想想清晰的图像)，概率接近0和1。

而标准归一化并不关心，只要比例相同。

看看当soft max有10倍大的输入时会发生什么，即你的神经网络得到一个清晰的图像，许多神经元被激活

>>> softmax([1,2])              # blurry image of a ferret
[0.26894142,      0.73105858])  #     it is a cat perhaps !?
>>> softmax([10,20])            # crisp image of a cat
[0.0000453978687, 0.999954602]) #     it is definitely a CAT !

然后与标准归一化进行比较

>>> std_norm([1,2])                      # blurry image of a ferret
[0.3333333333333333, 0.6666666666666666] #     it is a cat perhaps !?
>>> std_norm([10,20])                    # crisp image of a cat
[0.3333333333333333, 0.6666666666666666] #     it is a cat perhaps !?

2017-07-19 09:14:50

我们正在研究一个多类分类问题。也就是说，预测变量y可以取k个类别中的一个，其中k > 2。在概率论中，这通常用多项分布来模拟。多项分布是指数族分布的一个成员。我们可以利用指数族分布的性质重构概率P(k=?|x)，它与softmax公式一致。

如果你相信这个问题可以用另一个分布来建模，而不是多项式分布，那么你就可以得到一个不同于softmax的结论。

有关进一步的信息和正式的推导，请参阅CS229课堂讲稿(9.3 Softmax回归)。

此外，通常对softmax执行的一个有用的技巧是:softmax(x) = softmax(x+c)， softmax在输入中的偏移量不变。

2017-06-12 01:54:01

选择softmax函数似乎有些武断，因为有许多其他可能的归一化函数。因此，目前还不清楚为什么log-softmax损耗会比其他损耗替代品表现更好。

来自“属于球形损失家族的Softmax替代方案的探索”https://arxiv.org/abs/1511.05042

作者探索了其他一些函数，其中包括泰勒exp展开和所谓的球形软最大值，并发现有时它们可能比通常的软最大值执行得更好。

2017-11-07 08:49:07

加上Piotr Czapla答案，在相同比例和与其他输入相比，输入值越大，最大输入的概率越大:

2019-01-15 15:33:16

假设我们改变softmax函数，使输出激活由

where c is a positive constant. Note that c=1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., c→∞. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a "softened" version of the maximum function. This is the origin of the term "softmax". You can follow the details from this source (equation 83).

2017-02-16 15:28:12

为什么使用softmax而不是标准归一化?

推荐文章

最新文章

标签