从Udacity的深度学习课程中,y_i的softmax仅仅是指数除以整个Y向量的指数之和:
其中S(y_i)是y_i的软最大函数e是指数函数j是no。输入向量Y中的列。
我试过以下几种方法:
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
scores = [3.0, 1.0, 0.2]
print(softmax(scores))
返回:
[ 0.8360188 0.11314284 0.05083836]
但建议的解决方案是:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
它产生与第一个实现相同的输出,尽管第一个实现显式地取每列与Max的差值,然后除以和。
有人能用数学方法解释一下吗?一个是对的,另一个是错的?
实现在代码和时间复杂度方面是否相似?哪个更有效率?
The purpose of the softmax function is to preserve the ratio of the vectors as opposed to squashing the end-points with a sigmoid as the values saturate (i.e. tend to +/- 1 (tanh) or from 0 to 1 (logistical)). This is because it preserves more information about the rate of change at the end-points and thus is more applicable to neural nets with 1-of-N Output Encoding (i.e. if we squashed the end-points it would be harder to differentiate the 1-of-N output class because we can't tell which one is the "biggest" or "smallest" because they got squished.); also it makes the total output sum to 1, and the clear winner will be closer to 1 while other numbers that are close to each other will sum to 1/p, where p is the number of output neurons with similar values.
从向量中减去最大值的目的是,当你计算e^y指数时,你可能会得到非常高的值,将浮点数夹在最大值处,导致平局,但在这个例子中不是这样。如果你减去最大值得到一个负数,那么就会出现一个大问题,然后你就会得到一个负指数,它会迅速缩小数值,改变比率,这就是在海报上的问题中发生的情况,并得到错误的答案。
Udacity提供的答案效率低得可怕。我们需要做的第一件事是计算所有向量分量的e^y_j, KEEP这些值,然后求和,然后除。Udacity搞砸的地方是他们计算了两次e^y_j !!正确答案如下:
def softmax(y):
e_to_the_y_j = np.exp(y)
return e_to_the_y_j / np.sum(e_to_the_y_j, axis=0)
我需要一些与Tensorflow的密集层输出兼容的东西。
来自@desertnaut的解决方案在本例中不起作用,因为我有一批数据。因此,我提出了另一个解决方案,应该在这两种情况下工作:
def softmax(x, axis=-1):
e_x = np.exp(x - np.max(x)) # same code
return e_x / e_x.sum(axis=axis, keepdims=True)
结果:
logits = np.asarray([
[-0.0052024, -0.00770216, 0.01360943, -0.008921], # 1
[-0.0052024, -0.00770216, 0.01360943, -0.008921] # 2
])
print(softmax(logits))
#[[0.2492037 0.24858153 0.25393605 0.24827873]
# [0.2492037 0.24858153 0.25393605 0.24827873]]
参考:Tensorflow softmax
我很好奇它们之间的性能差异
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
def softmaxv2(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
def softmaxv3(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / np.sum(e_x, axis=0)
def softmaxv4(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x - np.max(x)) / np.sum(np.exp(x - np.max(x)), axis=0)
x=[10,10,18,9,15,3,1,2,1,10,10,10,8,15]
使用
print("----- softmax")
%timeit a=softmax(x)
print("----- softmaxv2")
%timeit a=softmaxv2(x)
print("----- softmaxv3")
%timeit a=softmaxv2(x)
print("----- softmaxv4")
%timeit a=softmaxv2(x)
增加x内部的值(+100 +200 +500…)我使用原始numpy版本得到的结果始终更好(这里只是一个测试)
----- softmax
The slowest run took 8.07 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 17.8 µs per loop
----- softmaxv2
The slowest run took 4.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv3
The slowest run took 4.06 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv4
10000 loops, best of 3: 23 µs per loop
直到……x内的值达到~800,则得到
----- softmax
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: overflow encountered in exp
after removing the cwd from sys.path.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in true_divide
after removing the cwd from sys.path.
The slowest run took 18.41 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv2
The slowest run took 4.18 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.8 µs per loop
----- softmaxv3
The slowest run took 19.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv4
The slowest run took 16.82 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.7 µs per loop
就像一些人说的,你的版本在“大数字”上更稳定。对于小数字来说,情况可能正好相反。