从Udacity的深度学习课程中,y_i的softmax仅仅是指数除以整个Y向量的指数之和:

其中S(y_i)是y_i的软最大函数e是指数函数j是no。输入向量Y中的列。

我试过以下几种方法:

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

scores = [3.0, 1.0, 0.2]
print(softmax(scores))

返回:

[ 0.8360188   0.11314284  0.05083836]

但建议的解决方案是:

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

它产生与第一个实现相同的输出,尽管第一个实现显式地取每列与Max的差值,然后除以和。

有人能用数学方法解释一下吗?一个是对的,另一个是错的?

实现在代码和时间复杂度方面是否相似?哪个更有效率?


当前回答

这将泛化并假设您正在规范化尾随维度。

def softmax(x: np.ndarray) -> np.ndarray:
    e_x = np.exp(x - np.max(x, axis=-1)[..., None])
    e_y = e_x.sum(axis=-1)[..., None]
    return e_x / e_y

其他回答

为了提供另一种解决方案,请考虑这样的情况:参数的值非常大,以至于exp(x)会溢出(在负的情况下)或溢出(在正的情况下)。这里你希望尽可能长时间地保持在对数空间中,只在你可以相信结果是良好的地方取幂。

import scipy.special as sc
import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    return np.exp(x - sc.logsumexp(x))

在上面的回答中已经回答了很多细节。Max被减去以避免溢出。我在这里添加了python3中的另一个实现。

import numpy as np
def softmax(x):
    mx = np.amax(x,axis=1,keepdims = True)
    x_exp = np.exp(x - mx)
    x_sum = np.sum(x_exp, axis = 1, keepdims = True)
    res = x_exp / x_sum
    return res

x = np.array([[3,2,4],[4,5,6]])
print(softmax(x))

所以,这实际上是对desertnaut的回答的一个评论,但由于我的声誉,我还不能评论它。正如他所指出的,只有当输入包含单个样本时,你的版本才是正确的。如果您的输入包含多个样本,则是错误的。然而,沙漠探险者的解决方案也是错误的。问题是一旦他得到一个一维的输入然后他又得到一个二维的输入。让我给你们看看这个。

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# desertnaut solution (copied from his answer): 
def desertnaut_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

# my (correct) solution:
def softmax(z):
    assert len(z.shape) == 2
    s = np.max(z, axis=1)
    s = s[:, np.newaxis] # necessary step to do broadcasting
    e_x = np.exp(z - s)
    div = np.sum(e_x, axis=1)
    div = div[:, np.newaxis] # dito
    return e_x / div

让我们以沙漠探险者为例:

x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)

输出如下:

your_softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

desertnaut_softmax(x1)
array([[ 1.,  1.,  1.,  1.]])

softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

你可以看到沙漠版本在这种情况下会失败。(如果输入是一维的,就不会像np那样。数组([1,2,3,6])。

现在让我们使用3个样本,因为这就是为什么我们使用二维输入的原因。下面的x2和沙漠例子中的x2不一样。

x2 = np.array([[1, 2, 3, 6],  # sample 1
               [2, 4, 5, 6],  # sample 2
               [1, 2, 3, 6]]) # sample 1 again(!)

该输入由一个有3个样本的批次组成。但样本一和样本三本质上是一样的。我们现在期望3行softmax激活,其中第一行应该与第三行相同,也与x1的激活相同!

your_softmax(x2)
array([[ 0.00183535,  0.00498899,  0.01356148,  0.27238963],
       [ 0.00498899,  0.03686393,  0.10020655,  0.27238963],
       [ 0.00183535,  0.00498899,  0.01356148,  0.27238963]])


desertnaut_softmax(x2)
array([[ 0.21194156,  0.10650698,  0.10650698,  0.33333333],
       [ 0.57611688,  0.78698604,  0.78698604,  0.33333333],
       [ 0.21194156,  0.10650698,  0.10650698,  0.33333333]])

softmax(x2)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047],
       [ 0.01203764,  0.08894682,  0.24178252,  0.65723302],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

我希望你能明白,这只是我的解的情况。

softmax(x1) == softmax(x2)[0]
array([[ True,  True,  True,  True]], dtype=bool)

softmax(x1) == softmax(x2)[2]
array([[ True,  True,  True,  True]], dtype=bool)

另外,下面是TensorFlows softmax实现的结果:

import tensorflow as tf
import numpy as np
batch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})

结果是:

array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037045],
       [ 0.01203764,  0.08894681,  0.24178252,  0.657233  ],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037045]], dtype=float32)

softmax函数是一种激活函数,它将数字转换为和为1的概率。softmax函数输出一个向量,表示结果列表的概率分布。它也是深度学习分类任务中使用的核心元素。

当我们有多个类时,使用Softmax函数。

它对于找出有最大值的类很有用。概率。

Softmax函数理想地用于输出层,在那里我们实际上试图获得定义每个输入类的概率。

取值范围是0 ~ 1。

Softmax函数将对数[2.0,1.0,0.1]转换为概率[0.7,0.2,0.1],概率和为1。Logits是神经网络最后一层输出的原始分数。在激活发生之前。为了理解softmax函数,我们必须看看第(n-1)层的输出。

softmax函数实际上是一个arg max函数。这意味着它不会返回输入中的最大值,而是返回最大值的位置。

例如:

softmax之前

X = [13, 31, 5]

softmax后

array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12]

代码:

import numpy as np

# your solution:

def your_softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum() 

# correct solution: 

def softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum(axis=0) 

# only difference

我很好奇它们之间的性能差异

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

def softmaxv2(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

def softmaxv3(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / np.sum(e_x, axis=0)

def softmaxv4(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x - np.max(x)) / np.sum(np.exp(x - np.max(x)), axis=0)



x=[10,10,18,9,15,3,1,2,1,10,10,10,8,15]

使用

print("----- softmax")
%timeit  a=softmax(x)
print("----- softmaxv2")
%timeit  a=softmaxv2(x)
print("----- softmaxv3")
%timeit  a=softmaxv2(x)
print("----- softmaxv4")
%timeit  a=softmaxv2(x)

增加x内部的值(+100 +200 +500…)我使用原始numpy版本得到的结果始终更好(这里只是一个测试)

----- softmax
The slowest run took 8.07 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 17.8 µs per loop
----- softmaxv2
The slowest run took 4.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv3
The slowest run took 4.06 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv4
10000 loops, best of 3: 23 µs per loop

直到……x内的值达到~800,则得到

----- softmax
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: overflow encountered in exp
  after removing the cwd from sys.path.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in true_divide
  after removing the cwd from sys.path.
The slowest run took 18.41 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv2
The slowest run took 4.18 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.8 µs per loop
----- softmaxv3
The slowest run took 19.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv4
The slowest run took 16.82 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.7 µs per loop

就像一些人说的,你的版本在“大数字”上更稳定。对于小数字来说,情况可能正好相反。