什么是“熵和信息增益”?

我正在读这本书(NLTK)，它令人困惑。熵的定义为:

熵是每个标签的概率之和乘以相同标签的log概率

我如何在文本挖掘方面应用熵和最大熵?有人能给我举个简单的例子吗?

当前回答

我真的建议你读一下信息理论，贝叶斯方法和MaxEnt。我们可以从大卫·麦凯(David Mackay)的这本书(网上免费下载)开始:

http://www.inference.phy.cam.ac.uk/mackay/itila/

这些推理方法真的比文本挖掘更通用，如果不学习这本书或其他关于机器学习和MaxEnt贝叶斯方法的介绍性书籍中的一些基本知识，我真的无法设计如何将其应用到NLP中。

The connection between entropy and probability theory to information processing and storing is really, really deep. To give a taste of it, there's a theorem due to Shannon that states that the maximum amount of information you can pass without error through a noisy communication channel is equal to the entropy of the noise process. There's also a theorem that connects how much you can compress a piece of data to occupy the minimum possible memory in your computer to the entropy of the process that generated the data.

我不认为你真的有必要去学习所有这些关于通信理论的定理，但如果不学习什么是熵，它是如何计算的，它与信息和推理的关系等基本知识，就不可能学习这些……

2009-12-14 17:42:55

其他回答

我假设熵是在构建决策树时提到的。

为了说明这一点，想象一下学习把名字分成男性和女性两组的任务。给出一个名字列表，每个名字都标有m或f，我们想要学习一个符合数据的模型，可以用来预测一个新的未见过的名字的性别。

name       gender
-----------------        Now we want to predict 
Ashley        f              the gender of "Amro" (my name)
Brian         m
Caroline      f
David         m

第一步是确定哪些数据特征与我们想要预测的目标类相关。一些例子特征包括:第一个/最后一个字母，长度，元音的数量，是否以元音结尾，等等。所以在特征提取之后，我们的数据看起来是这样的:

# name    ends-vowel  num-vowels   length   gender
# ------------------------------------------------
Ashley        1         3           6        f
Brian         0         2           5        m
Caroline      1         4           8        f
David         0         2           5        m

目标是建立一个决策树。树的一个例子是:

length<7
|   num-vowels<3: male
|   num-vowels>=3
|   |   ends-vowel=1: female
|   |   ends-vowel=0: male
length>=7
|   length=5: male

基本上每个节点都表示对单个属性执行的测试，我们将根据测试结果向左或向右移动。我们继续遍历树，直到到达包含类预测(m或f)的叶节点

因此，如果我们沿着这棵树运行Amro这个名字，我们从测试“长度是否<7?”开始，答案是“是”，所以我们沿着这个分支向下。在分支之后，下一个测试“元音的数量是否<3?”再次评估为true。这导致了一个标记为m的叶节点，因此预测是男性(我碰巧是男性，所以树正确地预测了结果)。

决策树是以自顶向下的方式构建的，但问题是如何选择在每个节点上拆分哪个属性?答案是找到能最好地将目标类划分为最纯粹的子节点的特性(即:不包含男女混合的节点，而是只有一个类的纯节点)。

这种纯度的测量被称为信息。它表示给定到达节点的示例，指定一个新实例(名字)应该分类为男性还是女性所需的预期信息量。我们计算一下基于节点上的男性和女性类的数量。

另一方面，熵是杂质的度量(相反)。它被定义为值为a/b的二进制类:

Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))

这个二进制熵函数如下图所示(随机变量可以取两个值之一)。当概率p=1/2时，它达到最大值，这意味着p(X=a)=0.5或p(X=b)=0.5有50%/50%的机会是a或b(不确定性处于最大值)。当概率p=1或p=0且完全确定(分别为p(X=a)=1或p(X=a)=0，后者表示p(X=b)=1)时，熵函数最小值为零。

当然，熵的定义可以推广到一个离散随机变量X有N个结果(而不仅仅是两个):

(公式中的对数通常取以2为底的对数)

回到我们的名称分类任务，让我们看一个例子。想象一下，在构建树的过程中，我们正在考虑以下分裂:

     ends-vowel
      [9m,5f]          <--- the [..,..] notation represents the class
    /          \            distribution of instances that reached a node
   =1          =0
 -------     -------
 [3m,4f]     [6m,1f]

如你所见，在拆分之前，我们有9名男性和5名女性，即P(m)=9/14, P(f)=5/14。根据熵的定义:

Entropy_before = - (5/14)*log2(5/14) - (9/14)*log2(9/14) = 0.9403

接下来，我们将它与通过查看两个子分支考虑分裂后计算的熵进行比较。在ends-vowel=1的左分支中，我们有:

Entropy_left = - (3/7)*log2(3/7) - (4/7)*log2(4/7) = 0.9852

而ends-vowel=0的右支，我们有:

Entropy_right = - (6/7)*log2(6/7) - (1/7)*log2(1/7) = 0.5917

我们结合左/右熵，使用每个分支的实例数作为权重因子(7个实例向左，7个实例向右)，并得到分裂后的最终熵:

Entropy_after = 7/14*Entropy_left + 7/14*Entropy_right = 0.7885

现在通过比较拆分前后的熵，我们得到了一个信息增益的度量，或者我们通过使用特定的特征进行拆分获得了多少信息:

Information_Gain = Entropy_before - Entropy_after = 0.1518

你可以这样解释上面的计算:通过使用结尾元音特征进行分割，我们能够将子树预测结果中的不确定性降低0.1518(以比特作为信息单位)。

在树的每个节点上，对每个特征都进行这种计算，并以贪婪的方式选择信息增益最大的特征进行分割(因此倾向于产生低不确定性/熵的纯分割的特征)。这个过程从根节点向下递归地应用，当叶子节点包含所有具有相同类的实例时(不需要进一步拆分它)，该过程就会停止。

注意，我跳过了一些超出这篇文章范围的细节，包括如何处理数字特征、缺失值、过拟合和修剪树等。

2009-12-07 13:16:07

当我在实现一个算法来计算图像的熵时，我发现了这些链接，看这里和这里。

这是我使用的伪代码，你需要调整它来处理文本而不是图像，但原则应该是相同的。

//Loop over image array elements and count occurrences of each possible
//pixel to pixel difference value. Store these values in prob_array
for j = 0, ysize-1 do $
    for i = 0, xsize-2 do begin
       diff = array(i+1,j) - array(i,j)
       if diff lt (array_size+1)/2 and diff gt -(array_size+1)/2 then begin
            prob_array(diff+(array_size-1)/2) = prob_array(diff+(array_size-1)/2) + 1
       endif
     endfor

//Convert values in prob_array to probabilities and compute entropy
n = total(prob_array)

entrop = 0
for i = 0, array_size-1 do begin
    prob_array(i) = prob_array(i)/n

    //Base 2 log of x is Ln(x)/Ln(2). Take Ln of array element
    //here and divide final sum by Ln(2)
    if prob_array(i) ne 0 then begin
        entrop = entrop - prob_array(i)*alog(prob_array(i))
    endif
endfor

entrop = entrop/alog(2)

我从某处得到了这个代码，但我找不到链接。

2009-12-07 13:35:46

我不能给你图表，但也许我可以给出一个明确的解释。

Suppose we have an information channel, such as a light that flashes once every day either red or green. How much information does it convey? The first guess might be one bit per day. But what if we add blue, so that the sender has three options? We would like to have a measure of information that can handle things other than powers of two, but still be additive (the way that multiplying the number of possible messages by two adds one bit). We could do this by taking log2(number of possible messages), but it turns out there's a more general way.

Suppose we're back to red/green, but the red bulb has burned out (this is common knowledge) so that the lamp must always flash green. The channel is now useless, we know what the next flash will be so the flashes convey no information, no news. Now we repair the bulb but impose a rule that the red bulb may not flash twice in a row. When the lamp flashes red, we know what the next flash will be. If you try to send a bit stream by this channel, you'll find that you must encode it with more flashes than you have bits (50% more, in fact). And if you want to describe a sequence of flashes, you can do so with fewer bits. The same applies if each flash is independent (context-free), but green flashes are more common than red: the more skewed the probability the fewer bits you need to describe the sequence, and the less information it contains, all the way to the all-green, bulb-burnt-out limit.

事实证明，有一种方法可以测量信号中的信息量，基于不同符号的概率。如果接收符号xi的概率是pi，那么考虑数量

-log p_i

越小，这个值越大。如果xi变得两倍的不可能，这个值增加一个固定的量(log(2))。这将提醒您在消息中添加一个比特。

如果我们不知道这个符号是什么(但我们知道概率)，那么我们可以计算这个值的平均值，即我们将得到多少，通过对不同可能性的总和:

I = -Σ p_i log(p_i)

这是一个闪光的信息内容。

Red bulb burnt out: p_red = 0, p_green=1, I = -(0 + 0)  = 0
Red and green equiprobable: p_red = 1/2, p_{green = 1/2}, I = -(2 * 1/2 * log(1/2)) = log(2)
Three colors, equiprobable: p_i=1/3, I = -(3 * 1/3 * log(1/3)) = log(3)
Green and red, green twice as likely: p_red=1/3, p_green=2/3, I = -(1/3 log(1/3) + 2/3 log(2/3)) = log(3) - 2/3 log(2)

这是消息的信息内容或熵。当不同的符号是等概率时，它是最大的。如果你是物理学家，你会用自然对数，如果你是计算机科学家，你会用log2得到比特。

2009-12-07 13:45:44

我真的建议你读一下信息理论，贝叶斯方法和MaxEnt。我们可以从大卫·麦凯(David Mackay)的这本书(网上免费下载)开始:

http://www.inference.phy.cam.ac.uk/mackay/itila/

2009-12-14 17:42:55

首先，最好了解信息的度量方法。

我们如何衡量这些信息?

当一件不太可能发生的事情发生时，我们说这是一个大新闻。此外，当我们说一些可预测的事情时，它真的不有趣。为了量化这个有趣的东西，函数应该满足

如果事件的概率为1(可预测)，则函数给出0 如果事件的概率接近于0，则函数应给出较高的数值如果概率为0.5的事件发生，则给出1比特的信息。

满足约束条件的一个自然度量是

I(X) = -log_2(p)

其中p是事件x的概率，单位是位，与计算机使用的位相同。0或者1。

示例1

公平抛硬币:

从一次抛硬币中我们能得到多少信息?

答案:-log(p) = -log(1/2) = 1(位)

示例2

如果明天有一颗流星撞击地球，p=2^{-22}，那么我们可以得到22位的信息。

如果太阳明天升起，p ~ 1，那么它是0位信息。

熵

所以如果我们对事件Y的有趣性取期望，那么它就是熵。也就是说，熵是一个事件有趣度的期望值。

H(Y) = E[ I(Y)]

更正式地说，熵是一个事件的预期比特数。

例子

Y = 1:事件X发生的概率为p

Y = 0:事件X发生的概率为1-p

H(Y) = E[I(Y)] = p I(Y==1) + (1-p) I(Y==0) 
     = - p log p - (1-p) log (1-p)

以2为底的对数。

2014-07-04 01:27:30

什么是“熵和信息增益”?

推荐文章

最新文章

标签