使用Python从HTML文件中提取文本

我想使用Python从HTML文件中提取文本。我想从本质上得到相同的输出，如果我从浏览器复制文本，并将其粘贴到记事本。

我想要一些更健壮的东西，而不是使用正则表达式，正则表达式可能会在格式不佳的HTML上失败。我见过很多人推荐Beautiful Soup，但我在使用它时遇到了一些问题。首先，它会抓取不需要的文本，比如JavaScript源代码。此外，它也不解释HTML实体。例如，我会期望'在HTML源代码中转换为文本中的撇号，就像我将浏览器内容粘贴到记事本一样。

更新html2text看起来很有希望。它正确地处理HTML实体，而忽略JavaScript。然而，它并不完全生成纯文本;它产生的降价，然后必须转换成纯文本。它没有示例或文档，但代码看起来很干净。

相关问题:

在python中过滤HTML标签并解析实体在Python中将XML/HTML实体转换为Unicode字符串

当前回答

html2text是一个Python程序，它在这方面做得很好。

2008-11-30 03:23:58

其他回答

html2text是一个Python程序，它在这方面做得很好。

2008-11-30 03:23:58

PyParsing做得很好。PyParsing wiki被杀死了，所以这里有另一个位置，这里有使用PyParsing的示例(示例链接)。花点时间在pyparsing上的一个原因是，他还写了一本非常简短、组织良好的O'Reilly捷径手册，而且价格便宜。

话虽如此，我经常使用BeautifulSoup，处理实体问题并不难，你可以在运行BeautifulSoup之前转换它们。

古德勒克

2008-11-30 15:46:19

Beautiful soup可以转换html实体。考虑到HTML经常有bug并且充满unicode和HTML编码问题，这可能是您最好的选择。这是我用来将html转换为原始文本的代码:

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

2012-11-30 08:23:23

今天我发现自己面临着同样的问题。我编写了一个非常简单的HTML解析器来剥离传入内容中的所有标记，仅以最小的格式返回剩余的文本。

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

2010-10-21 13:14:38

注意:NTLK不再支持clean_html函数

下面是原始答案，评论部分有备选答案。

使用NLTK

我浪费了4-5个小时来修复html2text的问题。幸运的是我遇到了NLTK。它神奇地起作用。

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

2011-11-20 12:34:09

使用Python从HTML文件中提取文本

推荐文章

最新文章

标签