解码HTML实体在Python字符串?

我正在用Beautiful Soup 3解析一些HTML，但它包含了Beautiful Soup 3不会自动解码的HTML实体:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

我如何解码文本中的HTML实体以获得“6.82亿英镑”而不是“&英镑;6.82亿”。

当前回答

Beautiful Soup 4允许您为输出设置格式化程序

如果传入formatter=None, Beautiful Soup将不会修改字符串完全没有输出。这是最快的选择，但它可能会导致 Beautiful Soup生成无效的HTML/XML，如以下示例所示:

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

2014-01-14 10:03:44

其他回答

Beautiful Soup处理实体转换。在BeautifulSoup 3中，你需要为BeautifulSoup构造函数指定convertEntities参数(参见归档文档中的“实体转换”部分)。在《Beautiful Soup 4》中，实体会自动解码。

美味汤3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>

美丽的汤

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>£682m</p></body></html>

2010-01-18 16:19:14

Beautiful Soup 4允许您为输出设置格式化程序

如果传入formatter=None, Beautiful Soup将不会修改字符串完全没有输出。这是最快的选择，但它可能会导致 Beautiful Soup生成无效的HTML/XML，如以下示例所示:

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

2014-01-14 10:03:44

Python 3 + 4。

使用html.unescape ():

import html
print(html.unescape('&pound;682m'))

仅供参考，html.parser.HTMLParser.unescape已弃用，并应该在3.5中被删除，尽管它被错误地保留了下来。它很快就会从语言中删除。

Python 2.6 - -3.3

你可以从标准库中使用HTMLParser.unescape():

对于Python 2.6-2.7，它在HTMLParser中对于Python 3，它在html.parser中

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

你也可以使用6个兼容性库来简化导入:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

2010-01-18 16:17:50

这可能与此无关。但是要从整个文档中消除这些html实体，你可以这样做:(假设文档=页面，请原谅草率的代码，但如果你有关于如何使它更好的想法，我洗耳恭听-我新到这)。

import re
import HTMLParser

regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e) #finds the unescaped value of the html entity
    page = page.replace(e, unescaped) #replaces html entity with unescaped value

2012-12-18 18:28:55

我也有类似的编码问题。我使用normalize()方法。当我将数据帧导出到另一个目录中的.html文件时，我使用pandas .to_html()方法得到一个Unicode错误。我最终这么做了，而且成功了……

    import unicodedata

dataframe对象可以是任何你喜欢的东西，我们叫它table…

    table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
    table.index+= 1

对表格数据进行编码，以便我们可以将其导出到模板文件夹中的。html文件(这可以是您希望的任何位置:))

     #this is where the magic happens
     html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')

导出规范化字符串到HTML文件

    file = open("templates/home.html","w") 

    file.write(html_data) 

    file.close()

参考:unicodedata文档

2020-04-02 21:03:50

解码HTML实体在Python字符串?

推荐文章

最新文章

标签