我试图抓取一个网站,但它给了我一个错误。

我正在使用以下代码:

import urllib.request
from bs4 import BeautifulSoup

get = urllib.request.urlopen("https://www.website.com/")
html = get.read()

soup = BeautifulSoup(html)

print(soup)

我得到以下错误:

File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>

我该怎么补救呢?


当前回答

这个问题有很多方面。最基本的问题是您希望输出到哪个字符集。您可能还必须找出输入字符集。

Printing (with either print or write) into a file with an explicit encoding="..." will translate Python's internal Unicode representation into that encoding. If the output contains characters which are not supported by that encoding, you will get an UnicodeEncodeError. For example, you can't write Russian or Chinese or Indic or Hebrew or Arabic or emoji or ... anything except a restricted set of some 200+ Western characters to a file whose encoding is "cp1252" because this limited 8-bit character set has no way to represent these characters.

基本上,任何8位字符集都会出现同样的问题,包括几乎所有的传统Windows代码页(437、850、1250、1251等),尽管其中一些代码页除了支持或取代英语,还支持一些额外的脚本(例如,1251支持西里尔字母,所以你可以写俄语、乌克兰语、塞尔维亚语、保加利亚语等)。8位编码最多只能有256个字符代码,并且无法表示不在其中的字符。

也许现在是一个阅读Joel Spolsky的《每个软件开发人员绝对必须知道Unicode和字符集的绝对最小值》(没有借口!)的好时机。

在终端无法打印Unicode的平台上(目前只有Windows,不过如果您喜欢回溯计算,这个问题在上个千年的其他平台上也很普遍),尝试打印Unicode字符串也会产生这个错误,或输出mojibake。如果您看到类似Héllö的内容,而不是Héllö,这是您的问题。

简而言之,你需要知道:

What is the character set of the page you scraped, or the data you received? Was it correctly scraped? Did the originator correctly identify its encoding, or are you able to otherwise obtain this information (or guess it)? Some web sites incorrectly declare a different character set than the page actually contains, some sites have incorrectly configured the connection between the web server and a back-end database. See e.g. scrape with correct character encoding (python requests + beautifulsoup) for a more detailed example with some solutions. What is the character set you want to write? If printing to the screen, is your terminal correctly configured, and is your Python interpreter configured identically? Perhaps see also How to display utf-8 in windows console

如果你在这里,可能其中一个问题的答案不是“UTF-8”。这也越来越成为网页的普遍编码,尽管以前的标准是ISO-8859-1(又名Latin-1),最近是Windows代码页1252。

Going forward, you basically want all your textual data to be Unicode, outside of a few fringe use cases. Generally, that means UTF-8, though on Windows (or if you need Java compatibility), UTF-16 is also vaguely viable, albeit somewhat cumbersome. (There are several other Unicode serialization formats, which may be useful in specialized circumstances. UTF-32 is technically trivial, but takes up a lot more memory; UTF-7 is used in a few network protocols where 7-bit ASCII is required for transport.) Perhaps see also https://utf8everywhere.org/

当然,如果要打印到文件中,还需要使用能够正确显示该文件的工具来检查该文件。常见的引导错误是使用只显示当前选择的系统编码的工具打开文件,或者使用试图猜测编码但猜错的工具。同样,使用Windows代码页1252查看UTF-8文本时的常见症状会导致,例如,Héllö显示为Héllö。

如果字符数据的编码是未知的,就没有简单的方法来自动建立它。如果您知道文本应该表示什么,您也许可以推断出它,但这通常是一个手工过程,涉及到一些猜测。(chardet和ftfy等自动工具可以提供帮助,但它们有时也会出错。)

To establish which encoding you are looking at, it can be helpful if you can identify the individual bytes in a character which isn't displayed correctly. For example, if you are looking at H\x8ell\x9a but expect it to represent Héllö, you can look up the bytes in a translation table. I have published one such table at https://tripleee.github.io/8bit where you can see that in this example, it's probably one of the legacy Mac 8-bit character sets; with more data points, perhaps you can narrow it down to just one of them (and if not, any one of them will do in practice, since all the code points you care about map to the same Unicode characters).

Python 3 on most platforms defaults to UTF-8 for all input and output, but on Windows, this is commonly not the case. It will then instead default to the system's default encoding (still misleadingly called "ANSI code page" in some Microsoft documentation), which depends on a number of factors. On Western systems, the default encoding out of the box is commonly Windows code page 1252. (Earlier Python versions had somewhat different expectations, and in Python 2, the internal string representation was not Unicode.)

如果您在Windows上,并将UTF-8写入一个文本文件,可能会指定encoding=" UTF-8 -sig",这将在文件的开头添加一个BOM序列。严格来说,这是不必要或不正确的,但一些Windows工具需要它来正确识别编码。

前面的几个回答建议盲目地应用某种编码,但希望这能帮助您理解为什么这通常不是正确的方法,以及如何找出(而不是猜测)要使用哪种编码。

其他回答

我通过添加.encode("utf-8")来解决这个问题。

这意味着print(soup)变成print(soup.encode("utf-8"))。

对于那些仍然得到这个错误的人,添加encode(“utf-8”)到soup也可以解决这个问题。

soup = BeautifulSoup(html_doc, 'html.parser').encode("utf-8")
print(soup)

我得到了相同的UnicodeEncodeError时,保存抓取的网页内容到一个文件。为了修复它,我替换了以下代码:

with open(fname, "w") as f:
    f.write(html)

用这个:

with open(fname, "w", encoding="utf-8") as f:
    f.write(html)

如果你需要支持Python 2,那么使用这个:

import io
with io.open(fname, "w", encoding="utf-8") as f:
    f.write(html)

如果您想使用与UTF-8不同的编码,请指定用于编码的实际编码。

在Python 3.7中,并运行Windows 10,这是可行的(我不确定它是否适用于其他平台和/或其他版本的Python)

替换这一行:

用open('filename', 'w')作为f:

用这个:

用open('filename', 'w', encoding='utf-8')作为f:

之所以能正常工作,是因为在使用该文件时,编码被更改为UTF-8,因此UTF-8中的字符能够转换为文本,而不是在遇到当前编码不支持的UTF-8字符时返回错误。

在保存get请求的响应时,在窗口10的Python 3.7上抛出了相同的错误。从URL接收到的响应,编码是UTF-8,所以总是建议检查编码,这样就可以传递相同的编码,以避免这种琐碎的问题,因为它真的在生产中浪费了大量的时间

import requests
resp = requests.get('https://en.wikipedia.org/wiki/NIFTY_50')
print(resp.encoding)
with open ('NiftyList.txt', 'w') as f:
    f.write(resp.text)

当我用open命令添加encoding="utf-8"时,它会以正确的响应保存文件

with open ('NiftyList.txt', 'w', encoding="utf-8") as f:
    f.write(resp.text)