将Unicode文本写入文本文件?

我从谷歌文档中提取数据，处理它，并将其写入文件(最终我将粘贴到Wordpress页面)。

它有一些非ascii符号。如何将这些安全转换为可以在HTML源代码中使用的符号?

目前，我正在将所有内容转换为Unicode，在一个Python字符串中将它们全部连接在一起，然后做:

import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))

最后一行有一个编码错误:

'ascii'编解码器不能解码字节0xa0在位置 12286:序数不在范围(128)

部分解决方案:

这个Python运行时没有错误:

row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8"))

但如果我打开实际的文本文件，我会看到很多像这样的符号:

Qur‚Äôan

也许我需要写一些东西，而不是一个文本文件?

当前回答

如果用python3编写

>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'

如果用python2编写:

>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

为了避免这个错误，你必须使用“utf-8”编解码器将其编码为字节，就像这样:

>>> f.write(a.encode("utf-8"))
>>> f.close()

并在读取时使用“utf-8”编解码器解码数据:

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'

另外，如果你试图对这个字符串执行print，它会自动解码使用“utf-8”编解码器，就像这样

>>> print a
batsà

2019-04-26 09:24:12

其他回答

如何将unicode字符打印到文件中:

保存到文件:foo.py:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys 
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')

运行它并将管道输出到文件:

python foo.py > tmp.txt

打开tmp.txt，看看里面，你会看到:

el@apollo:~$ cat tmp.txt 
e with obfuscation: é

因此，您已经将带有混淆标记的unicode e保存到文件中。

2013-12-27 18:36:02

尽可能只处理unicode对象，方法是在第一次获得对象时将其解码为unicode对象，然后在返回时根据需要对其进行编码。

如果你的字符串实际上是一个unicode对象，你需要在将其写入文件之前将其转换为一个unicode编码的字符串对象:

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()

当你再次读取该文件时，你会得到一个unicode编码的字符串，你可以解码为unicode对象:

f = file('test', 'r')
print f.read().decode('utf8')

2011-05-18 16:49:01

由编解码器打开的文件。Open是一个接受unicode数据的文件，将其编码为iso-8859-1并将其写入文件。然而，您尝试编写的不是unicode;你自己把unicode编码成iso-8859-1。这就是统一码。Encode方法可以，对unicode字符串进行编码的结果是一个字节字符串(一种STR类型)。

您应该使用普通的open()并自己编码unicode，或者(通常是更好的主意)使用codecs.open()而不自己编码数据。

2011-05-18 16:44:35

如果用python3编写

>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'

如果用python2编写:

>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

为了避免这个错误，你必须使用“utf-8”编解码器将其编码为字节，就像这样:

>>> f.write(a.encode("utf-8"))
>>> f.close()

并在读取时使用“utf-8”编解码器解码数据:

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'

另外，如果你试图对这个字符串执行print，它会自动解码使用“utf-8”编解码器，就像这样

>>> print a
batsà

2019-04-26 09:24:12

当您尝试编码一个非unicode字符串时，就会出现这个错误:它会尝试解码它，假设它是纯ASCII。有两种可能:

您将它编码为字节字符串，但因为您使用了编解码器。打开时，write方法需要一个unicode对象。所以你给它编码，它再试着解码。尝试:f.r write(all_html)。 All_html实际上不是unicode对象。当你使用.encode(…)时，它首先尝试解码。

2011-05-18 16:45:01

将Unicode文本写入文本文件?

推荐文章

最新文章

标签