我从Mark Amery的回答中改编了代码,特别是为了摆脱isinstance的鸭子键入的优点。
编码是手动完成的,ensure_ascii是禁用的。json的Python文档。Dump说:
如果ensure_ascii为True(默认值),输出中的所有非ascii字符将使用\uXXXX序列转义
免责声明:在文档测试中,我使用了匈牙利语。一些著名的与匈牙利语相关的字符编码有:cp852,在DOS中使用的IBM/OEM编码(有时被称为ASCII)。我认为这是不正确的,因为它取决于代码页设置)。Windows-1250用于Windows(有时称为ANSI,取决于区域设置),ISO 8859-1有时用于HTTP服务器。
测试文本Tüskéshátú kígyóbűvölő来自Koltai László(本地个人姓名形式),来自维基百科。
# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json
def encode_items(input, encoding='utf-8'):
u"""original from: https://stackoverflow.com/a/13101776/611007
adapted by SO/u/611007 (20150623)
>>>
>>> ## run this with `python -m doctest <this file>.py` from command line
>>>
>>> txt = u"Tüskéshátú kígyóbűvölő"
>>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
>>> txt3 = u"uúuutifu"
>>> txt4 = b'u\\xfauutifu'
>>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
>>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
>>> txt4u = txt4.decode('cp1250')
>>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
>>> txt5 = b"u\\xc3\\xbauutifu"
>>> txt5u = txt5.decode('utf-8')
>>> txt6 = u"u\\u251c\\u2551uutifu"
>>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
>>> assert txt == there_and_back_again(txt)
>>> assert txt == there_and_back_again(txt2)
>>> assert txt3 == there_and_back_again(txt3)
>>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
>>> assert txt3 == txt4u,(txt3,txt4u)
>>> assert txt3 == there_and_back_again(txt5)
>>> assert txt3 == there_and_back_again(txt5u)
>>> assert txt3 == there_and_back_again(txt4u)
>>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
>>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
>>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
>>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
>>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
>>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
>>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
>>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
"""
try:
input.iteritems
return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
except AttributeError:
if isinstance(input, unicode):
return input.encode(encoding)
elif isinstance(input, str):
return input
try:
iter(input)
return [encode_items(e) for e in input]
except TypeError:
return input
def alt_dumps(obj, **kwargs):
"""
>>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
'{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
"""
if 'ensure_ascii' in kwargs:
del kwargs['ensure_ascii']
return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)
我还想强调Jarret Hardie引用JSON规范的答案,引用如下:
字符串是零个或多个Unicode字符的集合
在我的用例中,我有带有JSON内容的文件。它们是UTF-8编码的文件。ensure_ascii的结果是正确转义,但不是非常可读的JSON文件,这就是为什么我改编了Mark Amery的答案来满足我的需要。
doctest不是特别周到,但我分享了代码,希望它对某人有用。