我使用Python 2从ASCII编码的文本文件解析JSON。

当用json或simplejson加载这些文件时,我的所有字符串值都转换为Unicode对象而不是字符串对象。问题是,我必须将数据与一些只接受字符串对象的库一起使用。我不能更改库也不能更新它们。

是否有可能获得字符串对象而不是Unicode对象?

例子

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

(2017年一个简单而干净的解决方案是使用最新版本的Python——即Python 3和更高版本。)


当前回答

我也遇到了这个问题,不得不处理JSON,我想出了一个小循环,将Unicode键转换为字符串。(GAE上的simplejson不返回字符串键。)

obj是从JSON解码的对象:

if NAME_CLASS_MAP.has_key(cls):
    kwargs = {}
    for i in obj.keys():
        kwargs[str(i)] = obj[i]
    o = NAME_CLASS_MAP[cls](**kwargs)
    o.save()

kwargs是我传递给GAE应用程序的构造函数的内容(它不喜欢**kwargs中的Unicode键)。

它不如Wells的解决方案健壮,但要小得多。

其他回答

恐怕在simplejson库中没有任何方法可以自动实现这一点。

The scanner and decoder in simplejson are designed to produce Unicode text. To do this, the library uses a function called c_scanstring (if it's available, for speed), or py_scanstring if the C version is not available. The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You'd have to either monkey patch the scanstring value in simplejson.decoder, or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text.

然而,simplejson输出Unicode的原因是JSON规范特别提到“字符串是0个或多个Unicode字符的集合”……对Unicode的支持被假定为格式本身的一部分。simplejson的扫描字符串实现甚至扫描和解释Inicode转义(甚至错误检查格式不正确的多字节字符集表示),因此它能够可靠地将值返回给您的唯一方法是Unicode。

如果你有一个老旧的库,需要一个str,我建议你在解析后费力地搜索嵌套的数据结构(我承认这是你明确说过你想避免的…对不起),或者可能将库包装在某种外观中,在这种外观中您可以在更细粒度的级别上处理输入参数。如果数据结构确实嵌套很深,第二种方法可能比第一种方法更易于管理。

我从Mark Amery的回答中改编了代码,特别是为了摆脱isinstance的鸭子键入的优点。

编码是手动完成的,ensure_ascii是禁用的。json的Python文档。Dump说:

如果ensure_ascii为True(默认值),输出中的所有非ascii字符将使用\uXXXX序列转义

免责声明:在文档测试中,我使用了匈牙利语。一些著名的与匈牙利语相关的字符编码有:cp852,在DOS中使用的IBM/OEM编码(有时被称为ASCII)。我认为这是不正确的,因为它取决于代码页设置)。Windows-1250用于Windows(有时称为ANSI,取决于区域设置),ISO 8859-1有时用于HTTP服务器。

测试文本Tüskéshátú kígyóbűvölő来自Koltai László(本地个人姓名形式),来自维基百科。

# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json

def encode_items(input, encoding='utf-8'):
    u"""original from: https://stackoverflow.com/a/13101776/611007
    adapted by SO/u/611007 (20150623)
    >>>
    >>> ## run this with `python -m doctest <this file>.py` from command line
    >>>
    >>> txt = u"Tüskéshátú kígyóbűvölő"
    >>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
    >>> txt3 = u"uúuutifu"
    >>> txt4 = b'u\\xfauutifu'
    >>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
    >>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
    >>> txt4u = txt4.decode('cp1250')
    >>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
    >>> txt5 = b"u\\xc3\\xbauutifu"
    >>> txt5u = txt5.decode('utf-8')
    >>> txt6 = u"u\\u251c\\u2551uutifu"
    >>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
    >>> assert txt == there_and_back_again(txt)
    >>> assert txt == there_and_back_again(txt2)
    >>> assert txt3 == there_and_back_again(txt3)
    >>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
    >>> assert txt3 == txt4u,(txt3,txt4u)
    >>> assert txt3 == there_and_back_again(txt5)
    >>> assert txt3 == there_and_back_again(txt5u)
    >>> assert txt3 == there_and_back_again(txt4u)
    >>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
    >>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
    >>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
    >>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
    >>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
    >>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
    >>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
    >>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
    """
    try:
        input.iteritems
        return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
    except AttributeError:
        if isinstance(input, unicode):
            return input.encode(encoding)
        elif isinstance(input, str):
            return input
        try:
            iter(input)
            return [encode_items(e) for e in input]
        except TypeError:
            return input

def alt_dumps(obj, **kwargs):
    """
    >>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
    '{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
    """
    if 'ensure_ascii' in kwargs:
        del kwargs['ensure_ascii']
    return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)

我还想强调Jarret Hardie引用JSON规范的答案,引用如下:

字符串是零个或多个Unicode字符的集合

在我的用例中,我有带有JSON内容的文件。它们是UTF-8编码的文件。ensure_ascii的结果是正确转义,但不是非常可读的JSON文件,这就是为什么我改编了Mark Amery的答案来满足我的需要。

doctest不是特别周到,但我分享了代码,希望它对某人有用。

没有内置选项让json模块函数返回字节字符串而不是Unicode字符串。然而,这个简短而简单的递归函数将任何解码的JSON对象从使用Unicode字符串转换为utf -8编码的字节字符串:

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

只需在从json中获得的输出上调用此函数。加载或json。负载的电话。

几点注意事项:

To support Python 2.6 or earlier, replace return {byteify(key): byteify(value) for key, value in input.iteritems()} with return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]), since dictionary comprehensions weren't supported until Python 2.7. Since this answer recurses through the entire decoded object, it has a couple of undesirable performance characteristics that can be avoided with very careful use of the object_hook or object_pairs_hook parameters. Mirec Miskuf's answer is so far the only one that manages to pull this off correctly, although as a consequence, it's significantly more complicated than my approach.

使用钩子支持Python 2和3(来自Mirec Miskuf的回答):

import requests
import six
from six import iteritems

requests.packages.urllib3.disable_warnings()  # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)

def _byteify(data):
    # If this is a Unicode string, return its string representation
    if isinstance(data, six.string_types):
        return str(data.encode('utf-8').decode())

    # If this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item) for item in data ]

    # If this is a dictionary, return dictionary of byteified keys and values,
    # but only if we haven't already byteified it
    if isinstance(data, dict):
        return {
            _byteify(key): _byteify(value) for key, value in iteritems(data)
        }
    # If it's anything else, return it in its original form
    return data

w = r.json(object_hook=_byteify)
print(w)

返回:

 {'three': '', 'key': 'value', 'one': 'two'}

下面是一个用C语言编写的递归编码器: https://github.com/axiros/nested_encode

与json.loads()相比,“平均”结构的性能开销约为10%。

python speed.py
  json loads            [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
  json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
  time overhead in percent: 9%

使用这个测试结构:

import json, nested_encode, time

s = """
{
  "firstName": "Jos\\u0301",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "\\u00d6sterreich",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null,
  "a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""


t1 = time.time()
for i in xrange(10000):
    u = json.loads(s)
dt_json = time.time() - t1

t1 = time.time()
for i in xrange(10000):
    b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1

print "json loads            [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])

print "time overhead in percent: %i%%"  % (100 * (dt_json_enc - dt_json)/dt_json)