如何从JSON获得字符串对象而不是Unicode

我使用Python 2从ASCII编码的文本文件解析JSON。

当用json或simplejson加载这些文件时，我的所有字符串值都转换为Unicode对象而不是字符串对象。问题是，我必须将数据与一些只接受字符串对象的库一起使用。我不能更改库也不能更新它们。

是否有可能获得字符串对象而不是Unicode对象?

例子

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

(2017年一个简单而干净的解决方案是使用最新版本的Python——即Python 3和更高版本。)

当前回答

没有内置选项让json模块函数返回字节字符串而不是Unicode字符串。然而，这个简短而简单的递归函数将任何解码的JSON对象从使用Unicode字符串转换为utf -8编码的字节字符串:

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

只需在从json中获得的输出上调用此函数。加载或json。负载的电话。

几点注意事项:

To support Python 2.6 or earlier, replace return {byteify(key): byteify(value) for key, value in input.iteritems()} with return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]), since dictionary comprehensions weren't supported until Python 2.7. Since this answer recurses through the entire decoded object, it has a couple of undesirable performance characteristics that can be avoided with very careful use of the object_hook or object_pairs_hook parameters. Mirec Miskuf's answer is so far the only one that manages to pull this off correctly, although as a consequence, it's significantly more complicated than my approach.

2012-10-28 00:27:17

其他回答

这是因为json()在字符串对象和Unicode对象之间没有区别。它们都是JavaScript中的字符串。

我认为JSON返回Unicode对象是正确的。事实上，我不会接受更少的东西，因为JavaScript字符串实际上是unicode对象(即JSON (JavaScript)字符串可以存储任何类型的unicode字符)，因此在从JSON转换字符串时创建unicode对象是有意义的。普通字符串不适合，因为库必须猜测您想要的编码。

最好在任何地方都使用unicode字符串对象。因此，最好的选择是更新库，使它们能够处理Unicode对象。

但如果你真的想要字节串，只需将结果编码为你选择的编码:

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']

2009-06-05 16:44:45

使用Python 3.6，有时我仍然会遇到这个问题。例如，当从REST API获取响应并将响应文本加载到JSON时，我仍然得到Unicode字符串。使用json.dumps()找到了一个简单的解决方案。

response_message = json.loads(json.dumps(response.text))
print(response_message)

2018-04-25 17:17:55

我也遇到了这个问题，不得不处理JSON，我想出了一个小循环，将Unicode键转换为字符串。(GAE上的simplejson不返回字符串键。)

obj是从JSON解码的对象:

if NAME_CLASS_MAP.has_key(cls):
    kwargs = {}
    for i in obj.keys():
        kwargs[str(i)] = obj[i]
    o = NAME_CLASS_MAP[cls](**kwargs)
    o.save()

kwargs是我传递给GAE应用程序的构造函数的内容(它不喜欢**kwargs中的Unicode键)。

它不如Wells的解决方案健壮，但要小得多。

2011-06-20 01:20:36

使用钩子支持Python 2和3(来自Mirec Miskuf的回答):

import requests
import six
from six import iteritems

requests.packages.urllib3.disable_warnings()  # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)

def _byteify(data):
    # If this is a Unicode string, return its string representation
    if isinstance(data, six.string_types):
        return str(data.encode('utf-8').decode())

    # If this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item) for item in data ]

    # If this is a dictionary, return dictionary of byteified keys and values,
    # but only if we haven't already byteified it
    if isinstance(data, dict):
        return {
            _byteify(key): _byteify(value) for key, value in iteritems(data)
        }
    # If it's anything else, return it in its original form
    return data

w = r.json(object_hook=_byteify)
print(w)

 {'three': '', 'key': 'value', 'one': 'two'}

2017-08-21 20:16:47

Mark (Amery)正确地指出:在JSON转储上使用PyYAML的反序列化器仅在只有ASCII时有效。至少是开箱即用。

关于PyYAML方法的两个简短评论:

永远不要对来自字段的数据使用yaml.load()。这是YAML的一个特性(!)，可以执行隐藏在结构中的任意代码。你也可以通过以下方法使它适用于非ASCII: Def to_utf8(加载器，节点): 返回loader.construct_scalar(节点).encode(“utf - 8”) yaml.add_constructor (u 'tag: yaml.org, 2002: str ', to_utf8)

但就性能而言，这与马克·艾默里的答案无法相提并论:

将一些深度嵌套的样本字典扔到这两个方法上，我得到了这个(与dt[j] = json.loads(json.dumps(m))的时间delta):

     dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
     dt[byteify recursion(Mark Amery)] =~   5 * dt[j]

因此，反序列化(包括完全遍历树和编码)完全在基于c语言的JSON实现的数量级之内。我发现这非常快，而且在深度嵌套结构上比yaml加载更健壮。更少的安全错误，看yaml.load。

虽然我很喜欢一个指向c语言的转换器的指针，但byteify函数应该是默认答案。

如果JSON结构来自包含用户输入的字段，则尤其如此。因为这样你可能需要遍历你的结构——独立于你想要的内部数据结构(“unicode三明治”或字节字符串)。

Why?

Unicode正常化。给不知情的人:吃片止痛药，看看这篇文章。

所以使用byteify递归你一石二鸟:

从嵌套的JSON转储中获取字节串让用户输入值正常化，这样你就可以在你的存储中找到东西。

在我的测试中，结果是将input.encode('utf-8')替换为unicodedata。normalize('NFC'， input).encode('utf-8')甚至比没有NFC时还要快——但我猜这在很大程度上依赖于样本数据。

2015-04-14 17:36:25

如何从JSON获得字符串对象而不是Unicode

推荐文章

最新文章

标签