使用Unix工具解析JSON

我试图解析从curl请求返回的JSON，就像这样:

curl 'http://twitter.com/users/username.json' |
    sed -e 's/[{}]/''/g' | 
    awk -v k="text" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}'

上面将JSON划分为多个字段，例如:

% ...
"geo_enabled":false
"friends_count":245
"profile_text_color":"000000"
"status":"in_reply_to_screen_name":null
"source":"web"
"truncated":false
"text":"My status"
"favorited":false
% ...

我如何打印一个特定的字段(由-v k=文本表示)?

当前回答

这是使用大多数发行版上可用的标准Unix工具。它也适用于反斜杠(\)和引号(")。

警告:这并不能接近jq的功能，并且只能用于非常简单的JSON对象。这是在无法安装其他工具的情况下尝试回答最初的问题。

function parse_json()
{
    echo $1 | \
    sed -e 's/[{}]/''/g' | \
    sed -e 's/", "/'\",\"'/g' | \
    sed -e 's/" ,"/'\",\"'/g' | \
    sed -e 's/" , "/'\",\"'/g' | \
    sed -e 's/","/'\"---SEPERATOR---\"'/g' | \
    awk -F=':' -v RS='---SEPERATOR---' "\$1~/\"$2\"/ {print}" | \
    sed -e "s/\"$2\"://" | \
    tr -d "\n\t" | \
    sed -e 's/\\"/"/g' | \
    sed -e 's/\\\\/\\/g' | \
    sed -e 's/^[ \t]*//g' | \
    sed -e 's/^"//'  -e 's/"$//'
}


parse_json '{"username":"john, doe","email":"john@doe.com"}' username
parse_json '{"username":"john doe","email":"john@doe.com"}' email

--- outputs ---

john, doe
johh@doe.com

2014-10-30 14:38:52

其他回答

为了快速提取特定键的值，我个人喜欢使用“grep -o”，它只返回正则表达式的匹配项。例如，要从tweets中获取“text”字段，如下所示:

grep -Po '"text":.*?[^\\]",' tweets.json

这个正则表达式比你想象的更健壮;例如，它可以很好地处理包含逗号和转义引号的字符串。我想再做点工作，你就能做出一个保证能提取值的程序，如果它是原子的。(如果它有嵌套，那么正则表达式当然不能这样做。)

为了进一步清除(尽管保持字符串的原始转义)，您可以使用如下代码:| perl -pe 's/"text"://;/ / s / ^”;s /, / /美元”。(我这样做是为了分析。)

对于那些坚持认为你应该使用真正的JSON解析器的人——是的，这对于正确性是必不可少的，但是

To do a really quick analysis, like counting values to check on data cleaning bugs or get a general feel for the data, banging out something on the command line is faster. Opening an editor to write a script is distracting. grep -o is orders of magnitude faster than the Python standard json library, at least when doing this for tweets (which are ~2 KB each). I'm not sure if this is just because json is slow (I should compare to yajl sometime); but in principle, a regex should be faster since it's finite state and much more optimizable, instead of a parser that has to support recursion, and in this case, spends lots of CPU building trees for structures you don't care about. (If someone wrote a finite state transducer that did proper (depth-limited) JSON parsing, that would be fantastic! In the meantime we have "grep -o".)

为了编写可维护的代码，我总是使用真正的解析库。我还没有尝试过jsawk，但如果它工作得很好，这将解决第1点。

最后一个更古怪的解决方案:我写了一个脚本，使用Python json并将你想要的键提取到制表符分隔的列中;然后我通过awk的包装器，允许对列进行命名访问。这里:json2tsv和tsvawk脚本。对于这个例子，它将是:

json2tsv id text < tweets.json | tsvawk '{print "tweet " $id " is: " $text}'

这种方法没有解决第2点，比单一的Python脚本效率更低，而且有点脆弱:它强制将字符串值中的换行符和制表符规范化，以更好地处理awk的字段/记录分隔视图。但它确实让您停留在命令行上，比grep -o更正确。

2011-07-27 23:24:46

我已经这样做了，为一个特定的值“解析”JSON响应，如下所示:

curl $url | grep $var | awk '{print $2}' | sed s/\"//g

显然，这里的$url将是Twitter url， $var将是“text”，以获取该变量的响应。

实际上，我认为我所做的OP所遗漏的唯一一件事是grep，用于他所寻找的特定变量的行。AWK获取行上的第二项，并使用sed删除引号。

比我聪明的人可能会用AWK或grep来做整个思考。

现在，你可以用sed完成这一切:

curl $url | sed '/text/!d' | sed s/\"text\"://g | sed s/\"//g | sed s/\ //g

因此，没有AWK，没有grep…我不知道为什么我以前没想到。嗯…

2012-12-10 04:13:07

有很多专门为从命令行操作JSON而设计的工具，它们会比用Awk更容易、更可靠，比如jq:

curl -s 'https://api.github.com/users/lambda' | jq -r '.name'

您还可以使用可能已经安装在系统上的工具，例如使用json模块的Python，从而避免任何额外的依赖，同时仍然具有适当的json解析器的好处。下面假设你想要使用UTF-8，原始JSON应该用它来编码，而且大多数现代终端也使用它:

Python 3:

curl -s 'https://api.github.com/users/lambda' | \
    python3 -c "import sys, json; print(json.load(sys.stdin)['name'])"

Python 2:

export PYTHONIOENCODING=utf8
curl -s 'https://api.github.com/users/lambda' | \
    python2 -c "import sys, json; print json.load(sys.stdin)['name']"

常见问题

为什么不是纯壳的解决方案呢?

标准POSIX/Single Unix Specification shell是一种非常有限的语言，它不包含表示序列(列表或数组)或关联数组(在其他一些语言中也称为哈希表、映射、字典或对象)的工具。这使得在可移植的shell脚本中表示解析JSON的结果有些棘手。有一些简单的方法可以做到这一点，但如果键或值包含某些特殊字符，其中许多方法都可能会失效。

Bash 4 and later, zsh, and ksh have support for arrays and associative arrays, but these shells are not universally available (macOS stopped updating Bash at Bash 3, due to a change from GPLv2 to GPLv3, while many Linux systems don't have zsh installed out of the box). It's possible that you could write a script that would work in either Bash 4 or zsh, one of which is available on most macOS, Linux, and BSD systems these days, but it would be tough to write a shebang line that worked for such a polyglot script.

最后，在shell中编写一个完整的JSON解析器将是一个非常重要的依赖项，您还可以使用现有的依赖项，如jq或Python。要实现良好的实现，它不会是一行代码，甚至不会是五行代码片段。

为什么不使用awk、sed或grep呢?

可以使用这些工具从具有已知形状和格式的JSON中进行一些快速提取，例如每行一个键。在其他的回答中有几个关于这方面建议的例子。

然而，这些工具是为基于行或基于记录的格式设计的;它们不是为递归解析带有可能转义字符的匹配分隔符而设计的。

因此，使用awk/sed/grep的这些快速而肮脏的解决方案很可能是脆弱的，如果输入格式的某些方面发生变化，例如折叠空白，或在JSON对象中添加额外的嵌套级别，或字符串中的转义引号，就会中断。一个足够健壮、能够处理所有JSON输入而不中断的解决方案也相当庞大和复杂，因此与在jq或Python上添加另一个依赖关系没有太大区别。

我曾经处理过由于shell脚本中糟糕的输入解析而导致的大量客户数据被删除的情况，所以我从不推荐在这种情况下可能很脆弱的快速和肮脏的方法。如果您正在进行一些一次性处理，请参阅其他答案以获得建议，但我仍然强烈建议只使用现有的经过测试的JSON解析器。

历史记录

这个答案最初建议使用jsawk，它应该仍然可以工作，但使用起来比jq要麻烦一些，并且依赖于安装的独立JavaScript解释器，它不像Python解释器那么常见，所以上面的答案可能更可取:

curl -s 'https://api.github.com/users/lambda' | jsawk -a 'return this.name'

这个答案最初也使用了问题中的Twitter API，但该API不再有效，因此很难复制示例进行测试，并且新的Twitter API需要API密钥，因此我已经切换到使用GitHub API，它可以在没有API密钥的情况下轻松使用。原问题的第一个答案是:

curl 'http://twitter.com/users/username.json' | jq -r '.text'

2009-12-23 21:59:30

使用node . js

如果系统安装了Node.js，则可以在JSON中使用-p print和-e evaluate脚本标志。解析以提取所需的任何值。

一个简单的例子，使用JSON字符串{"foo": "bar"}并取出"foo"的值:

node -pe 'JSON.parse(process.argv[1]).foo' '{ "foo": "bar" }'

输出:

bar

因为我们可以访问cat和其他实用程序，我们可以对文件使用这个:

node -pe 'JSON.parse(process.argv[1]).foo' "$(cat foobar.json)"

输出:

bar

或包含JSON的URL等任何其他格式:

node -pe 'JSON.parse(process.argv[1]).name' "$(curl -s https://api.github.com/users/trevorsenior)"

输出:

Trevor Senior

2013-08-27 15:11:22

对于更复杂的JSON解析，我建议使用Python jsonpath模块(Stefan Goessner) -

Install it - sudo easy_install -U jsonpath Use it - Example file.json (from http://goessner.net/articles/JsonPath) - { "store": { "book": [ { "category": "reference", "author": "Nigel Rees", "title": "Sayings of the Century", "price": 8.95 }, { "category": "fiction", "author": "Evelyn Waugh", "title": "Sword of Honour", "price": 12.99 }, { "category": "fiction", "author": "Herman Melville", "title": "Moby Dick", "isbn": "0-553-21311-3", "price": 8.99 }, { "category": "fiction", "author": "J. R. R. Tolkien", "title": "The Lord of the Rings", "isbn": "0-395-19395-8", "price": 22.99 } ], "bicycle": { "color": "red", "price": 19.95 } } } Parse it (extract all book titles with price < 10) - cat file.json | python -c "import sys, json, jsonpath; print '\n'.join(jsonpath.jsonpath(json.load(sys.stdin), 'store.book[?(@.price < 10)].title'))" Will output - Sayings of the Century Moby Dick Note: The above command line does not include error checking. For a full solution with error checking, you should create a small Python script, and wrap the code with try-except.

2014-04-01 08:57:52

使用Unix工具解析JSON

推荐文章

最新文章

标签