使用Unix工具解析JSON

我试图解析从curl请求返回的JSON，就像这样:

curl 'http://twitter.com/users/username.json' |
    sed -e 's/[{}]/''/g' | 
    awk -v k="text" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}'

上面将JSON划分为多个字段，例如:

% ...
"geo_enabled":false
"friends_count":245
"profile_text_color":"000000"
"status":"in_reply_to_screen_name":null
"source":"web"
"truncated":false
"text":"My status"
"favorited":false
% ...

我如何打印一个特定的字段(由-v k=文本表示)?

当前回答

为了快速提取特定键的值，我个人喜欢使用“grep -o”，它只返回正则表达式的匹配项。例如，要从tweets中获取“text”字段，如下所示:

grep -Po '"text":.*?[^\\]",' tweets.json

这个正则表达式比你想象的更健壮;例如，它可以很好地处理包含逗号和转义引号的字符串。我想再做点工作，你就能做出一个保证能提取值的程序，如果它是原子的。(如果它有嵌套，那么正则表达式当然不能这样做。)

为了进一步清除(尽管保持字符串的原始转义)，您可以使用如下代码:| perl -pe 's/"text"://;/ / s / ^”;s /, / /美元”。(我这样做是为了分析。)

对于那些坚持认为你应该使用真正的JSON解析器的人——是的，这对于正确性是必不可少的，但是

To do a really quick analysis, like counting values to check on data cleaning bugs or get a general feel for the data, banging out something on the command line is faster. Opening an editor to write a script is distracting. grep -o is orders of magnitude faster than the Python standard json library, at least when doing this for tweets (which are ~2 KB each). I'm not sure if this is just because json is slow (I should compare to yajl sometime); but in principle, a regex should be faster since it's finite state and much more optimizable, instead of a parser that has to support recursion, and in this case, spends lots of CPU building trees for structures you don't care about. (If someone wrote a finite state transducer that did proper (depth-limited) JSON parsing, that would be fantastic! In the meantime we have "grep -o".)

为了编写可维护的代码，我总是使用真正的解析库。我还没有尝试过jsawk，但如果它工作得很好，这将解决第1点。

最后一个更古怪的解决方案:我写了一个脚本，使用Python json并将你想要的键提取到制表符分隔的列中;然后我通过awk的包装器，允许对列进行命名访问。这里:json2tsv和tsvawk脚本。对于这个例子，它将是:

json2tsv id text < tweets.json | tsvawk '{print "tweet " $id " is: " $text}'

这种方法没有解决第2点，比单一的Python脚本效率更低，而且有点脆弱:它强制将字符串值中的换行符和制表符规范化，以更好地处理awk的字段/记录分隔视图。但它确实让您停留在命令行上，比grep -o更正确。

2011-07-27 23:24:46

其他回答

如果有人只想从简单的JSON对象中提取值，而不需要嵌套结构，那么甚至不需要离开Bash就可以使用正则表达式。

下面是我使用基于JSON标准的bash正则表达式定义的函数:

function json_extract() {
  local key=$1
  local json=$2

  local string_regex='"([^"\]|\\.)*"'
  local number_regex='-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?'
  local value_regex="${string_regex}|${number_regex}|true|false|null"
  local pair_regex="\"${key}\"[[:space:]]*:[[:space:]]*(${value_regex})"

  if [[ ${json} =~ ${pair_regex} ]]; then
    echo $(sed 's/^"\|"$//g' <<< "${BASH_REMATCH[1]}")
  else
    return 1
  fi
}

注意:对象和数组不支持作为值，但支持标准中定义的所有其他值类型。另外，只要具有完全相同的键名，无论对在JSON文档中有多深，都将匹配。

以OP为例:

$ json_extract text "$(curl 'http://twitter.com/users/username.json')"
My status

$ json_extract friends_count "$(curl 'http://twitter.com/users/username.json')"
245

2017-09-20 14:33:32

这是一个很好的参考资料。在这种情况下:

curl 'http://twitter.com/users/username.json' | sed -e 's/[{}]/''/g' | awk -v k="text" '{n=split($0,a,","); for (i=1; i<=n; i++) { where = match(a[i], /\"text\"/); if(where) {print a[i]} }  }'

2015-05-25 03:45:50

如果安装了Node.js，这对我来说是有效的:

node -pe "require('${HOME}/.config/dev-utils.json').doToken"

2022-02-17 15:28:17

如果你安装了PHP解释器:

php -r 'var_export(json_decode(`curl http://twitter.com/users/username.json`, 1));'

例如:

我们有一个资源，提供JSON内容与国家的ISO代码:http://country.io/iso3.json，我们可以很容易地看到它在一个shell与curl:

curl http://country.io/iso3.json

但它看起来不是很方便，也不容易读。更好地解析JSON内容并看到可读的结构:

php -r 'var_export(json_decode(`curl http://country.io/iso3.json`, 1));'

这段代码将打印如下内容:

array (
  'BD' => 'BGD',
  'BE' => 'BEL',
  'BF' => 'BFA',
  'BG' => 'BGR',
  'BA' => 'BIH',
  'BB' => 'BRB',
  'WF' => 'WLF',
  'BL' => 'BLM',
  ...

如果你有嵌套数组，这个输出看起来会更好…

2015-11-18 14:24:07

下面是node .js就绪环境的一个简单方法:

curl -L https://github.com/trentm/json/raw/master/lib/json.js > json
chmod +x json
echo '{"hello":{"hi":"there"}}' | ./json "hello.hi"

2020-07-10 13:21:12

使用Unix工具解析JSON

推荐文章

最新文章

标签