What this post covers

UNIX processing, JSON whitespace agnosticism, jq (touching on rq ), JSONlines and some tips on using jq in daily life. Hopefully a good tutorial for getting folks interested in jq.

The newline oriented nature of *NIX processing

The character \n (aka ^J) signifies the end of the line in unix-like systems. It is leveraged by countless unix command line programs to signify end of record, end of input, etc. Vice versa it is used by many command line programs in output to signify the same. For example ls, by default, separates files it is listing with newlines on stdout.

The whitespace agnostic world of JSON

The javascript object notation formation has taken over as the de-facto standard for structured, non-binary data. It is far from perfect, but it allows for seamless interoperation between programming languages, with only occasional hiccups, err well a lot of hiccups (e.g. parsing json is a minefield). Dealing with the corner cases of JSON parsing may unwittingly form a lot of a developer’s bug-fixing work these days. While other formats like messagepack exists, the text-editor friendly JSON rules much of the programming playground for the moment.

jq enters the scene

In a major side note, you may want to consider learning rq (Record Query, which is a re-designed jq style tool written in rust, it supports many more formats for input and output JSON, Msgpack, Proto, etc) instead of jq!

But on to jq, jq is in most package managers, is easily installable, is not to be confused with the javascript library jQuery, and can help with the disconnect between whitespace everywhere (including newlines) world of JSON and the “newlines are important” world of the command line shell (minus powershell that is arguably smarter about these things).

Basic jq usage

To use jq to pretty print some json from stdin:

 echo '{"key2":"value2","key1":"value1"}' | jq '.' 

{
  "key2": "value2",
  "key1": "value1"
}

Now notice the single quoted arguments in the call to the *nix classic command echo and to jq. That is because our friendly shell (bash, zsh, ksh, tcsh) generally doesn’t do any parsing to things in single quotes. That let’s us drop double quotes ", pipes |, curly brackets {} inside the single quotes and jq or echo gets to handle them instead of the shell treating them like special characters. That’s awesome until you need a single quote inside a single quote, or you forget a single quote, so be warned, anything you’re passing to jq as a ‘command’ must be in single quotes unless you really know what you’re doing.

The command

Now about that command to jq, above you are passing the command ‘.’ which amounts to send all of this json to the jq pretty printer and then stdout.

A JSONL digression

Jsonlines, aka lines of json, could look like this:

{"key2":"value2","key1":"value8"}
{"key2":"value3","key1":"value6"}
{"key2":"value4","key1":"value10"}
{"key2":"value4","key1":"value8"}
{"key2":"value6","key1":"value1"}
{"key2":"value7","key1":"value8"}
{"key2":"value2","key1":"value2"}

as a whole chunk this is invalid JSON. To be valid it would have to look like:

[
{"key2": "value2", "key1": "value8"},
{"key2": "value3", "key1": "value6"},
{"key2": "value4", "key1": "value10"},
{"key2": "value4", "key1": "value8"},
{"key2": "value6", "key1": "value1"},
{"key2": "value7", "key1": "value8"},
{"key2": "value2", "key1": "value2"}
]

or

[{"key2": "value2", "key1": "value8"}, {"key2": "value3","key1": "value6"},
{"key2": "value4","key1": "value10"}, {"key2": "value4", "key1": "value8"}, 
{"key2": "value6","key1": "value1"}, {"key2": "value7","key1": "value8"},
{"key2": "value2", "key1": "value2"}]

or the ultimate offense

[{"key2": "value2", "key1": "value8"}, {"key2": "value3","key1": "value6"}, {"key2": "value4","key1": "value10"}, {"key2": "value4", "key1": "value8"}, {"key2": "value6","key1": "value1"}, {"key2": "value7","key1": "value8"}, {"key2": "value2", "key1": "value2"}]

ok so 4 different ways of saying the same thing…

What’s the diff?

The difference is a world of frustration. That first beautiful example in JSON lines, one logical object per line. Their order is implicit in the line numbering of the file. The other 3 are pure JSON. But that purity comes at a cost, all three are semantically identical, but the classic UNIX tools like grep, wc, diff couldn’t care less about JSON’s whitespace ideology. For example, wc -l would give you 3 different answers for the pure JSON none of them equal to the number of objects in the list. That’s fine if you’re going to be loading this file into your program as a configuration, but if you are loading data for the love of your own sanity use JSON lines. With JSON lines you’ll get all sorts of free information like wc -l giving you the list length and grep returning single, usable, JSON objects.

There are plenty of resources for loading newline delimited JSON, the easiest is to remember is jsonlines.org, it also gives some other nice features and info on the origin of the term and preferred extension (.jsonl). These days some programming languages have packages for loading JSON lines, even though it is usually easy to do using their standard JSON and line reading primitives.

JSON <-> JSONlines interoperability with jq

Here’s how to go back and forth between formats with jq (note the -c just makes the output compact, necessary for jsonlines but not for pretty printing):

cat jsonlines.jsonl | jq -cs '.' > array_of_json.json
cat array_of_json.json | jq -c '.[]' > jsonlines.jsonl

This is great! What about more terrible formats that don’t have names, for instance the classic “I printed JSON objects one after another with no newline, or just a space” or the even crazier, yet common I printed python dicts instead of JSON? Well if you have concatenated JSON with spaces or without jq can save you with the --slurp option, seen before as -s

echo '{"hi":1}{"hi":2} {"hi": 3}\n\n{"hi":4}' | jq -sc '.|.[]'

{"hi":1}
{"hi":2}
{"hi":3}
{"hi":4}

Python dict printed as a single JSON blob into JSONlines (new)

If you find a python dictionary written to a file as json, you end up with a possibly large single line file like:

{"123":{"features":[1,2,3]},"333":{"features":[3,3,3]}}

To turn this into JSONlines, use the following incantation:

jq -c 'to_entries|.[]|{(.key):(.value)}' blob_of_json.json

Which will output:

{"123":{"features":[1,2,3]}}
{"333":{"features":[3,3,3]}}

Python dicts just blindly printed

What about that other pesky issue of printed python dicts (aka not jsonified with a bunch of u’thing’s)? Well if you printed a dict and you want to read it I feel bad for you, but it happens and there’s a solution on stackoverflow for you here.

Speed round of jq tips

Now for the fun stuff. The jq manual can be a bit dicey when it comes to explanations of how to do simple stuff, it’s more of an index than a tutorial. While I don’t have the patience to write up a true tutorial, I’d love to share some of the easy tips that you can use to process and verify json at the command line.

remove keys and everything under them

del

echo '{"good":1,"bad":["things","here"],"evil":4}\n{"good":2,"evil":88}' | jq -c 'del(.bad,.evil)' 

{"good":1}
{"good":2}

if statements to keep or drop objects

If you can figure out a way to create a boolean with the following tip you can turn that into a filter to remove or keep a piece of JSON. There are cooler ways to do this in jq but this is a relatively obvious way to do it.

The idiom is: if (condition_to_pass_through) then . else empty end and it works like so where in this example we want the grades greater than or equal to 97 to be printed. The keyword empty does what it says… prints out emptiness. Other less verbose ways to do this are left as an exercise:

echo '{"name":"univox", "grade":77}\n{"name":"moog", "grade":100}\n{"name":"arturia", "grade":97}' | jq -c 'if (.grade >= 97) then . else empty end'

{"name":"moog","grade":100}
{"name":"arturia","grade":97}

piping things

Here’s where things get complicated but fun. You may have thought pipes were just for use in the shell, but nope, you can pipe things within jq.

For example:

echo '{"file_name":"supercool.wav"}{"file_name":"great.dll"}{"file_name":"uncool.exe"}{"nothing":"what"}{"file_name":"boolcool.net"}'| jq -c 'select(.file_name | contains("cool"))'

{"file_name":"supercool.wav"}
{"file_name":"uncool.exe"}
jq: error (at :1): null (null) and string ("cool") cannot have their containment checked
{"file_name":"boolcool.net"}

Oh but that error is so ugly, you say. But alas, that error was printed on stderr, so you can make it disappear with the illusionist’s friend, redirection: 2> /dev/null

echo '{"file_name":"supercool.wav"}{"file_name":"great.dll"}{"file_name":"uncool.exe"}{"nothing":"what"}{"file_name":"boolcool.net"}'| jq -c 'select(.file_name | contains("cool"))' 2> /dev/null

{"file_name":"supercool.wav"}
{"file_name":"uncool.exe"}
{"file_name":"boolcool.net"}

What next?

Well this barely scratches the surface of what jq can do. If you have never ever used jq I’d recommend learning how to do all of these things in rq which can handle input and output of more file formats. But like many beloved things, jq is currently more popular and perhaps a little easier to install on some systems so that’s why I wrote this.

I have found that jq’s usefulness trails off at about 120 characters of commands between the single quotes, at that point I tend to think about using a scripting language to handle the job. That said, there are many things that jq can do that could require a multiline script in a single line. For things you’re only running once, this can be a nice advantage to have.

Also with data stores mongodb and postgresql (with JSONB and operators like ->, ->>, etc) one can find command line queries that translate into DB queries without too much trouble.

Have fun!

comments powered by Disqus