JSON, JSONlines, and jq as a better grep
What this post covers
UNIX processing, JSON whitespace agnosticism, jq (touching on rq ), JSONlines and some tips on using jq in daily life. Hopefully a good tutorial for getting folks interested in jq.
The newline oriented nature of *NIX processing
The character \n
(aka ^J
) signifies the end of the line in unix-like systems.
It is leveraged by countless unix command line programs to signify end of
record, end of input, etc. Vice versa it is used by many command line programs
in output to signify the same. For example ls, by default, separates files it
is listing with newlines on stdout.
The whitespace agnostic world of JSON
The javascript object notation formation has taken over as the de-facto standard for structured, non-binary data. It is far from perfect, but it allows for seamless interoperation between programming languages, with only occasional hiccups, err well a lot of hiccups (e.g. parsing json is a minefield). Dealing with the corner cases of JSON parsing may unwittingly form a lot of a developer’s bug-fixing work these days. While other formats like messagepack exists, the text-editor friendly JSON rules much of the programming playground for the moment.
jq enters the scene
In a major side note, you may want to consider learning rq (Record Query, which is a re-designed jq style tool written in rust, it supports many more formats for input and output JSON, Msgpack, Proto, etc) instead of jq!
But on to jq, jq is in most package managers, is easily installable, is not to be confused with the javascript library jQuery, and can help with the disconnect between whitespace everywhere (including newlines) world of JSON and the “newlines are important” world of the command line shell (minus powershell that is arguably smarter about these things).
Basic jq usage
To use jq to pretty print some json from stdin:
Now notice the single quoted arguments in the call to the *nix classic command
echo
and to jq
. That is because our friendly shell (bash, zsh, ksh, tcsh)
generally doesn’t do any parsing to things in single quotes. That let’s us
drop double quotes "
, pipes |
, curly brackets {}
inside the single quotes
and jq or echo gets to handle them instead of the shell treating them like
special characters. That’s awesome until you need a single quote inside
a single quote, or you forget a single quote, so be warned, anything you’re
passing to jq as a ‘command’ must be in single quotes unless you really know
what you’re doing.
The command
Now about that command to jq, above you are passing the command ‘.’ which amounts to send all of this json to the jq pretty printer and then stdout.
A JSONL digression
Jsonlines, aka lines of json, could look like this:
as a whole chunk this is invalid JSON. To be valid it would have to look like:
or
or the ultimate offense
ok so 4 different ways of saying the same thing…
What’s the diff?
The difference is a world of frustration. That first beautiful example in JSON
lines, one logical object per line. Their order is implicit in the line
numbering of the file. The other 3 are pure JSON. But that purity comes at
a cost, all three are semantically identical, but the classic UNIX tools like
grep
, wc
, diff
couldn’t care less about JSON’s whitespace ideology. For
example, wc -l
would give you 3 different answers for the pure JSON none of
them equal to the number of objects in the list. That’s fine if you’re going
to be loading this file into your program as a configuration, but if you are
loading data for the love of your own sanity use JSON lines. With JSON lines
you’ll get all sorts of free information like wc -l
giving you the list
length and grep
returning single, usable, JSON objects.
There are plenty of resources for loading newline delimited JSON, the easiest
is to remember is jsonlines.org, it also gives some
other nice features and info on the origin of the term and preferred extension
(.jsonl
). These days some programming languages have packages for loading
JSON lines, even though it is usually easy to do using their standard JSON and
line reading primitives.
JSON <-> JSONlines interoperability with jq
Here’s how to go back and forth between formats with jq (note the -c just makes the output compact, necessary for jsonlines but not for pretty printing):
This is great! What about more terrible formats that don’t have names, for
instance the classic “I printed JSON objects one after another with no newline,
or just a space” or the even crazier, yet common I printed python dicts instead
of JSON? Well if you have concatenated JSON with spaces or without jq
can
save you with the --slurp
option, seen before as -s
Python dict printed as a single JSON blob into JSONlines (new)
If you find a python dictionary written to a file as json, you end up with a possibly large single line file like:
To turn this into JSONlines, use the following incantation:
Which will output:
Python dicts just blindly printed
What about that other pesky issue of printed python dicts (aka not jsonified with a bunch of u’thing’s)? Well if you printed a dict and you want to read it I feel bad for you, but it happens and there’s a solution on stackoverflow for you here.
Speed round of jq tips
Now for the fun stuff. The jq manual can be a bit dicey when it comes to explanations of how to do simple stuff, it’s more of an index than a tutorial. While I don’t have the patience to write up a true tutorial, I’d love to share some of the easy tips that you can use to process and verify json at the command line.
remove keys and everything under them
del
if statements to keep or drop objects
If you can figure out a way to create a boolean with the following tip you
can turn that into a filter to remove or keep a piece of JSON. There are
cooler ways to do this in jq
but this is a relatively obvious
way to do it.
The idiom is: if (condition_to_pass_through) then . else empty end
and
it works like so where in this example we want the grades greater than or equal
to 97 to be printed. The keyword empty does what it says… prints out
emptiness. Other less verbose ways to do this are left as an
exercise:
piping things
Here’s where things get complicated but fun. You may have thought pipes were just for use in the shell, but nope, you can pipe things within jq
.
For example:
Oh but that error is so ugly, you say. But alas, that error was printed on stderr, so you can make it disappear with the illusionist’s friend, redirection: 2> /dev/null
What next?
Well this barely scratches the surface of what jq
can do. If you have never
ever used jq
I’d recommend learning how to do all of these things in rq
which can handle input and output of more file formats. But like many beloved
things, jq
is currently more popular and perhaps a little easier to install
on some systems so that’s why I wrote this.
I have found that jq
’s usefulness trails off at about 120 characters of
commands between the single quotes, at that point I tend to think about using
a scripting language to handle the job. That said, there are many things that
jq can do that could require a multiline script in a single line. For things
you’re only running once, this can be a nice advantage to have.
Also with data stores mongodb and postgresql (with JSONB and operators like ->, -», etc) one can find command line queries that translate into DB queries without too much trouble.
Have fun!
comments powered by Disqus