Story of a Software Engineer
It was your average Wednesday afternoon, and I was working my job. My specific task on this day was quite simple: document our custom Vue components that make up most of our products UI.
This should be a relatively easy task and for the most part it was, but I had an issue. Some of these components had some really obscure properties that could influence their behavior, and seeing as much of the codebase was written 10 years ago by utter idiots, the code implementing these properties is really hard to read.
I decided that it would be quite a bit easier to instead of trying to study the definitions of these properties, to try to study the usage of these properties. But how do I find them? Our codebase is hundreds of thousands of lines of code, and these properties have very generic names such as ‘browser’. Additionally while the components are easy to search for, they’re used in hundreds of places and such properties may only be used once or twice.
The solution? I thought it would be the trusty tool in every hackers
The Downfall of Grep
I thought that
grep would be my saviour. The tool that would
answer the call to find me the usages I so desired. So I whipped that
baby out and went straight to work:
You can probably tell from the fact I’m writing this post that this did not work. If you’ve ever worked with Vue or something similar, you might even be able to figure out why. For those unfamiliar with the frontend (you’re a treasure that must be preserved), allow me to show you something that is all too common in a Vue codebase:
The issue here is clear: the property we’re searching for (‘browser’) is
on an entirely different line from the component we’re searching for
<date-input>’). It’s not enough to search for just the
component because it’s used everywhere but only a few rare usages
interest me, and it’s not enough to search for just the attribute
because many different components have attributes of the same name (and
no they don’t have the same behavior; the codebase is shit).
What I need is a tool that will let me search for patterns that span multiple lines.
That quote is from the abstract of Structural Regular Expressions, a paper written by the one and only Rob Pike back in 1987. It describes an idea by which we stop assuming that all data is organized in lines, and instead use regular expressions to define the shapes comprising our data.
I actually had read this paper some years ago and it had always sat in
the back of my mind. I had actually toyed around in the past with an
grep that wasn’t strictly line-oriented, but
it was very bare-bones, and lacked basic faculties such as reporting the
positions of matches, something I desperately needed.
So over the following few days I made major changes, rewrote lots of the
code, and overall turned my tool —
grab — into a staple part of
my hackers toolbelt.
How Grab Finds Text
If you’re familiar with the UNIX environment, you’re probably used to
querying text with tools such as
regular expressions. These are the same regular expressions we as
programmers all know and love, but with one important — yet often
overlooked — characteristic: you cannot match the newline.
grab utility moves away from this limiting paradigm; the
newline is treated no differently from another other character you want
to match. Want to match an entire paragraph of text? The pattern is as
simple as ‘
[^\n].+?(?=\n\n|$)’. It may look
complicated if you’re new to regular expressions — PCREs to be
specific — but it’s really quite simple. You just match a non-newline
character, and then as many characters as possible until reaching either
a double newline, or the end of input.
On its own this isn’t too amazing though. The great thing of
grep is that it doesn’t just show you matches, but it shows you
them in the context of a complete line.
grab solves this in the
same way described in Rob Pike’s paper: chaining operations.
Say we want to iterate not over lines but over paragraphs. We can use the following pattern:
Here we’re using the ‘x’ operator. It iterates over all occurrences of the pattern. In this case we’re iterating over all paragraphs in our input. Maybe we want to see all paragraphs which contain doubled words (for example: ‘the the’), a common typo found in text files. For this we can use the ‘g’ operator:
The fundamental difference between the two operators is that the
‘x’ operator specifies the structure to iterate over. In the context of
grep these are lines, but in
grab they can be whatever
you want. The ‘g’ operator on the other hand doesn’t modify the
structure of the matches returned to you at all; it simply acts as a
filter selecting matches with match the given regular expression.
Here’s an interactive example:
This is almost perfect; there’s just one bit missing. In my interactive
example I’ve shown how you can use the power of
grab to find
paragraphs in your files containing doubled words. This is really handy
if you find yourself writing websites, documentation, or other long-form
Given my example though, how easily were you able to spot the doubled words? It probably didn’t stick out to you right away, unlike if it had been highlighted by some bright flashy color. It is for this reason that the ‘h’ operator exists. This operator is unique in that it does not change the given selections at all. Any matches made by previous occurrences of ‘x' and ‘g’ will be displayed the same with and without the use of ‘h’.
The ‘h’ operators is purely for the user. By using this operator you can specify a pattern for which matching text must be highlighted. Let’s apply it to the previous example and see how the doubled words are made instantly obvious to the user:
There is an obvious problem here: the duplication of the regular
expression provided to the ‘g’ and ‘h’ operators. It is extremely
common that you will want to highlight text that was just matched by a
‘g’ operator. Like, really common. So common in fact that the
‘h’ operator supports a shorthand syntax for this exact situation:
h//. Giving an empty regular expression as an argument to an
operator is illegal with the exception of the ‘h’ operator. When this
operator is given an empty argument, it assumes the regular expression
of the previous operator:
So… what was the final solution to my problem? How did I find all the
<date-input> tags in my jobs codebase that were passed the
‘browser’ attribute? Well here’s how:
Quick, simple, and elegant. Just the way I like it!
Here I’ve shown you the 3 main operators: ‘x’, ‘g’, and ‘h’. These are not all however! Each operator also has a capital variant (‘X’, ‘G’, ‘H’) which behaves the same but instead of working on text that matches the given pattern, these operators match on text which doesn’t match the given pattern.
These operators allow for better pattern matching. For example a
pattern to match all numbers which contain a ‘3’ but which aren’t ‘1337’
could be written as
x/[0-9]+/ g/3/ G/^1337$/.