Reinvent The Wheel!

  • Home
  • Back
  • You have to do what must be done. Nobody is going to ask you, “why didn’t you make it?”. It’s either do it or not. Do not think about what you’re feeling, do it no matter what.

    Haroon Khan

    Story of a Software Engineer

    It was your average Wednesday afternoon, and I was working my job. My specific task on this day was quite simple: document our custom Vue components that make up most of our products UI.

    This should be a relatively easy task and for the most part it was, but I had an issue. Some of these components had some really obscure properties that could influence their behavior, and seeing as much of the codebase was written 10 years ago by utter idiots, the code implementing these properties is really hard to read.

    I decided that it would be quite a bit easier to instead of trying to study the definitions of these properties, to try to study the usage of these properties. But how do I find them? Our codebase is hundreds of thousands of lines of code, and these properties have very generic names such as ‘browser’. Additionally while the components are easy to search for, they’re used in hundreds of places and such properties may only be used once or twice.

    The solution? I thought it would be the trusty tool in every hackers toolbelt: grep.

    The Downfall of Grep

    I thought that grep would be my saviour. The tool that would answer the call to find me the usages I so desired. So I whipped that baby out and went straight to work:

    $ git grep '<date-input.*browser.*>'

    You can probably tell from the fact I’m writing this post that this did not work. If you’ve ever worked with Vue or something similar, you might even be able to figure out why. For those unfamiliar with the frontend (you’re a treasure that must be preserved), allow me to show you something that is all too common in a Vue codebase:

    <date-input
    	v-model="date"
    	class="foo bar"
    	:browser="true"
    	:placeholder="today"
    	required
    />

    The issue here is clear: the property we’re searching for (‘browser’) is on an entirely different line from the component we’re searching for (‘<date-input>’). It’s not enough to search for just the component because it’s used everywhere but only a few rare usages interest me, and it’s not enough to search for just the attribute because many different components have attributes of the same name (and no they don’t have the same behavior; the codebase is shit).

    What I need is a tool that will let me search for patterns that span multiple lines.

    Introducing Grab

    The current UNIX® text processing tools are weakened by the built-in concept of a line. There is a simple notation that can describe the ‘shape’ of files when the typical array-of-lines picture is inadequate. That notation is regular expressions. Using regular expressions to describe the structure in addition to the contents of files has interesting applications, and yields elegant methods for dealing with some problems the current tools handle clumsily. When operations using these expressions are composed, the result is reminiscent of shell pipelines.

    Rob Pike

    That quote is from the abstract of Structural Regular Expressions, a paper written by the one and only Rob Pike back in 1987. It describes an idea by which we stop assuming that all data is organized in lines, and instead use regular expressions to define the shapes comprising our data.

    I actually had read this paper some years ago and it had always sat in the back of my mind. I had actually toyed around in the past with an implementation of grep that wasn’t strictly line-oriented, but it was very bare-bones, and lacked basic faculties such as reporting the positions of matches, something I desperately needed.

    So over the following few days I made major changes, rewrote lots of the code, and overall turned my tool — grab — into a staple part of my hackers toolbelt.

    How Grab Finds Text

    If you’re familiar with the UNIX environment, you’re probably used to querying text with tools such as sed and awk using regular expressions. These are the same regular expressions we as programmers all know and love, but with one important — yet often overlooked — characteristic: you cannot match the newline.

    The grab utility moves away from this limiting paradigm; the newline is treated no differently from another other character you want to match. Want to match an entire paragraph of text? The pattern is as simple as ‘[^\n].‌+?(?=\n\n|$)’. It may look complicated if you’re new to regular expressions — PCREs to be specific — but it’s really quite simple. You just match a non-newline character, and then as many characters as possible until reaching either a double newline, or the end of input.

    On its own this isn’t too amazing though. The great thing of grep is that it doesn’t just show you matches, but it shows you them in the context of a complete line. grab solves this in the same way described in Rob Pike’s paper: chaining operations.

    Say we want to iterate not over lines but over paragraphs. We can use the following pattern:

    x/[^\n].+?(?=\n\n|$)/

    Here we’re using the ‘x’ operator. It iterates over all occurrences of the pattern. In this case we’re iterating over all paragraphs in our input. Maybe we want to see all paragraphs which contain doubled words (for example: ‘the the’), a common typo found in text files. For this we can use the ‘g’ operator:

    x/[^\n].+?(?=\n\n|$)/ g/(\b\w+\b)\s+\1/

    The fundamental difference between the two operators is that the ‘x’ operator specifies the structure to iterate over. In the context of grep these are lines, but in grab they can be whatever you want. The ‘g’ operator on the other hand doesn’t modify the structure of the matches returned to you at all; it simply acts as a filter selecting matches with match the given regular expression.

    Here’s an interactive example:

    $ cat foo
    Hello world, this is
    a paragraph.
    
    This is also a paragraph
    but it contains doubled
    doubled words.
    $ grab 'x/[^\n].+?(?=\n\n|$)/ g/(\b\w+\b)\s+\1/' foo
    This is also a paragraph
    but it contains doubled
    doubled words.
    $ # Just like grep, you can display match positions
    $ grab -f '…' foo
    foo:4:1:This is also a paragraph
    but it contains doubled
    doubled words.

    This is almost perfect; there’s just one bit missing. In my interactive example I’ve shown how you can use the power of grab to find paragraphs in your files containing doubled words. This is really handy if you find yourself writing websites, documentation, or other long-form written content.

    Given my example though, how easily were you able to spot the doubled words? It probably didn’t stick out to you right away, unlike if it had been highlighted by some bright flashy color. It is for this reason that the ‘h’ operator exists. This operator is unique in that it does not change the given selections at all. Any matches made by previous occurrences of ‘x' and ‘g’ will be displayed the same with and without the use of ‘h’.

    The ‘h’ operators is purely for the user. By using this operator you can specify a pattern for which matching text must be highlighted. Let’s apply it to the previous example and see how the doubled words are made instantly obvious to the user:

    $ grab 'x/[^\n].+?(?=\n\n|$)/ g/(\b\w+\b)\s+\1/ h/(\b\w+\b)\s+\1/' foo
    This is also a paragraph
    but it contains doubled
    doubled words.

    There is an obvious problem here: the duplication of the regular expression provided to the ‘g’ and ‘h’ operators. It is extremely common that you will want to highlight text that was just matched by a ‘g’ operator. Like, really common. So common in fact that the ‘h’ operator supports a shorthand syntax for this exact situation: h//. Giving an empty regular expression as an argument to an operator is illegal with the exception of the ‘h’ operator. When this operator is given an empty argument, it assumes the regular expression of the previous operator:

    $ grab 'x/[^\n].+?(?=\n\n|$)/ g/(\b\w+\b)\s+\1/ h//' foo
    This is also a paragraph
    but it contains doubled
    doubled words.

    Final Solution

    So… what was the final solution to my problem? How did I find all the <date-input> tags in my jobs codebase that were passed the ‘browser’ attribute? Well here’s how:

    $ grab 'x/<date-input.*?>/ g/\bbrowser\b/ h//' foo
    

    Quick, simple, and elegant. Just the way I like it!

    Additional Operators

    Here I’ve shown you the 3 main operators: ‘x’, ‘g’, and ‘h’. These are not all however! Each operator also has a capital variant (‘X’, ‘G’, ‘H’) which behaves the same but instead of working on text that matches the given pattern, these operators match on text which doesn’t match the given pattern.

    These operators allow for better pattern matching. For example a pattern to match all numbers which contain a ‘3’ but which aren’t ‘1337’ could be written as x/[0-9]+/ g/3/ G/^1337$/.