An observation on writing line-processing loop code

While reading this article, I ran into a concept named Jackson Structured Programming (JSP). JSP is mainly a design technique (with its own diagram syntax and CASE tools), but one of its main insights is a very intuitive programming pattern. This is what I want to focus on in this post.

Consider the following problem:

A file is a sorted list of lines. We want to produce a report from this file, displaying each kind of line only once, along with a count of how many times it appears in the file.

For example, take this file:

aaaa
bbbb
bbbb
bbbb
cccc
cccc
dddd
dddd
dddd
dddd
eeee
ffff
ffff
gggg

The report produced from it would be:

aaaa: 1
bbbb: 3
cccc: 2
dddd: 4
eeee: 1
ffff: 2
gggg: 1

Michael A. Jackson, the original developer of the JSP technique, has an interesting paper on this subject called "Getting It Wrong - A Cautionary Tale" (PDF link), in which he explains the common way to solve this problem and its pitfalls. The code samples are in COBOL, but other than that the article is quite accessible. I encourage you to read it, or at least think about how you would write the code to solve this task before reading on.

There's no catch here. The problem is very simple - there's no magic algorithm that's needed to solve it. This post deals with the particular implementation technique at the lowest level.

When one thinks about a solution, most likely the first line of thought is "OK, so we go over the file line by line, and for each line...". That's correct, of course, but stated like this it hints at a code pattern. In Python, for example, such code is likely to be based on:

for line in stream:
  # do something

Here's the full implementation, translating the code sample in the Wikipedia page on JSP linked earlier into Python:

def process_normal(stream):
    """ Process the stream using "normal" looping techniques
    """
    count = 0
    first_line_of_group = None
    for line in stream:
        if first_line_of_group and first_line_of_group != line:
            print_line_with_count(first_line_of_group, count)
        if not first_line_of_group or first_line_of_group != line:
            count = 0
            first_line_of_group = line
        count += 1
    if first_line_of_group:
        print_line_with_count(first_line_of_group, count)

This code is, as far as I can tell, correct. However, it looks a bit complex. It's not bad, and it's not hard to understand, but there's a better way.

Here it is. All it does is restructure the code a bit, deviating from the "normal" looping pattern, instead reading one line in advance and then keeping an invariant that when the main loop cycles it already has the next line handy in line:

def process_jsp(stream):
    """ Process the stream using the JSP looping technique with read-ahead
    """
    line = stream.readline()
    while line:
        count = 0
        first_line_of_group = line
        while line and line == first_line_of_group:
            count += 1
            line = stream.readline()
        print_line_with_count(first_line_of_group, count)

I don't know about you, but it just strikes me how much simpler this code is to comprehend. You can almost read the code aloud and just understand it. Simpler to comprehend, hence much less likely to contain bugs (which is the point Jackson is trying to make in his paper).

At this point I could stop and say - "oh, cool, a nice coding pattern that makes some problems more natural to express". But I'm not satisfied. I'm trying to understand what happened here. Why did such a seemingly simple code change make the algorithm much easier to express?

Here's what I think. The "normal" approach just implements a state machine. "For each line..." we have a state. The loop body then just acts according to the state. Since the state is not entirely trivial, to handle it also requires more than a minimal amount of code.

On the other hand, in the JSP approach, most of the state is implicitly encoded in the code itself. Let me repeat that in other words - the state is still there, but it's implicit. Splitting the line reading to two places and keeping the invariant on the main loop allows us to express the state in the flow itself, instead of with explicit conditions.

And I have Déjà vu here, because I've already written about something like this - in the article "Co-routines as an alternative to state machines". That article explains how, by using co-routines, we can avoid an explicit state machine and thus considerably simplify the code. By writing the code in a certain way, the state becomes implicitly encoded in the control flow instead.

To be frank I'm not yet 100% sure on the connection - after all, the code presented in process_jsp is not really a co-routine. But I can't escape the hunch that these concepts are related. If you have any insights on this, please share them!

Another familiarity I see here is with the method of coding recursive-descent parsers (I've written a few posts on RD parsers in the past, for example this one). When implementing an RD parser, the code consists of a set of mutual-calling functions that represent grammar structures (for example "Statement", "Identifier", "Binary operation", etc.) Each such function assumes the current token is available, without it having to read it from the input. When the parsing of a construct finishes, the next token is read - for the sake of the next parsing function, not this one. Trying to code a RD parser differently - having each parsing function getting its own token prior to looking at it - results in much more complicated code. I think this is somewhat similar to the pattern presented here.

To conclude, "a-ha!" moments can often come from examining the simplest things. Personally I feel that understanding the coding pattern presented in this post helps me see things more clearly, and helps understand other concepts a bit better. I was aware for a long time that simply changing the order of operations in such algorithms (i.e. process first, then read next input) may make the code simpler, and also used the technique myself, but I feel that only now I also understand why this is so. I hope I have managed to get the point across :-)