An observation on writing line-processing loop code

October 28th, 2011 at 3:13 pm

While reading this article, I ran into a concept named Jackson Structured Programming (JSP). JSP is mainly a design technique (with its own diagram syntax and CASE tools), but one of its main insights is a very intuitive programming pattern. This is what I want to focus on in this post.

Consider the following problem:

A file is a sorted list of lines. We want to produce a report from this file, displaying each kind of line only once, along with a count of how many times it appears in the file.

For example, take this file:


The report produced from it would be:

aaaa: 1
bbbb: 3
cccc: 2
dddd: 4
eeee: 1
ffff: 2
gggg: 1

Michael A. Jackson, the original developer of the JSP technique, has an interesting paper on this subject called "Getting It Wrong – A Cautionary Tale" (PDF link), in which he explains the common way to solve this problem and its pitfalls. The code samples are in COBOL, but other than that the article is quite accessible. I encourage you to read it, or at least think about how you would write the code to solve this task before reading on.

There’s no catch here. The problem is very simple – there’s no magic algorithm that’s needed to solve it. This post deals with the particular implementation technique at the lowest level.

When one thinks about a solution, most likely the first line of thought is "OK, so we go over the file line by line, and for each line…". That’s correct, of course, but stated like this it hints at a code pattern. In Python, for example, such code is likely to be based on:

for line in stream:
  # do something

Here’s the full implementation, translating the code sample in the Wikipedia page on JSP linked earlier into Python:

def process_normal(stream):
    """ Process the stream using "normal" looping techniques
    count = 0
    first_line_of_group = None
    for line in stream:
        if first_line_of_group and first_line_of_group != line:
            print_line_with_count(first_line_of_group, count)
        if not first_line_of_group or first_line_of_group != line:
            count = 0
            first_line_of_group = line
        count += 1
    if first_line_of_group:
        print_line_with_count(first_line_of_group, count)

This code is, as far as I can tell, correct. However, it looks a bit complex. It’s not bad, and it’s not hard to understand, but there’s a better way.

Here it is. All it does is restructure the code a bit, deviating from the "normal" looping pattern, instead reading one line in advance and then keeping an invariant that when the main loop cycles it already has the next line handy in line:

def process_jsp(stream):
    """ Process the stream using the JSP looping technique with read-ahead
    line = stream.readline()
    while line:
        count = 0
        first_line_of_group = line
        while line and line == first_line_of_group:
            count += 1
            line = stream.readline()
        print_line_with_count(first_line_of_group, count)

I don’t know about you, but it just strikes me how much simpler this code is to comprehend. You can almost read the code aloud and just understand it. Simpler to comprehend, hence much less likely to contain bugs (which is the point Jackson is trying to make in his paper).

At this point I could stop and say – "oh, cool, a nice coding pattern that makes some problems more natural to express". But I’m not satisfied. I’m trying to understand what happened here. Why did such a seemingly simple code change make the algorithm much easier to express?

Here’s what I think. The "normal" approach just implements a state machine. "For each line…" we have a state. The loop body then just acts according to the state. Since the state is not entirely trivial, to handle it also requires more than a minimal amount of code.

On the other hand, in the JSP approach, most of the state is implicitly encoded in the code itself. Let me repeat that in other words – the state is still there, but it’s implicit. Splitting the line reading to two places and keeping the invariant on the main loop allows us to express the state in the flow itself, instead of with explicit conditions.

And I have Déjà vu here, because I’ve already written about something like this – in the article "Co-routines as an alternative to state machines". That article explains how, by using co-routines, we can avoid an explicit state machine and thus considerably simplify the code. By writing the code in a certain way, the state becomes implicitly encoded in the control flow instead.

To be frank I’m not yet 100% sure on the connection – after all, the code presented in process_jsp is not really a co-routine. But I can’t escape the hunch that these concepts are related. If you have any insights on this, please share them!

Another familiarity I see here is with the method of coding recursive-descent parsers (I’ve written a few posts on RD parsers in the past, for example this one). When implementing an RD parser, the code consists of a set of mutual-calling functions that represent grammar structures (for example "Statement", "Identifier", "Binary operation", etc.) Each such function assumes the current token is available, without it having to read it from the input. When the parsing of a construct finishes, the next token is read – for the sake of the next parsing function, not this one. Trying to code a RD parser differently – having each parsing function getting its own token prior to looking at it – results in much more complicated code. I think this is somewhat similar to the pattern presented here.

To conclude, "a-ha!" moments can often come from examining the simplest things. Personally I feel that understanding the coding pattern presented in this post helps me see things more clearly, and helps understand other concepts a bit better. I was aware for a long time that simply changing the order of operations in such algorithms (i.e. process first, then read next input) may make the code simpler, and also used the technique myself, but I feel that only now I also understand why this is so. I hope I have managed to get the point across :-)

Related posts:

  1. Parsing C: more on #line directives
  2. lesson for today: caveat in C++ line-reading
  3. Processing XML in Python with ElementTree
  4. using a cache to speed-up code
  5. Book review: “Beautiful Code”, edited by Andy Oram & Greg Wilson

9 Responses to “An observation on writing line-processing loop code”

  1. MikeNo Gravatar Says:

    Python has a neat trick with iter() for calling a function on each iteration, so the “while True: line = file.readline(); if not line: break” can be simplified to “for line in iter(file.readline, ”):”, which I think last example might benefit from.

    And while being far from a computer scientist, such transformation seem to me like splitting a single (and more complex) state (machine) into a multiple simplier ones, kinda akin to structured programming vs “global variables + goto”, which I thik is generally a good thing to do.

  2. FredrikNo Gravatar Says:

    read-ahead and the corresponding write pattern are fundamental tools to preserve your sanity, but in modern Python, itertools.groupby provides a nice abstraction that covers many uses of those patterns. Your example becomes:

    for key, run in itertools.groupby(stream):
      print_line_with_count(key, len(list(run)))

    or a bit more robust for large runs:

    for key, run in itertools.groupby(stream):
      print_line_with_count(key, sum(1 for item in run))

  3. elibenNo Gravatar Says:


    Yep, groupby surely is the right way to do this in Pyhon. I was just using this problem to demonstrate the pattern, of course. The pattern itself is general.

  4. MarsNo Gravatar Says:

    When going through this article, Awk comes upon my mind. Awk abstracts this pattern very well.

  5. MarsNo Gravatar Says:
    BEGIN { 
    	getline line
    	count = 1;
    	if (line == $0) { 
    	} else {
    		print line " : " count;
    		count = 1;
    	line = $0;
    END {
    	print line " : " count;

    So clear. However, I think what you trying to say is more general strategy to handle control flow.
    Thanks for stimulating me.

  6. elibenNo Gravatar Says:


    Thanks for bringing this up, it’s a very interesting example. Last time I touched Awk was 10 years ago or so :-)

    The code sample you posted is a direct translation of this pattern, relying on the fact that there’s an implicit getline $0 after each iteration of the implicit loop (I hope I’m remembering Awk correctly), and there’s also another explicit getline line in the beginning. So I agree, this is a nice translation of the idiom to Awk.

  7. bryaneNo Gravatar Says:

    Mars, et al:
    There’s an easier way in AWK, using associative arrays and relying on things being initialized to zero:
    { X[$0] = X[$0] + 1; }
    END { for (x in X) print x, X[x]; }

    This goes more towards looking at the program a different way (and assumes that there is exactly one set of lines containing a particular value). Said another way, this awk program doesn’t require the input to be sorted – which probably simplifies input preprocessing for this step :)

  8. bryaneNo Gravatar Says:

    Whoops. I noticed that the first AWK example, from Mars, has an error if the input is empty. It will still produce one line of output because there’s no indicator of whether a line was actually seen in the END block. I didn’t have my caffeine before.

  9. MichaelNo Gravatar Says:
    $ cat lines | sort | uniq -c
       1 aaaa
       3 bbbb
       2 cccc
       4 dddd
       1 eeee
       2 ffff
       1 gggg

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)