conversion tools that don't scale - Eli Bendersky's website

Conversion tools from one format to another are very common in today's plethora-of-formats world. But sometimes the formats are so different that tools to do the conversions don't scale.

One tool I wrote at work to convert from one simulation log format (text) to another (binary) has this problem. It all worked nice and fine until one of the users decided to run a 800 Meg file (27 mln lines, LOL, wc -l took 3 minutes to run) through it. Here, my program showed that it doesn't scale and crashed.

The reason is obvious: I read the whole file into memory, process it and spew the output - and memory tends to run out... It seemed logical to do so because of the difference between the formats - I must know some "future" information prior to writing the output.

The solution is not simple. I must rewrite my tool to do a first pass on the file, finding all the information it needs. A second pass will read input and write output right-away. This is way different from how the tool works now - BUMMER.