Tags Python

I think I've found a bug in one of the modules of the standard Python distribution (version 2.5.2)

The problem is with the SequenceMatcher class of the difflib module - a useful tool for computing the differences between sequences and files that works similarly to the diff command line program.

After importing SequenceMatcher:

from difflib import SequenceMatcher

ratio is supposed to return the ratio of relative similarity between two sequences. For example:

>>> SequenceMatcher(None, [4, 5, 6], [4, 5, 7]).ratio()

So far so good. But for sequences 200 elements or longer, ratio breaks:

>>> SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()

This was supposed to return something very close to 1.0, like this run with a sequence just a tad shorter:

>>> SequenceMatcher(None, [4] + [5] * 199, [5] * 199).ratio()

Since it's so unusual, I just had to look at the code of SequenceMatcher to see what's going on. Sure enough, the number 200 plays a role there. It is the amount above which an item is considered popular, and invokes a special optimization. I suppose there's a bug somewhere in this optimization.

Other methods of SequenceMatcher suffer from this problem as well, so I suppose it's a bug that makes SequenceMatcher almost unusable for long sequences. This is strange, because I can't understand how no one has noticed this before.

By the way, I tried filing a bug in the Python SF tracker, but apparently I don't have permissions for that (even as a registered SF user).