Bug in Python’s difflib module ?

December 9th, 2008 at 10:07 pm

I think I’ve found a bug in one of the modules of the standard Python distribution (version 2.5.2)

The problem is with the SequenceMatcher class of the difflib module – a useful tool for computing the differences between sequences and files that works similarly to the diff command line program.

After importing SequenceMatcher:

from difflib import SequenceMatcher

ratio is supposed to return the ratio of relative similarity between two sequences. For example:

>>> SequenceMatcher(None, [4, 5, 6], [4, 5, 7]).ratio()
0.666666666667

So far so good. But for sequences 200 elements or longer, ratio breaks:

>>> SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()
0.0

This was supposed to return something very close to 1.0, like this run with a sequence just a tad shorter:

>>> SequenceMatcher(None, [4] + [5] * 199, [5] * 199).ratio()
0.997493734336

Since it’s so unusual, I just had to look at the code of SequenceMatcher to see what’s going on. Sure enough, the number 200 plays a role there. It is the amount above which an item is considered popular, and invokes a special optimization. I suppose there’s a bug somewhere in this optimization.

Other methods of SequenceMatcher suffer from this problem as well, so I suppose it’s a bug that makes SequenceMatcher almost unusable for long sequences. This is strange, because I can’t understand how no one has noticed this before.

By the way, I tried filing a bug in the Python SF tracker, but apparently I don’t have permissions for that (even as a registered SF user).

Related posts:

  1. Book review: “Galapagos” by Kurt Vonnegut
  2. Contributing to Python
  3. Getters and setters in Python

3 Responses to “Bug in Python’s difflib module ?”

  1. Michael SchurterNo Gravatar Says:

    I believe http://bugs.python.org/ is the place to file bugs now.

  2. Rene DudfieldNo Gravatar Says:

    hi,

    Perhaps file another bug about how the old bug tracker needs to point to the new bug tracker.

    Imagine how many bugs are being left unreported because of this issue? Probably one of the most important python bugs there is.

  3. elibenNo Gravatar Says:

    Thanks for the tip, I’ve opened the issue:

    http://bugs.python.org/issue4622