I think I’ve found a bug in one of the modules of the standard Python distribution (version 2.5.2)
The problem is with the SequenceMatcher class of the difflib module – a useful tool for computing the differences between sequences and files that works similarly to the diff command line program.
After importing SequenceMatcher:
from difflib import SequenceMatcher
ratio is supposed to return the ratio of relative similarity between two sequences. For example:
>>> SequenceMatcher(None, [4, 5, 6], [4, 5, 7]).ratio() 0.666666666667
So far so good. But for sequences 200 elements or longer, ratio breaks:
>>> SequenceMatcher(None,  +  * 200,  * 200).ratio() 0.0
This was supposed to return something very close to 1.0, like this run with a sequence just a tad shorter:
>>> SequenceMatcher(None,  +  * 199,  * 199).ratio() 0.997493734336
Since it’s so unusual, I just had to look at the code of SequenceMatcher to see what’s going on. Sure enough, the number 200 plays a role there. It is the amount above which an item is considered popular, and invokes a special optimization. I suppose there’s a bug somewhere in this optimization.
Other methods of SequenceMatcher suffer from this problem as well, so I suppose it’s a bug that makes SequenceMatcher almost unusable for long sequences. This is strange, because I can’t understand how no one has noticed this before.
By the way, I tried filing a bug in the Python SF tracker, but apparently I don’t have permissions for that (even as a registered SF user).