grep through code history with Git, Mercurial or SVN

May 22nd, 2012 at 3:51 am

A problem that sometimes comes up with source-controlled code is to find a revision in which some line was deleted, or otherwise modified in a way that blame can’t decipher. In other words, we want to grep over all revisions of some file to know which revisions contain a certain pattern. Note that the goal is not to search in the commit log (which is trivial), but rather in the code itself.

Well, if you’re using Mercurial or Git, you’re lucky because both provide built-in methods for doing this.

With Mercurial, use hg grep.

With Git, you can either use git grep in conjunction with git rev-list, or git log -S (more details in this SO thread).

What about Subversion, though? SVN, to the best of my knowledge, does not have this functionality built-in. Moreover, SVN’s design makes this task inherently slow because no revisions past the last one are actually kept on your machine (unless the repository is local) and you have to ask the server for each revision. That’s a lot of network traffic.

That said, if you’re willing to tolerate the slowness (and sometimes there’s no choice!), then the following script – svnrevgrep – makes it as simple as with Git or Mercurial:

import re, sys, subprocess

def run_command(cmd):
    """ Run shell command, return its stdout output.
    """
    return subprocess.check_output(cmd.split(), universal_newlines=True)

def svnrevgrep(filename, s):
    """ Go over all revisions of filename, checking if s can be found
        in them.
    """
    log = run_command('svn log ' + filename)
    for ver in re.findall('r\d+', log, flags=re.MULTILINE):
        cmd = 'svn cat -r %s %s' % (ver.rstrip('r'), filename)
        contents = run_command(cmd)
        print('%s: %s' % (ver, 'found' if re.search(s, contents)
                                       else 'not found'))
if __name__ == '__main__':
    if len(sys.argv) != 3:
        print('Usage: %s <path> <regex>' % sys.argv[0])
    else:
        svnrevgrep(sys.argv[1], sys.argv[2])

It basically goes over all revisions of the file starting with the most recent one and looks for the pattern.

Note that while one could imagine using some kind of binary searching to find the first revision in which the regex appears (or doesn’t), this won’t work in the general case because code sometimes is added, then deleted, then re-added, then deleted again (this happens when refactoring or when reverting problematic commits).

Finally, if you find yourself doing the above frequently for a given repository, you may be better off with:

git svn clone <path>
git grep <...>

Related posts:

  1. Python development switches to Mercurial source control
  2. Migrating my personal projects to Mercurial
  3. Announcing pss: a tool for searching inside source code
  4. Book review: “A short history of nearly everything ” by Bill Bryson
  5. Book review: “A short history of the United States” by Edward Channing

3 Responses to “grep through code history with Git, Mercurial or SVN”

  1. ChrisNo Gravatar Says:

    This will fail on filenames with spaces in them. You should pass a list of command line tokens to run_command() directly (e.g. run_command(['svn', 'log', filename])) and drop the split()

    Also, be aware that subprocess.check_output() needs Python >= 2.7.

  2. elibenNo Gravatar Says:

    Chris,

    I agree about the spaces. This script is mainly aimed at Linux where spaces in filenames are rare, but as you mentioned it can be easily modified to support filenames with spaces.

  3. jonathanNo Gravatar Says:

    unbeliavably useful thanks!

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)