Perl's "guess if file is text or binary" implemented in Python

Perl is a favorite language of system administrators for many reasons; for example, it has some built-in features very well suited for sysadmin scripts. One such feature is the -T and -B file test operators [1]. Recently I needed a similar feature in a Python script I was writing. Since Python doesn't have it built-in, I became curious about how it works in Perl. Hence this post.

Here's the relevant bit from the output of perldoc -f -B:

The "-T" and "-B" switches work as follows. The first block or so of the file is examined for odd characters such as strange control codes or characters with the high bit set. If too many strange characters (>30%) are found, it's a "-B" file; otherwise it's a "-T" file. Also, any file containing null in the first block is considered a binary file. [...]

OK, that appears to be a reasonable heuristic. But "odd characters" and "strange control codes or characters" sounds too vague, so I decided to take a peek at the source code of Perl 5.10 [2] to see what it actually does. This functionality is implemented in function pp_fttext, which resides in pp_sys.c in the root directory of the source distribution. The -T and -B operators work on both file names and file handles, but for the sake of simplicity I will ignore this distinction, as well as some minor corner cases. Here's what the code does:

If the file is empty, it's considered text.
Otherwise, a buffer of up to 512 bytes is read from the file - this buffer will be examined for the heuristic.
The variable odd is initialized to 0. It will count the chars that don't appear to be text.
Main loop - each byte of the buffer is examined in order:
- If the byte is \x00, the heuristic immediately declares the file is binary and the function is finished.
- The high bit of the byte is examined.
- If the high bit is 0: odd is incremented if the byte is below ASCII code 32 (space) and not one of the known special values, such as newline or tab. In other words, if the byte isn't an ASCII textual character.
- If the high bit is 1: if this byte appears to be the first byte in UTF-8 encoding of code-points above U-007F, attempt to decode the next bytes and see if they form a valid UTF-8 sequence. If they do, skip the loop pointer to after this sequence. Otherwise, increment odd and proceed with the next character.
The main loop has ended, and now odd contains the amount of chars that don't appear to be textual in the buffer.
If odd is higher than 30% of the length of the buffer, the file is considered binary. Otherwise, it's considered text.

Here's an implementation in Python of this heuristic [3], ignoring the UTF-8 case (meaning that if this implementation encounters true UTF-8 chars, it will count them as "odd"):

import sys
PY3 = sys.version_info[0] == 3

# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr

_text_characters = (
        b''.join(int2byte(i) for i in range(32, 127)) +
        b'\n\r\t\f\b')

def istextfile(fileobj, blocksize=512):
    """ Uses heuristics to guess whether the given file is text or binary,
        by reading a single block of bytes from the file.
        If more than 30% of the chars in the block are non-text, or there
        are NUL ('\x00') bytes in the block, assume this is a binary file.
    """
    block = fileobj.read(blocksize)
    if b'\x00' in block:
        # Files with null bytes are binary
        return False
    elif not block:
        # An empty file is considered a valid text file
        return True

    # Use translate's 'deletechars' argument to efficiently remove all
    # occurrences of _text_characters from the block
    nontext = block.translate(None, _text_characters)
    return float(len(nontext)) / len(block) <= 0.30

The main point of interest here is the usage of the translate method, and particularly its second (optional) argument to delete a set of chars. Since translate is implemented in C, this method should be quite fast. Naturally, adding UTF-8 detection here shouldn't be too hard, if required.

Note also that this code was written to run on both Python 2 and Python 3 without changes.

[1]	And all file test operators in general, as they enable very succinct code for querying simple attributes of files.

[2]	`ಠ_ಠ` and my eyes are still bleeding! Appreciate the LOTR quotes though.

[3]	Loosely based on this recipe.