Perl’s “guess if file is text or binary” implemented in Python

October 19th, 2011 at 8:06 am

Perl is a favorite language of system administrators for many reasons; for example, it has some built-in features very well suited for sysadmin scripts. One such feature is the -T and -B file test operators [1]. Recently I needed a similar feature in a Python script I was writing. Since Python doesn’t have it built-in, I became curious about how it works in Perl. Hence this post.

Here’s the relevant bit from the output of perldoc -f -B:

The "-T" and "-B" switches work as follows. The first block or
so of the file is examined for odd characters such as strange
control codes or characters with the high bit set. If too many
strange characters (>30%) are found, it’s a "-B" file; otherwise
it’s a "-T" file. Also, any file containing null in the first
block is considered a binary file. [...]

OK, that appears to be a reasonable heuristic. But "odd characters" and "strange control codes or characters" sounds too vague, so I decided to take a peek at the source code of Perl 5.10 [2] to see what it actually does. This functionality is implemented in function pp_fttext, which resides in pp_sys.c in the root directory of the source distribution. The -T and -B operators work on both file names and file handles, but for the sake of simplicity I will ignore this distinction, as well as some minor corner cases. Here’s what the code does:

  • If the file is empty, it’s considered text.
  • Otherwise, a buffer of up to 512 bytes is read from the file – this buffer will be examined for the heuristic.
  • The variable odd is initialized to 0. It will count the chars that don’t appear to be text.
  • Main loop – each byte of the buffer is examined in order:
    • If the byte is \x00, the heuristic immediately declares the file is binary and the function is finished.
    • The high bit of the byte is examined.
    • If the high bit is 0: odd is incremented if the byte is below ASCII code 32 (space) and not one of the known special values, such as newline or tab. In other words, if the byte isn’t an ASCII textual character.
    • If the high bit is 1: if this byte appears to be the first byte in UTF-8 encoding of code-points above U-007F, attempt to decode the next bytes and see if they form a valid UTF-8 sequence. If they do, skip the loop pointer to after this sequence. Otherwise, increment odd and proceed with the next character.
  • The main loop has ended, and now odd contains the amount of chars that don’t appear to be textual in the buffer.
  • If odd is higher than 30% of the length of the buffer, the file is considered binary. Otherwise, it’s considered text.

Here’s an implementation in Python of this heuristic [3], ignoring the UTF-8 case (meaning that if this implementation encounters true UTF-8 chars, it will count them as "odd"):

import sys
PY3 = sys.version_info[0] == 3

# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr

_text_characters = (
        b''.join(int2byte(i) for i in range(32, 127)) +
        b'\n\r\t\f\b')

def istextfile(fileobj, blocksize=512):
    """ Uses heuristics to guess whether the given file is text or binary,
        by reading a single block of bytes from the file.
        If more than 30% of the chars in the block are non-text, or there
        are NUL ('\x00') bytes in the block, assume this is a binary file.
    """
    block = fileobj.read(blocksize)
    if b'\x00' in block:
        # Files with null bytes are binary
        return False
    elif not block:
        # An empty file is considered a valid text file
        return True

    # Use translate's 'deletechars' argument to efficiently remove all
    # occurrences of _text_characters from the block
    nontext = block.translate(None, _text_characters)
    return float(len(nontext)) / len(block) <= 0.30

The main point of interest here is the usage of the translate method, and particularly its second (optional) argument to delete a set of chars. Since translate is implemented in C, this method should be quite fast. Naturally, adding UTF-8 detection here shouldn’t be too hard, if required.

Note also that this code was written to run on both Python 2 and Python 3 without changes.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] And all file test operators in general, as they enable very succinct code for querying simple attributes of files.
[2] ಠ_ಠ and my eyes are still bleeding! Appreciate the LOTR quotes though.
[3] Loosely based on this recipe.

Related posts:

  1. endian-ness of bits and bytes
  2. perl bit-crunching
  3. some pondering on design of MIX in Perl
  4. interesting problem: buffered text view widget
  5. The bytes/str dichotomy in Python 3

10 Responses to “Perl’s “guess if file is text or binary” implemented in Python”

  1. nipNo Gravatar Says:

    “odd” seems like a weird choice of variable name. May I suggest “quaint”?

  2. Andrew DalkeNo Gravatar Says:

    You jogged an old memory of mine! Here’s a recipe I wrote nearly 9 years ago: http://code.activestate.com/recipes/173220-test-if-a-file-or-string-is-text-or-binary/ . Looks like I omitted the “\f” character, and a commenter pointed out a division bug in my code.

  3. MarkNo Gravatar Says:

    It may be nice if it supported UTF/UCS also.
    It could also check for a BOM (though a BOM is not always present)

    For example UTF-16 encoded text will have a 0 byte almost every second character for plain characters/numbers.

    I guess it depends on what the main use of the functionality is :)

    Thanks for the interesting post.

  4. elibenNo Gravatar Says:

    nip,

    But, “quaint” is such a peculiar word. Maybe it should be named “queer”?

    Andrew,

    Note that I point to your recipe in footnote 3. Thanks for writing it, I used it as the basis for my implementation.

    Mark,

    Is there a way to automatically detect UTF-16 encoded files?

  5. Andrew DalkeNo Gravatar Says:

    Oh! No, I didn’t see footnote #3. I’m glad you found it useful.

    Regarding automatically detecting UTF-16; there used to be the ‘chardet’ module, but the author recently removed it and all other content from his web site. There’s a mirror at https://github.com/ramalho/chardet . It detects the character encoding using a set of state machines.

  6. Rene DudfieldNo Gravatar Says:

    Hey,

    I guess you should seek back to the previous position of the file too?

    cheers,

  7. elibenNo Gravatar Says:

    Rene,

    Yes, definitely, if this file object is going to be later reused. I just wanted to focus on the heuristic here.

  8. ElazarNo Gravatar Says:

    Your code will miss the common case of .run installer files, which are bash files binary information tucked at their tail.

    http://megastep.org/makeself/

    Maybe you can also peek at the tail?

  9. elibenNo Gravatar Says:

    Elazar,

    Frankly, I’m not familiar with file format! Regardless, in this post I’m just trying to reproduce the algorithm built-into Perl, not invent a new one, and obviously Perl didn’t deem such files important enough to read from the tail as well.

  10. Patrick (G)No Gravatar Says:

    Given UTF-16 text files, A single null byte shouldn’t indicate a binary file, but a sequence of two or more null bytes should.

    Also, for detecting UTF-16, you might want to test for a Byte Order Mark and take it’s presence as indicating a Unicode text file.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)