Perl’s “guess if file is text or binary” implemented in Python
October 19th, 2011 at 8:06 amPerl is a favorite language of system administrators for many reasons; for example, it has some built-in features very well suited for sysadmin scripts. One such feature is the -T and -B file test operators [1]. Recently I needed a similar feature in a Python script I was writing. Since Python doesn’t have it built-in, I became curious about how it works in Perl. Hence this post.
Here’s the relevant bit from the output of perldoc -f -B:
The "-T" and "-B" switches work as follows. The first block or
so of the file is examined for odd characters such as strange
control codes or characters with the high bit set. If too many
strange characters (>30%) are found, it’s a "-B" file; otherwise
it’s a "-T" file. Also, any file containing null in the first
block is considered a binary file. [...]
OK, that appears to be a reasonable heuristic. But "odd characters" and "strange control codes or characters" sounds too vague, so I decided to take a peek at the source code of Perl 5.10 [2] to see what it actually does. This functionality is implemented in function pp_fttext, which resides in pp_sys.c in the root directory of the source distribution. The -T and -B operators work on both file names and file handles, but for the sake of simplicity I will ignore this distinction, as well as some minor corner cases. Here’s what the code does:
- If the file is empty, it’s considered text.
- Otherwise, a buffer of up to 512 bytes is read from the file – this buffer will be examined for the heuristic.
- The variable odd is initialized to 0. It will count the chars that don’t appear to be text.
- Main loop – each byte of the buffer is examined in order:
- If the byte is \x00, the heuristic immediately declares the file is binary and the function is finished.
- The high bit of the byte is examined.
- If the high bit is 0: odd is incremented if the byte is below ASCII code 32 (space) and not one of the known special values, such as newline or tab. In other words, if the byte isn’t an ASCII textual character.
- If the high bit is 1: if this byte appears to be the first byte in UTF-8 encoding of code-points above U-007F, attempt to decode the next bytes and see if they form a valid UTF-8 sequence. If they do, skip the loop pointer to after this sequence. Otherwise, increment odd and proceed with the next character.
- The main loop has ended, and now odd contains the amount of chars that don’t appear to be textual in the buffer.
- If odd is higher than 30% of the length of the buffer, the file is considered binary. Otherwise, it’s considered text.
Here’s an implementation in Python of this heuristic [3], ignoring the UTF-8 case (meaning that if this implementation encounters true UTF-8 chars, it will count them as "odd"):
import sys
PY3 = sys.version_info[0] == 3
# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr
_text_characters = (
b''.join(int2byte(i) for i in range(32, 127)) +
b'\n\r\t\f\b')
def istextfile(fileobj, blocksize=512):
""" Uses heuristics to guess whether the given file is text or binary,
by reading a single block of bytes from the file.
If more than 30% of the chars in the block are non-text, or there
are NUL ('\x00') bytes in the block, assume this is a binary file.
"""
block = fileobj.read(blocksize)
if b'\x00' in block:
# Files with null bytes are binary
return False
elif not block:
# An empty file is considered a valid text file
return True
# Use translate's 'deletechars' argument to efficiently remove all
# occurrences of _text_characters from the block
nontext = block.translate(None, _text_characters)
return float(len(nontext)) / len(block) <= 0.30
The main point of interest here is the usage of the translate method, and particularly its second (optional) argument to delete a set of chars. Since translate is implemented in C, this method should be quite fast. Naturally, adding UTF-8 detection here shouldn’t be too hard, if required.
Note also that this code was written to run on both Python 2 and Python 3 without changes.

| [1] | And all file test operators in general, as they enable very succinct code for querying simple attributes of files. |
| [2] | ಠ_ಠ and my eyes are still bleeding! Appreciate the LOTR quotes though. |
| [3] | Loosely based on this recipe. |
Related posts:

October 19th, 2011 at 18:09
“odd” seems like a weird choice of variable name. May I suggest “quaint”?
October 19th, 2011 at 23:44
You jogged an old memory of mine! Here’s a recipe I wrote nearly 9 years ago: http://code.activestate.com/recipes/173220-test-if-a-file-or-string-is-text-or-binary/ . Looks like I omitted the “\f” character, and a commenter pointed out a division bug in my code.
October 20th, 2011 at 06:04
It may be nice if it supported UTF/UCS also.
It could also check for a BOM (though a BOM is not always present)
For example UTF-16 encoded text will have a 0 byte almost every second character for plain characters/numbers.
I guess it depends on what the main use of the functionality is
Thanks for the interesting post.
October 20th, 2011 at 08:31
nip,
But, “quaint” is such a peculiar word. Maybe it should be named “queer”?
Andrew,
Note that I point to your recipe in footnote 3. Thanks for writing it, I used it as the basis for my implementation.
Mark,
Is there a way to automatically detect UTF-16 encoded files?
October 20th, 2011 at 13:03
Oh! No, I didn’t see footnote #3. I’m glad you found it useful.
Regarding automatically detecting UTF-16; there used to be the ‘chardet’ module, but the author recently removed it and all other content from his web site. There’s a mirror at https://github.com/ramalho/chardet . It detects the character encoding using a set of state machines.
October 20th, 2011 at 13:08
Hey,
I guess you should seek back to the previous position of the file too?
cheers,
October 20th, 2011 at 16:03
Rene,
Yes, definitely, if this file object is going to be later reused. I just wanted to focus on the heuristic here.
October 21st, 2011 at 12:28
Your code will miss the common case of
.runinstaller files, which are bash files binary information tucked at their tail.http://megastep.org/makeself/
Maybe you can also peek at the tail?
October 21st, 2011 at 14:03
Elazar,
Frankly, I’m not familiar with file format! Regardless, in this post I’m just trying to reproduce the algorithm built-into Perl, not invent a new one, and obviously Perl didn’t deem such files important enough to read from the tail as well.
October 21st, 2011 at 20:43
Given UTF-16 text files, A single null byte shouldn’t indicate a binary file, but a sequence of two or more null bytes should.
Also, for detecting UTF-16, you might want to test for a Byte Order Mark and take it’s presence as indicating a Unicode text file.