Python internals: Symbol tables, part 1

Introduction

This article is the first in a short series in which I intend to explain how CPython [1] implements and uses symbol tables in its quest to compile Python source code into bytecode. In this part I will explain what a symbol table is and show how the general concepts apply to Python. In the second part I will delve into the implementation of symbol tables in the core of CPython.

So what is a symbol table?

As usual, it's hard to beat Wikipedia for a succinct definition:

In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is associated with information relating to its declaration or appearance in the source, such as its type, scope level and sometimes its location.

Symbol tables are used by practically all compilers. They're especially important in statically typed languages where all variables have types and type checking by the compiler is an important part of the front-end.

Consider this C code:

int main()
{
    int aa, bb;

    bb = *aa;

    {
        int* aa;
        bb = *aa;
    }

    return 0;
}

There are two distinct assignments of the form bb = *aa here, but only the second one is legal. The compiler throws the following error when it sees the first one:

error: invalid type argument of ‘unary *’ (have ‘int’)

How does the compiler know that the * operator is given an argument of type int, which is invalid for this operator? The answer is: the symbol table. The compiler sees *aa and asks itself what the type of aa is. To answer this question it consults the symbol table it constructed earlier. The symbol table contains the declared types for all variables the compiler encountered in the code.

This example demonstrates another important concept - for most languages a single symbol table with information about all variables won't do. The second assignment is valid, because in the internal scope created by the curly braces aa is redefined to be of a pointer type. Thus, to correctly compile such code the C compiler has to keep a separate symbol table per scope [2].

A digression: "variables" in Python

So far I've been using the term "variable" liberally. Just to be on the safe side, let's clarify what is meant by variable in Python. Formally, Python doesn't really have variables in the sense C has. Rather, Python has symbolic names bound to objects:

aa = [1, 2, 3]
bb = aa
aa[0] = 666

In this code, aa is a name bound to a list object. bb is a name bound to the same object. The third line modifies the list through aa, and if we print out bb we'll see the modified list as well.

Now, once this is understood, I will still use the term "variable" from time to time since it's occasionally convenient and everybody's used to it anyway.

Symbol tables for Python code

Alright, so symbol tables are very useful for type checking. But Python doesn't have compile-time type checking (duck typing FTW!), so what does CPython need symbol tables for?

The CPython compiler still has to resolve what kinds of variables are used in the code. Variables in Python can be local, global or even bound by a lexically enclosing scope. For example:

def outer(aa):
    def inner():
        bb = 1
        return aa + bb + cc
    return inner

The function inner uses three variables: aa, bb and cc. They're all different from Python's point of view: aa is lexically bound in outer, bb is locally bound in inner itself, and cc is not bound anywhere in sight, so it's treated as global. The bytecode generated for inner shows clearly the different treatment of these variables:

5           0 LOAD_CONST               1 (1)
            3 STORE_FAST               0 (bb)

6           6 LOAD_DEREF               0 (aa)
            9 LOAD_FAST                0 (bb)
           12 BINARY_ADD
           13 LOAD_GLOBAL              0 (cc)
           16 BINARY_ADD
           17 RETURN_VALUE

As you can see, different opcodes are used for loading the variables onto the stack prior to applying BINARY_ADD. LOAD_DEREF is used for aa, LOAD_FAST for bb and LOAD_GLOBAL for cc.

At this point, there are three different directions we can pursue on our path to deeper understanding of Python:

Figure out the exact semantics of variables in Python - when are they local, when are they global and what exactly makes them lexically bound.
Understand how the CPython compiler knows the difference.
Learn about the different bytecode opcodes for these variables and how they affect the way the VM runs code.

I won't even try going into (1) since it's a broad topic completely out of the scope of this article. There are plenty of resources online - start with the official and continue Googling until you're fully enlightened. (3) is also out of scope as I'm currently focusing on the front-end of CPython. If you're interested, there's an excellent series of in-depth articles on Python focusing on the back-end, with a nice treatment of this very issue.

To answer (2) we need to understand how CPython uses symbol tables, which is what this series of articles is about.

Where symbol tables fit in

A high-level view of the front-end of CPython is:

Parse source code into a parse tree
Transform parse tree into an Abstract Syntax Tree
Transform AST into a Control Flow Graph
Emit bytecode based on the Control Flow Graph

Symbol tables are created in step 3. The compiler builds a symbol table from the AST representing the Python source code. This symbol table, in conjunction with the AST is then used to generate the control flow graph (CFG) and ultimately the bytecode.

Exploring the symbol table

CPython does a great job exposing some of its internals via standard-library modules [3]. Symbol tables is yet another internal data structure that can be explored from the outside in pure Python code, with the help of the symtable module. From its description:

Symbol tables are generated by the compiler from AST just before bytecode is generated. The symbol table is responsible for calculating the scope of every identifier in the code. symtable provides an interface to examine these tables.

The symtable module provides a lot of information on the various identifiers encountered in Python code. Apart from telling their scope, it allows us to find out which variables are referenced in their scope, assigned in their scope, define new namespaces (like functions) and so on. To help with exploring the symbol table I've written the following function that simplifies working with the module:

def describe_symbol(sym):
    assert type(sym) == symtable.Symbol
    print("Symbol:", sym.get_name())

    for prop in [
            'referenced', 'imported', 'parameter',
            'global', 'declared_global', 'local',
            'free', 'assigned', 'namespace']:
        if getattr(sym, 'is_' + prop)():
            print('    is', prop)

Let's see what it has to say about the inner function from the example above:

Symbol: aa
    is referenced
    is free
Symbol: cc
    is referenced
    is global
Symbol: bb
    is referenced
    is local
    is assigned

Indeed, we see that the symbol table marks aa as lexically bound, or "free" (more on this in the next section), bb as local and cc as global. It also tells us that all these variables are referenced in the scope of inner and that bb is assigned in that scope.

The symbol table contains other useful information as well. For example, it can help distinguish between explicitly declared globals and implicit globals:

def outer():
    global gg
    return ff + gg

In this code both ff and gg are global to outer, but only gg was explicitly declared global. The symbol table knows this - the output of describe_symbol for this function is:

Symbol: gg
    is referenced
    is global
    is declared_global
Symbol: ff
    is referenced
    is global

Free variables

Unfortunately, there's a shorthand in the core of Python that may initially confuse readers as to exactly what constitutes a "free" variable. Fortunately, it's a very slight confusion that's easy to put in order. The execution model reference says:

If a variable is used in a code block but not defined there, it is a free variable.

This is consistent with the formal definition. In the source, however, "free" is actually used as a shorthand for "lexically bound free variable" (i.e. variables for which a binding has been found in an enclosing scope), with "global" being used to refer to all remaining free variables. So when reading the CPython source code it is important to remember that the full set of free variables includes both the variables tagged specifically as "free", as well as those tagged as "global".

Thus, to avoid a confusion I say "lexically bound" when I want to refer to the variables actually treated in CPython as free.

Catching errors

Although Python is duck-typed, some things can still be enforced at compile-time. The symbol table is a powerful tool allowing the compiler to catch some errors [4].

For example, it's not allowed to declare function parameters as global:

def outer(aa):
    global aa

When compiling this function, the error is caught while constructing the symbol table:

Traceback (most recent call last):
  File "symtab_1.py", line 33, in <module>
    table = symtable.symtable(code, '<string>', 'exec')
  File "symtable.py", line 13, in symtable
    raw = _symtable.symtable(code, filename, compile_type)
  File "<string>", line 2
SyntaxError: name 'aa' is parameter and global

The symbol table is useful here since it knows that aa is a parameter in the scope of outer and when global aa is encountered it's a sure sign of an error.

Other errors are handled by the symbol table: duplicate argument names in functions, using import * inside functions, returning values inside generators, and a few more.

Conclusion

This article serves mainly as an introduction for the next one, where I plan to explore the actual implementation of symbol tables in the core of CPython. A symbol table is a tool designed to solve some problems for the compiler, and I hope this article did a fair job describing a few of these problems and related terminology.

Special thanks to Nick Coghlan for reviewing this article.

[1]

You'll note that in this article I'm using the terms Python and CPython interchangeably. They're not the same - by Python I mean the language (version 3.x) and by CPython I mean the official C implementation of the compiler + VM. There are several implementations of the Python language in existence, and while they all implement the same specification they may do it differently.

[2]	It's even more complex than that, but we're not here to talk about C compilers. In the next article I will explain exactly the structure of symbol tables used by CPython.

[3]	I've previously discussed how to use the `ast` module to tap into the compilation process of CPython.

[4]	Actually, if you `grep` the CPython source, you'll find out that a good proportion of `SyntaxError` exceptions thrown by the compiler are from `Python/symtable.c`.