Tags Python

In Python, a generator is a function that returns an iterator object. There are a couple of ways to do this, but the most elegant and common one is to use the yield statement.

For example, here is a simple synthetic example:

def pyrevgen(seq):
    for i, elem in enumerate(reversed(seq)):
        yield i, elem

The pyrevgen function is a generator. Given any sequence, it returns an iterator that yields the sequences's elements in reversed order, and also enumerates them. For example:

>>> for i, e in pyrevgen(['a', 'b', 'c']):
...   print(i, e)
...
0 c
1 b
2 a

The goal of this post is to demonstrate how to achieve the same effect using the Python C API; in other words, in a C extension module. The focus is on Python 3 - for Python 2 the principle is the same although there may be some differences in the details.

yield is a very powerful construct in Python. In C there is no equivalent ability (unless you use some co-routine macro-fu, but this is outside our scope here). Therefore, we have to explicitly return an iterator object and let it handle the details of the iteration.

To write an iterator in Python we have to create a class that implements the __iter__ and __next__ special methods. The equivalent methods in the C API are tp_iter and tp_iternext, respectively.

We'll create a new extension module, named spam. It will export a single object - the revgen type which will be callable similarly to the Python code above. In other words, the client Python code will be able to do:

import spam
for i, e in spam.revgen(['a', 'b', 'c']):
  print(i, e)

Let's begin (a link to the full source code is available at the end of this post):

PyTypeObject PyRevgen_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "revgen",                       /* tp_name */
    sizeof(RevgenState),            /* tp_basicsize */
    0,                              /* tp_itemsize */
    (destructor)revgen_dealloc,     /* tp_dealloc */
    0,                              /* tp_print */
    0,                              /* tp_getattr */
    0,                              /* tp_setattr */
    0,                              /* tp_reserved */
    0,                              /* tp_repr */
    0,                              /* tp_as_number */
    0,                              /* tp_as_sequence */
    0,                              /* tp_as_mapping */
    0,                              /* tp_hash */
    0,                              /* tp_call */
    0,                              /* tp_str */
    0,                              /* tp_getattro */
    0,                              /* tp_setattro */
    0,                              /* tp_as_buffer */
    Py_TPFLAGS_DEFAULT,             /* tp_flags */
    0,                              /* tp_doc */
    0,                              /* tp_traverse */
    0,                              /* tp_clear */
    0,                              /* tp_richcompare */
    0,                              /* tp_weaklistoffset */
    PyObject_SelfIter,              /* tp_iter */
    (iternextfunc)revgen_next,      /* tp_iternext */
    0,                              /* tp_methods */
    0,                              /* tp_members */
    0,                              /* tp_getset */
    0,                              /* tp_base */
    0,                              /* tp_dict */
    0,                              /* tp_descr_get */
    0,                              /* tp_descr_set */
    0,                              /* tp_dictoffset */
    0,                              /* tp_init */
    PyType_GenericAlloc,            /* tp_alloc */
    revgen_new,                     /* tp_new */
};


static struct PyModuleDef spammodule = {
   PyModuleDef_HEAD_INIT,
   "spam",                  /* m_name */
   "",                      /* m_doc */
   -1,                      /* m_size */
};


PyMODINIT_FUNC
PyInit_spam(void)
{
    PyObject *module = PyModule_Create(&spammodule);
    if (!module)
        return NULL;

    if (PyType_Ready(&PyRevgen_Type) < 0)
        return NULL;
    Py_INCREF((PyObject *)&PyRevgen_Type);
    PyModule_AddObject(module, "revgen", (PyObject *)&PyRevgen_Type);

    return module;
}

This is standard code for creating a new module and a new type in it. The module initialization function (PyInit_spam) adds a single object to the module, named revgen. This object is the PyRevgen_Type type. By "calling" this object, the user is able to create new instances of the type.

The following structure ("subclass" of PyObject) is going to represent an instance of revgen:

/* RevgenState - reverse generator instance.
 *
 * sequence: ref to the sequence that's being iterated
 * seq_index: index of the next element in the sequence to yield
 * enum_index: next enumeration index to yield
 *
 * In pseudo-notation, the yielded tuple at each step is:
 *  enum_index, sequence[seq_index]
 *
*/
typedef struct {
    PyObject_HEAD
    Py_ssize_t seq_index, enum_index;
    PyObject *sequence;
} RevgenState;

The most interesting thing to note here is the reference to the sequence we're iterating on. The iterator needs this instance to be able to access the sequence each time next is called on it.

And here is the function responsible for creating new instances. It is assigned to the tp_new slot of the type. Note that we don't assign tp_init, so the default "do-nothing" initializer will be used:

static PyObject *
revgen_new(PyTypeObject *type, PyObject *args, PyObject *kwargs)
{
    PyObject *sequence;

    if (!PyArg_UnpackTuple(args, "revgen", 1, 1, &sequence))
        return NULL;

    /* We expect an argument that supports the sequence protocol */
    if (!PySequence_Check(sequence)) {
        PyErr_SetString(PyExc_TypeError, "revgen() expects a sequence");
        return NULL;
    }

    Py_ssize_t len = PySequence_Length(sequence);
    if (len == -1)
        return NULL;

    /* Create a new RevgenState and initialize its state - pointing to the last
     * index in the sequence.
    */
    RevgenState *rgstate = (RevgenState *)type->tp_alloc(type, 0);
    if (!rgstate)
        return NULL;

    Py_INCREF(sequence);
    rgstate->sequence = sequence;
    rgstate->seq_index = len - 1;
    rgstate->enum_index = 0;

    return (PyObject *)rgstate;
}

This is a straightforward tp_new implementation. It makes sure the object it's expected to iterate on is a sequence, and initializes the state to prepare returning the last element in the first next call. The corresponding de-allocation function is also unsurprising:

static void
revgen_dealloc(RevgenState *rgstate)
{
    /* We need XDECREF here because when the generator is exhausted,
     * rgstate->sequence is cleared with Py_CLEAR which sets it to NULL.
    */
    Py_XDECREF(rgstate->sequence);
    Py_TYPE(rgstate)->tp_free(rgstate);
}

Now what's left to see is the implementations of the tp_iter and tp_iternext slots. Since our type is an iterator, it can just assign PyObject_SelfIter function for the tp_iter slot. tp_iternext is where the interesting work is happening. This is what gets invoked when the next built-in is called on the iterator, and by extension also when the iterator is consumed by a for loop:

static PyObject *
revgen_next(RevgenState *rgstate)
{
    /* seq_index < 0 means that the generator is exhausted.
     * Returning NULL in this case is enough. The next() builtin will raise the
     * StopIteration error for us.
    */
    if (rgstate->seq_index >= 0) {
        PyObject *elem = PySequence_GetItem(rgstate->sequence,
                                            rgstate->seq_index);
        /* Exceptions from PySequence_GetItem are propagated to the caller
         * (elem will be NULL so we also return NULL).
        */
        if (elem) {
            PyObject *result = Py_BuildValue("lO", rgstate->enum_index, elem);
            rgstate->seq_index--;
            rgstate->enum_index++;
            return result;
        }
    }

    /* The reference to the sequence is cleared in the first generator call
     * after its exhaustion (after the call that returned the last element).
     * Py_CLEAR will be harmless for subsequent calls since it's idempotent
     * on NULL.
    */
    rgstate->seq_index = -1;
    Py_CLEAR(rgstate->sequence);
    return NULL;
}

The most important point to remember here is this - the state of the iteration should be wholly kept in the iterator object. Comparing to the Python implementation, this means quite a bit more work. The Python yield statement allows us to use the Python interpreter itself to keep the state of execution for us. This is what makes co-routines in Python so powerful - very little explicit state has to be kept manually. As I mentioned in the beginning of the post, in C extensions we don't have this luxury, so we have to go the manual way. Since the current example is very simple and linear, this is relatively easy. In more complex examples, some ingenuity is often required to design the state object and tp_iternext function correctly.

The full code of this post, together with a simple Python test script and a setup.py for building the extension with distutils is available for download here.