Less copies in Python with the buffer protocol and memoryviews

November 28th, 2011 at 7:48 am

For one of the hobby projects I’m currently hacking on, I recently had to do a lot of binary data processing in memory. Large chunks of data are being read from a file, then examined and modified in memory and finally used to write some reports.

This made me think about the most efficient way to read data from a file into a modifiable memory chunk in Python. As we all know, the standard file read method, for a file opened in binary mode, returns a bytes object [1], which is immutable:

# Snippet #1

f = open(FILENAME, 'rb')
data = f.read()
# oops: TypeError: 'bytes' object does not support item assignment
data[0] = 97

This reads the whole contents of the file into data – a bytes object which is read only. But what if we now want to perform some modifications on the data? Then, we need to somehow get it into a writable object. The most straightforward writable data buffer in Python is a bytearray. So we can do this:

# Snippet #2

f = open(FILENAME, 'rb')
data = bytearray(f.read())
data[0] = 97 # OK!

Now, the bytes object returned by f.read() is passed into the bytearray constructor, which copies its contents into an internal buffer. Since data is a bytearray, we can manipulate it.

Although it appears that the goal has been achieved, I don’t like this solution. The extra copy made by bytearray is bugging me. Why is this copy needed? f.read() just returns a throwaway buffer we don’t need anyway – can’t we just initialize the bytearray directly, without copying a temporary buffer?

http://eli.thegreenplace.net/wp-content/uploads/2011/yes-we-can-thumb.jpg

This use case is one of the reasons the Python buffer protocol exists.

The buffer protocol – introduction

The buffer protocol is described in the Python documentation and in PEP 3118 [2]. Briefly, it provides a way for Python objects to expose their internal buffers to other objects. This is useful to avoid extra copies and for certain kinds of sharing. There are many examples of the buffer protocol in use. In the core language – in builtin types such as bytes and bytearray, in the standard library (for example array.array and ctypes) and 3rd party libraries (some important Python libraries such as numpy and PIL rely extensively on the buffer protocol for performance).

There are usually two or more parties involved in each protocol. In the case of the Python buffer protocol, the parties are a "producer" (or "provider") and a "consumer". The producer exposes its internals via the buffer protocol, and the consumer accesses those internals.

Here I want to focus specifically on one use of the buffer protocol that’s relevant to this article. The producer is the built-in bytearray type, and the consumer is a method in the file object named readinto.

A more efficient way to read into a bytearray

Here’s the way to do what Snippet #2 did, just without the extra copy:

# Snippet #3

f = open(FILENAME, 'rb')
data = bytearray(os.path.getsize(FILENAME))
f.readinto(data)

First, a bytearray is created and pre-allocated to the size of the data we’re going to read into it. The pre-allocation is important – since readinto directly accesses the internal buffer of bytearray, it won’t write more than has been allocated. Next, the file.readinto method is used to read the data directly into the bytearray’s internal storage, without going via temporary buffers.

The result: this code runs ~30% faster than snippet #2 [3].

Variations on the theme

Other objects and modules could be used here. For example, the built-in array.array class also supports the buffer protocol, so it can also be written and read from a file directly and efficiently. The same goes for numpy arrays. On the consumer side, the socket module can also read directly into a buffer with the read_into method. I’m sure that it’s easy to find many other sample uses of this protocol in Python itself and some 3rd partly libraries – if you find something interesting, please let me know.

The buffer protocol – implementation

Let’s see how Snippet #3 works under the hood using the buffer protocol [4]. We’ll start with the producer.

bytearray declares that it implements the buffer protocol by filling the tp_as_buffer slot of its type object [5]. What’s placed there is the address of a PyBufferProcs structure, which is a simple container for two function pointers:

typedef int (*getbufferproc)(PyObject *, Py_buffer *, int);
typedef void (*releasebufferproc)(PyObject *, Py_buffer *);
/* ... */
typedef struct {
     getbufferproc bf_getbuffer;
     releasebufferproc bf_releasebuffer;
} PyBufferProcs;

bf_getbuffer is the function used to obtain a buffer from the object providing it, and bf_releasebuffer is the function used to notify the object that the provided buffer is no longer needed.

The bytearray implementation in Objects/bytearrayobject.c initializes an instance of PyBufferProces thus:

static PyBufferProcs bytearray_as_buffer = {
    (getbufferproc)bytearray_getbuffer,
    (releasebufferproc)bytearray_releasebuffer,
};

The more interesting function here is bytearray_getbuffer:

static int
bytearray_getbuffer(PyByteArrayObject *obj, Py_buffer *view, int flags)
{
    int ret;
    void *ptr;
    if (view == NULL) {
        obj->ob_exports++;
        return 0;
    }
    ptr = (void *) PyByteArray_AS_STRING(obj);
    ret = PyBuffer_FillInfo(view, (PyObject*)obj, ptr, Py_SIZE(obj), 0, flags);
    if (ret >= 0) {
        obj->ob_exports++;
    }
    return ret;
}

It simply uses the PyBuffer_FillInfo API to fill the buffer structure passed to it. PyBuffer_FillInfo provides a simplified method of filling the buffer structure, which is suitable for unsophisticated objects like bytearray (if you want to see a more complex example that has to fill the buffer structure manually, take a look at the corresponding function of array.array).

On the consumer side, the code that interests us is the buffered_readinto function in Modules\_io\bufferedio.c [6]. I won’t show its full code here since it’s quite complex, but with regards to the buffer protocol, the flow is simple:

  1. Use the PyArg_ParseTuple function with the w* format specifier to parse its argument as a R/W buffer object, which itself calls PyObject_GetBuffer – a Python API that invokes the producer’s "get buffer" function.
  2. Read data from the file directly into this buffer.
  3. Release the buffer using the PyBuffer_Release API [7], which eventually gets routed to the bytearray_releasebuffer function in our case.

To conclude, here’s what the call sequence looks like when f.readinto(data) is executed in the Python code:

buffered_readinto
|
\--> PyArg_ParseTuple(..., "w*", ...)
|    |
|    \--> PyObject_GetBuffer(obj)
|         |
|         \--> obj->ob_type->tp_as_buffer->bf_getbuffer
|
|--> ... read the data
|
\--> PyBuffer_Release
     |
     \--> obj->ob_type->tp_as_buffer->bf_releasebuffer

Memory views

The buffer protocol is an internal implementation detail of Python, accessible only on the C-API level. And that’s a good thing, since the buffer protocol requires certain low-level behavior such as properly releasing buffers. Memoryview objects were created to expose it to a user’s Python code in a safe manner:

memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.

The linked documentation page explains memoryviews quite well and should be immediately comprehensible if you’ve reached so far in this article. Therefore I’m not going to explain how a memoryview works, just show some examples of its use.

It is a known fact that in Python, slices on strings and bytes make copies. Sometimes when performance matters and the buffers are large, this is a big waste. Suppose you have a large buffer and you want to pass just half of it to some function (that will send it to a socket or do something else [8]). Here’s what happens (annotated Python pseudo-code):

mybuf = ... # some large buffer of bytes
func(mybuf[:len(mybuf)//2])
  # passes the first half of mybuf into func
  # COPIES half of mybuf's data to a new buffer

The copy can be expensive if there’s a lot of data involved. What’s the alternative? Using a memoryview:

mybuf = ... # some large buffer of bytes
mv_mybuf = memoryview(mybuf) # a memoryview of mybuf
func(mv_mybuf[:len(mv_mybuf)//2])
  # passes the first half of mybuf into func as a "sub-view" created
  # by slicing a memoryview.
  # NO COPY is made here!

A memoryview behaves just like bytes in many useful contexts (for example, it supports the mapping protocol) so it provides an adequate replacement if used carefully. The great thing about it is that it uses the buffer protocol beneath the covers to avoid copies and just juggle pointers to data. The performance difference is dramatic – I timed a 300x speedup on slicing out a half of a 1MB bytes buffer when using a memoryview as demonstrated above. And this speedup will get larger with larger buffers, since it’s O(1) vs. the O(n) of copying.

But there’s more. On writable producers such as bytearray, a memoryview creates a writable view that can be modified:

>>> buf = bytearray(b'abcdefgh')
>>> mv = memoryview(buf)
>>> mv[4:6] = b'ZA'
>>> buf
bytearray(b'abcdZAgh')

This gives us a way to do something we couldn’t achieve by any other means – read from a file (or receive from a socket) directly into the middle of some existing buffer [9]:

buf = bytearray(...) # pre-allocated to the needed size
mv = memoryview(buf)
numread = f.readinto(mv[some_offset:])

Conclusion

This article demonstrated the Python buffer protocol, showing both how it works and what it can be used for. The main use of the buffer protocol to the Python programmer is optimization of certain patterns of coding, by avoiding unnecessary data copies.

Any mention of optimization in Python code is sure to draw fire from people claiming that if I want to write fast code, I shouldn’t use Python at all. But I disagree. Python these days is fast enough to be suitable for many tasks that were previously only in the domain of C/C++. I want to keep using it while I can, and only resort to low-level C/C++ when I must.

Employing the buffer protocol to have zero-copy buffer manipulations on the Python level is IMHO a huge boon that can stall (or even avoid) the transition of some performance-sensitive code from Python to C. That’s because when dealing with data processing, we often use a lot of C APIs anyway, the only Python overhead being the passing of data between these APIs. A speed boost in this code can make a huge difference and bring the Python code very close to the performance we could have with plain C.

The article also gave a glimpse into one aspect of the implementation of Python, hopefully showing that it’s not difficult at all to dive right into the code and understand how Python does something it does.

http://eli.thegreenplace.net/wp-content/uploads/hline.jpg

[1] In this article I’m focusing on the latest Python 3.x, although most of it also applies to Python 2.7
[2] The buffer protocol existed in Python prior to 2.6, but was then greatly enhanced. The PEP also describes the change that was made to the buffer protocol with the move to Python 3.x (and later backported to the 2.x line starting with 2.6).
[3]

This is on the latest build of Python 3.3. Roughly the same speedup can be seen on Python 2.7. With Python 3.2 there appears to be a speed regression that makes the two snippets perform similarly, but it has been fixed in 3.3

Another note on the benchmarking: it’s recommended to use large files (say, ~100 MB and up) to get reliable measurements. For small files too many irrelevant factors come into play and offset the benchmarks. In addition, the code should be run in a loop to avoid differences due to warm/cold disk cache issues. I’m using the timeit module, which is perfect for this purpose.

[4] All the code displayed here is taken from the latest development snapshot of Python 3.3 (default branch).
[5] Type objects are a fascinating aspect of Python’s implementation, and I hope to cover it in a separate article one day. Briefly, it allows Python objects to declare which services they provide (or, in other terms, which interfaces they implement). More information can be found in the documentation.
[6] Since the built-in open function, when asked to open a file in binary mode for reading, returns an io.BufferedReader object by default. This can be controlled with the buffering argument.
[7] Releasing the buffer structure is an important part of the buffer protocol. Each time it’s requested for a buffer, bytearray increments a reference count, and decrements it when the buffer is released. While the refcount is positive (meaning that there are consumer objects directly relying on the internal buffer), bytearray won’t agree to resize or do other operations that may invalidate the internal buffer. Otherwise, this would be an avenue for insidious memory bugs.
[8] Networking code is actually a common use case. When sending data over sockets, it’s frequently sliced and diced to build frames. This can involve a lot of copying. Other data-munging applications such as encryption and compression are also culprits.
[9] This code snippet was borrowed from Antoine Pitrou’s post on python-dev.

Related posts:

  1. Length-prefix framing for protocol buffers
  2. Boost.Asio with Protocol Buffers code sample
  3. Zipped dump of a SQLite database with Python
  4. Python objects, types, classes, and instances – a glossary
  5. Perl’s “guess if file is text or binary” implemented in Python

23 Responses to “Less copies in Python with the buffer protocol and memoryviews”

  1. Rene DudfieldNo Gravatar Says:

    Great post, as usual.

    Avoiding memory copies will often give you the biggest improvement in performance for many programs these days. Since programs are mostly waiting on IO, where the IO can be ram as well as disk/network. Avoiding allocating big chunks of memory, and copying big chunks of memory will therefore give you nice performance improvements.

    Another related technique is to use the mmap module. This can save you even more memory in some cases. If your file system has the file in cache, then using a mmap will allow you to use that memory allocated to cache, and you will not need to allocate any memory at all in your own program. Mmap can be managed into other memory view type python objects too.

    PyOpenGL and pygame are other python modules that make use of these techniques.

  2. elibenNo Gravatar Says:

    Rene,

    Thanks for the mmap bits – that’s very interesting.

  3. eryksunNo Gravatar Says:

    If you had data that you wanted to modify using slicing and indexing but also with a fast filelike interface, you could read it into a BytesIO object via its writable memoryview returned by getbuffer(). But the problem is the buffer can’t grow while the memoryview exists. Is there a simple way to initialize a BytesIO object to a certain size, similar to bytearray(size)? All I can think to do is repeatedly write(b'\0'), but I’d rather use an initializer implemented efficiently in C.

  4. elibenNo Gravatar Says:

    eryksun,

    I’m not aware of a way to directly pre-allocate a BytesIO, however its write method accepts a bytearray, which itself can be initialized as you mention. But this is also a copy, so I don’t know if it’s gaining anything.

  5. eryksunNo Gravatar Says:

    It occurs to me now that one could allocate a large bytes chunk (say 1 MiB), and a smaller one for the remainder, to reduce the number of loop iterations. I tested this with timeit. For a 128 MiB target, initializing the BytesIO buffer by writing a byte at a time took about a hundred times longer than using a pre-allocated bytes object. In contrast, writing a 1 MiB buffer 128 times was only 4 times slower. As expected the bottleneck is the Python loop and function calls.

  6. David BeazleyNo Gravatar Says:

    Thanks for the great post! I wanted to offer one bit of caution about one aspect of this. For data structures such as numpy arrays, memoryview objects don’t expose the data directly as a raw sequence of bytes. Instead, the memoryview object preserves the underlying array structure including dimensions and item size. Here is a simple example that illustrates this:

    >>> import numpy
    >>> a = numpy.array([1,2,3,4])
    >>> m = memoryview(a)
    >>> len(m)
    4
    >>> m[0]
    b'\x01\x00\x00\x00\x00\x00\x00\x00'
    >>> m.itemsize
    8
    >>> m.shape
    (4,)
    >>>

    Multidimensional arrays may not work very well at all:

    >>> a = numpy.array([[1,2,3],[4,5,6]])
    >>> m = memoryview(a)
    >>> len(m)
    2
    >>> m[0]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: cannot make memory view from a buffer with a NULL data pointer
    >>> m[0:2]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NotImplementedError
    >>>

    This view of memory can be extremely problematic to work with if you try to directly mix arrays with I/O operations such as those on sockets (recv_into, send, etc.). Part of the problem is that socket operations might fragment the data in a way that is unaligned with the underlying array structure. For example, the recv_into() method might only receive partial data whose length is not an even multiple of the array item size (e.g., you receive 15 bytes of data into an array of floats where each item is 8 bytes in size). If this happens, you don’t get an error. Moreover, there is (apparently) no simple way to resume receiving data where you left off because there is no means for indicating a partial item offset (or raw byte position) via the memoryview interface.

    I spent some time exploring this awhile back because I was interested in the idea of zero-copy message passing of numpy arrays through sockets. The outcome of that experiment was that it simply wasn’t possible using pure Python code and the buffer/memoryview interface. There were too many problems with data fragmentation and misalignment with the underlying array structure to make it work. Maybe I’m missing something obvious, but at the time it made my head hurt and I just gave up. I would love to know if any of the numpy developers have any thoughts or experience with this.

  7. elibenNo Gravatar Says:

    David,

    Interesting comment. I never considered this, since I’m not really doing work with numpy (my interest in the buffer protocol is more on the networking / binary file crunching side), but what you’re saying makes a lot of sense.

    I wonder if some kind of “buffer stealing” mechanism would work for numpy. E.g. you build a buffer on the side and zero-copy read into it from a socket. When you’re done, you instantiate a numpy array and tell it – here’s your internal buffer, take it! This may be wishful thinking since this kind of interface is C-ish and not really something done in Python (I’m not even sure how to describe its semantics).

  8. eryksunNo Gravatar Says:

    David,

    Regarding multidimensional array views — subscripting isn’t implemented at all. See here and here in memoryobject.c.

  9. David BeazleyNo Gravatar Says:

    There might be some way to hack buffer stealing via ctypes or some other low-level mechanism. However, once you start doing that it, you get into all sorts of issues about who owns which memory and so forth (a good way to get a segmentation fault).

    I almost wonder if the problem would be better solved by extending the recv_into() method so that it allows a raw byte-offset to be supplied so that you can resume receiving where you left off. That’s really the only clean solution that I can think of.

  10. elibenNo Gravatar Says:

    David,

    If you think such an addition will solve the problem (I can’t really judge since I’m not as familiar with the details as you are), then by all means propose it to python-ideas. I will wholeheartedly support it, since it’s really easy to implement and harmless.

  11. eryksunNo Gravatar Says:

    numpy has frombuffer. If you pass it a bytes object you get a read-only array. If you pass it a bytearray or array.array you get a writeable array shared between the two objects.

  12. Elazar LeibovichNo Gravatar Says:

    memoryview looks similar to Go (golang) slices.

  13. Dave RawksNo Gravatar Says:

    Your example in snippet #3 has a race condition where the file can be written to between the time that it is stat’d for size and the time it is actually read into the byte array. Best case scenario is that you are ONLY reading the file and something else is appending to it (e.g. a log file) and your read just gets a truncated version of the file at the length of the stat, worst case scenario is that the file is actually shorter than the stat’d size at the time you do the read and then your read blocks, perhaps indefinitely.

  14. elibenNo Gravatar Says:

    Dave,

    You’re right, but I just left the whole issue of concurrent access to the file out of this article, since my point was to focus on the buffers. The file read is just an example – I imagine that for the case of concurrency all the usual solutions apply.

  15. Sivaram KannanNo Gravatar Says:

    Great article. Excatly what I was looking for. I was trying to understand how to read a file to a buffer and send the data across the network with Asyncore for almost a week now.Not much information available linking File Streams – Asyncore – buffers, got some clarity regarding my requirement reading this. Thanks for the article.

  16. laike9mNo Gravatar Says:

    Eli, thank you for your great article!
    How can I subscribe to your blog?

  17. elibenNo Gravatar Says:

    @laike9m,

    http://eli.thegreenplace.net/feed/

  18. TimNo Gravatar Says:

    This is a great explanation about memoryview, thank you. I have a problem though which I have not been able to find a solution to anywhere. What I need to do is to sort slices of a very large string which represents a genome and an end symbol. A short example would be AAAATGGGAC$. So far, I have done this by creating a range list of the length of the genome, and then using slice indexing on a buffer object of the genome:

    genome = AAAATGGGAC$
    order = range(len(genome))
    suffix_order = sorted(order, key=lambda x: buffer(data, x))

    This works perfectly with the old buffer object, but as I read that this was being discontinued, I decided to try it with memoryview instead with the following code for suffix_order:

    suffix_order = sorted(order, key=lambda x: memoryview(a)[x:])

    This unfortunately fails as the sorting cannot be done on the content of the slice, but on the memory location instead. I have seen in forums that a solution is to use the tobytes() function on the memoryview object for any comparisons, but this kind of defeats the purpose as it creates a new object in memory and has a detrimental performance on the speed. When dealing with strings of length > 100,000 – this workaround is unusable.

    The instructor of my course suggested using string pointers to compare the different substrings of the genome. I have never used pointers before, and do not know any languages like C or C++. I don’t want to have to rely on buffer() if it is to be discontinued, so I was wondering if you know of any other solutions to this problem?

  19. elibenNo Gravatar Says:

    @Tim

    Are you trying to sort the slice of genome itself, or use it in comparison? If the latter, then how?

  20. TimNo Gravatar Says:

    What I am trying to do is to construct a suffix array of the genome. I don’t know if you know anything about this, so apologies if this is simplistic (I am only just learning it myself!). So I need to create all possible suffixes of a genome, and then sort the indices of where they appear in the genome. So for example with GCAT$:

    0 - GCAT$
    1 - CAT$
    2 - AT$
    3 - T$
    4 - $
    
    4 - $
    2 - AT$
    1 - CAT$
    0 - GCAT$
    3 - T$

    So my function would need to return a list of [4, 2, 1, 0, 3]. So, in answer to your question, I need to sort all the different slices of the same genome with each other.

  21. elibenNo Gravatar Says:

    @Tim,

    At this point it really makes more sense to take this to Stack Overflow :)

  22. TimNo Gravatar Says:

    Will do, thanks for your replies. I only came here as this was a really nice description of memoryview here. :)

  23. SteveADNo Gravatar Says:

    Well, you can use Long Path Tool for such issues, it works good.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)