Less copies in Python with the buffer protocol and memoryviews
November 28th, 2011 at 7:48 amFor one of the hobby projects I’m currently hacking on, I recently had to do a lot of binary data processing in memory. Large chunks of data are being read from a file, then examined and modified in memory and finally used to write some reports.
This made me think about the most efficient way to read data from a file into a modifiable memory chunk in Python. As we all know, the standard file read method, for a file opened in binary mode, returns a bytes object [1], which is immutable:
# Snippet #1
f = open(FILENAME, 'rb')
data = f.read()
# oops: TypeError: 'bytes' object does not support item assignment
data[0] = 97
This reads the whole contents of the file into data – a bytes object which is read only. But what if we now want to perform some modifications on the data? Then, we need to somehow get it into a writable object. The most straightforward writable data buffer in Python is a bytearray. So we can do this:
# Snippet #2
f = open(FILENAME, 'rb')
data = bytearray(f.read())
data[0] = 97 # OK!
Now, the bytes object returned by f.read() is passed into the bytearray constructor, which copies its contents into an internal buffer. Since data is a bytearray, we can manipulate it.
Although it appears that the goal has been achieved, I don’t like this solution. The extra copy made by bytearray is bugging me. Why is this copy needed? f.read() just returns a throwaway buffer we don’t need anyway – can’t we just initialize the bytearray directly, without copying a temporary buffer?
![]()
This use case is one of the reasons the Python buffer protocol exists.
The buffer protocol – introduction
The buffer protocol is described in the Python documentation and in PEP 3118 [2]. Briefly, it provides a way for Python objects to expose their internal buffers to other objects. This is useful to avoid extra copies and for certain kinds of sharing. There are many examples of the buffer protocol in use. In the core language – in builtin types such as bytes and bytearray, in the standard library (for example array.array and ctypes) and 3rd party libraries (some important Python libraries such as numpy and PIL rely extensively on the buffer protocol for performance).
There are usually two or more parties involved in each protocol. In the case of the Python buffer protocol, the parties are a "producer" (or "provider") and a "consumer". The producer exposes its internals via the buffer protocol, and the consumer accesses those internals.
Here I want to focus specifically on one use of the buffer protocol that’s relevant to this article. The producer is the built-in bytearray type, and the consumer is a method in the file object named readinto.
A more efficient way to read into a bytearray
Here’s the way to do what Snippet #2 did, just without the extra copy:
# Snippet #3
f = open(FILENAME, 'rb')
data = bytearray(os.path.getsize(FILENAME))
f.readinto(data)
First, a bytearray is created and pre-allocated to the size of the data we’re going to read into it. The pre-allocation is important – since readinto directly accesses the internal buffer of bytearray, it won’t write more than has been allocated. Next, the file.readinto method is used to read the data directly into the bytearray’s internal storage, without going via temporary buffers.
The result: this code runs ~30% faster than snippet #2 [3].
Variations on the theme
Other objects and modules could be used here. For example, the built-in array.array class also supports the buffer protocol, so it can also be written and read from a file directly and efficiently. The same goes for numpy arrays. On the consumer side, the socket module can also read directly into a buffer with the read_into method. I’m sure that it’s easy to find many other sample uses of this protocol in Python itself and some 3rd partly libraries – if you find something interesting, please let me know.
The buffer protocol – implementation
Let’s see how Snippet #3 works under the hood using the buffer protocol [4]. We’ll start with the producer.
bytearray declares that it implements the buffer protocol by filling the tp_as_buffer slot of its type object [5]. What’s placed there is the address of a PyBufferProcs structure, which is a simple container for two function pointers:
typedef int (*getbufferproc)(PyObject *, Py_buffer *, int);
typedef void (*releasebufferproc)(PyObject *, Py_buffer *);
/* ... */
typedef struct {
getbufferproc bf_getbuffer;
releasebufferproc bf_releasebuffer;
} PyBufferProcs;
bf_getbuffer is the function used to obtain a buffer from the object providing it, and bf_releasebuffer is the function used to notify the object that the provided buffer is no longer needed.
The bytearray implementation in Objects/bytearrayobject.c initializes an instance of PyBufferProces thus:
static PyBufferProcs bytearray_as_buffer = {
(getbufferproc)bytearray_getbuffer,
(releasebufferproc)bytearray_releasebuffer,
};
The more interesting function here is bytearray_getbuffer:
static int
bytearray_getbuffer(PyByteArrayObject *obj, Py_buffer *view, int flags)
{
int ret;
void *ptr;
if (view == NULL) {
obj->ob_exports++;
return 0;
}
ptr = (void *) PyByteArray_AS_STRING(obj);
ret = PyBuffer_FillInfo(view, (PyObject*)obj, ptr, Py_SIZE(obj), 0, flags);
if (ret >= 0) {
obj->ob_exports++;
}
return ret;
}
It simply uses the PyBuffer_FillInfo API to fill the buffer structure passed to it. PyBuffer_FillInfo provides a simplified method of filling the buffer structure, which is suitable for unsophisticated objects like bytearray (if you want to see a more complex example that has to fill the buffer structure manually, take a look at the corresponding function of array.array).
On the consumer side, the code that interests us is the buffered_readinto function in Modules\_io\bufferedio.c [6]. I won’t show its full code here since it’s quite complex, but with regards to the buffer protocol, the flow is simple:
- Use the PyArg_ParseTuple function with the w* format specifier to parse its argument as a R/W buffer object, which itself calls PyObject_GetBuffer – a Python API that invokes the producer’s "get buffer" function.
- Read data from the file directly into this buffer.
- Release the buffer using the PyBuffer_Release API [7], which eventually gets routed to the bytearray_releasebuffer function in our case.
To conclude, here’s what the call sequence looks like when f.readinto(data) is executed in the Python code:
buffered_readinto
|
\--> PyArg_ParseTuple(..., "w*", ...)
| |
| \--> PyObject_GetBuffer(obj)
| |
| \--> obj->ob_type->tp_as_buffer->bf_getbuffer
|
|--> ... read the data
|
\--> PyBuffer_Release
|
\--> obj->ob_type->tp_as_buffer->bf_releasebuffer
Memory views
The buffer protocol is an internal implementation detail of Python, accessible only on the C-API level. And that’s a good thing, since the buffer protocol requires certain low-level behavior such as properly releasing buffers. Memoryview objects were created to expose it to a user’s Python code in a safe manner:
memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.
The linked documentation page explains memoryviews quite well and should be immediately comprehensible if you’ve reached so far in this article. Therefore I’m not going to explain how a memoryview works, just show some examples of its use.
It is a known fact that in Python, slices on strings and bytes make copies. Sometimes when performance matters and the buffers are large, this is a big waste. Suppose you have a large buffer and you want to pass just half of it to some function (that will send it to a socket or do something else [8]). Here’s what happens (annotated Python pseudo-code):
mybuf = ... # some large buffer of bytes
func(mybuf[:len(mybuf)//2])
# passes the first half of mybuf into func
# COPIES half of mybuf's data to a new buffer
The copy can be expensive if there’s a lot of data involved. What’s the alternative? Using a memoryview:
mybuf = ... # some large buffer of bytes
mv_mybuf = memoryview(mybuf) # a memoryview of mybuf
func(mv_mybuf[:len(mv_mybuf)//2])
# passes the first half of mybuf into func as a "sub-view" created
# by slicing a memoryview.
# NO COPY is made here!
A memoryview behaves just like bytes in many useful contexts (for example, it supports the mapping protocol) so it provides an adequate replacement if used carefully. The great thing about it is that it uses the buffer protocol beneath the covers to avoid copies and just juggle pointers to data. The performance difference is dramatic – I timed a 300x speedup on slicing out a half of a 1MB bytes buffer when using a memoryview as demonstrated above. And this speedup will get larger with larger buffers, since it’s O(1) vs. the O(n) of copying.
But there’s more. On writable producers such as bytearray, a memoryview creates a writable view that can be modified:
>>> buf = bytearray(b'abcdefgh')
>>> mv = memoryview(buf)
>>> mv[4:6] = b'ZA'
>>> buf
bytearray(b'abcdZAgh')
This gives us a way to do something we couldn’t achieve by any other means – read from a file (or receive from a socket) directly into the middle of some existing buffer [9]:
buf = bytearray(...) # pre-allocated to the needed size
mv = memoryview(buf)
numread = f.readinto(mv[some_offset:])
Conclusion
This article demonstrated the Python buffer protocol, showing both how it works and what it can be used for. The main use of the buffer protocol to the Python programmer is optimization of certain patterns of coding, by avoiding unnecessary data copies.
Any mention of optimization in Python code is sure to draw fire from people claiming that if I want to write fast code, I shouldn’t use Python at all. But I disagree. Python these days is fast enough to be suitable for many tasks that were previously only in the domain of C/C++. I want to keep using it while I can, and only resort to low-level C/C++ when I must.
Employing the buffer protocol to have zero-copy buffer manipulations on the Python level is IMHO a huge boon that can stall (or even avoid) the transition of some performance-sensitive code from Python to C. That’s because when dealing with data processing, we often use a lot of C APIs anyway, the only Python overhead being the passing of data between these APIs. A speed boost in this code can make a huge difference and bring the Python code very close to the performance we could have with plain C.
The article also gave a glimpse into one aspect of the implementation of Python, hopefully showing that it’s not difficult at all to dive right into the code and understand how Python does something it does.

| [1] | In this article I’m focusing on the latest Python 3.x, although most of it also applies to Python 2.7 |
| [2] | The buffer protocol existed in Python prior to 2.6, but was then greatly enhanced. The PEP also describes the change that was made to the buffer protocol with the move to Python 3.x (and later backported to the 2.x line starting with 2.6). |
| [3] |
This is on the latest build of Python 3.3. Roughly the same speedup can be seen on Python 2.7. With Python 3.2 there appears to be a speed regression that makes the two snippets perform similarly, but it has been fixed in 3.3 Another note on the benchmarking: it’s recommended to use large files (say, ~100 MB and up) to get reliable measurements. For small files too many irrelevant factors come into play and offset the benchmarks. In addition, the code should be run in a loop to avoid differences due to warm/cold disk cache issues. I’m using the timeit module, which is perfect for this purpose. |
| [4] | All the code displayed here is taken from the latest development snapshot of Python 3.3 (default branch). |
| [5] | Type objects are a fascinating aspect of Python’s implementation, and I hope to cover it in a separate article one day. Briefly, it allows Python objects to declare which services they provide (or, in other terms, which interfaces they implement). More information can be found in the documentation. |
| [6] | Since the built-in open function, when asked to open a file in binary mode for reading, returns an io.BufferedReader object by default. This can be controlled with the buffering argument. |
| [7] | Releasing the buffer structure is an important part of the buffer protocol. Each time it’s requested for a buffer, bytearray increments a reference count, and decrements it when the buffer is released. While the refcount is positive (meaning that there are consumer objects directly relying on the internal buffer), bytearray won’t agree to resize or do other operations that may invalidate the internal buffer. Otherwise, this would be an avenue for insidious memory bugs. |
| [8] | Networking code is actually a common use case. When sending data over sockets, it’s frequently sliced and diced to build frames. This can involve a lot of copying. Other data-munging applications such as encryption and compression are also culprits. |
| [9] | This code snippet was borrowed from Antoine Pitrou’s post on python-dev. |
Related posts:

November 28th, 2011 at 10:46
Great post, as usual.
Avoiding memory copies will often give you the biggest improvement in performance for many programs these days. Since programs are mostly waiting on IO, where the IO can be ram as well as disk/network. Avoiding allocating big chunks of memory, and copying big chunks of memory will therefore give you nice performance improvements.
Another related technique is to use the mmap module. This can save you even more memory in some cases. If your file system has the file in cache, then using a mmap will allow you to use that memory allocated to cache, and you will not need to allocate any memory at all in your own program. Mmap can be managed into other memory view type python objects too.
PyOpenGL and pygame are other python modules that make use of these techniques.
November 28th, 2011 at 11:11
Rene,
Thanks for the
mmapbits – that’s very interesting.November 28th, 2011 at 13:47
If you had data that you wanted to modify using slicing and indexing but also with a fast filelike interface, you could read it into a
BytesIOobject via its writablememoryviewreturned bygetbuffer(). But the problem is the buffer can’t grow while thememoryviewexists. Is there a simple way to initialize aBytesIOobject to a certain size, similar tobytearray(size)? All I can think to do is repeatedlywrite(b'\0'), but I’d rather use an initializer implemented efficiently in C.November 28th, 2011 at 14:13
eryksun,
I’m not aware of a way to directly pre-allocate a
BytesIO, however itswritemethod accepts abytearray, which itself can be initialized as you mention. But this is also a copy, so I don’t know if it’s gaining anything.November 28th, 2011 at 15:03
It occurs to me now that one could allocate a large
byteschunk (say 1 MiB), and a smaller one for the remainder, to reduce the number of loop iterations. I tested this withtimeit. For a 128 MiB target, initializing theBytesIObuffer by writing a byte at a time took about a hundred times longer than using a pre-allocatedbytesobject. In contrast, writing a 1 MiB buffer 128 times was only 4 times slower. As expected the bottleneck is the Python loop and function calls.November 28th, 2011 at 15:30
Thanks for the great post! I wanted to offer one bit of caution about one aspect of this. For data structures such as numpy arrays, memoryview objects don’t expose the data directly as a raw sequence of bytes. Instead, the memoryview object preserves the underlying array structure including dimensions and item size. Here is a simple example that illustrates this:
Multidimensional arrays may not work very well at all:
This view of memory can be extremely problematic to work with if you try to directly mix arrays with I/O operations such as those on sockets (recv_into, send, etc.). Part of the problem is that socket operations might fragment the data in a way that is unaligned with the underlying array structure. For example, the recv_into() method might only receive partial data whose length is not an even multiple of the array item size (e.g., you receive 15 bytes of data into an array of floats where each item is 8 bytes in size). If this happens, you don’t get an error. Moreover, there is (apparently) no simple way to resume receiving data where you left off because there is no means for indicating a partial item offset (or raw byte position) via the memoryview interface.
I spent some time exploring this awhile back because I was interested in the idea of zero-copy message passing of numpy arrays through sockets. The outcome of that experiment was that it simply wasn’t possible using pure Python code and the buffer/memoryview interface. There were too many problems with data fragmentation and misalignment with the underlying array structure to make it work. Maybe I’m missing something obvious, but at the time it made my head hurt and I just gave up. I would love to know if any of the numpy developers have any thoughts or experience with this.
November 28th, 2011 at 16:04
David,
Interesting comment. I never considered this, since I’m not really doing work with numpy (my interest in the buffer protocol is more on the networking / binary file crunching side), but what you’re saying makes a lot of sense.
I wonder if some kind of “buffer stealing” mechanism would work for numpy. E.g. you build a buffer on the side and zero-copy read into it from a socket. When you’re done, you instantiate a numpy array and tell it – here’s your internal buffer, take it! This may be wishful thinking since this kind of interface is C-ish and not really something done in Python (I’m not even sure how to describe its semantics).
November 28th, 2011 at 16:35
David,
Regarding multidimensional array views — subscripting isn’t implemented at all. See here and here in memoryobject.c.
November 28th, 2011 at 16:41
There might be some way to hack buffer stealing via ctypes or some other low-level mechanism. However, once you start doing that it, you get into all sorts of issues about who owns which memory and so forth (a good way to get a segmentation fault).
I almost wonder if the problem would be better solved by extending the recv_into() method so that it allows a raw byte-offset to be supplied so that you can resume receiving where you left off. That’s really the only clean solution that I can think of.
November 28th, 2011 at 16:49
David,
If you think such an addition will solve the problem (I can’t really judge since I’m not as familiar with the details as you are), then by all means propose it to python-ideas. I will wholeheartedly support it, since it’s really easy to implement and harmless.
November 28th, 2011 at 17:49
numpy has
frombuffer. If you pass it abytesobject you get a read-only array. If you pass it abytearrayorarray.arrayyou get a writeable array shared between the two objects.November 28th, 2011 at 18:20
memoryview looks similar to Go (golang) slices.
November 28th, 2011 at 21:15
Your example in snippet #3 has a race condition where the file can be written to between the time that it is stat’d for size and the time it is actually read into the byte array. Best case scenario is that you are ONLY reading the file and something else is appending to it (e.g. a log file) and your read just gets a truncated version of the file at the length of the stat, worst case scenario is that the file is actually shorter than the stat’d size at the time you do the read and then your read blocks, perhaps indefinitely.
November 28th, 2011 at 22:02
Dave,
You’re right, but I just left the whole issue of concurrent access to the file out of this article, since my point was to focus on the buffers. The file read is just an example – I imagine that for the case of concurrency all the usual solutions apply.
January 12th, 2012 at 08:19
Great article. Excatly what I was looking for. I was trying to understand how to read a file to a buffer and send the data across the network with Asyncore for almost a week now.Not much information available linking File Streams – Asyncore – buffers, got some clarity regarding my requirement reading this. Thanks for the article.