Directed graph traversal, orderings and applications to data-flow analysis

When traversing trees with DFS, orderings are easy to define: we have pre-order, which visits a node before recursing into its children; in-order, which visits a node in-between recursing into its children; post-order, which visits a node after recursing into its children.

However, for directed graphs, these orderings are not as natural and slightly different definitions are used. In this article I want to discuss the various directed graph orderings and their implementations. I also want to mention some applications of directed graph traversals to data-flow analysis.

Directed graphs are my focus here, since these are most useful in the applications I'm interested in. When a directed graph is known to have no cycles, I may refer to it as a DAG (directed acyclic graph). When cycles are allowed, undirected graphs can be simply modeled as directed graphs where each undirected edge turns into a pair of directed edges.

Depth-first search and pre-order

Let's start with the basis algorithm for everything discussed in this article - depth-first search (DFS). Here is a simple implementation of DFS in Python (the full code is available here):

def dfs(graph, root, visitor):
    """DFS over a graph.

    Start with node 'root', calling 'visitor' for every visited node.
    visited = set()
    def dfs_walk(node):
        for succ in graph.successors(node):
            if not succ in visited:

This is the baseline implementation, and we'll see a few variations on it to implement different orderings and algorithms. You may think that, due to the way the algorithm is structured, it implements pre-order traversal. It certainly looks like it. In fact, pre-preder traversal on graphs is defined as the order in which the aforementioned DFS algorithm visited the nodes. However, there's a subtle but important difference from tree pre-order. Whereas in trees, we may assume that in pre-order traversal we always see a node before all its successors, this isn't true for graph pre-order. Consider this graph:

Directed Graph

If we print the nodes in the order visited by DFS, we may get something like: A, C, B, D, E, T. So B comes before T, even though B is T's successor [1].

We'll soon see that other orderings do provide some useful guarantees.

Post-order and reverse post-order

To present other orderings and algorithms, we'll take the dfs function above and tweak it slightly. Here's a post-order walk. It has the same general structure as dfs, but it manages additional information (order list) and doesn't call a visitor function:

def postorder(graph, root):
    """Return a post-order ordering of nodes in the graph."""
    visited = set()
    order = []
    def dfs_walk(node):
        for succ in graph.successors(node):
            if not succ in visited:
    return order

This algorithm adds a node to the order list when its traversal is fully finished; that is, when all its outgoing edges have been visited. Unlike pre-order, here it's actually ensured - in the absence of cycles - that for two nodes V and W, if there is a path from W to V in the graph, then V comes before W in the list [2].

Reverse post-order (RPO) is exactly what its name implies. It's the reverse of the list created by post-order traversal. In reverse post-order, if there is a path from V to W in the graph, V appears before W in the list. Note that this is actually a useful guarantee - we see a node before we see any other nodes reachable from it; for this reason, RPO is useful in many algorithms.

Let's see the orderings produced by pre-order, post-order and RPO for our sample DAG:

  • Pre-order: A, C, B, D, E, T
  • Post-order: D, B, E, C, T, A
  • RPO: A, T, C, E, B, D

This example should help dispel a common confusion about these traversals - RPO is not the same as pre-order. While pre-order is simply the order in which DFS visits a graph, RPO actually guarantees that we see a node before all of its successors (again, this is in the absence of cycles, which we'll deal with later). So if you have an algorithm that needs this guarantee and you could simply use pre-order on trees, when you're dealing with DAGs it's RPO, not pre-order, that you're looking for.

Topological sort

In fact, the RPO of a DAG has another name - topological sort. Indeed, listing a DAG's nodes such that a node comes before all its successors is precisely what sorting it topologically means. If nodes in the graph represent operations and edges represent dependencies between them (an edge from V to W meaning that V must happen before W) then topological sort gives us an order in which we can run the operations such that all the dependencies are satisfied (no operation runs before its dependencies).

DAGs with multiple roots

So far the examples and algorithms demonstrated here show graphs that have a single, identifiable root node. This is not always the case for realistic graphs, though it is most likely the case for graphs used in compilation because a program has a single entry function (for call graphs) and a function has a single entry instruction or basic block (for control-flow and data-flow analyses of functions).

Once we leave the concept of "one root to rule them all" behind, it's not even clear how to traverse a graph like the example used in the article so far. Why start from node A? Why not B? And how would the traversal look if we did start from B? We'd then only be able to discover B itself and D, and would have to restart to discover other nodes.

Here's another graph, where not all nodes are connected with each other. This graph has no pre-defined root [3]. How do we still traverse all of it in a meaningful way?

Directed Multigraph

The idea is a fairly simple modification of the DFS traversal shown earlier. We pick an arbitrary unvisited node in the graph, treat it as a root and start dutifully traversing. When we're done, we check if there are any unvisited nodes remaining in the graph. If there are, we pick another one and restart. So on and so forth until all the nodes are visited:

def postorder_unrooted(graph):
    """Unrooted post-order traversal of a graph.

    Restarts traversal as long as there are undiscovered nodes. Returns a list
    of lists, each of which is a post-order ordering of nodes discovered while
    restarting the traversal.
    allnodes = set(graph.nodes())
    visited = set()
    orders = []
    def dfs_walk(node):
        for succ in graph.successors(node):
            if not succ in visited:
    while len(allnodes) > len(visited):
        # While there are still unvisited nodes in the graph, pick one at random
        # and restart the traversal from it.
        remaining = allnodes - visited
        root = remaining.pop()
    return orders

To demonstrate this approach, postorder_unrooted returns not just a single list of nodes ordered in post-order, but a list of lists. Each list is a post-order ordering of a set of nodes that were visited together in one iteration of the while loop. Running on the new sample graph, I get this:

 ['d', 'b', 'e', 'c'],
 ['j', 'f'],
 ['r', 'k'],
 ['p', 'q'],
 ['t', 'x']

There's a slight problem with this approach. While relative ordering between entirely unconnected pieces of the graph (like K and and X) is undefined, the ordering within each connected piece may be defined. For example, even though we discovered T and X separately from D, B, E, C, there is a natural ordering between them. So a more sophisticated algorithm would attempt to amend this by merging the orders when that makes sense. Alternatively, we can look at all the nodes in the graph, find the ones without incoming edges and use these as roots to start from.

Directed graphs with cycles

Cycles in directed graphs present a problem for many algorithms, but there are important applications where avoiding cycles is unfeasible, so we have to deal with them. Here's a sample graph with a couple of cycles:

Directed Graph with cycles

Cycles break almost all of the nice-and-formal definitions of traversals stated so far in the article. In a graph with cycles, there is - by definition - no order of nodes where we can say that one comes before the other. Nodes can be visited before all their outgoing edges have been visited in post-order. Topological sort is simply undefined on graphs with cycles.

Out post-order search will still run and terminate, visiting the whole graph, but the order it will visit the nodes is just not as clearly defined. Here's what we get from postorder, starting with X as root:

['m', 'g', 'd', 'e', 'c', 'b', 't', 'x']

This is a perfectly valid post-order visit on a graph with cycles, but something about it bothers me. Note that M is returned before G and D. True, this is just how it is with cycles [4], but can we do better? After all, there's no actual path from G and D to M, while a path in the other direction exists. So could we somehow "approximate" a post-order traversal of a graph with cycles, such that at least some high-level notion of order stays consistent (outside actual cycles)?

Strongly Connected Components

The answer to the last question is yes, and the concept that helps us here is Strongly Connected Components (SCCs, hereafter). A graph with cycles is partitioned into SCCs, such that every component is strongly connected - every node within it is reachable from every other node. In other words, every component is a loop in the graph; we can then create a DAG of components [5]. Here's our loopy graph again, with the SCCs identified. All the nodes that don't participate in loops can be considered as single-node components of themselves:

Directed Graph with cycles and SCCs

Having this decomposition of the graph availabe, we can now once again run meaningful post-order visits on the SCCs and even sort them topologically.

There are a number of efficient algorithms available for decomposing a graph into SCCs; I will leave this topic to some other time.

3-color DFS and edge classification

As we've seen above, while our postorder function manages to visit all nodes in a graph with cycles, it doesn't do it in a sensical order and doesn't even detect cycles. We'll now examine a variation of the DFS algorithm that keeps track of more information during traversal, which helps it analyze cyclic graphs with more insight.

This version of graph DFS is actually something you will often find in books and other references; it colors every node white, grey or black, depending on how much traversal was done from it at any given time. This is why I call it the "3-color" algorithm, as opposed to the basic "binary" or "2-color" algorithm presented here earlier (any node is either in the "visited" set or not).

The following function implements this. The color dict replaces the visited set and tracks three different states for each node:

  1. Node not in the dict: not discovered yet ("white").
  2. Node discovered but not finished yet ("grey")
  3. Node is finished - all its outgoing edges were finished ("black")
def postorder_3color(graph, root):
    """Return a post-order ordering of nodes in the graph.

    Prints CYCLE notifications when graph cycles ("back edges") are discovered.
    color = dict()
    order = []
    def dfs_walk(node):
        color[node] = 'grey'
        for succ in graph.successors(node):
            if color.get(succ) == 'grey':
                print 'CYCLE: {0}-->{1}'.format(node, succ)
            if succ not in color:
        color[node] = 'black'
    return order

If we run it on the loopy graph, we get the same order as the one returned by postorder, with the addition of:

CYCLE: m-->c
CYCLE: g-->d

The edges between M and C and between G and D are called back edges. This falls under the realm of edge classification. In addition to back edges, graphs can have tree edges, cross edges and forward edges. All of these can be discovered during a DFS visit. Back edges are the most important for our current discussion since they help us identify cycles in the graph.

Data-flow analysis

This section is optional if you only wanted to learn about graph graversals. It assumes non-trivial background on compilation.

In compilation, data-flow analysis is an important technique used for many optimizations. The compiler analyzes the control-flow graph (CFG) of a program, reasoning about how values can potentially change [6] through its basic blocks.

Without getting into the specifics of data-flow analysis problems and algorithms, I want to mention a few relevant observations that bear strong relation to the earlier topics of this post. Broadly speaking, there are two kinds of data-flow problems: forward and backward problems. In forward problems, we use information learned about basic blocks to analyze their successors. In backward problems, it's the other way around - information learned about basic blocks is used to analyze their predecessors.

Therefore, it shouldn't come as a surprise that the most efficient way to run forward data-flow analyses is in RPO. In the absence of cycles, RPO (topological order) makes sure we've seen all of a node's predecessors before we get to it. This means we can perform the analysis in a single pass over the graph.

With cycles in the graph, this is no longer true, but RPO still guarantees the fastest convergence - in graphs with cycles data-flow analysis is iterative until a fixed point is reached [7].

For a similar reason, the most efficient way to run backward data-flow analysis is post-order. In the absence of cycles, postorder makes sure that we've seen all of a node's successors before we get to it.

One final note: some resources recommend to use RPO on a reverse CFG (CFG with all its edges inverted) instead of post-order on the original graph for backward data-flow. If your initial reaction is to question whether the two are equivalent - that's not a bad guess, though it's wrong. With cycles, the two orders can be subtly different. Consider this graph, for example:

ABCD graph

Assuming A is the entry node, post-order would return: D, C, B, A. Now let's invert all the edges:

ABCD graph with inverted nodes

Here D would be the entry node. RPO on this graph is: D, B, C, A. This is not the same as post-order on the original graph - the order of the nodes in the cycle is different.

So here's a puzzle: why is RPO on the reverse graph recommended by some resources over post-order?

If I had to guess, I'd say that looking at the sample graph in this section, it makes some sense to examine B before C, because more of its successors were already visited. If we visit C before B, we don't know much about successors yet because B wasn't visited - and B is the only successor of C. But if we visit B before C, at least D (a successor of B in the original graph) was already visited. This may help faster convergence. If you have additional ideas, please let me know.

[1]This depends on the order in which edges were added to the graph, the data structure this graph uses to store these edges, and the order the graph's successors methods returns node successors. The abstract algorithm doesn't enforce any of this.
[2]There may be other reasons for V to come before W - for example if both are root nodes.
[3]Its leftmost part has the same structure as our initial sample graph, with the slight difference of replacing A by X. I did this to shuffle the traversal order a bit; in my implementation, it happens to depend on the order Python's set.pop returns it elements.
[4]It should be an easy exercise to figure out how the traversal got to return M first here.
[5]This is only useful for fairly regular graphs, of course, If all nodes in a graph are reachable from one another (the whole graph is strongly connected), the SCC decomposition won't help much since the whole graph will be a single component. But then we also wouldn't have the problem of discovering M before G and D as in the sample graph.
[6]I'm saying "potentially" because this is static analysis. In the general case, the compiler cannot know which conditions will be true and false at runtime - it can only use information available to it statically, before the program actually executes.
[7]Since data-flow sets form a partial order and the algorithms have to satisfy some additional mathematical guarantees, it will converge even if we visit nodes randomly. However, picking the right order may result is significantly faster convergence - which is important for compiltation time.

Summary of reading: July - September 2015

  • "The Whole-Brain Child" by Daniel J. Siegel and Tina Payne Bryson - one of those books that smears a handful of good ideas over two hundred pages. The "scientific" angle the authors try to take is, IMHO somewhat light on real hard science - most of it is just in the realm of naturally intuitive ideas and pop-culture stereotypes (the whole right-brain vs. left-brain issue, for example). That said, even though most of the book is Tony Robbins-style general feel-good advice, there are a few interesting suggestions in it that are worth reading the book for.
  • "The Fabric of the Cosmos" by Brian Greene - an information packed whirlwind through modern physics. On the textbook-to-pop science line, this books lays much farther ahead from pop science than other similar books, such as those by Simon Singh, by delving much deeper into the details. I particularly enjoyed Greene's explanation of special relativity and the second law of thermodynamics. OTOH, the explanation of entanglement and Bell's theorem could be a bit better (though I'll admit I don't know if that's possible without getting into real math). On the average the book is very good, though. I enjoyed the last part of the book somewhat less - as soon as things get into string theory and even stranger theories (branes, holograms, etc) they become almost impossible to understand intuitively, because they don't represent anything human intuition can grasp. Greene admits that these theories are usually studied through their mathematical equations because none of their physics is obvious (they're not even experimentally tested yet).
  • "Particle Physics for Non-Physicists" by Steven Pollock (audio course) - a laymen overview of particle physics, focusing on the history of discovering the particle zoo that currently comprises the standard model. This is somewhat more advanced than the other "introductory" modern physics courses I've heard so far, and as such I also feel it approaches the limit of what can be effectively taught in such a course for laymen. The topic of particle physics is super complex, and explaining it without extensive math is challenging. In an audio setting, explaining it without using any visual aids like diagrams and drawings is even more challenging. So don't expect to learn much beyond a very trivial glimpse of the subject from here. That said, the course is well put and the lecturer is good, so it's not a bad use of one's time. It definitely does a decent job of putting all the possible particles in some order and explain their main characteristics and how (and why) they were discovered.
  • "The Martian" by Andy Weir - what a great book! A Mars exploration mission goes wrong and an astronaut is left stranded on the desolate red planet for many months. Science fiction at its best, IMHO, is one that's at least somewhat plausible. It's what really incites imagination - things that could actually happen in the not-too-far future. I think this book is particularly fun for engineers who like to do back-of-the-envelope computations and get excited by duct-tape-and-zipties creative solutions to problems (a-la MacGyver). One of my favorite childhood books was The Mysterious Island by Jules Verne. "The Martian" is finally something to match it, but transported onto a much more modern setting.
  • "Einstein's Relativity and the Quantum Revolution" by Richard Wolfson (audio course) - a tour through modern physics for non-scientists. As is usual with such books/courses, the simpler topics in the beginning get a thorough treatment, while the more complex topics later on get mostly skimmed. That said, one can't expect to get a good understanding of all of modern physics from a single course (truly, a whole physics degree is needed for that). Overall, I liked the course - prof. Wolfson is pretty good at explaining things, and I found his explanation of special relativity interesting and easy to follow. It's a shame the last few lectures are in sprint-mode, covering important topics like particle physics only on a very superficial story-telling level. I'd gladly hear more in-depth lectures from prof. Wolfson on these topics.
  • "Coding the Matrix - Linear Algebra through Computer Science applications" by Philip N. Klein - an introduction to linear algebra with emphasis on implementing its fundamental algorithms (in Python). This book is a companion to a Coursera course by the same author. Even though it's an application-oriented book, it doesn't give up rigor and most of the propositions are proven. Also, the computation angle is nice and unusual for a math textbook. However, as a standalone book (I didn't take the course) it's lacking, IMHO. I found it poorly organized and overly verbose. Also, the formatting is terrible - the book seems to be some sort of self-printing, it's basically A4 pages glued together (too tall to fit on my bookshelf!). The diagrams are often unintelligible, and code samples sometimes run off their boxes. The author's over-use of list comprehensions in Python doesn't help in this aspect :-) This being the first edition and the author's maintaining a large errata, I believe many of these problems will be fixed in future editions. To conclude, while the book's practical aspect makes it an interesting addition to one's bookshelf, I wouldn't recommend it for self-study.
  • "What is Relativity" by Jeffrey Bennett - A layman's explanation of the theory of relativity, split about half-half between special and general relativity. While the book isn't bad, I didn't find it particularly great either. Some parts of the explanation on special relativity are lacking (for example, length contraction if pretty badly explained), and the general relativity part is too handwavy. I realize it's not easy to explain GR without diving into the heavy math involved, but I expected more.

Memory layout of multi-dimensional arrays

When working with multi-dimensional arrays, one important decision programmers have to make fairly early on in the project is what memory layout to use for storing the data, and how to access such data in the most efficient manner. Since computer memory is inherently linear - a one-dimensional structure, mapping multi-dimensional data on it can be done in several ways. In this article I want to examine this topic in detail, talking about the various memory layouts available and their effect on the performance of the code.

Row-major vs. column-major

By far the two most common memory layouts for multi-dimensional array data are row-major and column-major.

When working with 2D arrays (matrices), row-major vs. column-major are easy to describe. The row-major layout of a matrix puts the first row in contiguous memory, then the second row right after it, then the third, and so on. Column-major layout puts the first column in contiguous memory, then the second, etc.

Higher dimensions are a bit more difficult to visualize, so let's start with some diagrams showing how 2D layouts work.

2D row-major

First, some notes on the nomenclature of this article. Computer memory will be represented as a linear array with low addresses on the left and high addresses on the right. Also, we're going to use programmer notation for matrices: rows and columns start with zero, at the top-left corner of the matrix. Row indices go over rows from top to bottom; column indices go over columns from left to right.

As mentioned above, in row-major layout, the first row of the matrix is placed in contiguous memory, then the second, and so on:

Row major 2D

Another way to describe row-major layout is that column indices change the fastest. This should be obvious by looking at the linear layout at the bottom of the diagram. If you read the element index pairs from left to right, you'll notice that the column index changes all the time, and the row index only changes once per row.

For programmers, another important observation is that given a row index <math> and a column index <math>, the offset of the element they denote in the linear representation is:


Where NCOLS is the number of columns per row in the matrix. It's easy to see this equation fits the linear layout in the diagram shown above.

2D column-major

Describing column-major 2D layout is just taking the description of row-major and replacing every appearance of "row" by "column" and vice versa. The first column of the matrix is placed in contiguous memory, then the second, and so on:

Column major 2D

In column-major layout, row indices change the fastest. The offset of an element in column-major layout can be found using this equation:


Where NROWS is the number of rows per column in the matrix.

Beyond 2D - indexing and layout of N-dimensional arrays

Even though matrices are the most common multi-dimensional arrays programmers deal with, they are by no means the only ones. The notation of multi-dimensional arrays is fully generalizable to more than 2 dimensions. These entities are commonly called "N-D arrays" or "tensors".

When we move to 3D and beyond, it's best to leave the row/column notation of matrices behind. This is because this notation doesn't easily translate to 3 dimensions due to a common confusion between rows, columns and the Cartesian coordinate system. In 4 dimensions and above, we lose any purely-visual intuition to describe multi-dimensional entities anyway, so it's best to stick to a consistent mathematical notation instead.

So let's talk about some arbitrary number of dimensions d, numbered from 1 to d. For each dimension <math>, <math> is the size of the dimension. Also, the index of an element in dimension <math> is <math>. For example, in the latest matrix diagram above (where column-layout is shown), we have <math>. If we choose dimension 1 to be the row and dimension 2 to be the column, then <math>, and the element in the bottom-left corner of the matrix has <math> and <math>.

In row-major layout of multi-dimensional arrays, the last index is the fastest changing. In case of matrices the last index is columns, so this is equivalent to the previous definition.

Given a <math>-dimensional array, with the notation shown above, we compute the memory location of an element from its indices as:


For a matrix, <math>, this reduces to:


Which is exactly the formula we've seen above for row-major layout, just using a slightly more formal notation.

Similarly, in column-major layout of multi-dimensional arrays, the first index is the fastest changing. Given a <math>-dimensional array, we compute the memory location of an element from its indices as:


And again, for a matrix with <math> this reduces to the familiar:


Example in 3D

Let's see how this works out in 3D, which we can still visualize. Assuming 3 dimensions: rows, columns and depth. The following diagram shows the memory layout of a 3D array with <math>, in row-major:

Row major 3D

Note how the last dimension (depth, in this case) changes the fastest and the first (row) changes the slowest. The offset for a given element is:


For example, the offset of the element with indices 2,1,1 is 22.

As an exercise, try to figure out how this array would be laid out in column-major order. But beware - there's a caveat! The term column-major may lead you to believe that columns are the slowest-changing index, but this is wrong. The last index is the slowest changing in column-major, and the last index here is depth, not columns. In fact, columns would be right in the middle in terms of change speed. This is exactly why in the discussion above I suggested dropping the row/column notation when going above 2D. In higher dimensions it becomes confusing, so it's best to refer to the relative change rate of the indices, since these are unambiguous.

In fact, one could conceive a sort of hybrid (or "mixed") layout where the second dimension changes faster than the first or the third. This would be neither row-major nor column-major, but in itself it's a consistent and perfectly valid layout that may benefit some applications. More details on why we would choose one layout over another are later in the article.

History: Fortran vs. C

While knowing which layout a particular data set is using is critical for good performance, there's no single answer to the question which layout "is better" in general. It's not much different from the big-endian vs. little-endian debate; what's important is to pick up a consistent standard and stick to it. Unfortunately, as almost always happens in the world of computing, different programming languages and environments picked different standards.

Among the programming languages still popular today, Fortran was definitely one of the pioneers. And Fortran (which is still very important for scientific computing) uses column-major layout. I read somewhere that the reason for this is that column vectors are more commonly used and considered "canonical" in linear algebra computations. Personally I don't buy this, but you can make your own judgement.

A slew of modern languages follow Fortran's lead - Matlab, R, Julia, to name a few. One of the strongest reasons for this is that they want to use LAPACK - a fast Fortran library for linear algebra, so using Fortran's layout makes sense.

On the other hand, C and C++ use row-major layout. Following their example are a few other popular languages such as Python, Pascal and Mathematica. Since multi-dimensional arrays are a first-class type in the C language, the standard defines the layout very explicitly in section [1].

In fact, having the first index change the slowest and the last index change the fastest makes sense if you think about how multi-dimensional arrays in C are indexed.

Given the declaration:

int x[3][5];

Then x is an array of 3 elements, each of which is an array of 5 integers. x[1] is the address of the second array of 5 integers contained in x, and x[1][4] is the fifth integer of the second 5-integer array in x. These indexing rules imply row-major layout.

None of this is to say that C could not have chosen column-major layout. It could, but then its multi-dimensional array indexing rules would have to be different as well. The result could be just as consistent as what we have now.

Moreover, since C lets you manipulate pointers, you can decide on the layout of data in your program by computing offsets into multi-dimensional arrays on your own. In fact, this is how most C programs are written.

Memory layout example - numpy

So far we've discussed memory layout purely conceptually - using diagrams and mathematical formulae for index computations. It's worthwhile to see a "real" example of how multi-dimensional arrays are stored in memory. For this purpose, the Numpy library of Python is a great tool since it supports both layout kinds and is easy to play with from an interactive shell.

The numpy.array constructor can be used to create multi-dimensional arrays. One of the parameters it accepts is order, which is either "C" for C-style layout (row-major) or "F" for Fortran-style layout (column-major). "C" is the default. Let's see how this looks:

In [42]: ar2d = numpy.array([[1, 2, 3], [11, 12, 13], [10, 20, 40]], dtype='uint8', order='C')
In [43]: '  '.join(str(ord(x)) for x in
Out[43]: '1  2  3  11  12  13  10  20  40'

In "C" order, elements of rows are contiguous, as expected. Let's try Fortran layout now:

In [44]: ar2df = numpy.array([[1, 2, 3], [11, 12, 13], [10, 20, 40]], dtype='uint8', order='F')

In [45]: '  '.join(str(ord(x)) for x in
Out[45]: '1  11  10  2  12  20  3  13  40'

For a more complex example, let's encode the following 3D array as a numpy.array and see how it's laid out:

Numeric 3D array

This array has two rows (first dimension), 4 columns (second dimension) and depth 2 (third dimension). As a nested Python list, this is its representation:

In [47]: lst3d = [[[1, 11], [2, 12], [3, 13], [4, 14]], [[5, 15], [6, 16], [7, 17], [8, 18]]]

And the memory layout, in both C and Fortran orders:

In [50]: ar3d = numpy.array(lst3d, dtype='uint8', order='C')

In [51]: '  '.join(str(ord(x)) for x in
Out[51]: '1  11  2  12  3  13  4  14  5  15  6  16  7  17  8  18'

In [52]: ar3df = numpy.array(lst3d, dtype='uint8', order='F')

In [53]: '  '.join(str(ord(x)) for x in
Out[53]: '1  5  2  6  3  7  4  8  11  15  12  16  13  17  14  18'

Note that in C layout (row-major), the first dimension (rows) changes the slowest while the third dimension (depth) changes the fastest. In Fortran layout (column-major) the first dimension changes the fastest while the third dimension changes the slowest.

Performance: why it's worth caring which layout your data is in

After reading the article thus far, one may wonder why any of this matters. Isn't it just another way of divergence of standards, a-la endianness? As long as we all agree on the layout, isn't this just a boring implementation detail? Why would we care about this?

The answer is: performance. We're talking about numerical computing here (number crunching on large data sets) where performance is almost always critical. It turns out that matching the way your algorithm works with the data layout can make or break the performance of an application.

The short takeaway is: always traverse the data in the order it was laid out. If your data sits in memory in row-major layout, iterate over each row before going to the next one, etc. The rest of the section will explain why this is so and will also present a benchmask with some measurements to get a feel of the consequences of this decision.

There are two aspects of modern computer architecture that have a large impact on code performance and are relevant to our discussion: caching and vector units. When we iterate over each row of a row-major array, we access the array sequentially. This pattern has spatial locality, which makes the code perfect for cache optimization. Moreover, depending on the operations we do with the data, the CPU's vector unit can kick in since it also requires consecutive access.

Graphically, it looks something like the following diagram. Let's say we have the array: int array[128][128], and we iterate over each row, jumping to the next one when all the columns in the current one were visited. The number within each gray cell is the memory address - it grows by 4 since this is an array if integers. The blue numbered arrow enumerates accesses in the order they are made:

Row access pattern

Here, the optimal usage of caching and vector instructions should be obvious. Since we always access elements sequentially, this is the perfect scenario for the CPU's caches to kick in - we will always hit the cache. In fact, we always hit the fastest cache - L1, because the CPU correctly pre-fetches all data ahead.

Moreover, since we always read one 32-bit word [2] after another, we can leverage the CPU's vector units to load the data (and perhaps process is later). The purple arrows show how this can be done with SSE vector loads that grab 128-bit chunks (four 32-bit words) at a time. In actual code, this can either be done with intrinsics or by relying on the compiler's auto-vectorizer (as we will soon see in an actual code sample).

Contrast this with accessing this row-major data one column at a time, iterating over each column before moving to the next one:

Column access pattern

We lose spatial locality here, unless the array is very narrow. If there are few columns, consecutive rows may be found in the cache. However, in more typical applications the arrays are large and when access #2 happens it's likely that the memory it accesses is nowhere to be found in the cache. Unsurprisingly, we also lose the vector units since the accesses are not made to consecutive memory.

But what should you do if your algorithm needs to access data column-by-column rather than row-by-row? Very simple! This is precisely what column-major layout is for. With column-major data, this access pattern will hit all the same architectural sweetspots we've seen with consecutive access on row-major data.

The diagrams above should be convincing enough, but let's do some actual measurements to see just how dramatic these effects are.

The full code for the benchmark is available here, so I'll just show a few selected snippets. We'll start with a basic matrix type laid out in linear memory:

// A simple Matrix of unsigned integers laid out row-major in a 1D array. M is
// number of rows, N is number of columns.
struct Matrix {
  unsigned* data = nullptr;
  size_t M = 0, N = 0;

The matrix is using row-major layout: its elements are accessed using this C expression:[row * x.N + col]

Here's a function that adds two such matrices together, using a "bad" access pattern - iterating over the the rows in each column before going to the next column. The access patter is very easy to spot looking at C code - the inner loop is the faster-changing index, and in this case it's rows:

void AddMatrixByCol(Matrix& y, const Matrix& x) {
  assert(y.M == x.M);
  assert(y.N == x.N);

  for (size_t col = 0; col < y.N; ++col) {
    for (size_t row = 0; row < y.M; ++row) {[row * y.N + col] +=[row * x.N + col];

And here's a version that uses a better pattern, iterating over the columns in each row before going to the next row:

void AddMatrixByRow(Matrix& y, const Matrix& x) {
  assert(y.M == x.M);
  assert(y.N == x.N);

  for (size_t row = 0; row < y.M; ++row) {
    for (size_t col = 0; col < y.N; ++col) {[row * y.N + col] +=[row * x.N + col];

How do the two access patterns compare? Based on the discussion in this article, we'd expect the by-row access pattern to be faster. But how much faster? And what role does vectorization play vs. efficient usage of cache?

To try this, I ran the access patterns on matrices of various sizes, and added a variation of the by-row pattern where vectorization is disabled [3]. Here are the results; the vertical bars represent the bandwidth - how many billions of items (32-bit words) were processed (added) by the given function.

Benchmark results

Some observations:

  • For matrix sizes above 64x64, by-row access is significantly faster than by-column (6-8x, depending on size). In the case of 64x64, what I believe happens is that both matrices fit into the 32-KB L1 cache of my machine, so the by-column pattern actually manages to find the next row in cache. For larger sizes the matrices no longer fit in L1, so the by-column version has to go to L2 frequently.
  • The vectorized version beats the non-vectorized one by 2-3x in all cases. On large matrices the speedup is a bit smaller; I think this is because at 256x256 and beyond the matrices no longer fit in L2 (my machine has 256KB of it) and needs slower memory access. So the CPU spends a bit more time waiting for memory on average.
  • The overall speedup of the vectorized by-row access over the by-column access is enormous - up to 25x for large matrices.

I'll have to admit that, while I expected the by-row access to be faster, I didn't expect it to be this much faster. Clearly, choosing the proper access pattern for the memory layout of the data is absolutely crucial for the performance of an application.


This article examined the issue of multi-dimensional array layout from multiple angles. The main takeaway is: know how your data is laid out and access it accordingly. In C-based programming languages, even though the default layout for 2D-arrays is row-major, when we use pointers to dynamically allocated data, we are free to choose whatever layout we like. After all, multi-dimensional arrays are just a logical abstraction above a linear storage system.

Due to the wonders of modern CPU architectures, choosing the "right" way to access multi-dimensional data may result in colossal speedups; therefore, this is something that should always be on the programmer's mind when working on large multi-dimensional data sets.

[1]Taken from draft n1570 of the C11 standard.
[2]The term "word" used to be clearly associated with a 16-bit entity at some point in the past (with "double word" meaning 32 bits and so on), but these days it's too overloaded. In various references online you'll find "word" to be anything from 16 to 64 bits, depending on the CPU architecture. So I'm going to deliberately side-step the confusion by explicitly mentioning the bit size of words.
[3]See the benchmark repository for the full details, including function attributes and compiler flags. A special thanks goes to Nadav Rotem for helping me think through an issue I was initially having due to g++ ignoring my no-tree-vectorize attribute when inlining the function into the benchmark. I turned off inlining to fix this.