Tags C & C++

The problem

We want to use a very large array for some computations. When created, all the elements of this array have to be initialized to some value. We'll only use a few values from the array1, so we don't want the runtime of the algorithm to be dominated by the initialization time.

In other words, we want to create and access an initialized array in constant time2.

How can this be done ? (Hint: we may use extra memory)

A trivial non-solution

One wrong solution that often comes to mind is:

Keep another array, where index i holds some known value if and only if our data array was initialized at index i.

This won't work, because when we create an array, it is full of random data (garbage that was in the memory at the time of creation). This random data can contain any "special" value we might choose to mark initialized indices, and the idea breaks down.

Solution

For a n-element array (vec), we'll use two additional n-element arrays and a pointer3. The arrays are called from and to and the pointer is called top. init_val is the value we want to initialize the array with.

  1. When an element i from vec is accessed, we'll know whether it has been initialized if from[i] < top and to[from[i]] = i. If this condition is satisfied, we return vec[i]. Otherwise, we return init_val.
  2. Initially, top is 0
  3. The array element i is first accessed by the code:
from[i] = top
to[top] = i
vec[i] = init_val
top++

Understanding the algorithm

Let's assume we want an array with size 10. This is the initial state of the array and helpers (X denotes a memory location with arbitrary contents):

vec:  X X X X X X X X X X
from: X X X X X X X X X X
to:   X X X X X X X X X X
top:  0

Since from and to contain unsigned values, the condition from[i] < top can never be true while top = 0. So all read accesses at this point will result in init_val.

Now suppose we write into the vector. Let's say vec[4] <- 3. According to the algorithm, the next state will be:

vec:  X X X X 3 X X X X X
from: X X X X 0 X X X X X
to:   4 X X X X X X X X X
top:  1

Now accesses to i = 4 will satisfy the conditions from[4] = 0 < 1 and to[from[4]] = 4, so the value of vec[4] can be returned. Accesses to other indices will fail, because the only i that will pass the from[i] < top is an i for which from[i] = 0, but only i = 4 will also fulfill the second condition to[from[i]] = i.

Let's work through another step. Let's say we write vec[9] <- 6. The next state will be:

vec:  X X X X 3 X X X X 6
from: X X X X 0 X X X X 1
to:   4 9 X X X X X X X X
top:  2

Again, the accesses to initialized indices, i = 4, 9 will satisfy the conditions. Accesses to other indices will fail. Suppose for example an access to i = 7. Is from[7] < 2 ? Let's assume that by chance it is. It must be either 0 or 1 then. But for both, to points to the only really initialized fields, which are either 4 or 9, so the second condition will not be satisfied.

I hope it's now clear how to, from and top interact to ensure that only initialized values will be actually returned. They kind-of close a loop - i points to from, from points to to and to points back to i. top makes sure that the location in to was already initialized by the algorithm. Since the relevant locations in from and to are always in a controlled state, this algorithm can never be fooled by random values that were just lying around in the memory.

A C++ implementation

C++ is a convenient language to implement this algorith in, because:

  1. It can actually create un-initialized arrays in almost constant time with new, unlike the higher-level languages like Python, in which arrays (lists) are initialized.
  2. It can encapsulate the whole data structure in a convenient class, and hide all the helper-data manipulation that goes on behind the scenes.

So here's the class:

#include 
#include 

using namespace std;


class InitializedVector
{
public:
    InitializedVector(size_t len_, int init_val_ = 0)
    {
        vec = new int[len_];
        
        from = new size_t[len_];
        to = new size_t[len_];
        top = 0;
        init_val = init_val_;
        len = len_;
    }

    int& operator[](size_t n)  
    {
        if (from[n] < top && to[from[n]] == n)
            return vec[n];
        else
        {
            from[n] = top;
            to[top] = n;
            vec[n] = init_val;
            top++;
            
            return vec[n];
        }
    }

    ~InitializedVector()
    {
        delete[] vec;
        delete[] from;
        delete[] to;
    }
    
private:
    int* vec;
    size_t* from;
    size_t* to;
    size_t top;
    int init_val;
    size_t len;
};

And some test code to show that it works:

int main()
{
    cout << "Hello there" << endl << endl;

    InitializedVector iv(10, 2222);

    cout << iv[4] << endl;
    cout << iv[9] << endl;

    iv[4] = 142;

    cout << iv[4] << endl;
    cout << iv[9] << endl;
    
    return 0;
}

In principle, we could make read accesses a bit more efficient, by skipping the complex initialization and just returning init_val. In this C++ implementation this isn't possible, however, because when you overload operator [], you can't discriminate between read and write accesses to the element.

1 Such a data structure is usually called sparse.

2 Dynamic allocation routines (malloc and its progeny) don't run in constant time in the worst case (when memory is fragmented and the requested block is large). In practice, however, we can assume it's very close to constant time for most real use-cases.

3 I've picked this algorithm in John Bentley's Programming Pearls, 2nd Ed.. He attributes it to Aho, Hopcroft and Ullman's Design and Analysis of Computer Algorithms. After some Googling I've also found references to it in James Storer's Data structures and algorithms.