## Initializing an array in constant time

August 23rd, 2008 at 11:48 am

#### The problem

We want to use a very large array for some computations. When created, all the elements of this array have to be initialized to some value. We’ll only use a few values from the array1, so we don’t want the runtime of the algorithm to be dominated by the initialization time.

In other words, we want to create and access an initialized array in constant time2.

How can this be done ? (Hint: we may use extra memory)

#### A trivial non-solution

One wrong solution that often comes to mind is:

Keep another array, where index `i` holds some known value if and only if our data array was initialized at index `i`.

This won’t work, because when we create an array, it is full of random data (garbage that was in the memory at the time of creation). This random data can contain any “special” value we might choose to mark initialized indices, and the idea breaks down.

#### Solution

For a n-element array (`vec`), we’ll use two additional n-element arrays and a pointer3. The arrays are called `from` and `to` and the pointer is called `top`. `init_val` is the value we want to initialize the array with.

1. When an element `i` from `vec` is accessed, we’ll know whether it has been initialized if `from[i] < top and to[from[i]] = i`. If this condition is satisfied, we return `vec[i]`. Otherwise, we return `init_val`.
2. Initially, `top` is 0
3. The array element `i` is first accessed by the code:
```from[i] = top
to[top] = i
vec[i] = init_val
top++
```

#### Understanding the algorithm

Let’s assume we want an array with size 10. This is the initial state of the array and helpers (`X` denotes a memory location with arbitrary contents):

```vec:  X X X X X X X X X X
from: X X X X X X X X X X
to:   X X X X X X X X X X
top:  0
```

Since `from` and `to` contain unsigned values, the condition `from[i] < top` can never be true while `top = 0`. So all read accesses at this point will result in `init_val`.

Now suppose we write into the vector. Let’s say `vec[4] <- 3`. According to the algorithm, the next state will be:

```vec:  X X X X 3 X X X X X
from: X X X X 0 X X X X X
to:   4 X X X X X X X X X
top:  1
```

Now accesses to `i = 4` will satisfy the conditions `from[4] = 0 < 1` and `to[from[4]] = 4`, so the value of `vec[4]` can be returned. Accesses to other indices will fail, because the only `i` that will pass the `from[i] < top` is an `i` for which `from[i] = 0`, but only `i = 4` will also fulfill the second condition `to[from[i]] = i`.

Let’s work through another step. Let’s say we write `vec[9] <- 6`. The next state will be:

```vec:  X X X X 3 X X X X 6
from: X X X X 0 X X X X 1
to:   4 9 X X X X X X X X
top:  2
```

Again, the accesses to initialized indices, `i = 4, 9` will satisfy the conditions. Accesses to other indices will fail. Suppose for example an access to `i = 7`. Is `from[7] < 2` ? Let’s assume that by chance it is. It must be either 0 or 1 then. But for both, `to` points to the only really initialized fields, which are either 4 or 9, so the second condition will not be satisfied.

I hope it’s now clear how `to`, `from` and `top` interact to ensure that only initialized values will be actually returned. They kind-of close a loop – `i` points to `from`, `from` points to `to` and `to` points back to `i`. `top` makes sure that the location in `to` was already initialized by the algorithm. Since the relevant locations in `from` and `to` are always in a controlled state, this algorithm can never be fooled by random values that were just lying around in the memory.

#### A C++ implementation

C++ is a convenient language to implement this algorith in, because:

1. It can actually create un-initialized arrays in almost constant time with `new`, unlike the higher-level languages like Python, in which arrays (lists) are initialized.
2. It can encapsulate the whole data structure in a convenient class, and hide all the helper-data manipulation that goes on behind the scenes.

So here’s the class:

```#include
#include

using namespace std;

class InitializedVector
{
public:
InitializedVector(size_t len_, int init_val_ = 0)
{
vec = new int[len_];

from = new size_t[len_];
to = new size_t[len_];
top = 0;
init_val = init_val_;
len = len_;
}

int& operator[](size_t n)
{
if (from[n] < top && to[from[n]] == n)
return vec[n];
else
{
from[n] = top;
to[top] = n;
vec[n] = init_val;
top++;

return vec[n];
}
}

~InitializedVector()
{
delete[] vec;
delete[] from;
delete[] to;
}

private:
int* vec;
size_t* from;
size_t* to;
size_t top;
int init_val;
size_t len;
};
```

And some test code to show that it works:

```int main()
{
cout << "Hello there" << endl << endl;

InitializedVector iv(10, 2222);

cout << iv[4] << endl;
cout << iv[9] << endl;

iv[4] = 142;

cout << iv[4] << endl;
cout << iv[9] << endl;

return 0;
}
```

In principle, we could make read accesses a bit more efficient, by skipping the complex initialization and just returning `init_val`. In this C++ implementation this isn't possible, however, because when you overload `operator []`, you can't discriminate between read and write accesses to the element.

1 Such a data structure is usually called sparse.

2 Dynamic allocation routines (`malloc` and its progeny) don't run in constant time in the worst case (when memory is fragmented and the requested block is large). In practice, however, we can assume it's very close to constant time for most real use-cases.

3 I've picked this algorithm in John Bentley's Programming Pearls, 2nd Ed.. He attributes it to Aho, Hopcroft and Ullman's Design and Analysis of Computer Algorithms. After some Googling I've also found references to it in James Storer's Data structures and algorithms.

Related posts:

### 13 Responses to “Initializing an array in constant time”

1. Andrew Says:

s/just laying around/just lying around

“It’s bad grammar baby, yes it is…” — Dan Hicks & His Hot Licks

2. eliben Says:

Andrew,

Fixed, thanks.

3. Ankur Says:

I have a doubt, instead of using to and from arrays, a bitmap of size N (say arr[n]) would be space economical. memset the arr once and it can store 0,1 for initialization/ uninitialization status respectively.

4. eliben Says:

@Ankur,

`memset` is the problem here – it’s not constant time, it’s O(n)

5. Sam V Says:

Good one..Nice explanation…

6. raghu Says:

While this is an interesting solution – you might also want to consider using calloc which initializes your array tp 0 in constant time. This is faster than a matched malloc / memset – because calloc on most modern operating systems is mapped to a zero page.

For illustration – lets allocate a 1 MB array via calloc. calloc returns the address of a 1MB block of contiguous virtual memory. Note: This is not 1 MB in RAM but 1MB in the virtual address space. This entire 1MB of contiguous address space is mapped to a 4KB page consisting of nothing but 0′s (called the zero page). So – if you iterate over your array – you will read nothing but 0′s.

Now when you change a value in the array, the OS’s virtual memory manager implements copy on write semantics and then actually allocates an area in RAM to back that particular page of the 1MB block.

The implication of this is that calloc will be way faster than malloc/memset.
Again – like i said – this is OS/CRT dependant – so please check the calloc implementation on your platform.

7. eliben Says:

raghu

Sounds interesting, thanks for sharing. What OSes implement this?

8. Gagan Says:

Hi,

I tried to reproduce this algorithm using C on OS X and Linux and compiling using GCC.

On OS X gcc –version returns
`i686-apple-darwin11-llvm-gcc-4.2`

On Linux it is
`gcc (SUSE Linux) 4.7.1 20120723`

(Basic difference being OS X GCC is llvm based)

My problem is how do I instantiate an array which is not guaranteed to be initialized to all 0s ?

Currently I am using the following way, for e.g. for the TO array
```unsigned int* TO; TO = (unsigned int*)(malloc(K*sizeof(unsigned int)));```

Basically, the TO array is also recording the sequence in which the VEC array was written.

On both the platforms the array is initialized with 0s, thus this could fail if we wrote the 0th location to begin with ?

9. Yoed Says:

consider the following:

1. when top == len you can start using it like a normal array

2. you can separate the [] operator into 2 more efficient implementations like this:

``` int& operator[](size_t n); //- for writing int operator[](size_t n) const; //- for reading ```

The const one will be used by default whenever possible and there you can just return init_val.

Also, you’re missing functionality to re-initialize the array in constant time by setting init_val to a new value and setting top to 0. This would really be constant time.

Otherwise cheers! it’s been nostalgic, this was one of my favorites in college.

10. s Says:

“Now accesses to i = 4 will satisfy the conditions from[3** ] = 0 < 1 and to[from[4]] = 4"

Incorrect, should be

Now accesses to i = 4 will satisfy the conditions from[4 **] = 0 < 1 and to[from[4]] = 4, so the

11. eliben Says:

@s: fixed, thanks

12. John Says:

What if at the very start, when from[] and to[] are uninitialized, they contain garbage data that fulfills the requirements necessary to assert that vec[i] has been set? It’s a rare edge case, but isn’t it technically possible?

13. eliben Says:

@John,

`top` was created just for that…