When writing concurrent code, a lock is one of the most critical tools programmers have in their toolbox. For the majority of applications adding a lock to protect access to some data is all you need. However, for some performance-critical pieces of code, a lock may introduce significant overhead and its use has to be carefully optimized; for example, lock granularity is frequently tweaked to make sure threads hold locks for the minimal amount of time necessary.

A common optimization for locks is using reader-writer locks, and this post discusses the concept and provides multiple implementations. I'm using Go as the programming language because of its great support for concurrency and the large number of building blocks available in the standard library, but the concepts are easily translatable to any programming language [1].

All the code for this post is available on GitHub. See the appendix at the end of the post for some notes on testing and benchmarking methodology.

The purpose of this post is purely educational; Go already has an excellent reader-writer lock implementation with sync.RWMutex. In fact, we'll be looking at how it works in this post.

Image of a lock with several read keys and one large write key

Motivation for having reader-writer locks

Reader-writer locks (RW locks from here on) were created from the observation that multiple threads can read shared data concurrently, as long as no one is modifying that data while it's being read. Regular locks don't distinguish between "lock for reading" and "lock for writing", so when multiple threads are reading data, each will still have to lock it, producing needless serialization. In other words, nesessary exclusion between readers and writers leads to unnecessary exlusion between multiple readers.

RW locks fix this issue. Instead of having a single lock method, they have two - one for readers and one for writers. When readers enter the critical section they invoke the reader lock (and then reader unlock on exit); when writers enter the critical section they invoke the writer lock (and then writer unlock on exit). With this scheme, as long as there are no writers in the critical section, multiple readers can interact with the data simultaneously.

A simple implementation

Let's start with a simple implementation that uses a counter. All RW locks in this post will implement this interface:

type RWLocker interface {
    RLock()
    RUnlock()
    WLock()
    WUnlock()
}

Having a single interface is very useful for testing - see the appendix for more details on that. The names of the methods in the interface should be self-explanatory: reader lock/unlock, writer lock/unlock.

The simple counter-based implementation uses a mutex and a counter:

type ReaderCountRWLock struct {
    m           sync.Mutex
    readerCount int
}

The counter keeps track of the number of readers that are holding the lock. The reader locking/unlocking is straightforward [2]:

func (l *ReaderCountRWLock) RLock() {
    l.m.Lock()
    l.readerCount++
    l.m.Unlock()
}

func (l *ReaderCountRWLock) RUnlock() {
    l.m.Lock()
    l.readerCount--
    l.m.Unlock()
}

The writer is a bit trickier. A writer attempting to take the lock will have to wait until there are no readers inside. Here's a simple, though inefficient, way to do this:

func (l *ReaderCountRWLock) WLock() {
    for {
        l.m.Lock()
        if l.readerCount > 0 {
            l.m.Unlock()
        } else {
            break
        }
    }
}

The writer locks the mutex and checks if there are readers inside the lock. If there are, it releases the mutex and tries again - this is called spinning. If there are no readers, WLock returns with the mutex still locked, so readers won't be able to take the lock. A writer unlock is trivial:

func (l *ReaderCountRWLock) WUnlock() {
    l.m.Unlock()
}

This implementation is the simplest I could come up with. Its performance could be better, though. If readers are inside the lock, a writer will spin taking and releasing the mutex and "burn" CPU cycles. We could optimize this lock significantly if only we had some way for a writer to wait more efficiently.

Efficient waiting with a condition variable

Condition variables are exactly what we need here, since the pattern of "take mutex, check something, release it back if not ready" is precisely the pattern they are designed to optimize. A more advanced version of the RW lock will use one:

type ReaderCountCondRWLock struct {
    readerCount int
    c           *sync.Cond
}

If you're wondering where is the mutex here, don't worry. In Go, a sync.Cond has a mutex embedded in it, so we get both. This struct is going to need a constructor:

func NewReaderCountCondRWLock() *ReaderCountCondRWLock {
    return &ReaderCountCondRWLock{0, sync.NewCond(new(sync.Mutex))}
}

Here's the reader for this lock:

func (l *ReaderCountCondRWLock) RLock() {
    l.c.L.Lock()
    l.readerCount++
    l.c.L.Unlock()
}

func (l *ReaderCountCondRWLock) RUnlock() {
    l.c.L.Lock()
    l.readerCount--
    if l.readerCount == 0 {
        l.c.Signal()
    }
    l.c.L.Unlock()
}

The only difference from the simple RW lock is that we signal the condition variable when the last reader is releasing the lock. Now the writer lock can be implemented as follows:

func (l *ReaderCountCondRWLock) WLock() {
    l.c.L.Lock()
    for l.readerCount > 0 {
        l.c.Wait()
    }
}

The Wait is still in a loop because it's entirely possible that between the time a reader signaled the condition variable and the time we got the lock, some other reader got to the lock first.

The unlock is:

func (l *ReaderCountCondRWLock) WUnlock() {
    l.c.Signal()
    l.c.L.Unlock()
}

Can you figure out why the Signal is needed here? Without it, we'll get a deadlock under moderately heavy load. The answer is in footnote [3].

This implementation is much more efficient than the first one, because the spin-loop is avoided. While there's still a loop around Wait, it's of a very different nature - Wait still blocks, and the loop re-runs only in rare cases when there's a true race to the lock.

Here a note on benchmarking methodology is apt. I'm saying "more efficient" without providing evidence, because benchmarking such code is extremely hard and use-case dependent. Results will vary dramatically based on the load and the specific workload you're using. The appendix describes one benchmarking methodology I used for this code and you should be able to reproduce my results.

Counted semaphore

Another variation of the RW lock which I find particularly elegant and simple is using a counting semaphore. In Go, package golang.org/x/sync/semaphore provides the Weighted type which implements this - a semaphore where you can acquire and release a numeric "weight". Here's our type, a constructor and reader methods:

const maxWeight int64 = 1 << 30

type SemaRWLock struct {
    s *semaphore.Weighted
}

// NewSemaRWLock creates a new SemaRWLock.
func NewSemaRWLock() *SemaRWLock {
    return &SemaRWLock{semaphore.NewWeighted(maxWeight)}
}

func (l *SemaRWLock) RLock() {
    l.s.Acquire(context.Background(), 1)
}

func (l *SemaRWLock) RUnlock() {
    l.s.Release(1)
}

A writer simply grabs the whole maxWeight, guaranteeing that only a single writer can take a semaphore and it's mutually exclusive with any other writer or readers:

func (l *SemaRWLock) WLock() {
    l.s.Acquire(context.Background(), maxWeight)
}

func (l *SemaRWLock) WUnlock() {
    l.s.Release(maxWeight)
}

While this is a really simple and clean solution, I found its performance to be wanting. It could be that the implementation of semaphore.Weighted is just not a good fit for my specific benchmark, but this aproach performed vastly worse than even the simple spin-loop implementation.

Reader preference vs. Writer preference

You may have noticed a common theme for all the implementations presented so far. They all make it rather difficult for writers to get in when many readers are active. For example, the very first implementation requires readerCount to be at 0 for a writer to get in. Imagine there are 2 active readers and a waiting writer; the writer waits for both readers to release the lock, but while it's waiting other readers can come in and grab it. It's like waiting at a stop sign to cross a very busy street where every car has the right of way over you - it may take you quite a while.

This is called reader preference, or - more dramatically - writer starvation. It's often not what we want since it can add significant delay to updates in the system. It's great to have many readers working concurrently without blocking each other, but it's not as appealing if they are mostly reading stale data.

Let's now look at a couple of writer-preferring implementations.

A simple writer-preferring RW lock

This implementation is adapted from Wikipedia. We'll start with the struct and the constructor:

type WritePreferRWLock struct {
    readerCount int
    hasWriter   bool
    c           *sync.Cond
}

func NewWritePreferRWLock() *WritePreferRWLock {
    return &WritePreferRWLock{0, false, sync.NewCond(new(sync.Mutex))}
}

Here readerCount is still the number of readers holding the lock, but we're adding a new field - hasWriter; this one is true whenever there's a writer waiting to take the lock. c is a condition variable and an associated mutex that implement the actual exclusion as well as efficient waiting. Let's start with the reader:

func (l *WritePreferRWLock) RLock() {
    l.c.L.Lock()
    for l.hasWriter {
        l.c.Wait()
    }
    l.readerCount++
    l.c.L.Unlock()
}

func (l *WritePreferRWLock) RUnlock() {
    l.c.L.Lock()
    l.readerCount--
    if l.readerCount == 0 {
        l.c.Broadcast()
    }
    l.c.L.Unlock()
}

As before, it's critical for correctness to only examine the shared data when a mutex is held. When a reader comes in, it first checks if writers are waiting to grab the lock; if yes, it will wait on the condition variable. This is where the writer preference is embodied - unlike previous implementations, in this one readers give writers the right of way. When no writers are waiting, the reader increments the reader count.

When releasing the lock, the last reader out will broadcast l.c to wake up any writers that could be waiting [4].

Here's the writer side:

func (l *WritePreferRWLock) WLock() {
    l.c.L.Lock()
    for l.hasWriter {
        l.c.Wait()
    }
    l.hasWriter = true
    for l.readerCount > 0 {
        l.c.Wait()
    }
    l.c.L.Unlock()
}

func (l *WritePreferRWLock) WUnlock() {
    l.c.L.Lock()
    l.hasWriter = false
    l.c.Broadcast()
    l.c.L.Unlock()
}

A writer starts by checking that no other writer is active. Note that unlike in previous implementations, here writers don't hold the mutex between WLock and WUnlock; instead, the mutex is only used to control access to the shared struct, and the hasWriter field plays the double role of "writer waiting to get lock" and "writer is using lock". Once there are no more writers, it flips hasWriter and waits for existing readers to clear out. Recall that setting hasWriter to true guarantees that no new readers will grab the lock.

In WUnlock, the writer unmarks hasWriter and broadcasts on the condition variable. We have to use Broadcast rather than Signal here because multiple readers can be waiting and we want to release all of them.

A more efficient writer-preferring RW lock

While relatively easy to understand, the implementation we've just seen is not performing very well - at least on my benchmarks. Therefore, I went looking for a more sophisticated writer-preferring implementation, and ended up exploring what Go itself uses for its sync.RWMutex [5]. This is the most complicated implementation in this post, so don't be discouraged if you don't get it on the first try. Feel free to experiment with the code and add some logging if you want to get a better grasp of what's going on.

The goal is to make lock taking - especially in readers - as fast as possible, while at the same time giving writers the right of way.

As usual, we'll start with the struct and constructor:

type WritePreferFastRWLock struct {
    w sync.Mutex
    writerWait chan struct{}
    readerWait chan struct{}
    numPending int32
    readersDeparting int32
}

const maxReaders int32 = 1 << 30

func NewWritePreferFastRWLock() *WritePreferFastRWLock {
    var l WritePreferFastRWLock
    l.writerWait = make(chan struct{})
    l.readerWait = make(chan struct{})
    return &l
}

Now I'll present the reader methods and explain the relevant fields along the way:

func (l *WritePreferFastRWLock) RLock() {
    if atomic.AddInt32(&l.numPending, 1) < 0 {
        <-l.readerWait
    }
}

func (l *WritePreferFastRWLock) RUnlock() {
    if r := atomic.AddInt32(&l.numPending, -1); r < 0 {
        if atomic.AddInt32(&l.readersDeparting, -1) == 0 {
            l.writerWait <- struct{}{}
        }
    }
}

The mutex w is not used by readers at all. Its sole purpose is to provide mutual exclusion between writers, so we'll get to it later. The most critical field in this implementation is numPending. It's used to mark the number of readers that are using the lock (like readerCount), but is sneakily used by writers as well. Writers subtract maxReaders from this field, so a negative value means a writer is using the lock. Access to the field is done with atomics - so no locking required.

The reader adds 1 to numPending (atomically). If the value of numPending is not negative, there are no waiting/active writers and the reader can proceed. This path is expected to be the most common; hence, it's extremely quick - only a single atomic add followed by a branch.

If numPending is negative it means a writer is waiting on the lock or using it, so the reader will give the writer the right of way. This is done by waiting on an unbuffered channel.

When a reader is done, it decrements numPending. If there are no writers, it's done - again, very quick for the common path. If writers are waiting, the readersDeparting field is used to ensure that only a single reader releases one waiting writer. readersDeparting is set by a waiting writer as we'll soon see.

Writer lock:

func (l *WritePreferFastRWLock) WLock() {
    l.w.Lock()
    r := atomic.AddInt32(&l.numPending, -maxReaders) + maxReaders
    if r != 0 && atomic.AddInt32(&l.readersDeparting, r) != 0 {
        <-l.writerWait
    }
}

The w mutex is used here to ensure that only one writer is inside the lock at any given time. The second line of the function is a bit tricky, since it does two things:

  1. It announces to readers that a writer is pending by decrementing maxReaders from numPending.
  2. It calculates the number of active readers by adding maxReaders back into the local variable r.

Then, if there are any active readers (r != 0), it sets their number into readersDeparting - this lets readers know how many of them are there before releasing writerWait. This function will return (holding the lock) when the last departing reader is done. At this point the writer holds the lock exclusively (other writers are excluded by w, other readers wait until numPending turns positive again).

Writer unlock:

func (l *WritePreferFastRWLock) WUnlock() {
    r := atomic.AddInt32(&l.numPending, maxReaders)
    for i := 0; i < int(r); i++ {
        l.readerWait <- struct{}{}
    }
    l.w.Unlock()
}

Once again, the clever usage of numPending takes a bit of time to decipher here. By adding maxReaders, the writer tells future readers that there are no more writers using the lock. The remaining r is the number of readers that accumulated waiting for the lock while this writer was active (review RLock again). The writer now releases all of them by sending r dummy objects into readerWait. Finally it unlocks the writer exclusion lock.

Phew, that was quite a journey! While it's not necessary to fully understand this version of RW locking to get a good sense of how RW locks work, it can be quite a rewarding experience - so I do encourage you to play with the code a bit and study it more.

Appendix: Testing and benchmarking methodology

The testing and benchmarking code can be found here. Its emphasis is on testing correctness, so it runs a stress test with many readers and many writers running concurrently. They all have access to shared data - a long slice of integers. The slice begins in increasing sequence (0, 1, 2, 3, ...) and writers increment each element. Readers check for consistency - verifying that the elements in the slice are in the right order. If the lock implementation is wrong, readers will read interim state and will fail with an error. Since all the locks here implement RWLocker, testing them in a uniform way is very easy - interfaces are great for testing!

Each reader and writer does many iterations, so this is a fairly good stress test. I verified that minimal changes in the lock implementations cause the tests to fail immediately, and ran all tests with -count 100 and separately with -race.

Benchmarking is more tricky. Each reader and each writer record the time it takes them to acquire the lock, and averages are printed out at the end of the test. The tricky part here is representing a realistic workload; while my measurements seem reasonable for my specific testing scenario, there are many variables that represent workloads - number of writers, number of readers, how often they access locks, how long their work takes in the critical section and so on. In short, YMMV.


[1]

Quick disclaimer: in Go, channels are the first thing programmers reach for when coordinating between multiple goroutines, so locks are used much less frequently than in other languages. Nevertheless, they are still useful in many scenarios.

In addition, I'll be using the term threads to discuss multiple concurrent execution paths to keep in more familiar for non-Go programmers.

[2]Here and elsewhere I'm skipping some extra error checking / assertions to keep the code shorter. Take a look at accompanying code sample for the full version that also includes comments.
[3]If there are two writers waiting on the condition variable when readers are present, one of them will take a the lock and the other may still be blocked in Wait. Without additional readers signaling the condition variable, this other writer will be stuck in Wait forever. Therefore it's important for writers to signal the condition variable too, when they're done.
[4]

The Wikipedia sample I took this from says we should signal the condition variable, but this could be subtly wrong. While one writer could be waiting on l.c.Wait for readerCount > 0, another writer could be waiting for l.hasWriter; in addition, multiple readers can be waiting on l.c.Wait for l.hasWriter. With a single signal it's not defined which goroutine will be woken up.

Interestingly, if we replace the Broadcast by Signal in RUnlock, it happens to work out fine in Go. That's because Go's sync.Cond maintains a FIFO queue of waiting goroutines, and the writer that got there first (i.e. the one who toggled hasWriter and is now waiting for readerCount > 0 will be woken up. However, the documentation doesn't guarantee this, so it's safer (albeit somewhat less efficient) to use Broadcast.

[5]The following code sample is modified from the standard library implementation (Go version 1.12), which uses Go runtime-internal functions and types like runtime_SemacquireMutex; instead, I'll be using channels for a similar purpose. While the resulting implementation is slightly slower than sync.RWMutex, it's easier to understand.