The promises and challenges of std::async task-based parallelism in C++11

One of the biggest and most impactful changes C++11 heralds is a standardized threading library, along with a documented memory model for the language. While extremely useful and obviating the dilemma of non-portable code vs. third-party libraries for threading, this first edition of the threading libraries is not without kinks. This article is a brief overview of how C++11 tries to enable a "task-based parallelism" idiom with the introduction of std::async, and the challenges it runs into.

Warning: this article is opinionated, especially its last third or so. I'll be happy to get corrections and suggestions in comments or email.

Background - threads vs. tasks

When I'm talking about "thread-based parallelism", I mean manual, low-level management of threads. Something like using pthreads or the Windows APIs for threads directly. You create threads, launch them, "join" them, etc. Even though threads are an OS abstraction, this is as close as you can get to the machine. In such cases, the programmer knows (or better know!) exactly how many threads he has running at any given time, and has to take care of load-balancing the work between them.

"Task-based parallelism" refers to a higher level of abstraction, where the programmer manages "tasks" - chunks of work that has to be done, while the library (or language) presents an API to launch these tasks. It is then the library's job to launch threads, make sure there are not too few or too many of them, make sure the work is reasonably load-balanced, and so on. For better or worse, this gives the programmer less low-level control over the system, but also higher-level, more convenient and safer APIs to work with. Some will claim that this also leads to better performance, though this really depends on the application.

Threads and tasks in C++11

The C++11 thread library gives us a whole toolbox for working at the thread level. We have std::thread along with a horde of synchronization and signaling mechanisms, a well-defined memory model, thread-local data and atomic operations right there in the standard.

C++11 also tries to provide a set of tools for task-based parallelism, revolving around std::async. It succeeds in some respects, and fails in others. I will go ahead and say in advance that I believe std::async is a very nice tool to replace direct std::thread usage on the low level. On the other hand, it is not really a good task-based parallelism abstraction. The rest of the article will cover these claims in detail.

Using std::async as a smarter std::thread

While it's great to have std::thread in standard C++, it's a fairly low level construct. As such, its usage is often more cumbersome than we'd want, and also more error-prone than we'd want. Therefore, an experienced programmer would sit down and come up with a slightly higher-level abstraction that makes C++ threading a bit more pleasant and also safer. The good news is that someone has already written this abstraction, and even made it standard. It's called std::async.

Here's a simple example of using a worker thread to perform some work - in this case add up integers in a vector [1]:

void accumulate_block_worker(int* data, size_t count, int* result) {
  *result = std::accumulate(data, data + count, 0);
}

void use_worker_in_std_thread() {
  std::vector<int> v{1, 2, 3, 4, 5, 6, 7, 8};
  int result;
  std::thread worker(accumulate_block_worker,
                     v.data(), v.size(), &result);
  worker.join();
  std::cout << "use_worker_in_std_thread computed " << result << "\n";
}

Straightforward enough. The thread is created and then immediately joined (waited upon to finish in a blocking manner). The result is communicated back to the caller via a pointer argument, since a std::thread cannot have a return value. This already points at a potential issue: when we write computation functions in C++ we usually employ the return value construct, rather than taking results by reference/pointer. Say we had a function already that did work, and was used in serial code, and we want to launch it in a std::thread. Since that function most likely returns its value, we'd need to either write a new version of it, or create some sort of wrapper.

Here's an alternative using std::async and std::future:

int accumulate_block_worker_ret(int* data, size_t count) {
  return std::accumulate(data, data + count, 0);
}

void use_worker_in_std_async() {
  std::vector<int> v{1, 2, 3, 4, 5, 6, 7, 8};
  std::future<int> fut = std::async(
      std::launch::async, accumulate_block_worker_ret, v.data(), v.size());
  std::cout << "use_worker_in_std_async computed " << fut.get() << "\n";
}

I'm passing the std::launch::async policy explicitly - more on this in the latter part of the article. The main thing to note here is that now the actual function launched in a thread is written in a natural way, returning the value it computed; no by-pointer output arguments in sight. std::async takes the return type of the function and returns it wrapped in a std::future, which is another handy abstraction. Read more about futures and promises in concurrent programming on Wikipedia. In the code above, the waiting for the computation thread to finish happens when we call get() on the future.

I like how the future decouples the task from the result. In more complex code, you can pass the future somewhere else, and it encapsulates both the thread to wait on and the result you'll end up with. The alternative of using std::thread directly is more cumbersome, because there are two things to pass around.

Here is a contrived example, where a function launches threads but then wants to delegate waiting for them and getting the results to some other function. It represents many realistic scenarios where we want to launch tasks in one place but collect results in some other place. First, a version with std::thread:

// Demonstrates how to launch two threads and return two results to the caller
// that will have to wait on those threads. Gives half the input vector to
// one thread, and the other half to another.
std::vector<std::thread>
launch_split_workers_with_std_thread(std::vector<int>& v,
                                     std::vector<int>* results) {
  std::vector<std::thread> threads;
  threads.emplace_back(accumulate_block_worker, v.data(), v.size() / 2,
                       &((*results)[0]));
  threads.emplace_back(accumulate_block_worker, v.data() + v.size() / 2,
                       v.size() / 2, &((*results)[1]));
  return threads;
}

...

{
  // Usage
  std::vector<int> v{1, 2, 3, 4, 5, 6, 7, 8};
  std::vector<int> results(2, 0);
  std::vector<std::thread> threads =
      launch_split_workers_with_std_thread(v, &results);
  for (auto& t : threads) {
    t.join();
  }
  std::cout << "results from launch_split_workers_with_std_thread: "
            << results[0] << " and " << results[1] << "\n";
}

Note how the thread objects have to be propagated back to the caller (so the caller can join them). Also, the result pointers have to be provided by the caller because otherwise they go out of scope [2].

Now, the same operation using std::async and futures:

using int_futures = std::vector<std::future<int>>;

int_futures launch_split_workers_with_std_async(std::vector<int>& v) {
  int_futures futures;
  futures.push_back(std::async(std::launch::async, accumulate_block_worker_ret,
                               v.data(), v.size() / 2));
  futures.push_back(std::async(std::launch::async, accumulate_block_worker_ret,
                               v.data() + v.size() / 2, v.size() / 2));
  return futures;
}

...

{
  // Usage
  std::vector<int> v{1, 2, 3, 4, 5, 6, 7, 8};
  int_futures futures = launch_split_workers_with_std_async(v);
  std::cout << "results from launch_split_workers_with_std_async: "
            << futures[0].get() << " and " << futures[1].get() << "\n";
}

Once again, the code is cleaner and more concise. Bundling the thread handle with the result it's expected to produce just makes more sense.

If we want to implement more complex result sharing schemes, things get even trickier. Say we want two different threads to wait on the computation result. You can't just call join on a thread from multiple other threads. Or at least, not easily. A thread that was already joined will throw an exception if another join is attempted. With futures, we have std::shared_future, which wraps a std::future and permits concurrent access from multiple threads that may want to get the future's result.

Setting a timeout on retrieving task results

Say we launched a thread to do a computation. At some point we'll have to wait for it to finish in order to obtain the result. The wait may be trivial if we set the program up in a certain way, but it can actually take time in some situations. Can we set a timeout on this wait so that we don't block for too long? With the pure std::thread solution, it won't be easy. You can't set a timeout on the join() method, and other solutions are convoluted (such as setting up a "cooperative" timeout by sharing a condition variable with the launched thread).

With futures returned from std::async, nothing could be easier, since std::future has a wait_for() method that takes a timeout:

int accumulate_block_worker_ret(int* data, size_t count) {
  std::this_thread::sleep_for(std::chrono::seconds(3));
  return std::accumulate(data, data + count, 0);
}

int main(int argc, const char** argv) {
  std::vector<int> v{1, 2, 3, 4, 5, 6, 7, 8};
  std::future<int> fut = std::async(
      std::launch::async, accumulate_block_worker_ret, v.data(), v.size());
  while (fut.wait_for(std::chrono::seconds(1)) != std::future_status::ready) {
    std::cout << "... still not ready\n";
  }
  std::cout << "use_worker_in_std_async computed " << fut.get() << "\n";

  return 0;
}

Propagating exceptions between threads

If you're writing C++ code with exceptions enabled, you are kinda "living on the edge". You always have to keep a mischievous imaginary friend on your left shoulder who will remind you that at any point in the program an exception can be thrown and then "how are you handling it?". Threads add another dimension to this (already difficult) problem. What happens when a function launched in a std::thread throws an exception?

void accumulate_block_worker(int* data, size_t count, int* result) {
  throw std::runtime_error("something broke");
  *result = std::accumulate(data, data + count, 0);
}

...

{
  // Usage.
  std::vector<int> v{1, 2, 3, 4, 5, 6, 7, 8};
  int result;
  std::thread worker(accumulate_block_worker,
                     v.data(), v.size(), &result);
  worker.join();
  std::cout << "use_worker_in_std_thread computed " << result << "\n";
}

This:

terminate called after throwing an instance of 'std::runtime_error'
  what():  something broke
Aborted (core dumped)

Ah, silly me, I didn't catch the exception. Let's try this alternative usage:

try {
  std::thread worker(accumulate_block_worker,
                     v.data(), v.size(), &result);
  worker.join();
  std::cout << "use_worker_in_std_thread computed " << result << "\n";
} catch (const std::runtime_error& error) {
  std::cout << "caught an error: " << error.what() << "\n";
}

Nope:

terminate called after throwing an instance of 'std::runtime_error'
  what():  something broke
Aborted (core dumped)

What's going on? Well, as the C++ standard clearly states, "~thread(), if joinable(), calls std::terminate()". So trying to catch the exception in another thread won't help.

While the example shown here is synthetic, there are many real-world cases where code executed in a thread can throw an exception. In regular, non-threaded call, we may reasonably expect that this exception should be handled somewhere higher up the call stack. If the code runs in a thread, however, this assumption is broken.

It means that we should wrap the function running in the new thread in additional code that will catch all exceptions and somehow transfer them to the calling thread. Yet another "result" to return, as if returning the actual result of the computation wasn't cumbersome enough.

Once again, std::async to the rescue! Let's try this again:

int accumulate_block_worker_ret(int* data, size_t count) {
  throw std::runtime_error("something broke");
  return std::accumulate(data, data + count, 0);
}

...

{
  // Usage.
  std::vector<int> v{1, 2, 3, 4, 5, 6, 7, 8};
  try {
    std::future<int> fut = std::async(
        std::launch::async, accumulate_block_worker_ret, v.data(), v.size());
    std::cout << "use_worker_in_std_async computed " << fut.get() << "\n";
  } catch (const std::runtime_error& error) {
    std::cout << "caught an error: " << error.what() << "\n";
  }
}

Now we get:

caught an error: something broke

The exception was propagated to the calling thread through the std::future and re-thrown when its get() method is called.

This is also the place to mention that the C++11 thread library provides many low-level building blocks for implementing high-level threading and task constructs. Returning a std::future from std::async is a fairly high-level abstraction, tailored for a specific kind of task management. If you want to implement something more advanced, like a special kind of concurrent queue that manages tasks, you'll be happy to hear that tools like std::promise and std::packaged_task are right there in the standard library to make your life more convenient. They let you associate functions with futures, and set exceptions separately from real results on those futures. I'll leave a deeper treatment of these topics to another day.

... but is this real task-based parallelism?

So we've seen how std::async helps us write robust threaded programs with smaller code compared to "raw" std::threads. If your threading needs are covered by std::async, you should definitely use it instead of toiling to re-implement the same niceties with raw threads and other low-level constructs. But does std::async enable real task-based parallelism, wherein you can nonchalantly hand it functions and expect it to load-distribute them for you over some existing thread pool to use OS resources efficiently? Unfortunately, no. Well, at least in the current version of the C++ standard, not yet.

There are many problems. Let's start with the launch policy.

In all the samples shown above, I'm explicitly passing the async policy to std::async to circumvent the issue. async is not the only policy it supports. The other one is deferred, and the default is actually async | deferred, meaning that we leave it to the runtime to decide. Except that we shouldn't.

The deferred policy means that the task will run lazily on the calling thread only when get() is called on the future it returns. This is dramatically different from the async policy in many respects, so just letting the runtime choose either sound like it may complicate programming. Consider the wait_for example I've shown above. Let's modify it to launch the accumulation task with a deferred policy:

int accumulate_block_worker_ret(int* data, size_t count) {
  std::this_thread::sleep_for(std::chrono::seconds(3));
  return std::accumulate(data, data + count, 0);
}

int main(int argc, const char** argv) {
  std::vector<int> v{1, 2, 3, 4, 5, 6, 7, 8};
  std::future<int> fut = std::async(
      std::launch::deferred, accumulate_block_worker_ret, v.data(), v.size());
  while (fut.wait_for(std::chrono::seconds(1)) != std::future_status::ready) {
    std::cout << "... still not ready\n";
  }
  std::cout << "use_worker_in_std_async computed " << fut.get() << "\n";

  return 0;
}

Running it:

$ ./using-std-future
... still not ready
... still not ready
... still not ready
... still not ready
... still not ready
... still not ready
... still not ready
^C

Oops, what's going on? The problem is that with the deferred policy, the call to wait_for on the future doesn't actually run the task. Only get() does. So we're stuck in an infinite loop. This can be fixed, of course (by also checking for a std::future_status::deferred status from wait_for()), but requires extra thinking and extra handling. It's not just a matter of not getting stuck in a loop, it's also a matter of what do we do in case the task is deferred? Handling both async and deferred tasks in the same caller code becomes tricky. When we use the default policy, we let the runtime decide when it wants to use deferred instead of async, so bugs like this may be difficult to find since they will only manifest occasionally under certain system loads.

Tasks and TLS

The C++11 standard also added TLS support with the thread_local keyword, which is great because TLS is a useful technique that hasn't been standardized so far. Let's try a synthetic example showing how it mixes with std::async's launch policices:

thread_local int tls_var;

int read_tls_var() {
  return tls_var;
}

int main(int argc, const char** argv) {
  tls_var = 50;

  std::future<int> fut = std::async(std::launch::deferred, read_tls_var);
  std::cout << "got from read_tls_var: " << fut.get() << "\n";
  return 0;
}

When run, this shows the value 50, because read_tls_var runs in the calling thread. If we change the policy to std::launch::async, it will instead show 0. That's because read_tls_var now runs in a new thread where tls_var wasn't set to 50 by main. Now imagine the runtime decides if your task runs in the same thread or another thread. How useful are TLS variables in this scenario? Not very much, unfortunately. Well unless you love non-determinism and multi-threading Heisenbugs :-)

Tasks and mutexes

Here's another fun example, this time with mutexes. Consider this piece of code:

int task(std::recursive_mutex& m) {
  m.lock();
  return 42;
}

int main(int argc, const char** argv) {
  std::recursive_mutex m;
  m.lock();

  std::future<int> fut = std::async(std::launch::deferred, task, std::ref(m));
  std::cout << "got from task: " << fut.get() << "\n";
  return 0;
}

It runs and shows 42 because the same thread can lock a std::recursive_mutex multiple times. If we switch the launch policy to async, the program deadlocks because a different thread cannot lock a std::recursive_mutex while the calling thread is holding it. Contrived? Yes. Can this happen in real code - yes, of course. If you're thinking to yourself "he's cheating, what is this weird std::recursive_mutex example specifically tailored to show a problem...", I assure you that a regular std::mutex has its own problems. It has to be unlocked in the thread it was locked in. So if task unlocked a regular std::mutex that was locked by main instead, we'd also have an issue. Unlocking a mutex in a different thread is undefined behavior. With the default launch policy, this undefined behavior would happen just sometimes. Lovely.

Bartosz Milewski has some additional discussion of these problems here and also here. Note that they will haunt more advanced thread strategies as well. Thread pools reuse the same thread handles for different tasks, so they'll also have to face TLS and mutex thread-locality issues. Whatever the adopted solution ends up being, some additional constraints will have to be introduced to make sure it's not too easy to shoot yourself in the foot.

Is std::async fundamentally broken?

Due to the problems highlighted above, I'd consider the default launch policy of std::async broken and would never use it in production code. I'm not the only one thinking this way. Scott Meyers, in his "Effective Modern C++", recommends the following wrapper to launch tasks:

template <typename F, typename... Ts>
inline auto reallyAsync(F&& f, Ts&&... params) {
  return std::async(std::launch::async, std::forward<F>(f),
                    std::forward<Ts>(params)...);
}

Use this instead of raw std::async calls to ensure that the tasks are always launched in fresh threads, so that we can reason about our program more deterministically.

The authors of gcc came to realize this as well, and switched the libstdc++ default launch policy to std::launch::async in mid-2015. In fact, as the discussion in that bug highlights, std::async came close to being deprecated in the next C++ standard, since the standards committee realized it's not really possible to implement real task-based parallelism with it without non-deterministic and undefined behavior in some corner cases. And it's the role of the standards committee to ensure all corners are covered [3].

It's evident from online sources that std::async was a bit rushed into the C++11 standard, when the committee didn't have enough time to standardize a more comprehensive library solution such as thread pools. std::async was put there as a compromise, as part of a collection of low-level building blocks that could be used to build higher-level abstractions later. But actually, it can't. Or at least not easily. "Real" task-based parallel systems feature things like task migration between threads, task stealing queues, etc. It will just keep hitting the problems highlighted above (TLS, mutexes, etc.) in real user code. A more comprehensive overhaul is required. Luckily, this is exactly what the standards commitee is toiling on - robust high-level concurrency primitives for the C++17 version of the standard.

Conclusion and practical advice

This article started by expounding the virtues of std::async compared to plain std::threads, but finished by pointing out numerous problems with std::async that one needs to be aware of. So, what do we do?

I actually think that by being careful to stay within the well-defined limits of std::async, we can enjoy its benefits without running into the gotchas. Specifically:

Prefer std::async to std::thread. Futures are just too useful to ignore; especially if your code deals with exception handling, this is the only sane way to stay safe. Results provided by different threads should be wrapped in futures.
Always use the std::launch::async policy with std::async if you actually want multi-threading. Do not rely on the default policy. Do not use deferred unless you have very special needs. Remember that deferred is just syntactic sugar over holding a function pointer to call it later.
If you need a real thread pool or some other higher-level concurrency construct, use a library or roll your own. Standard objects like std::future, std::promise and std::packaged_task can be very helpful.

[1]

Here and elsewhere, I'm trying to strip the code down to bare essentials, in order to demonstrate the actual threading concepts the article focuses on. C++ has a lot of complexities which I'm occasionally leaving behind, on purpose. For example the accumulator worker discussed here is not very generic or STL-y. Rewriting it to be templated and acting on iterators instead of pointer + size is left as an exercise for the diligent reader.

Full code samples for this post are available at https://github.com/eliben/code-for-blog/tree/main/2016/std-async

[2] Alternatively, launch_split_workers_with_std_thread could return a vector of thread/result pairs. However, multiple return values in C++ are messy no matter how you go at them, so it wouldn't result in much cleaner code. If you want to say "let's put them together in a class", then you're getting close to implementing std::future yourself :-)

[3]

To be completely fair, there's another problem with std::async that was the main driver for the call to deprecate it - the "waiting destructor" problem with the futures returned by std::async. There are many discussions online about this issue. A couple I recommend are this one by Scott Meyers and this SG1 paper by Nicolai Josuttis.

The gist of the issue is that a std::future returned by std::async will block in its destructor until the launched thread joins. While this behavior is important in order to ensure we don't have a runaway thread that accesses deallocated data, it also has its problems since some code may not like being blocked unexpectedly. And recall that a destructor is also called when an exception happens - another complication. In addition to the links above, also read this other article by Meyers to get a clearer understanding of the issue.

While the C++ standards committee came dangerously close to deprecating std::async for this reason, it seems that it has survived for now, with a proposal to have two different kinds of futures in the standard library, and changing std::async to return a waiting_future type, to mark this wait explicitly. In any case, be wary of this problem.