Depthwise separable convolutions for machine learning

Convolutions are an important tool in modern deep neural networks (DNNs). This post is going to discuss some common types of convolutions, specifically regular and depthwise separable convolutions. My focus will be on the implementation of these operation, showing from-scratch Numpy-based code to compute them and diagrams that explain how things work.

Note that my main goal here is to explain how depthwise separable convolutions differ from regular ones; if you're completely new to convolutions I suggest reading some more introductory resources first.

The code here is compatible with TensorFlow's definition of convolutions in the tf.nn module. After reading this post, the documentation of TensorFlow's convolution ops should be easy to decipher.

Basic 2D convolution

The basic idea behind a 2D convolution is sliding a small window (usually called a "filter") over a larger 2D array, and performing a dot product between the filter elements and the corresponding input array elements at every position.

Here's a diagram demonstrating the application of a 3x3 convolution filter to a 6x6 array, in 3 different positions. W is the filter, and the yellow-ish array on the right is the result; the red square shows which element in the result array is being computed.

Single-channel 2D convolution

The topmost diagram shows the important concept of padding: what should we do when the window goes "out of bounds" on the input array. There are several options, with the following two being most common in DNNs:

  • Valid padding: in which only valid, in-bounds windows are considered. This also makes the output smaller than the input, because border elements can't be in the center of a filter (unless the filter is 1x1).
  • Same padding: in which we assume there's some constant value outside the bounds of the input (usually 0) and the filter is applied to every element. In this case the output array has the same size as the input array. The diagrams above depict same padding, which I'll keep using throughout the post.

There are other options for the basic 2D convolution case. For example, the filter can be moving over the input in jumps of more than 1, thus not centering on all elements. This is called stride, and in this post I'm always using stride of 1. Convolutions can also be dilated (or atrous), wherein the filter is expanded with gaps between every element. In this post I'm not going to discuss dilated convolutions and other options - there are plenty of resources on these topics online.

Implementing the 2D convolution

Here is a full Python implementation of the simple 2D convolution. It's called "single channel" to distinguish it from the more general case in which the input has more than two dimensions; we'll get to that shortly.

This implementation is fully self-contained, and only needs Numpy to work. All the loops are fully explicit - I specifically avoided vectorizing them for efficiency to maintain clarity:

def conv2d_single_channel(input, w):
    """Two-dimensional convolution of a single channel.

    Uses SAME padding with 0s, a stride of 1 and no dilation.

    input: input array with shape (height, width)
    w: filter array with shape (fd, fd) with odd fd.

    Returns a result with the same shape as input.
    assert w.shape[0] == w.shape[1] and w.shape[0] % 2 == 1

    # SAME padding with zeros: creating a new padded array to simplify index
    # calculations and to avoid checking boundary conditions in the inner loop.
    # padded_input is like input, but padded on all sides with
    # half-the-filter-width of zeros.
    padded_input = np.pad(input,
                          pad_width=w.shape[0] // 2,

    output = np.zeros_like(input)
    for i in range(output.shape[0]):
        for j in range(output.shape[1]):
            # This inner double loop computes every output element, by
            # multiplying the corresponding window into the input with the
            # filter.
            for fi in range(w.shape[0]):
                for fj in range(w.shape[1]):
                    output[i, j] += padded_input[i + fi, j + fj] * w[fi, fj]
    return output

Convolutions in 3 and 4 dimensions

The convolution computed above works in two dimensions; yet, most convolutions used in DNNs are 4-dimensional. For example, TensorFlow's tf.nn.conv2d op takes a 4D input tensor and a 4D filter tensor. How come?

The two additional dimensions in the input tensor are channel and batch. A canonical example of channels is color images in RGB format. Each pixel has a value for red, green and blue - three channels overall. So instead of seeing it as a matrix of triples, we can see it as a 3D tensor where one dimension is height, another width and another channel (also called the depth dimension).

Batch is somewhat different. ML training - with stochastic gradient descent - is often done in batches for performance; we train the model not on a single sample at a time, but a "batch" of samples, usually some power of two. Performing all the operations in tandem on a batch of data makes it easier to leverage the SIMD capabilities of modern processors. So it doesn't have any mathematical significance here - it can be seen as an outer loop over all operations, performing them for a set of inputs and producing a corresponding set of outputs.

For filters, the 4 dimensions are height, width, input channel and output channel. Input channel is the same as the input tensor's; output channel collects multiple filters, each of which can be different.

This can be slightly difficult to grasp from text, so here's a diagram:

Multi-channel 2D convolution

In the diagram and the implementation I'm going to ignore the batch dimension, since it's not really mathematically interesting. So the input image has three dimensions - in this diagram height and width are 8 and depth is 3. The filter is 3x3 with depth 3. In each step, the filter is slid over the input in two dimensions, and all of its elements are multiplied with the corresponding elements in the input. That's 3x3x3=27 multiplications added into the output element.

Note that this is different from a 3D convolution, where a filter is moved across the input in all 3 dimensions; true 3D convolutions are not widely used in DNNs at this time.

So, to reitarate, to compute the multi-channel convolution as shown in the diagram above, we compute each of the 64 output elements by a dot-product of the filter with the relevant parts of the input tensor. This produces a single output channel. To produce additional output channels, we perform the convolution with additional filters. So if our filter has dimensions (3, 3, 3, 4) this means 4 different 3x3x3 filters. The output will thus have dimensions 8x8 for the spatials and 4 for depth.

Here's the Numpy implementation of this algorithm:

def conv2d_multi_channel(input, w):
    """Two-dimensional convolution with multiple channels.

    Uses SAME padding with 0s, a stride of 1 and no dilation.

    input: input array with shape (height, width, in_depth)
    w: filter array with shape (fd, fd, in_depth, out_depth) with odd fd.
       in_depth is the number of input channels, and has the be the same as
       input's in_depth; out_depth is the number of output channels.

    Returns a result with shape (height, width, out_depth).
    assert w.shape[0] == w.shape[1] and w.shape[0] % 2 == 1

    padw = w.shape[0] // 2
    padded_input = np.pad(input,
                          pad_width=((padw, padw), (padw, padw), (0, 0)),

    height, width, in_depth = input.shape
    assert in_depth == w.shape[2]
    out_depth = w.shape[3]
    output = np.zeros((height, width, out_depth))

    for out_c in range(out_depth):
        # For each output channel, perform 2d convolution summed across all
        # input channels.
        for i in range(height):
            for j in range(width):
                # Now the inner loop also works across all input channels.
                for c in range(in_depth):
                    for fi in range(w.shape[0]):
                        for fj in range(w.shape[1]):
                            w_element = w[fi, fj, c, out_c]
                            output[i, j, out_c] += (
                                padded_input[i + fi, j + fj, c] * w_element)
    return output

An interesting point to note here w.r.t. TensorFlow's tf.nn.conv2d op. If you read its semantics you'll see discussion of layout or data format, which is NHWC by default. NHWC simply means the order of dimensions in a 4D tensor is:

  • N: batch
  • H: height (spatial dimension)
  • W: width (spatial dimension)
  • C: channel (depth)

NHWC is the default layout for TensorFlow; another commonly used layout is NCHW, because it's the format preferred by NVIDIA's DNN libraries. The code samples here follow the default.

Depthwise convolution

Depthwise convolutions are a variation on the operation discussed so far. In the regular 2D convolution performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. Depthwise convolutions don't do that - each channel is kept separate - hence the name depthwise. Here's a diagram to help explain how that works:

Depthwise 2D convolution

There are three conceptual stages here:

  1. Split the input into channels, and split the filter into channels (the number of channels between input and filter must match).
  2. For each of the channels, convolve the input with the corresponding filter, producing an output tensor (2D).
  3. Stack the output tensors back together.

Here's the code implementing it:

def depthwise_conv2d(input, w):
    """Two-dimensional depthwise convolution.

    Uses SAME padding with 0s, a stride of 1 and no dilation. A single output
    channel is used per input channel (channel_multiplier=1).

    input: input array with shape (height, width, in_depth)
    w: filter array with shape (fd, fd, in_depth)

    Returns a result with shape (height, width, in_depth).
    assert w.shape[0] == w.shape[1] and w.shape[0] % 2 == 1

    padw = w.shape[0] // 2
    padded_input = np.pad(input,
                          pad_width=((padw, padw), (padw, padw), (0, 0)),

    height, width, in_depth = input.shape
    assert in_depth == w.shape[2]
    output = np.zeros((height, width, in_depth))

    for c in range(in_depth):
        # For each input channel separately, apply its corresponsing filter
        # to the input.
        for i in range(height):
            for j in range(width):
                for fi in range(w.shape[0]):
                    for fj in range(w.shape[1]):
                        w_element = w[fi, fj, c]
                        output[i, j, c] += (
                            padded_input[i + fi, j + fj, c] * w_element)
    return output

In TensorFlow, the corresponding op is tf.nn.depthwise_conv2d; this op has the notion of channel multiplier which lets us compute multiple outputs for each input channel (somewhat like the number of output channels concept in conv2d).

Depthwise separable convolution

The depthwise convolution shown above is more commonly used in combination with an additional step to mix in the channels - depthwise separable convolution [1]:

Depthwise separable convolution

After completing the depthwise convolution, and additional step is performed: a 1x1 convolution across channels. This is exactly the same operation as the "convolution in 3 dimensions discussed earlier" - just with a 1x1 spatial filter. This step can be repeated multiple times for different output channels. The output channels all take the output of the depthwise step and mix it up with different 1x1 convolutions. Here's the implementation:

def separable_conv2d(input, w_depth, w_pointwise):
    """Depthwise separable convolution.

    Performs 2d depthwise convolution with w_depth, and then applies a pointwise
    1x1 convolution with w_pointwise on the result.

    Uses SAME padding with 0s, a stride of 1 and no dilation. A single output
    channel is used per input channel (channel_multiplier=1) in w_depth.

    input: input array with shape (height, width, in_depth)
    w_depth: depthwise filter array with shape (fd, fd, in_depth)
    w_pointwise: pointwise filter array with shape (in_depth, out_depth)

    Returns a result with shape (height, width, out_depth).
    # First run the depthwise convolution. Its result has the same shape as
    # input.
    depthwise_result = depthwise_conv2d(input, w_depth)

    height, width, in_depth = depthwise_result.shape
    assert in_depth == w_pointwise.shape[0]
    out_depth = w_pointwise.shape[1]
    output = np.zeros((height, width, out_depth))

    for out_c in range(out_depth):
        for i in range(height):
            for j in range(width):
                for c in range(in_depth):
                    w_element = w_pointwise[c, out_c]
                    output[i, j, out_c] += depthwise_result[i, j, c] * w_element
    return output

In TensorFlow, this op is called tf.nn.separable_conv2d. Similarly to our implementation it takes two different filter parameters: depthwise_filter for the depthwise step and pointwise_filter for the mixing step.

Depthwise separable convolutions have become popular in DNN models recently, for two reasons:

  1. They have fewer parameters than "regular" convolutional layers, and thus are less prone to overfitting.
  2. With fewer parameters, they also require less operations to compute, and thus are cheaper and faster.

Let's examine the difference between the number of parameters first. We'll start with some definitions:

  • S: spatial dimension - width and height, assuming square inputs.
  • F: filter width and height, assuming square filter.
  • inC: number of input channels.
  • outC: number of output channels.

We also assume SAME padding as discussed above, so that the spatial size of the output matches the input.

In a regular convolution there are F*F*inC*outC parameters, because every filter is 3D and there's one such filter per output channel.

In depthwise separable convolutions there are F*F*inC parameters for the depthwise part, and then inC*outC parameters for the mixing part. It should be obvious that for a non-trivial outC, the sum of these two is significanly smaller than F*F*inC*outC.

Now on to computational cost. For a regular convolution, we perform F*F*inC operations at each position of the input (to compute the 2D convolution over 3 dimensions). For the whole input, the number of computations is thus F*F*inC*S*S and taking all the output channels we get F*F*inC*S*S*outC.

For depthwise separable convolutions we need F*F*inC*S*S* operations for the depthwise part; then we need S*S*inC*outC operations for the mixing part. Let's use some real numbers to get a feel for the difference:

We'll assume S=128, F=3, inC=3, outC=16. For regular convolution:

  • Parameters: 3*3*3*16 = 432
  • Computation cost: 3*3*3*128*128*16 = ~7e6

For depthwise separable convolution:

  • Parameters: 3*3*3+3*16 = 75
  • Computation cost: 3*3*3*128*128+128*128*3*16 = ~1.2e6

[1]The term separable comes from image processing, where spatially separable convolutions are sometimes used to save on computation resources. A spatial convolution is separable when the 2D convolution filter can be expressed as an outer product of two vectors. This lets us compute some 2D convolutions more cheaply. In the case of DNNs, the spatial filter is not necessarily separable but the channel dimension is separable from the spatial dimensions.

Summary of reading: January - March 2018

  • "Rule and Ruin" by Geoffrey Kabaservice - full title "The Downfall of Moderation and the Destruction of the Republican Party, from Eisenhower to the Tea Party". A decent history of the Republican party, with lots of details in some periods and much fewer in others (in particular skipping multiple years while Democrats are in power is questionable). Gives a good account of the tortuous and convoluted process of radicalization and turn to conservatism that started after WWII.
  • "Pachinko" by Min Jin Lee - a historical novel focusing on the lives of Korean immigrants in Japan in the 20th century, following an extended family for 4 generations from South Korea to various cities in Japan. Good book, though somewhat longer than it needed be, IMHO.
  • "Atomic Adventures" by James Mahaffey - a set of stories about the crazier aspects of nuclear science, narrated by a physicist who himself played a role in cold fusion research. This is a curious case where a book is completely not what I expected it to be, and at the same time I thoroughly enjoyed reading it.
  • "Pitch Perfect: How to Say It Right the First Time, Every Time" by Bill McGowan - how to speak convincingly and avoid many traps. Although much of the book applies to significantly more "professional" speakers than me (for example executives having to do multiple speeches, folks who get interviewed by the media, discussion panel participators and moderators, etc), there's quite a bit of good advice for "day to day" conversations as well. The book is not perfect (it has the usual repetitiveness and fluff so common in this genre), but it's fairly good and insightful overall.
  • "Quantum Computing since Democritus" by Scott Aaronson - Wow, quite a mismatch of expectations here. I was expecting a light "popular science" read, and this isn't it; it's a recap of the author's graduate-level computer science course for students interested in complexity theory and quantum computing. For good measure it also mixes up with some discussions of recent papers on these topics. I would probably have liked the book much more if I allotted ~100 hours to going through it (instead of the 10-15 as usual for a 400-page nonfiction), but as a "lightweight" read it completely missed the mark for me.
  • "Lab Girl" by Hope Jahren - an autobiography by Prof. Jahren, a geobiologist focusing on the study of plants. This is a really, really good book, deeply personal and informative not only on the topic of plants, but also friendship, grit in scientific research, the challenges of women in academia, and more.
  • "Digital Apollo: Human and Machine in Spaceflight" by David A. Mindell - an interesting perspective on human-machine interactions in the early stages of the NASA space program, focusing on the Apollo missions and their precursors. Apollo's computer seems so primitive in retrospect, but the realization this happened >50 years ago is poignant. The programming and testing challenges faced by Apollo SW engineers certainly sound recognizable. It's also interesting to read about the very early concerns of computers replacing humans in certain tasks - concerns that are quite prominent recently due to advances in AI.
  • "The Subtle Art of Not Giving a F*ck" by Mark Manson - as with any self-help book, YMMV. I personally didn't like this one and didn't find it insightful or novel in any particular way. I've actually stopped reading books of this kind a while ago, but seemingly got caught in the hype and stellar reviews with this one.
  • "Development as Freedom" by Amartya Sen - the author has won the Nobel prize for welfare economics, and here he lays out his main tenet of looking at more than per-capita income as a measure of a country's development. There's quite a bit of implied criticism of the US economic policies, and many interesting examples from around the world. There's more stuff in the book, general thoughts on economics outside the main tenet, as well. Overall a pretty interesting book, if somewhat academic - not easy to digest in a single cover-to-cover reading.
  • "Countdown to Zero Day: Stuxnet and the Launch of the World's First Digital Weapon" by Kim Zetter - a pretty good, if somewhat long-winded and repetitive, book documenting the discovery of the Stuxnet virus in 2010. I liked the author's reasonably technical descriptions of the security vulnerabilities the attackers exploited. There's not much actual details to rely upon, of course, due to the secrecy surrounding the attack. Interestingly, Snowden's leaks provided some clues into the background, but not much of it concrete.
  • "Evicted - Poverty and Profit in the American City" by Matthew Desmond - a poignant ethnography of "inner city" Milwaukee residents (most of them black women with young kids) who face severe poverty and frequent evictions. The author somehow managed to blend himself into that environment and follow people over several years. Very well-written book, though definitely not an easy read. Much food for thought here.
  • "From Mathematics to Generic Programming" by A.A. Stepanov and D.E. Rose - some introductory math concepts (mostly geometry and abstract algebra) with musings on programming in the style of "Programming Pearls". I think I understand what the authors were going for with this book, but IMHO they didn't hit the target - it's just not sufficiently cohesive. Moreover, the choice of C++ is puzzling as it requires "concepts" to express the ideas the authors are talking about; yet C++ doesn't support concepts yet (maybe in C++20) so many of the code samples in the book won't even compile.
  • "Madame Curie - A Biography" by Eve Curie - a fairly good biography of Marie Curie. It's a bit starry-eyed, which is not surprising given that the author is Marie's younger daughter, but good writing.
  • "Song of Solomon" by Toni Morrison - a novel following the life of a young man in an African-American community in Michigan in the middle of the 20th century, and his quest to find his family's roots. Can't say I liked this book, too weird for me. Definitely some good writing there, as well as interesting historical context, but it's just not my style.


  • "Man's Search for Meaning" by Viktor Frankl
  • "The Mythical Man-Month" by Frederick P. Brooks Jr. - added some modern impressions to my super-old review from 2003.
  • "Coders at Work" by Peter Seibel

The Confusion Matrix in statistical tests

This winter was one of the worst flu seasons in recent years, so I found myself curious to learn more about the diagnostic flu tests available to doctors in addition to the usual "looks like bad cold but no signs of bacteria" strategy. There's a wide array of RIDTs (Rapid Influenza Dignostic Tests) available to doctors today [1], and reading through literature quickly gets you to decipher statements like:

Overall, RIDTs had a modest sensitivity of 62.3% and a high specificity of 98.2%, corresponding to a positive likelihood ratio of 34.5 and a negative likelihood ratio of 0.38. For the clinician, this means that although false-negatives are frequent (occurring in nearly four out of ten negative RIDTs), a positive test is unlikely to be a false-positive result. A diagnosis of influenza can thus confidently be made in the presence of a positive RIDT. However, a negative RIDT result is unreliable and should be confirmed by traditional diagnostic tests if the result is likely to affect patient management.

While I heard about statistical test quality measures like sensitivity before, there are too many terms here to remember for someone not dealing with these things routinely; this post is my attempt at documenting this understanding for future use.

A table of test outcomes

Let's say there is a condition with a binary outcome ("yes" vs. "no", 1 vs 0, or whatever you want to call it). Suppose we conduct a test that is designed to detect this condition; the test also has a binary outcome. The totality of outcomes can thus be represented with a 2-by-2 table, which is also called the Confusion Matrix.

Suppose 10000 patients get tested for flu; out of them, 9000 are actually healthy and 1000 are actually sick. For the sick people, a test was positive for 620 and negative for 380. For the healthy people, the same test was positive for 180 and negative for 8820. Let's summarize these results in a table:

Confusion matrix with numbers only

Now comes our first batch of definitions.

  • True Positive (TP): positive test result matches reality - person is actually sick and tested positive.
  • False Positive (FP): positive test result doesn't match reality - test is positive but the person is not actually sick.
  • True Negative (TN): negative test result matches reality - person is not sick and tested negative.
  • False Negative (FN): negative test result doesn't match reality - test is negative but the person is actually sick.

Folks get confused with these often, so here's a useful heuristic: positive vs. negative reflects the test outcome; true vs. false reflects whether the test got it right or got it wrong.

Since the rest of the definitions build upon these, here's the confusion matrix again now with them embedded:

Confusion matrix with TP, FP, TN, FN marked

Definition soup

Armed with these and N for the total population (10000 in our case), we are now ready to tackle the multitude of definitions statisticians have produced over the years to describe the performance of tests:

  • Prevalence: how common is the actual disease in the population
    • (FN+TP)/N
    • In the example: (380+620)/10000=0.1
  • Accuracy: how often is the test correct
    • (TP+TN)/N
    • In the example: (620+8820)/10000=0.944
  • Misclassification rate: how often the test is wrong
    • 1 - Accuracy = (FP+FN)/N
    • In the example: (180+380)/10000=0.056
  • Sensitivity or True Positive Rate (TPR) or Recall: when the patient is sick, how often does the test actually predict it correctly
    • TP/(TP+FN)
    • In the example: 620/(620+380)=0.62
  • Specificity or True Negative Rate (TNR): when the patient is not sick, how often does the test actually predict it correctly
    • TN/(TN+FP)
    • In the example: 8820/(8820+180)=0.98
  • False Positive Rate (FPR): probability of false alarm
    • 1 - Specificity = FP/(TN+FP)
    • In the example: 180/(8820+180)=0.02
  • False Negative Rage (FNR): miss rate, probability of missing a sickness with a test
    • 1 - Sensitivity = FN/(TP+FN)
    • In the example: 380/(620+380)=0.38
  • Precision or Positive Predictive Value (PPV): when the prediction is positive, how often is it correct
    • TP/(TP+FP)
    • In the example: 620/(620+180)=0.775
  • Negative Predictive Value (NPV): when the prediction is negative, how often is it correct
    • TN/(TN+FN)
    • In the example: 8820/(8820+380)=0.959
  • Positive Likelihood Ratio: odds of a positive prediction given that the person is sick (used with odds formulations of probability)
    • TPR/FPR
    • In the example: 0.62/0.02=31
  • Negative Likelihood Ratio: odds of a positive prediction given that the person is not sick
    • FNR/TNR
    • In the example: 0.38/0.98=0.388

The wikipedia page has even more.

Deciphering our example

Now back to the flu test example this post began with. RIDTs are said to have sensitivity of 62.3%; this is just a clever way of saying that for a person with flu, the test will be positive 62.3% of the time. For people who do not have the flu, the test is more accurate since its specificity is 98.2% - only 1.8% of healthy people will be flagged positive.

The positive likelihood ratio is said to be 34.5; let's see how it was computed:


This is to say - if the person is sick, odds are 35-to-1 that the test will be positive.

And the negative likelihood ratio is said to be 0.38:


This is to say - if the person is not sick, odds are 1-to-3 that the test will be positive.

In other words, these flu tests are pretty good when a person is actually sick, but not great when the person is not sick. Which is exactly what the quoted paragraph at the top of the post ends up saying.

Back to Bayes

An astute reader will notice that the previous sections talk about the probability of test outcomes given sickness, when we're usually interested in the opposite - given a positive test, how likely is it that the person is actually sick.

My previous post on the Bayes theorem covered this issue in depth [2]. Let's recap, using the actual numbers from our example. The events are:

  • T: test is positive
  • T^C: test is negative
  • F: person actually sick with flu
  • F^C: person doesn't have flu

Sensitivity of 0.623 means P(T|F)=0.623; similarly, specificity is P(T^C|F^C)=0.982. We're interested in finding P(F|T), and we can use the Bayes theorem for that:


Recall that P(F) is the prevalence of flu in the general population; for the sake of this example let's assume it's 0.1; we'll then compute P(T) by using the law of total probability as follows:


Obviously, P(T|F^C)=1-P(T^C|F^C)=0.018, so:

\[P(T)=0.623\ast0.1 + 0.018\ast0.9=0.0785\]

And then:

\[P(F|T)=\frac{0.623\ast 0.1}{0.0785}=0.79\]

So the probability of having flu given a positive test and a 10% flu prevalence is 79%. The prevalence strongly affects the outcome! Let's plot P(F|T) as a function of P(F) for some reasonable range of values:

P(F|T) as function of prevalence

Note how low the value of the test becomes with low disease prevalence - we've also observed this phenomenon in the previous post; there's a "tug of war" between the prevalence and the test's sensitivity and specificity. In fact, the official CDC guidelines page for interpreting RIDT results discusses this:

When influenza prevalence is relatively low, the positive predictive value (PPV) is low and false-positive test results are more likely. By contrast, when influenza prevalence is low, the negative predictive value (NPV) is high, and negative results are more likely to be true.

And then goes on to present a handy table for estimating PPV based on prevalence and specificity.

Naturally, the rapid test is not the only tool in the doctor's toolbox. Flu has other symptoms, and by observing them on the patient the doctor can increase their confidence in the diagnosis. For example, if the probability P(F|T) given 10% prevalence is 0.79 (as computed above), the doctor may be significantly less sure of the results if flu symptoms like cough and fever are not demonstrated, etc. The CDC discusses this in more detail with an algorithm for interepreting flu results.

[1]Slower tests like full viral cultures are also available, and they are very accurate. The problem is that these tests take a long time to complete - days - so they're usually not very useful in treating the disease. Anti-viral medication is only useful in the first 48 hours after disease onset. RIDTs provide results within hours, or even minutes.
[2]In that post we didn't distinguish between sensitivity and specificity, but assumed they're equal at 90%. It's much more common for these measures to be different, but it doesn't actually complicate the computations.