The perils of unsigned iteration in C/C++

June 11th, 2010 at 6:17 am

C and C++ frequently coax you into using an unsigned type for iteration. Standard functions like strlen and the size method of containers (in C++) return size_t, which is an unsigned type, so to avoid conversion warnings you comply and iterate with a variable of the appropriate type. For example:

size_t len = strlen(some_c_str);
size_t i;
for (i = 0; i < len; ++i) {
  /* Do stuff with each char of some_c_str
  */
}

I’ve long been aware of one painful gotcha of using size_t for iteration – using it for iterating backwards. The following code will fail:

/* Warning: buggy code!
*/
size_t len = strlen(some_c_str);
size_t i;
for (i = len - 1; i >= 0; --i) {
  /* Do stuff with each char of some_c_str, backwards
  */
}

When i reaches 0 it’s still within bounds, so it will be decremented and become a huge positive number (probably 2^((sizeof(size_t)*8) - 1). Congratulations, we have an infinite loop.

Today I ran into another manifestation of this problem. This one is more insidious, because it happens only for some kinds of input. I wrote the following code because the operation had to consider each character in the string and the character after it:

/* Warning: buggy code!
*/
size_t len = strlen(some_c_str);
size_t i;
for (i = 0; i < len - 1; ++i) {
  /* Do stuff with some_c_str[i] and some_c_str[i+1].
  */
}

Can you spot the bug?

When some_c_str is empty, len is 0. Therefore, i is compared with the unsigned version of -1, which is that huge positive number again. What chance does poor i have against such a giant? It will just keep chugging along, well beyond the length of my string.

As I see it, to avoid the problem we can either:

  1. Use an int variable and cast the return value of strlen to int. This feels a bit dirty, especially in C++ where you’d have to use static_cast<int>.
  2. Just keep using unsigned types for iteration, but be extra careful and use various hacks to avoid the problematic corner cases.

None of these options is ideal, so if you have a better idea, let me know.

Edit 12.06.2010: Thanks everyone for the excellent comments! It’s obvious creative ways exist to overcome this problem for unsigned types. Still, it remains a gotcha even seasoned programmers stumble upon from time to time. It’s not surprising that many C/C++ style guides recommend keeping unsigned types for bitfields only, using plain ints for everything else.

Related posts:

  1. c/c++ annoyance – unsigned iteration
  2. complying with -Wall -pedantic -ansi
  3. Faster XML iteration with ElementTree
  4. a cool algorithm for counting ones in a bitstring
  5. Initialization of structures and arrays in C++

32 Responses to “The perils of unsigned iteration in C/C++”

  1. andNo Gravatar Says:

    i use

    [const] char *end = some_c_str + strlen(some_c_str);
    for ([const] char *ptr = some_c_str; ptr != end; ptr++) {
      /* Do stuff with each char of some_c_str
          using *ptr
      */
    }
  2. Ben BassNo Gravatar Says:

    It really only makes sense for equal / not equal comparisons with zero for unsigned variables, which gives your second example a bad smell – perhaps compilers could include this as a warning…?

    In general of course this is what C++’s STL iterators and algorithms are for… but beside that (and for C code), I’d probably tend to use pointer equality comparison rather than indexing into an array, and avoid the index altogether (this is effectively what the relevant STL algorithms do anyway);

    char* ptr;
    for (ptr = some_c_str; ptr != some_c_str+len; ptr++) {
    /* do stuff with *ptr */
    }

    maybe this is a bit of a hack, but I think often introducing an index variable at all causes problems (personally I hate C arrays, but probably more people hate pointers…)

  3. Thomas GuestNo Gravatar Says:

    The usual form for iterating backwards would be:

    size_t const len = strlen(some_c_str);
    size_t i = len;
    while (i-- != 0) {
      /* Do stuff with each char of some_c_str, backwards */
    }

    In the second case, when you want to process some_c_str[i] and some_c_str[i+1], you need to check that makes sense before entering the loop, so check the length of the string first.

  4. Mr.TNo Gravatar Says:

    size_t len = strlen(some_c_str);
    // warning: correct code follows
    for (size_t i = len; i–;) {
    /* Do stuff with each char of some_c_str, backwards
    */
    }
    // i not longer accessible outside the for loop ;)

  5. AntoineNo Gravatar Says:

    The signed version of size_t is ssize_t, not int. It can make a big difference (literally) on 64-bit machines :)

  6. FrancescoNo Gravatar Says:

    Since you mention C++, in many cases you can use iterators which shield you from these gotchas. I agree, static_cast would be a code smell :-)
    Obviously iterators have their own costs/quirks it may or may not be applicable to your specific case / codebase.

  7. elibenNo Gravatar Says:

    @and and @Ben Bass,

    This won’t help cleanly access a char and the next char. Simple iteration is not a problem, especially over a string where I could simply use:

    char* ptr = some_c_str;
    while (*ptr) {
       // do something with *ptr
       ptr++;
    }

    @Thomas & @Mr. T,

    Thanks, an interesting approach. I must admit I haven’t seen a lot of real code that uses this idiom, and on first sight it raises an eyebrow as your subconscience battles with the fact that it indeed never makes you point outside the string.

    @Francesco,

    Yes, using a reverse iterator with rbegin() and rend() would help. But not for the second problem :-)

  8. tbpNo Gravatar Says:
    for (size_t i=std::strlen(some_c_str); i; i--) {
    	// dereference i-1
    }

    If you need to abstract that into iterators because it’s too sharp a tool, maybe it’s time for you to reconsider your choice of language.
    The only peril of unsigned loop counters is, possibly, making impossible for the compiler to prove their boundedness.

  9. JohnNo Gravatar Says:

    For the first, why not take into account that len?

    size_t len = strlen(some_c_str);
    size_t i;
    for (i = len - 1; i < len; --i) {
      /* Do stuff with each char of some_c_str, backwards
      */
    }

    For the second I think you’ll just have to make sure len isn’t zero.

    size_t len = strlen(some_c_str);
    size_t i;
    for (i = 0; len > 0 && i < len - 1; ++i) {
      /* Do stuff with some_c_str[i] and some_c_str[i+1].
      */
    }
  10. JohnNo Gravatar Says:

    That should read:

    For the first, why not take into account that less-than zero is the same as greater-than len?

  11. Matt EdlefsenNo Gravatar Says:

    Your loop invariant is that both i and i+1 must be valid indexes into some_c_str. It doesn’t make sense to talk about negative sizes so just work in positives.

    for (i = 0; (i  + 1) < len; ++i) {
         // do stuff
    }
  12. Brian BloniarzNo Gravatar Says:

    One cool way to make these bugs less subtle IMHO is to use an integer class that won’t silently overflow.

    I’ve used the SafeInt library from microsoft before, it works great: http://blogs.msdn.com/b/david_leblanc/archive/2008/09/30/safeint-3-on-codeplex.aspx

    SafeInt<size_t> is just like a size_t, except that it
    (a) throws on overflow
    (b) disallows implicit conversions to other integer types
    It also works fine on GCC.

  13. Laurie CheersNo Gravatar Says:

    In this case the simplest fix is:

    size_t len = strlen(some_c_str);
    size_t i;
    for (i = 0; i+1 < len; ++i) {
    /* Do stuff with some_c_str[i] and some_c_str[i+1].
    */
    }

  14. Duncan BeeversNo Gravatar Says:

    JavaScript doesn’t have an unsigned type, so the initial problem affect that dialect of ECMAScript, though I have been bitten by this in ActionScript, which does support unsigned types.

    Although most ECMAScript implementations offer forEach-equivalents, the while-loop, length-down-to-zero pattern is becoming more popular in the JavaScript community as that construct performs slightly faster in most implementations.

    var i = collection.length;
    while(i--) {
      doSomething(collection[i]);
    }

    The main caveat here, which is not just limited to this approach, is that JavaScript only has function-local, and not block-local scope.

  15. PaulNo Gravatar Says:

    Couldn’t you replace i < len-1 by i+1 < len?

  16. anttiNo Gravatar Says:
    for(char const* b = str, * e = str + strlen(str); b != e && (b + 1) != e; ++b) {
      f(b[0], b[1]);
    }

    equivalently

    for(char const* b = str, * b1 = str + 1, * e = str + strlen(str); b != e && b1 != e; ++b) {
      f(*b, *b1);
    }
  17. anttiNo Gravatar Says:

    Oops, forgot ++b1 from the second one :)

  18. SegherNo Gravatar Says:

    You write an unsigned down-count loop like:

    for (i = len – 1; i < len; i–)

    and your second example should use:

    for (i = 0; i+1 < len; i++)

    In unsigned, "a < b" is not equivalent to "a-c < b-c".

  19. SegherNo Gravatar Says:

    You write an unsigned down-count loop like:

    for (i = len - 1; i < len; i--)

    and your second example should use:

    for (i = 0; i+1 < len; i++)

    The actual condition you need is “i<len && i+1 < len", but that simplifies to just
    "i+1 < len". In unsigned arithmetic, "a < b" is not equivalent to "a-c < b-c", so
    this is not the same condition as "i < len-1".

  20. R Samuel KlatchkoNo Gravatar Says:

    Here’s a clean way to write that code without changing to signed or adding lots of extra checks:

    for (i = 0; i + 1 < len; ++i) {
  21. mehNo Gravatar Says:

    size_t len = strlen(some_c_str);
    size_t i;
    for (i = 0; i < len; i++) {
    /* Do stuff with some_c_str[i] and some_c_str[i+1].
    */
    }

  22. Ashutosh MehraNo Gravatar Says:

    One way to “solve” the problem might be to do use “i + 1″ instead of “len – 1″ in the comparison:

    size_t len = strlen(some_c_str);
    size_t i;
    for (i = 0; i + 1 < len; ++i) {
      /* Do stuff with some_c_str[i] and some_c_str[i+1]. */
    }

    This may cause extra move/increment instructions, unless they can be optimized away.

  23. Brian HurtNo Gravatar Says:

    Don’t use int’s as loop iterator variables! This introduces a rather bad 32-bit limitation in your code, were it ever to run on 64-bit machines. On 64-bit machines, it is quite likely already you will encounter arrays with more than 2 billion elements- and on most 64-bit C/C++ compilers, ints are still 32 bits. Java is hitting this problem hard (see all the blog posts on how hard it is to write a correct binary search- almost all the problems they talk about are due to very large array indicies that overflow 32-bit arithmetic). It’s only when counting down are unsigned values a problem.

  24. Joseph GarvinNo Gravatar Says:

    If you used a signed loop variable instead, would the bugs be any better? Looping 2 billion times or not looping at all still leaves a bug. At least if it’s looping you have plenty of time to break into the debugger and see where it’s stuck.

  25. mosNo Gravatar Says:

    I also prefer using the beginning and end of the string as my iterator bounds.

    const char *e = some_str[strlen(some_str)];
    for(char *c = some_str[0]; c != e; ++c)
    {
       ...
    }
  26. Roger PateNo Gravatar Says:

    @eliben: When using iterators instead of indexes (as ‘and’ and Ben Bass comment), you just use p[1] instead of s[i+1] to access the next element.

  27. DanielNo Gravatar Says:

    I write your last example

    for (i = 0; i+1 < len; ++i) ...
    when i is unsigned. I don’t like to use the postfix operators.

  28. elibenNo Gravatar Says:

    @Roger Pate,

    But then you can’t iterate until end itself, but must iterate until end - 1.

    @Matt, Laurie and others,

    Using i + 1 < len is an interesting approach – I didn’t think about it at all. You must admit that it’s not an idiomatic way to loop, however.

  29. Sean HsienNo Gravatar Says:

    Rather than using i + 1 < len, why not sanitize len before the loop, which saves you an increment each loop as well? For readability, you can store sanitized value in another variable if you like.

    size_t len = strlen(some_c_str);
    if (len > 0)
        --len;
    size_t i;
    for (i = 0; i < len; ++i) {
        /* Do stuff with some_c_str[i] and some_c_str[i+1]. */
    }

    Alternatively, you can loop by referring to the 2nd value of the pair:

    size_t len = strlen(some_c_str);
    size_t i;
    for (i = 1; i < len; ++i) {
        /** note the difference here... **/
        /* Do stuff with some_c_str[i-1] and some_c_str[i]. */
    }
  30. Sean HsienNo Gravatar Says:

    Sorry my first example won’t work, but the 2nd should :)

  31. MarcoNo Gravatar Says:

    Just error out of the function early before the loop:

    if (len == 0) return;

  32. Marc van LeeuwenNo Gravatar Says:

    I almost always loop using unsigned variables if that is appropriate. For decreasing loops over the interval [a,b) I use the following idiom, which curiously I have never seen used by others:
    for (unsigned i=b; i-->a; )
    { // stuff using i here
    }

    This works well for any possible a,b, including a=0. If a<=b, then it is equivalent to a version with i-- != a that I saw in the comments above, but i --> a (I often use spacing like that) just look so cute.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)