How I stopped worrying and switched to C++ for my Bob Scheme VM

Part of Bob Scheme is "BareVM" - a C++ implementation of the Bob virtual machine. After completing the Bob implementation in Python (including a VM), it was important for me to also re-implement the VM part in a lower language like C and C++, for a number of reasons:

"Real" VMs are implemented in low-level languages, usually C or C++, and I wanted to experience the challenges involved in such an implementation.
The serialization format I created for Bob's bytecode (heavily influenced by Python's marshal format) was meant to be truly cross-tool, and what a better way to prove it than to write a VM in a different language from the compiler, passing the bytecode between them in a serialized form.
An important part of the implementation of a language like Scheme is memory management, which usually means garbage collection. Implementing it in Python was cheating, because Python is garbage collected itself, so I didn't really have to do anything special. Just discard the implementation entities representing Scheme objects, and the Python GC will take care of them. The same isn't true for a C/C++ implementation, where a garbage collector has to be coded explicitly.

Having decided to do this, the next logical step was to decide which low-level language to use. The choice naturally came to be between C and C++. My initial leaning was to C, because unlike C++, I actually like C. Besides, I planned to model it after the VM running Python itself. And so I started writing it in C.

But pretty quickly it dawned on me that I may have taken the wrong direction. I once heard about a variation of Greenspun's tenth rule, which replaces Common Lisp with C++. And this was happening in my C BareVM implementation.

Leave aside the data structures. Yes, I had to implement a dynamic string, a hash table and a stack in C just to get started. But that's not too bad. What was too bad is that I found myself imitating a real object-oriented type system in C. Yes, Python has such a system. Yes, there's GObject. Yes, it works, and it's fast. But it's a hell to implement, and the nagging thought "just use C++ and be done with it" didn't leave me.

So, I switched to C++. You can still find a partial BareVM C implementation lying in the Mercurial troves of Bob (under experimental/old_barevm). Once the switch was made, I immediately felt much better. I could throw away all the data structures and just use STL. I could throw away my half-baked object system and just use... the language itself.

Another aspect is memory management. In C++, I can just have a base named BobObject (it's actually an abstract class) which implements the operators new and delete, which call the allocator underneath. The allocator gets memory for the object and registers it in an internal list of "live objects", which later serves as the basis for running a mark-and-sweep GC cycle. Some scattered code samples:

class BobObject
{
public:
    BobObject();
    virtual ~BobObject() = 0;
    // [...] skipping code
    void* operator new(size_t sz);
    void operator delete(void* p);
    // [...] skipping code
};

void* BobObject::operator new(size_t sz)
{
    return BobAllocator::get().allocate_object(sz);
}

void BobObject::operator delete(void* p)
{
    BobAllocator::get().release_object(p);
}

Now comes the good part. I can define some real Scheme objects, for example a boolean:

class BobBoolean : public BobObject
{
public:
    BobBoolean(bool value)
        : m_value(value)
    {}

    ~BobBoolean()
    {}

    bool value() const {return m_value;}
    std::string repr() const;
    bool equals_to(const BobObject& other) const;

private:
    bool m_value;
};

Naturally, a boolean just encapsulates a bool value. Here's one example of it being created:

static BobObject* symbol_p(BuiltinArgs& args)
{
    verify_numargs(args, 1, "symbol?");
    BobSymbol* sym = dynamic_cast<BobSymbol*>(args[0]);
    return new BobBoolean(sym != 0);
}

This is the symbol? built-in of Scheme. All it does it check if it actually has a single BobSymbol argument. It returns a boolean by simply creating a new BobBoolean object on the heap with new. Since BobBoolean doesn't implement its own operator new, its parent BobObject is looked at. BobObject does implement operator new, so that one ends up being called and the object is correctly created and registered by the memory allocator. So this new has no corresponding delete - the memory will be freed automatically by a GC cycle when it's no longer reachable. Sweet, isn't it?

I'm not saying that these would be particularly hard to implement in C. They wouldn't. I felt uncomfortable just sitting there and reimplementing the built-in facilities of C++ on my own. Getting "my head into" C++ doesn't automatically mean I should drown in a heap of steaming template metaprogramming. I carefully chose the C++ features I need to implement this project and just used them. With this, I saved myself a lot of work and also made the code clearer (because the reader doesn't have to learn and understand a whole new home-cooked object system as a prerequisite).

So this post is not to be seen as a flame against C and for C++. Just a nostalgic account of language choice in one specific project. A war story, if you will. The moral, as it so often turns out to be, is to use the right tool for the job at hand.