When it comes to databases, I'm a complete n00b. It's strange, but in 10 years of programming, I've never really had to use them more than superficially [1]. This is probably because I've never done any serious web or enterprise programming. Anyway, sometimes I feel bad about this, because databases is one of those parts of basic know-how every programmer should know at least something about. Moreover, with today's powerful and free tools, learning databases is easier than ever.

So I've decided to learn some more about real databases (DBMS). I want this tool to be in my toolbox the next time I have to do something.

I'll begin with writing down why databases are needed at all. As any self-respecting n00b would ask - can't I just use flat files (i.e. comma-separated values ?)

Well, no. Here's a partial list (inspired by this) of things you can't do with flat files [2] and must use databases for.

Simultanous access

When several processes want to access a database simultaneously (this happens all the time in web applications), a flat file is not good enough [3]. Suppose that two processes attempt to insert a new entry to the database at the same time. If it's just a file, one of the following scenarios can happen:

  1. Both inserts succeed
  2. One of the inserts is lost
  3. Information from the two inserts is mixed together so that both are corrupted.

Obviously, only (1) is acceptable, but with flat files (2) and (3) can (and will!) happen as well.

A DBMS allows transparent simultaneous access to the database that keeps it in a consistent state, no matter which request came when. The implementations of this are various, from simple file locking (as done by SQLite) to more sophisticated in-process locking as done by full-fledged database servers such as MySQL.

Transactions

Suppose that your database keeps money accounts for users (you don't have to run a bank, it can be an online fantasy game). A request arrived to transver $1,000 from John to Sally. You debit $1,000 from John, but before you have time to credit Sally, the power fails and your computer turns off. What state are the accounts in ? Can you be sure that it's consistent ? No. Depending on your flat-file structure and access method, it may be that Jonh now has $1,000 less in his account, but Sally didn't receive the money.

Sure, if you're running an online game you can just blame it on the "very rare emergency" and work something out, but this excuse won't fly if you're running a real bank and was in the middle of transacting $1M from one corporation full of angry-and-eager lawyers to another.

DBMSs implement transactions. This is far from simple to do correctly. See, for example, this explanation from SQLite's documentation.

Complex data

Keeping your grocery list in a database is perhaps an overkill, but real-world data is complex. Consider the database Wordpress keeps. There are posts, there are comments, and there are users. The posts table in the DB has about 20 columns. Comments are also quite complex and are linked to posts. The same with users. How do you keep so much inter-linked data in a simple flat-file. Managing it would be a nightmare. You might be tempted to split it to several inter-linked files (to simulate DB tables), but this would just make simultaneous access and transactions even more hard to manage.

DBMSs hold data in row/column tables of data which can be inter-linked (rows from one table can be pointing to other tables).

Performance

Imagine that your flat-file database has grown really large. When you have to search it often for entries, a lot of time is spent, since searching in a flat-file database is essentially linear.

DBMSs implement sophisticated indexing schemes to allow you maximal performance (as long as you structure your database in a reasonable manner).

Decoupling

Let's admit it, we've all implemented at least a few home-cooked flat-file "databases" in our career. Sometimes a simple file is just enough. But what about the formats of these files ? What about the code to access them ? If I've "invented" at least 10 flat-file formats and methods for reading them in my short career, how many such methods exist in the whole programming community ? A million ? More ?

DBMSs allow decoupling the problem to different levels of abstraction:

  • A database is one level. You access it using SQL, a declarative DSL that is common to almost all popular DBMSs. All you have to learn is SQL, and you can switch DBMSs easily. Has your application overgrown SQLite and you must graduate to MySQL or Oracle ? Easy. As long as all your code stays above the SQL abstraction, the transition will be simple.
  • The application is another level. Your code accesses the database using SQL, so you don't have to invent yet another format and a reader/writer for it.

Conclusion

OK, now I've convinced myself that real databases have their uses. It's time to go and learn how to use them effectively. I will try to write down as much as possible in the blog, to help myself remember for future needs.

[1]A good example of superficial use is poking around the MySQL database Wordpress created for my blog.
[2]That is, without writing part of a database engine in your application.
[3]I have a great story to tell about this - how my friend and I implemented a dynamic website in Perl / CGI back in 1999, and how our use of "flat file" databases made us lose our data almost every day. But it's long, and perhaps I'll leave it for another post.