Yesterday I ran into an article of Joel's on Unicode and Character Sets. I've read it a couple of years ago, but now I want to link to it here to "archive it". It really is an interesting article - very highly recommended for all programmers. Joel really set my head clear about character codes. Why some languages (Spanish, for example) show up well in text files, and some (Russian) don't ? What are those "Windows 1255" (Hebrew), "ISO-8859-1" (Western European, a.k.a. Latin 1), KOI8-R (Cyrillic) encodings ? Where does Unicode come into all of this ? The most important message from the article is
It does not make sense to have a string without knowing what encoding it uses
In plain text files, a special byte order mark is used to specify the type of Unicode encoding the text file is in. In XML there is the encoding attribute in the opening tag. In HTML there are content-type meta tags, et cetera. I also found out that my blog website uses UTF-8, which is what Wordpress generates the pages in. This is very good, since it means I can write text in a whole lot of languages anyone can see: עברית (Hebrew) Español (Spanish) Русский (Russian)