Unicode and UTF-8 Encoding
UTF-8 Encoding
[Web Developers: Encoding problems with special characters? Unicode, UTF-8]
Web Developers dealing with languages other than English often experience problems with ‘Special Characters’. Character encoding problems do not occur only with other languages. Let’s look at the Trademark Sign, the Registered Sign, the Euro Currency Sign, as well as many other commonly used symbols which can make trouble.
For developers Unicode and Unicode Transformation Formats will be the solution if character encoding is used properly.
The standard ASCII CharSet knows 128 characters (offset 0 to 127). Special Characters can be found above 127.
During the years many charsets, codepages have been developed for different languages and purposes. For example, iso-8859-1 (Western Europe) has the same first 127 chars like ASCII but beyond that many special characters of European languages.
But not for all European languages!
Different platforms (e.g. Windows and Apple, but there are many others) store one and the same character (at least you will see the same character) in different binary combinations, depending on what “code-Page” is used. That makes it difficult for data exchange, since you have to be aware of the correct code page.
Unicode is a standard that can solve the problems with multi-lingual content and tons of different code pages. It’s designed to cover every single possible sign in the world. It assigns a unique number to each character.
UTF-8, the Unicode Transformation Format
UTF stands for Unicode Transformation Format and comes in flavors like utf-8, utf-16, utf-32, and others. UTF-32 uses always 32 bits a.k.a. 4 bytes, for each value. This makes it for common Web use not really the mean of choice.
utf-8 and utf-16 are using a variable number of bytes to represent a character.
utf-8 has at least 1 byte, utf-16 at least 2 bytes. Let’s look at utf-8 as a common encoding format for web pages, XML files, etc.
For example every character 0-127 (this is the Standard ASCII set) uses just 1 byte.
From decimal 128 on (0x80) more bytes are needed.
Range 0x80 to 0x7FF needs two bytes.
Range 0x800 to 0xD7FF, 0xE000 to 0xFFFF needs 3 bytes.
Range 0x10000 to 0x10FFFF needs 4 bytes. All ranges according to Unicode Specifications,
see unicode.org for further information.
Behind utf-8 and utf-16, which both use the variable byte-width, stands a bit shifting algorithm, that encodes each byte. How this algorithm exactly works is not relevant for the web developer, he/she should just know that an Unicode value which might be hex AE (Registered Sign) might be stored as 2 bytes, see following.
The Registered Sign with decimal code 174 needs 2 bytes in utf-8.
It’s encoded (and stored) by utf-8 as “C2 AE”.
The important part is that if you choose to encode your web page in utf-8 (which is recommended) then you HAVE to tell the client (browser) which encoding format you used. And you have to be consistent within your website, particulary when handling e.g. XML content which is read into html pages.
How to tell the browser you want UTF-8 encoding
There exist various means:
- Your web server should deliver a HTTP header with a default encoding information. ASP.NET will handle this via its globalization settings which you can change, of course.
- if your web server does this not, so do it from your Java, ASP.net code
- Your html page should contain a meta tag like:
<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″> - Check that all relevant XSL Stylesheets state the encoding
- Check that all relevant XML files state the encoding
a) b) and c) should instruct the browser to decide on what decoding has to be used.
A user may override this manually, but this cannot be controlled by the web developer.
If you have an inconsistency between how your data is really encoded and what you tell the browser, then you have the well known issue with these weired additional characters together with your e.g. Registered Sign.
NOTE: If working with ASP.NET you’ll have to consider 2 things that I haven’t covered here yet. It’s the encoding of your source files: aspx as well as .cs files. I don’t go too deep into this right here since it’s more a topic of its own. This article is more general. A brief summary: for your source files you have in Visual Studio the Advanced Save Option, where you can set the type of encoding you need. For everything else you can control the output and file encoding with the globalization settings.
I’ll cover the topic “Globalization” another time.
Back to our example: the Registered Sign (R in a circle)
Following is what you see in your HTML output (source view in a browser) when everything went ok and the browser recognized your encoding correctly.
® Encoding as UTF-8 (actually you’ll see a circle with a R inside)
® (Encoding: decimal)
® (Encoding: hex)
® (Encoding: HTML Named Entity)
Since we know that the decimal representation is 174 (means above 127) we know also that the utf-8 encoding will use 2 bytes to store the value: “C2 AE”.
If you use utf-8 as encoding and the Registered Sign is saved as this 2-byte combination but you provide the browser with misleading information, e.g. iso-8859-1, then you will see following: an A with a circumflex on it and afterward the R in the circle. To solve it set a) b) and/or c) correctly to utf-8.
Note: a browser or search engine robot may decide what has precedence. It should be the HTTP header, but I know that some robots use the META TAG, since they assume that this is the place where the web designer can influence the output type. So better be careful and consistent. And all that theoretical knowledge will not help when you don’t check how your source files are actually stored, so you have to find out, if your editor has settings to change the encoding format. It should, since even notepad can do the job.
Hope that explained the additional ‘trash’ characters you might see when using e.g. the Registered Sign.
Here some samples:
こんにちは (konnichi-wa), Japanese, “Hello; Good afternoon”
® (Registered Sign)
äüöß (German Umlaute)
Cheers, best regards,
Frank