Specify LANG in a UTF-8 Web Page!

Currently more and more Web pages are encoded in UTF-8, which is good. What is not good is that the Web developers often forget to specify the language for their Web pages: even Google has this problem. In the normal cases—in which the Web developers test and view the pages—often there are no problems seen. However, not specifying the correct language is simply an ignorance of a necessity, and will easily show up in this internationalized world.

I mentioned Google the Big Giant. Let’s first have a look at it. The following is a screen shot of www.google.com under Ubuntu Linux 5.10 (locale is en_US.UTF-8) when accessed from China (thus the automatica redirection to the Chinese page):

See that the font is a little strange? Yes, really there are two different typefaces used: one is the Japanese font (the bigger one), and the other is the Chinese font (the smaller one). The reason? Since no language is specified, the browser does not know which international font should be used. And the style ‘font-family:arial,sans-serif’ in its stylesheet does not help much either. The result is that when a character exists in Arial, it gets used. If not (a Chinese character here), the browser will check to see which language this specific character belongs to, and use the default sans-serif font of that language to render this character. The problem is that some characters exist in several languages, and ‘/’ is just in this category. It seems Firefox prefers Japanese over Chinese: the Japanese font is used for such characters, and the (Simplified) Chinese font is used only when a character exists in Chinese but not in Japanese. Thus the ugly display. I can reproduce the exact effect (if you are using Firefox under Ubuntu, or at least Linux, of course) here specifying the lang attribute of the text to ‘ja’, and the result is ‘网页’.

Why was the lang attribute not needed when we used the old encodings ISO-8859-1, GB2312, Big5, SJIS, etc.? Because the browser could easily deduce the language for the Web page from the character set (well, not exactly, but English and other Western languages generally use the same Latin-1 font). Why has this problem been there without being found by the Web developers? Because the browser will assume a default language for the UTF-8 Web page according to the local regional setting/locale of the system if it is not explicitly specified, which often matches the intended language quite well. So the Web developers generally are not aware of this issue really.—In the Google case, the display effect will be OK if the locale is zh_CN.UTF-8 (Chinese, as in Mainland China, encoded in UTF-8), and you can check the effect here.

What if fonts are specified? Well, first, fonts are not universal across all platforms (Linux has no Arial, for example). More importantly, this does not always work because of the existence of ambiguous-width characters. The most frequent pain about UTF-8 pages are English ones that do not say that they are English. Have a check at the following three lines (the first two lines have lang set to ‘zh-CN’, the second also has ‘10pt Arial’ specified in style, and the third has lang set to ‘en’ and ‘10pt Arial’ specified):

Sun’s T2000 “Coolthreads” Server: First Impressions and Experiences
Sun’s T2000 “Coolthreads” Server: First Impressions and Experiences
Sun’s T2000 “Coolthreads” Server: First Impressions and Experiences
Click here (my screen shot) if you do not find any problems above.—Also, if you are a Linux Firefox user, chances are that you have not configured your brower to correctly use the Chinese fonts. On Ubuntu Linux Linux 5.10, the serif and sans-serif fonts for Simplified Chinese should be ‘AR PL Sungti GB’ and ‘AR PL KaitiM GB’, respectively.

The first two lines, as Internet Explorer users with East Asian languages support installed will see, are often like what I really see on the Web.—Hope you see now how ugly it is and understand why I am taking the trouble to write this article.—The problem is caused by the fact that the default language for my Windows box is ‘Chinese (PRC)’, and the browser assumes my lang is ‘zh-CN’. Since the context is Chinese, the browser also assumes that the left and right quotes, which have ambiguous width, should be rendered full width, and the default Chinese font is used instead of Arial. OK, the context should not be Chinese, right? So always spell out your language so that browsers will not make the wild and wrong guesses.

What is the recommendation, then? If your Web page is encoded in UTF-8 and have not yet had lang specified, just set it to your default language somewhere so that other people’s browser will render the text the same way as yours. In many cases you just need to add it to the outermost html tag, as in ‘<html lang="en">’ (to my disappointment, in some cases this is not enough—you have to add this attribute to some inner HTML tag, like ‘<table>’—this also means testing is necessary...). If other languages (say, Chinese) are used as well, wrap them with something like ‘<span lang="zh-CN">...<span>’. Not difficult, right? In fact, the rather-too-verbose HTML output from Microsoft Office has been correct in this aspect for quite a few years: the language attributes of your Word text are kept very well in its HTML incarnation.

Finally some formal references. The HTML 4.01 Specification describes the lang attribute here, and the most relevant words about the problem I am discussing are (my emphasis):

Some situations where author-supplied language information may be helpful include:
Long live the Web! Long live UTF-8 (as long as you specify your lang)!

2006-3-28, written by Wu Yongwei
2006-5-9, last updated by Wu Yongwei

This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 Licence.

Return to Main