Currently more and more Web pages are encoded in UTF-8, which is good. What is not good is that the Web developers often forget to specify the language for their Web pages: even Google has this problem. In the normal cases—in which the Web developers test and view the pages—often there are no problems seen. However, not specifying the correct language is simply an ignorance of a necessity, and will easily show up in this internationalized world.
I mentioned Google the Big Giant. Let’s first have a look at
it. The following is a screen shot of www.google.com under Ubuntu Linux
5.10 (locale is en_US.UTF-8
) when accessed from China (thus
the automatica redirection to the Chinese page):
See that the font is a little strange? Yes, really there are two
different typefaces used: one is the Japanese font (the bigger one),
and the other is the Chinese font (the smaller one). The reason? Since
no language is specified, the browser does not know which international
font should be used. And the style ‘
’ in its stylesheet
does not help much either. The result is that when a character exists
in Arial, it gets used. If not (a Chinese character here), the browser
will check to see which language this specific character belongs to, and
use the default sans-serif font of that language to render this
character. The problem is that some characters exist in several
languages, and ‘网/网’ is just in this category. It seems
Firefox prefers Japanese over Chinese: the Japanese font is used for
such characters, and the (Simplified) Chinese font is used only when a
character exists in Chinese but not in Japanese. Thus the ugly display.
I can reproduce the exact effect (if you are using Firefox under Ubuntu,
or at least Linux, of course) here specifying the lang
attribute of the text to ‘ja
’, and the result
is ‘网页’.
Why was the lang
attribute not needed when we used
the old encodings ISO-8859-1, GB2312, Big5, SJIS, etc.? Because
the browser could easily deduce the language for the Web page from the
character set (well, not exactly, but English and other Western
languages generally use the same Latin-1 font). Why has this
problem been there without being found by the Web developers?
Because the browser will assume a default language for the UTF-8 Web
page according to the local regional setting/locale of the system if it
is not explicitly specified, which often matches the intended language
quite well. So the Web developers generally are not aware of this issue
really.—In the Google case, the display effect will be OK if the
locale is zh_CN.UTF-8
(Chinese, as in Mainland China,
encoded in UTF-8), and you can check the effect here.
What if fonts are specified? Well, first, fonts are not
universal across all platforms (Linux has no Arial, for example). More
importantly, this does not always work because of the existence of ambiguous-width
characters. The most frequent pain about UTF-8 pages are English
ones that do not say that they are English. Have a check at the
following three lines (the first two lines have lang
set to
‘zh-CN
’, the second also has ‘10pt
Arial
’ specified in style
, and the third has
lang
set to ‘en
’ and ‘10pt
Arial’ specified):
Sun’s T2000 “Coolthreads” Server: First Impressions and ExperiencesSun’s T2000 “Coolthreads” Server: First Impressions and ExperiencesSun’s T2000 “Coolthreads” Server: First Impressions and ExperiencesClick here (my screen shot) if you do not find any problems above.—Also, if you are a Linux Firefox user, chances are that you have not configured your brower to correctly use the Chinese fonts. On Ubuntu Linux Linux 5.10, the serif and sans-serif fonts for Simplified Chinese should be ‘AR PL Sungti GB’ and ‘AR PL KaitiM GB’, respectively.The first two lines, as Internet Explorer users with East Asian languages support installed will see, are often like what I really see on the Web.—Hope you see now how ugly it is and understand why I am taking the trouble to write this article.—The problem is caused by the fact that the default language for my Windows box is ‘Chinese (PRC)’, and the browser assumes my
lang
is ‘zh-CN
’. Since the context is Chinese, the browser also assumes that the left and right quotes, which have ambiguous width, should be rendered full width, and the default Chinese font is used instead of Arial. OK, the context should not be Chinese, right? So always spell out your language so that browsers will not make the wild and wrong guesses.What is the recommendation, then? If your Web page is encoded in UTF-8 and have not yet had
lang
specified, just set it to your default language somewhere so that other people’s browser will render the text the same way as yours. In many cases you just need to add it to the outermosthtml
tag, as in ‘<html lang="en">
’ (to my disappointment, in some cases this is not enough—you have to add this attribute to some inner HTML tag, like ‘<table>
’—this also means testing is necessary...). If other languages (say, Chinese) are used as well, wrap them with something like ‘<span lang="zh-CN">...<span>
’. Not difficult, right? In fact, the rather-too-verbose HTML output from Microsoft Office has been correct in this aspect for quite a few years: the language attributes of your Word text are kept very well in its HTML incarnation.Finally some formal references. The HTML 4.01 Specification describes the
lang
attribute here, and the most relevant words about the problem I am discussing are (my emphasis):Some situations where author-supplied language information may be helpful include:Long live the Web! Long live UTF-8 (as long as you specify your
- Assisting search engines
- Assisting speech synthesizers
- Helping a user agent select glyph variants for high quality typography
- Helping a user agent choose a set of quotation marks
- Helping a user agent make decisions about hyphenation, ligatures, and spacing
- Assisting spell checkers and grammar checkers
lang
)!2006-3-28, written by Wu Yongwei
2006-5-9, last updated by Wu Yongwei
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 Licence.