Multibyte functions is useful in Far East countries, such as China (including Taiwan and Hong Kong), Korea, and Japan. The code pages of these countries are generally compliant with ASCII when the character codes are less than 0x80, and the meaning of the codes (and perhaps the code immediately follows) vary when they are greater than 0x80. For example, the name of the current Chinese Premiere, Mr Zhu Rongji, is encoded in Codepage 936 (GBK) as 0xD6EC 0xE946 0xBBF9. Please note that the second character Rong is encoded as 0xE946, the second byte being lower than 0x80 (However, the most common characters in CP 936 have both bytes greater than 0x80).
To use multibyte functions, the first step is to set the code page.
Of course, if you are using the OS default code page, this step could be
omitted. You should use the _setmbcp
function.
_MB_CP_SBCS = 0
Use single-byte code page_MB_CP_OEM = -2
Use OEM code page obtained
from operating system at program startup_MB_CP_ANSI = -3
Use ANSI code page obtained
from operating system at program startup_MB_CP_LOCALE = -4
Use the current locale's code page obtained
from a previous call to setlocale_MB_CP_SBCS
is equivalent to the "C" locale, where all char
data types are one byte. _MB_CP_OEM
seems rarely used. By default,
the run-time system automatically sets the multibyte code page to the system-default
ANSI code page so _MB_CP_ANSI
is generally not used either. _MB_CP_LOCALE
unifies the multibyte code page setting with locale setting. The following
simple program will show that both locale and multibyte code page is set
to that of Taiwan (CP 950):
It is trivial to mention that the#include <locale.h>
#include <mbctype.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
setlocale(LC_ALL, "Chinese_Taiwan");
_setmbcp(_MB_CP_LOCALE);
printf("%d\n", _getmbcp());
return 0;
}
_getmbcp
function is used to
get the code page:
_ismbb
routines provided for byte classification.
They are both implemented as functions and macros. For macro versions,
Microsoft uses a look-up array _mbctype
to get indexed attrbutes
and then uses bit masks to get the needed information. The bit masks are
_MS
(0x01), _MP
(0x02), _M1
(0x04), _M2
(0x08), _SBUP
(0x10), and _SBLOW
(0x20). The _ismbb
macros uses a way
such as ((_mbctype+1)[(unsigned char)(_c)] & Bitmask)
to test:
_ismbbkalnum
tests with _MS
_ismbbkprint
tests with (_MS|_MP)
_ismbbkpunct
tests with _MP
_ismbbkana
tests with (_MS|_MP)
(CP 932 only)_ismbblead
tests with _M1
_ismbbtrail
tests with _M2
ismbbxxx
functions combines ismbbkxxx
function and isxxx
function together. E.g. ismbbalnum(_c)
is defined as (((_ctype+1)[(unsigned char)(_c)] & (_ALPHA|_DIGIT))||_ismbbkalnum(_c))
.
Of these routines I only want to explain ismbblead
and ismbbtrail
.
The former tests whether an integer is the first byte of a multibyte character,
and the latter tests whether an integer is the second byte of a multibyte
character. The functions _ismbslead
and _ismbstrail
(they
have no inline versions) are also interesting that they test whether a
given character in a string is lead byte or trail byte depending on context.
For example,
The output is 4 (or 1 if use the function version of#include <mbctype.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char s[] = "\326\354"; /* ZHU1 */
_setmbcp(936); /* Chinese GBK */
printf("%d\n", _ismbblead(s[1]));
printf("%d\n", _ismbslead(s, &s[1]));
return 0;
}
_ismbblead
)
and 0, which shows that 0354 (0xEC) can be the lead byte in a Chinese string,
but here it is not the lead byte.
More interesting multibyte string functions are defined in mbstring.h. They include equivalents to standard string routines (taking into account the properties of multibyte characters), character type routines, and conversion routines. We shall name one from each category for explanation.
strlen
, but it returns the number of multibyte
characters instead of number of single-byte characters. For example,
The output is 2 and 1. Another#include <mbctype.h>
#include <mbstring.h>
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
char s[] = "\326\354"; /* ZHU1 */
_setmbcp(936); /* Chinese GBK */
printf("%d\n", strlen(s));
printf("%d\n", _mbslen(s));
return 0;
}
_mbstrlen
(defined in stdlib.h)
use the locale (instead of code page) information to count the number of
multibyte characters.
_MBC_SINGLE = 0
denotes valid single byte char_MBC_LEAD = 1
denotes lead byte_MBC_TRAIL = 2
denotes trailing byte_MBC_ILLEGAL = (-1)
denotes illegal byteThe output is/* mbstowcs.c - Demonstration of the mbstowcs function */
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
wchar_t wcstr[64];
char mbstr[] = "ABC - A Unicode conversion test";
size_t n, i;
// Set locale to the codepage of Simplified Chinese
//
// NB: mbstowcs is affected by locale information. By the way, the
// MBSTOWCS example given in MSDN makes no sense at all.
//
setlocale(LC_ALL, "English");
// Convert the string
n = mbstowcs(wcstr, mbstr, _mbstrlen(mbstr));
// Output the Unicode string to the screen
for( i = 0; i < n; i++ )
printf( "%.4X\t", wcstr[i] );
return 0;
}
0041 0042 0043
0020 002D 0020 0041
0020 0055 006E
0069 0063 006F
0064 0065 0020 0063
006F 006E 0076
0065 0072 0073
0069 006F 006E 0020
0074 0065 0073
0074
mbstr
, and setting the appropriate locale. The result
will be more interesting.
OK. I am a little tired now. Enough for today.
2001-12-16, written by Wu Yongwei
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 Licence.