Multibyte Functions in Microsoft C Run-time

Many C Run-times provide locale- and multibyte-related functions. However, since the relevant information is part of the operating system, the OS-vendor-provided versions are generally the most complete. We are here to have a look at the multibyte functions provided in Microsoft C Run-time.

Multibyte functions is useful in Far East countries, such as China (including Taiwan and Hong Kong), Korea, and Japan. The code pages of these countries are generally compliant with ASCII when the character codes are less than 0x80, and the meaning of the codes (and perhaps the code immediately follows) vary when they are greater than 0x80. For example, the name of the current Chinese Premiere, Mr Zhu Rongji, is encoded in Codepage 936 (GBK) as 0xD6EC 0xE946 0xBBF9. Please note that the second character Rong is encoded as 0xE946, the second byte being lower than 0x80 (However, the most common characters in CP 936 have both bytes greater than 0x80).

To use multibyte functions, the first step is to set the code page. Of course, if you are using the OS default code page, this step could be omitted. You should use the _setmbcp function.

int _setmbcp(int codepage);: Sets a new multibyte code page.

The codepage argument can be set to any valid code page value, or one of the following predefined constants:

_MB_CP_SBCS = 0 Use single-byte code page
_MB_CP_OEM = -2 Use OEM code page obtained from operating system at program startup
_MB_CP_ANSI = -3 Use ANSI code page obtained from operating system at program startup
_MB_CP_LOCALE = -4 Use the current locale's code page obtained from a previous call to setlocale

_MB_CP_SBCS is equivalent to the "C" locale, where all char data types are one byte. _MB_CP_OEM seems rarely used. By default, the run-time system automatically sets the multibyte code page to the system-default ANSI code page so _MB_CP_ANSI is generally not used either. _MB_CP_LOCALE unifies the multibyte code page setting with locale setting. The following simple program will show that both locale and multibyte code page is set to that of Taiwan (CP 950):

#include <locale.h>
#include <mbctype.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
    setlocale(LC_ALL, "Chinese_Taiwan");
    _setmbcp(_MB_CP_LOCALE);
    printf("%d\n", _getmbcp());
    return 0;
}

It is trivial to mention that the _getmbcp function is used to get the code page:

int _getmbcp(void);: Returns the current multibyte code page.

There are many _ismbb routines provided for byte classification. They are both implemented as functions and macros. For macro versions, Microsoft uses a look-up array _mbctype to get indexed attrbutes and then uses bit masks to get the needed information. The bit masks are _MS (0x01), _MP (0x02), _M1(0x04), _M2 (0x08), _SBUP (0x10), and _SBLOW (0x20). The _ismbb macros uses a way such as ((_mbctype+1)[(unsigned char)(_c)] & Bitmask) to test:

_ismbbkalnum tests with _MS
_ismbbkprint tests with (_MS|_MP)
_ismbbkpunct tests with _MP
_ismbbkana tests with (_MS|_MP) (CP 932 only)
_ismbblead tests with _M1
_ismbbtrail tests with _M2

The other ismbbxxx functions combines ismbbkxxx function and isxxx function together. E.g. ismbbalnum(_c) is defined as (((_ctype+1)[(unsigned char)(_c)] & (_ALPHA|_DIGIT))||_ismbbkalnum(_c)).

Of these routines I only want to explain ismbblead and ismbbtrail. The former tests whether an integer is the first byte of a multibyte character, and the latter tests whether an integer is the second byte of a multibyte character. The functions _ismbslead and _ismbstrail (they have no inline versions) are also interesting that they test whether a given character in a string is lead byte or trail byte depending on context. For example,

#include <mbctype.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
    char s[] = "\326\354"; /* ZHU1 */
    _setmbcp(936);          /* Chinese GBK */
    printf("%d\n", _ismbblead(s[1]));
    printf("%d\n", _ismbslead(s, &s[1]));
    return 0;
}

The output is 4 (or 1 if use the function version of _ismbblead) and 0, which shows that 0354 (0xEC) can be the lead byte in a Chinese string, but here it is not the lead byte.

More interesting multibyte string functions are defined in mbstring.h. They include equivalents to standard string routines (taking into account the properties of multibyte characters), character type routines, and conversion routines. We shall name one from each category for explanation.

size_t _mbslen(const unsigned char *string);: Returns the number of multibyte characters in string.

It is similar to strlen, but it returns the number of multibyte characters instead of number of single-byte characters. For example,

#include <mbctype.h>
#include <mbstring.h>
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
    char s[] = "\326\354"; /* ZHU1 */
    _setmbcp(936);          /* Chinese GBK */
    printf("%d\n", strlen(s));
    printf("%d\n", _mbslen(s));
    return 0;
}

The output is 2 and 1. Another _mbstrlen (defined in stdlib.h) use the locale (instead of code page) information to count the number of multibyte characters.

int _mbbtype(unsigned char c, int type);: Returns the type of a byte.

The return values is defined in mbctype.h:

_MBC_SINGLE = 0 denotes valid single byte char
_MBC_LEAD = 1 denotes lead byte
_MBC_TRAIL = 2 denotes trailing byte
_MBC_ILLEGAL = (-1) denotes illegal byte

For detailed usage, please consult MSDN.

size_t mbstowcs(wchar_t *wcstr, const char *mbstr, size_t count);: Converts a sequence of multibyte characters to a corresponding sequence of wide characters.

Let's see the example now:

/* mbstowcs.c - Demonstration of the mbstowcs function */
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
    wchar_t wcstr[64];
    char    mbstr[] = "ABC - A Unicode conversion test";
    size_t n, i;
    // Set locale to the codepage of Simplified Chinese
    //
    // NB: mbstowcs is affected by locale information. By the way, the
    // MBSTOWCS example given in MSDN makes no sense at all.
    //
    setlocale(LC_ALL, "English");
    // Convert the string
    n = mbstowcs(wcstr, mbstr, _mbstrlen(mbstr));
    // Output the Unicode string to the screen
    for( i = 0; i < n; i++ )
        printf( "%.4X\t", wcstr[i] );
    return 0;
}

The output is

0041    0042    0043    0020    002D    0020    0041    0020    0055    006E
0069    0063    006F    0064    0065    0020    0063    006F    006E    0076
0065    0072    0073    0069    006F    006E    0020    0074    0065    0073
0074

It seems logical, but not very interesting, right? I originally used Chinese characters, but Chinese characters cannot display in a ISO-8859-1 page :-(. However, if you are using a Far East-capable version of Windows (I confirmed that English Windows 95 will not do), you may try giving a suitable value to mbstr, and setting the appropriate locale. The result will be more interesting.

OK. I am a little tired now. Enough for today.

2001-12-16, written by Wu Yongwei

This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 Licence.

Return to Main