Character codes used for handling characters and character strings can be categorized into two groups:
File code is used for text data exchange and for storing in a file. It has fixed byte ordering regardless of the underlying system, which is Big Endian byte ordering. Codesets like UTF-8, EUC, single-byte codesets, BIG5, Shift-JIS, PCK, GBK, GB18030, and so on come under this category. The term multibyte character in the context of the functions described in this section is a general term that refers to the codeset of the current locale, even though it might in some cases be a single-byte codeset.
Process code is a fixed-width representation of a character used for internal processing. It is in the native byte ordering of the platform, which can be either Big Endian or Little Endian. Encodings like UTF-32, UCS-2, and UCS-4 can be wide-character encodings.
Conversion between multibyte data and wide-character data is often necessary. When a program takes input from a file, the multibyte data in the file is converted into wide-character process code by using input functions like fscanf() and fwscanf() or by using conversion functions like mbtowc() and mbsrtowcs() after the input. To convert output data from wide-character format to multibyte character format, use output functions like fwprintf() and fprintf() or apply conversion functions like wctomb() and wcsrtombs() before the output.
Functions for handling characters, wide characters, and corresponding data types are described in the following sections.
The ISO/IEC 9899 standard defines the term "wide character" and the wchar_t and wint_t data types.
A wide character is a representation of a single character that fits into an object of type wchar_t.
The wchar_t is an integer type capable of representing all characters for all supported locales.
The wint_t is an integer type capable of storing any valid value of wchar_t or WEOF.
A wide-character string (also wide string or process code string) is a sequence of wide characters terminated by a null wide character code.
In Oracle Solaris, the internal form of wchar_t is specific to a locale. In the Oracle Solaris Unicode locales, wchar_t has the UTF-32 Unicode encoding form, and other locales have different representations.
Fore more information, see stddef.h(3HEAD) and wchar.h(3HEAD) man pages.
The following functions are used for character classification and return a non-zero value for true, and 0 for false. With the exception of the isascii() function, all other functions are locale sensitive, specifically for the LC_CTYPE category of the current locale.
Test for an alphabetic character
Test for an alphanumeric character
Test for a 7-bit US-ASCII character
Test for a blank character
Test for a control character
Test for a decimal digit
Test for a visible character
Test for a lowercase letter
Test for a printable character
Test for a punctuation character
Test for a white-space character
Test for an uppercase letter
Test for a hexadecimal digit
These functions should not be used in a locale with a multibyte codeset, such as UTF-8. Use the wide-character classification functions described in the following section for multibyte codesets.
The behavior of some of these functions also depends on the compiler options used at compile time. The ctype(3C) man page describes the "Default" and "Standard conforming" behaviors for isalpha(), isgraph(), isprint(), and isxdigit() functions. For example, isalpha() function is defined as follows:
Tests for any character for which isupper() or islower() is true.
Tests for any character for which isupper() or islower() is true, or any character that is one of the current locale-defined set of characters for which none of iscntrl(), isdigit(), ispunct(), or isspace() is true. In the C locale, isalpha() returns true only for the characters for which isupper() or islower() is true.
This has consequences for languages or alphabets which have no case for its letters (also called unicase), such as Arabic, Hebrew or Thai. For alphabetic characters such as aleph (0xE0) in the Hebrew legacy locale he_IL.ISO8859-8, the functions isupper() and islower() always return false. Therefore, even the isalpha() function always returns false. If compiler options are enabled for the standard conforming behavior, the isalpha() function returns true for such characters. For more information, see the isalpha(3C) and standards(5) man pages.
See also the Oracle Developer Studio 12.6: C User's Guide, ctype(3C), and SUSv3(5) man pages.
The following man pages describe functions that classify wide characters and return a non-zero value for TRUE, and 0 for FALSE. These functions check the given wide character against named character classes, such as alpha, lower, or jkana, which are defined in the LC_CTYPE category of the current locale. Therefore, these functions are locale sensitive.
Test for an alphabetic wide-character
Test for an alphanumeric wide character
Test whether a wide character represents a 7-bit US-ASCII character
Test for a blank wide character
Test for a control wide character
Test for a decimal digit wide character
Test for a visible wide character
Test for a lowercase letter wide character
Test for a printable wide character
Test for a punctuation wide character
Test for a white-space wide character
Test for an uppercase letter wide character
Test for a hexadecimal digit wide character
Test for a wide character representing an English language character, excluding US-ASCII characters
Test for a wide character representing an ideographic language character, excluding US-ASCII characters
Test for wide character representing digit, excluding US-ASCII characters
Test for a wide character representing a phonetic language character, excluding US-ASCII characters
Test for a wide character representing a special language character, excluding US-ASCII characters
The following character classes are defined in all the locales:
alnum
alpha
blank
cntrl
digit
graph
lower
punct
space
upper
xdigit
The isenglish(), isideogram(), isnumber(), isphonogram(), and isspecial() are legacy Oracle Solaris specific wide-character classification functions. The character classes for these functions are defined only in the following Asian locales: ko_KR.EUC, zh_CN.EUC, zh_CN.GBK, zh_CN.GB18030, zh_HK.BIG5HK, zh_TW.BIG5, and zh_TW.EUC and their variants. The return values will always be false when used in other locales including Unicode locales.
You can to query for a specific character class in a generic way by using the following functions:
Define character class
Test character for specified class
In the following example, calls to the iswctype() and wctype() functions are used to check whether the given Unicode character belongs to the jhira character class . The jhira character class is from Japanese Hiragana script.
wint_t wc; int ret; setlocale(LC_ALL, "ja_JP.UTF-8"); /* "\xe3\x81\xba" is UTF-8 for HIRAGANA LETTER PE */ ret = mbtowc(&wc, "\xe3\x81\xba", 3); if (ret == (size_t)-1) { /* Invalid character sequence. */ : } if (iswctype(wc, wctype("jhira"))) { wprintf(L"'%c' is a hiragana character.\n", wc); }
The example will produce the following output:
ぺ is a hiragana character.
The following functions serve for mapping characters between character classes (character transliteration). If a mapping for a character is in the character class of the current locale, the functions return a transliterated character. These functions are locale sensitive.
Convert an uppercase character to lowercase
Convert a lowercase character to uppercase
Convert an uppercase wide character to lowercase
Convert a lowercase wide character to uppercase
The following functions provide a generic way to perform character transliteration:
Define character mapping
Wide-character mapping
For more information about related functions for Unicode strings, see Processing UTF-8 Strings.
Example 12 Transliteration of a Wide CharacterThe following code fragment shows how to use the towupper() function for transliterating a Unicode wide character to uppercase.
wint_t wc; int ret; setlocale(LC_ALL, "cs_CZ.UTF-8"); /* "\xc5\x99" is UTF-8 for LATIN SMALL LETTER R WITH CARON */ ret = mbtowc(&wc, "\xc5\x99", 2); if (ret == (size_t)-1) { /* Invalid character sequence. */ : } wprintf(L"'%c' is uppercase of '%c'.\n", towupper(wc), wc);
The example will produce the following output:
Ř is uppercase of ř.
The following functions are used for string comparison based on the collation data of the current locale:
String comparison using collating information
String transformation
Wide-character string comparison using collating information
Wide-character string transformation
For better performance when sorting large lists of strings, use the strxfrm() and strcmp() functions instead of the strcoll() function, and the wcsxfrm() and wcscmp() functions instead of the wcscoll() function.
When using the strxfrm() and wcsxfrm() functions, note that the format of the transformed string is not in a human-readable form. These functions are used as input to the strcmp() and wcscmp() function calls respectively.
For more information, see the strcmp(3C) and wcscmp(3C) man pages.
The following functions are used for conversion between the codeset of the current locale (multibyte) and the process code (wide-character representation).
These functions are locale sensitive and depend on the LC_CTYPE category of the current locale. They return the same error on incomplete characters and illegal characters. For more information about illegal characters and incomplete characters, see Converting Codesets.
Get the number of bytes in a character
Convert a character to a wide-character code
Convert a character string to a wide-character string
Convert a wide-character code to a character
Convert a wide-character string to a character string
The following functions are restartable, and can be used to handle incomplete character cases. These cases occur when an incomplete character reported from the previous call along with the additional bytes of the current call is a valid character. In order to store the state information required for this kind of processing, the functions either use a user-provided or an internal state structure of type mbstate_t. The mbsinit() function is used to detect whether an mbstate_t structure is in an initial state.
Determine the conversion object status
Get the number of bytes in a character (restartable)
Convert a character to a wide-character code (restartable)
Convert a character string to a wide-character string (restartable)
Convert a wide-character code to a character (restartable)
Convert a wide-character string to a character string (restartable)
The following functions are used for conversion between the codeset of the current locale and the process code. They determine whether the integer-coded character is represented in single-byte. If not, they return EOF and WEOF respectively.
Convert a wide-character to a single-byte character, if possible
Convert a single-byte character to a wide character, if possible
The following functions are used to handle wide-character strings:
Get length of a fixed-sized wide-character string
Find the first occurrence of a wide character in a wide-character string
Find the last occurrence of a wide character in a wide-character string
Scan a wide-character string for a wide-character code
Concatenate two wide-character strings
Compare two wide-character strings
Copy a wide-character string
Copy part of a wide-character string
Copy a wide-character string, returning a pointer to its end
Get the length of a wide-character substring
Get the length of a complementary wide-character substring
Split a wide-character string into tokens
Find a wide-character substring
Get the number of column positions of a wide-character or wide-character string
Case-insensitive wide-character string comparison
Duplicate a wide-character string
The wcswcs() function was marked legacy and may be removed from the ISO/IEC 9899 standard in the future. Use wcsstr() function instead.
The functions for converting wide characters to numbers are as follows:
Convert a wide-character string to a long integer
Convert a wide-character string to an unsigned long integer
Convert a wide-character string to a floating-point number
The following man pages describe functions that list the in-memory operations with wide characters. They are wide-character equivalents of functions like memset(), memcpy(), and so on. These functions are not affected by the locale and all wchar_t values are treated identically.
Set wide characters in memory
Copy wide characters in memory
Copy wide characters in memory with overlapping areas
Compare wide characters in memory
Find a wide character in memory
The following functions are used for wide-character input and output. These functions perform implicit conversion between file code (multibyte data) and internal process code (wide-character data).
Get a wide-character code from a stream
Get a wide character from a standard input stream
Get a wide-character string from a stream
Get a wide-character string from a standard input stream
Put a wide-character code on a stream
Put a wide-character code on the standard output stream
Put a wide-character string on a stream
Put a wide-character string on the standard output stream
Set the stream orientation to byte or wide-character
Push wide-character code back into the input stream
The following functions are used for formatting wide-character input and output:
Print formatted wide-character output
Wide-character formatted output of a stdarg argument list
Convert formatted wide-character input
Convert formatted wide-character input using a stdarg argument list
The functions marked with (*) were added to Oracle Solaris before the UNIX 98 standard that introduced the Multibyte Support Extension (MSE). They require inclusion of the widec.h header instead of the default wchar.h.