Character Set

The Open Group Base Specifications Issue 7
IEEE Std 1003.1, 2013 Edition
Copyright © 2001-2013 The IEEE and The Open Group

A.6 Character Set

A.6.1 Portable Character Set

The portable character set is listed in full so there is no dependency on the ISO/IEC 646:1991 standard (or historically ASCII) encoded character set, although the set is identical to the characters defined in the International Reference version of the ISO/IEC 646:1991 standard.

POSIX.1-2008 poses no requirement that multiple character sets or codesets be supported, leaving this as a marketing differentiation for implementors. Although multiple charmap files are supported, it is the responsibility of the implementation to provide the file(s); if only one is provided, only that one will be accessible using the localedef -f option.

The statement about invariance in codesets for the portable character set is worded to avoid precluding implementations where multiple incompatible codesets are available (for instance, ASCII and EBCDIC). The standard utilities cannot be expected to produce predictable results if they access portable characters that vary on the same implementation.

Not all character sets need include the portable character set, but each locale must include it. For example, a Japanese-based locale might be supported by a mixture of character sets: JIS X 0201 Roman (a Japanese version of the ISO/IEC 646:1991 standard), JIS X 0208, and JIS X 0201 Katakana. Not all of these character sets include the portable characters, but at least one does (JIS X 0201 Roman).

A.6.2 Character Encoding

Encoding mechanisms based on single shifts, such as the EUC encoding used in some Asian and other countries, can be supported via the current charmap mechanism. With single-shift encoding, each character is preceded by a shift code (SS2 or SS3). A complete EUC code, consisting of the portable character set (G0) and up to three additional character sets (G1, G2, G3), can be described using the current charmap mechanism; the encoding for each character in additional character sets G2 and G3 must then include their single-shift code. Other mechanisms to support locales based on encoding mechanisms such as locking shift are not addressed by this volume of POSIX.1-2008.

The encodings for <slash> and <period> are required to be the same across all locales, in part because pathname resolution requires recognition of these bytes. It is a fortunate accident that all common shift-based encodings did not use either <slash> or <period> as a valid second byte in a multi-byte character.

A.6.3 C Language Wide-Character Codes

The standard does not specify how wide characters are encoded or provide a method for defining wide characters in a charmap. It specifies ways of translating between wide characters and multi-byte characters. The standard does not prevent an extension from providing a method to define wide characters.

IEEE Std 1003.1-2001/Cor 2-2004, item XBD/TC2/D6/13 is applied, adding a statement that the standard has no means of defining a wide-character codeset.

A.6.4 Character Set Description File

IEEE PASC Interpretation 1003.2 #196 is applied, removing three lines of text dealing with ranges of symbolic names using position constant values which had been erroneously included in the final IEEE P1003.2b draft standard.

IEEE Std 1003.1-2001/Cor 2-2004, item XBD/TC2/D6/14 is applied, correcting the example and adding a statement that the standard provides no means of defining a wide-character codeset.

IEEE Std 1003.1-2001/Cor 2-2004, item XBD/TC2/D6/15 is applied, allowing the value zero for the width value of WIDTH and WIDTH_DEFAULT. This is required to cover some existing locales.

State-Dependent Character Encodings

A requirement was considered that would force utilities to eliminate any redundant locking shifts, but this was left as a quality of implementation issue.

This change satisfies the following requirement from the ISO POSIX-2:1993 standard, Annex H.1:

The support of state-dependent (shift encoding) character sets should be addressed fully. See descriptions of these in XBD Character Encoding. If such character encodings are supported, it is expected that this will impact XBD Character Encoding, Locale, Regular Expressions , and the comm, cut, diff, grep, head, join, paste, and tail utilities.

The character set description file provides:

The capability to describe character set attributes (such as collation order or character classes) independent of character set encoding, and using only the characters in the portable character set. This makes it possible to create generic localedef source files for all codesets that share the portable character set (such as the ISO 8859 family or IBM Extended ASCII).
Standardized symbolic names for all characters in the portable character set, making it possible to refer to any such character regardless of encoding.

Implementations are free to choose their own symbolic names, as long as the names identified by the Base Definitions volume of POSIX.1-2008 are also defined; this provides support for already existing "character names".

The names selected for the members of the portable character set follow the ISO/IEC 8859-1:1998 standard and the ISO/IEC 10646-1:2000 standard. However, several commonly used UNIX system names occur as synonyms in the list:

The historical UNIX system names are used for control characters.
The word "slash" is given in addition to "solidus".
The word "backslash" is given in addition to "reverse-solidus".
The word "hyphen" is given in addition to "hyphen-minus".
The word "period" is given in addition to "full-stop".
For digits, the word "digit" is eliminated.
For letters, the words "Latin Capital Letter" and "Latin Small Letter" are eliminated.
The words "left brace" and "right brace" are given in addition to "left-curly-bracket" and "right-curly-bracket".
The names of the digits are preferred over the numbers to avoid possible confusion between '0' and 'O', and between '1' and 'l' (one and the letter ell).

The names for the control characters in XBD Character Set were taken from the ISO/IEC 4873:1991 standard.

The charmap file was introduced to resolve problems with the portability of, especially, localedef sources. POSIX.1-2008 assumes that the portable character set is constant across all locales, but does not prohibit implementations from supporting two incompatible codings, such as both ASCII and EBCDIC. Such dual-support implementations should have all charmaps and localedef sources encoded using one portable character set, in effect cross-compiling for the other environment. Naturally, charmaps (and localedef sources) are only portable without transformation between systems using the same encodings for the portable character set. They can, however, be transformed between two sets using only a subset of the actual characters (the portable character set). However, the particular coded character set used for an application or an implementation does not necessarily imply different characteristics or collation; on the contrary, these attributes should in many cases be identical, regardless of codeset. The charmap provides the capability to define a common locale definition for multiple codesets (the same localedef source can be used for codesets with different extended characters; the ability in the charmap to define empty names allows for characters missing in certain codesets).

The <escape_char> declaration was added at the request of the international community to ease the creation of portable charmap files on terminals not implementing the default <backslash>-escape. The <comment_char> declaration was added at the request of the international community to eliminate the potential confusion between the <number-sign> and the hash sign.

The octal number notation with no leading zero required was selected to match those of awk and tr and is consistent with that used by localedef. To avoid confusion between an octal constant and the back-references used in localedef source, the octal, hexadecimal, and decimal constants must contain at least two digits. As single-digit constants are relatively rare, this should not impose any significant hardship. Provision is made for more digits to account for systems in which the byte size is larger than 8 bits. For example, a Unicode (ISO/IEC 10646-1:2000 standard) system that has defined 16-bit bytes may require six octal, four hexadecimal, and five decimal digits.

The decimal notation is supported because some newer international standards define character values in decimal, rather than in the old column/row notation.

The charmap identifies the coded character sets supported by an implementation. At least one charmap must be provided, but no implementation is required to provide more than one. Likewise, implementations can allow users to generate new charmaps (for instance, for a new version of the ISO 8859 family of coded character sets), but does not have to do so. If users are allowed to create new charmaps, the system documentation describes the rules that apply (for instance, "only coded character sets that are supersets of the ISO/IEC 646:1991 standard IRV, no multi-byte characters").

This addition of the WIDTH specification satisfies the following requirement from the ISO POSIX-2:1993 standard, Annex H.1:

(9)

The definition of column position relies on the implementation's knowledge of the integral width of the characters. The charmap or LC_CTYPE locale definitions should be enhanced to allow application specification of these widths.

The character "width" information was first considered for inclusion under LC_CTYPE but was moved because it is more closely associated with the information in the charmap than information in the locale source (cultural conventions information). Concerns were raised that formalizing this type of information is moving the locale source definition from the codeset-independent entity that it was designed to be to a repository of codeset-specific information. A similar issue occurred with the <code_set_name>, <mb_cur_max>, and <mb_cur_min> information, which was resolved to reside in the charmap definition.

The width definition was added to the IEEE P1003.2b draft standard with the intent that the wcswidth() and/or wcwidth() functions (currently specified in the System Interfaces volume of POSIX.1-2008) be the mechanism to retrieve the character width information.

return to top of page

UNIX ® is a registered Trademark of The Open Group.
POSIX ® is a registered Trademark of The IEEE.
Copyright © 2001-2013 The IEEE and The Open Group, All Rights Reserved
[ Main Index | XBD | XSH | XCU | XRAT ]

<<< Previous

Home

Next >>>