The Open Group Base Specifications Issue 6
IEEE Std 1003.1, 2004 Edition
Copyright © 2001-2004 The IEEE and The Open Group
A newer edition of this document exists here

A.7 Locale

A.7.1 General

The description of locales is based on work performed in the UniForum Technical Committee, Subcommittee on Internationalization. Wherever appropriate, keywords are taken from the ISO C standard or the X/Open Portability Guide.

The value used to specify a locale with environment variables is the name specified as the name operand to the localedef utility when the locale was created. This provides a verifiable method to create and invoke a locale.

The "object" definitions need not be portable, as long as "source" definitions are. Strictly speaking, source definitions are portable only between implementations using the same character set(s). Such source definitions, if they use symbolic names only, easily can be ported between systems using different codesets, as long as the characters in the portable character set (see the Base Definitions volume of IEEE Std 1003.1-2001, Section 6.1, Portable Character Set) have common values between the codesets; this is frequently the case in historical implementations. Of source, this requires that the symbolic names used for characters outside the portable character set be identical between character sets. The definition of symbolic names for characters is outside the scope of IEEE Std 1003.1-2001, but is certainly within the scope of other standards organizations.

Applications can select the desired locale by invoking the setlocale() function (or equivalent) with the appropriate value. If the function is invoked with an empty string, the value of the corresponding environment variable is used. If the environment variable is not set or is set to the empty string, the implementation sets the appropriate environment as defined in the Base Definitions volume of IEEE Std 1003.1-2001, Chapter 8, Environment Variables.

A.7.2 POSIX Locale

The POSIX locale is equal to the C locale. To avoid being classified as a C-language function, the name has been changed to the POSIX locale; the environment variable value can be either "POSIX" or, for historical reasons, "C".

The POSIX definitions mirror the historical UNIX system behavior.

The use of symbolic names for characters in the tables does not imply that the POSIX locale must be described using symbolic character names, but merely that it may be advantageous to do so.

A.7.3 Locale Definition

The decision to separate the file format from the localedef utility description was only partially editorial. Implementations may provide other interfaces than localedef. Requirements on "the utility", mostly concerning error messages, are described in this way because they are meant to affect the other interfaces implementations may provide as well as localedef.

The text about POSIX2_LOCALEDEF does not mean that internationalization is optional; only that the functionality of the localedef utility is. REs, for instance, must still be able to recognize, for example, character class expressions such as "[[:alpha:]]". A possible analogy is with an applications development environment; while all conforming implementations must be capable of executing applications, not all need to have the development environment installed. The assumption is that the capability to modify the behavior of utilities (and applications) via locale settings must be supported. If the localedef utility is not present, then the only choice is to select an existing (presumably implementation-documented) locale. An implementation could, for example, choose to support only the POSIX locale, which would in effect limit the amount of changes from historical implementations quite drastically. The localedef utility is still required, but would always terminate with an exit code indicating that no locale could be created. Supported locales must be documented using the syntax defined in this chapter. (This ensures that users can accurately determine what capabilities are provided. If the implementation decides to provide additional capabilities to the ones in this chapter, that is already provided for.)

If the option is present (that is, locales can be created), then the localedef utility must be capable of creating locales based on the syntax and rules defined in this chapter. This does not mean that the implementation cannot also provide alternate means for creating locales.

The octal, decimal, and hexadecimal notations are the same employed by the charmap facility (see the Base Definitions volume of IEEE Std 1003.1-2001, Section 6.4, Character Set Description File). To avoid confusion between an octal constant and a back-reference, the octal, hexadecimal, and decimal constants must contain at least two digits. As single-digit constants are relatively rare, this should not impose any significant hardship. Provision is made for more digits to account for systems in which the byte size is larger than 8 bits. For example, a Unicode (see the ISO/IEC 10646-1:2000 standard) system that has defined 16-bit bytes may require six octal, four hexadecimal, and five decimal digits. As with the charmap file, multi-byte characters are described in the locale definition file using "big-endian" notation for reasons of portability. There is no requirement that the internal representation in the computer memory be in this same order.

One of the guidelines used for the development of this volume of IEEE Std 1003.1-2001 is that characters outside the invariant part of the ISO/IEC 646:1991 standard should not be used in portable specifications. The backslash character is not in the invariant part; the number sign is, but with multiple representations: as a number sign, and as a pound sign. As far as general usage of these symbols, they are covered by the "grandfather clause", but for newly defined interfaces, the WG15 POSIX working group has requested that POSIX provide alternate representations. Consequently, while the default escape character remains the backslash and the default comment character is the number sign, implementations are required to recognize alternative representations, identified in the applicable source file via the <escape_char> and <comment_char> keywords.


The LC_CTYPE category is primarily used to define the encoding-independent aspects of a character set, such as character classification. In addition, certain encoding-dependent characteristics are also defined for an application via the LC_CTYPE category. IEEE Std 1003.1-2001 does not mandate that the encoding used in the locale is the same as the one used by the application because an implementation may decide that it is advantageous to define locales in a system-wide encoding rather than having multiple, logically identical locales in different encodings, and to convert from the application encoding to the system-wide encoding on usage. Other implementations could require encoding-dependent locales.

In either case, the LC_CTYPE attributes that are directly dependent on the encoding, such as <mb_cur_max> and the display width of characters, are not user-specifiable in a locale source and are consequently not defined as keywords.

Implementations may define additional keywords or extend the LC_CTYPE mechanism to allow application-defined keywords.

The text "The ellipsis specification shall only be valid within a single encoded character set" is present because it is possible to have a locale supported by multiple character encodings, as explained in the rationale for the Base Definitions volume of IEEE Std 1003.1-2001, Section 6.1, Portable Character Set. An example given there is of a possible Japanese-based locale supported by a mixture of the character sets JIS X 0201 Roman, JIS X 0208, and JIS X 0201 Katakana. Attempting to express a range of characters across these sets is not logical and the implementation is free to reject such attempts.

As the LC_CTYPE character classes are based on the ISO C standard character class definition, the category does not support multi-character elements. For instance, the German character <sharp-s> is traditionally classified as a lowercase letter. There is no corresponding uppercase letter; in proper capitalization of German text, the <sharp-s> will be replaced by "SS" ; that is, by two characters. This kind of conversion is outside the scope of the toupper and tolower keywords.

Where IEEE Std 1003.1-2001 specifies that only certain characters can be specified, as for the keywords digit and xdigit, the specified characters must be from the portable character set, as shown. As an example, only the Arabic digits 0 through 9 are acceptable as digits.

The character classes digit, xdigit, lower, upper, and space have a set of automatically included characters. These only need to be specified if the character values (that is, encoding) differs from the implementation default values. It is not possible to define a locale without these automatically included characters unless some implementation extension is used to prevent their inclusion. Such a definition would not be a proper superset of the C locale, and thus, it might not be possible for the standard utilities to be implemented as programs conforming to the ISO C standard.

The definition of character class digit requires that only ten characters-the ones defining digits-can be specified; alternate digits (for example, Hindi or Kanji) cannot be specified here. However, the encoding may vary if an implementation supports more than one encoding.

The definition of character class xdigit requires that the characters included in character class digit are included here also and allows for different symbols for the hexadecimal digits 10 through 15.

The inclusion of the charclass keyword satisfies the following requirement from the ISO POSIX-2:1993 standard, Annex H.1:

The LC_CTYPE ( locale definition should be enhanced to allow user-specified additional character classes, similar in concept to the ISO C standard Multibyte Support Extension (MSE) iswctype() function.

This keyword was previously included in The Open Group specifications and is now mandated in the Shell and Utilities volume of IEEE Std 1003.1-2001.

The symbolic constant {CHARCLASS_NAME_MAX} was also adopted from The Open Group specifications. Applications portability is enhanced by the use of symbolic constants.


The rules governing collation depend to some extent on the use. At least five different levels of increasingly complex collation rules can be distinguished:

  1. Byte/machine code order: This is the historical collation order in the UNIX system and many proprietary operating systems. Collation is here performed character by character, without any regard to context. The primary virtue is that it usually is quite fast and also completely deterministic; it works well when the native machine collation sequence matches the user expectations.

  2. Character order: On this level, collation is also performed character by character, without regard to context. The order between characters is, however, not determined by the code values, but on the expectations by the user of the "correct" order between characters. In addition, such a (simple) collation order can specify that certain characters collate equally (for example, uppercase and lowercase letters).

  3. String ordering: On this level, entire strings are compared based on relatively straightforward rules. Several "passes'' may be required to determine the order between two strings. Characters may be ignored in some passes, but not in others; the strings may be compared in different directions; and simple string substitutions may be performed before strings are compared. This level is best described as "dictionary" ordering; it is based on the spelling, not the pronunciation, or meaning, of the words.

  4. Text search ordering: This is a further refinement of the previous level, best described as "telephone book ordering''; some common homonyms (words spelled differently but with the same pronunciation) are collated together; numbers are collated as if they were spelled out, and so on.

  5. Semantic-level ordering: Words and strings are collated based on their meaning; entire words (such as "the") are eliminated; the ordering is not deterministic. This usually requires special software and is highly dependent on the intended use.

While the historical collation order formally is at level 1, for the English language it corresponds roughly to elements at level 2. The user expects to see the output from the ls utility sorted very much as it would be in a dictionary. While telephone book ordering would be an optimal goal for standard collation, this was ruled out as the order would be language-dependent. Furthermore, a requirement was that the order must be determined solely from the text string and the collation rules; no external information (for example, "pronunciation dictionaries") could be required.

As a result, the goal for the collation support is at level 3. This also matches the requirements for the Canadian collation order, as well as other, known collation requirements for alphabetic scripts. It specifically rules out collation based on pronunciation rules or based on semantic analysis of the text.

The syntax for the LC_COLLATE category source meets the requirements for level 3 and has been verified to produce the correct result with examples based on French, Canadian, and Danish collation order. Because it supports multi-character collating elements, it is also capable of supporting collation in codesets where a character is expressed using non-spacing characters followed by the base character (such as the ISO/IEC 6937:2001 standard).

The directives that can be specified in an operand to the order_start keyword are based on the requirements specified in several proposed standards and in customary use. The following is a rephrasing of rules defined for "lexical ordering in English and French" by the Canadian Standards Association (the text in square brackets is rephrased):

It is estimated that this part of IEEE Std 1003.1-2001 covers the requirements for all European languages, and no particular problems are anticipated with Slavic or Middle East character sets.

The Far East (particularly Japanese/Chinese) collations are often based on contextual information and pronunciation rules (the same ideogram can have different meanings and different pronunciations). Such collation, in general, falls outside the desired goal of IEEE Std 1003.1-2001. There are, however, several other collation rules (stroke/radical or "most common pronunciation") that can be supported with the mechanism described here.

The character order is defined by the order in which characters and elements are specified between the order_start and order_end keywords. Weights assigned to the characters and elements define the collation sequence; in the absence of weights, the character order is also the collation sequence.

The position keyword provides the capability to consider, in a compare, the relative position of characters not subject to IGNORE. As an example, consider the two strings "o-ring" and "or-ing". Assuming the hyphen is subject to IGNORE on the first pass, the two strings compare equal, and the position of the hyphen is immaterial. On second pass, all characters except the hyphen are subject to IGNORE, and in the normal case the two strings would again compare equal. By taking position into account, the first collates before the second.


The currency symbol does not appear in LC_MONETARY because it is not defined in the C locale of the ISO C standard.

The ISO C standard limits the size of decimal points and thousands delimiters to single-byte values. In locales based on multi-byte coded character sets, this cannot be enforced; IEEE Std 1003.1-2001 does not prohibit such characters, but makes the behavior unspecified (in the text "In contexts where other standards ...").

The grouping specification is based on, but not identical to, the ISO C standard. The -1 indicates that no further grouping is performed; the equivalent of {CHAR_MAX} in the ISO C standard.

The text "the value is not available in the locale" is taken from the ISO C standard and is used instead of the "unspecified" text in early proposals. There is no implication that omitting these keywords or assigning them values of "" or -1 produces unspecified results; such omissions or assignments eliminate the effects described for the keyword or produce zero-length strings, as appropriate.

The locale definition is an extension of the ISO C standard localeconv() specification. In particular, rules on how currency_symbol is treated are extended to also cover int_curr_symbol, and p_set_by_space and n_sep_by_space have been augmented with the value 2, which places a <space> between the sign and the symbol. This has been updated to match the ISO/IEC 9899:1999 standard requirements and is an incompatible change from UNIX 98 and the ISO POSIX-2 standard and the ISO POSIX-1:1996 standard requirements. The following table shows the result of various combinations:









p_cs_precedes = 1

p_sign_posn = 0


($ 1.25)



p_sign_posn = 1

+ $1.25

+$ 1.25



p_sign_posn = 2

$1.25 +

$ 1.25+



p_sign_posn = 3

+ $1.25

+$ 1.25



p_sign_posn = 4

$ +1.25

$+ 1.25


p_cs_precedes = 0

p_sign_posn = 0

(1.25 $)

(1.25 $)



p_sign_posn = 1

+1.25 $

+1.25 $



p_sign_posn = 2

1.25$ +

1.25 $+



p_sign_posn = 3

1.25+ $

1.25 +$



p_sign_posn = 4

1.25$ +

1.25 $+


The following is an example of the interpretation of the mon_grouping keyword. Assuming that the value to be formatted is 123456789 and the mon_thousands_sep is '", then the following table shows the result. The third column shows the equivalent string in the ISO C standard that would be used by the localeconv() function to accommodate this grouping.


Formatted Value

ISO C String
















In these examples, the octal value of {CHAR_MAX} is 177.

IEEE Std 1003.1-2001/Cor 1-2002, item XBD/TC1/D6/6 adds a correction that permits the Euro currency symbol and addresses extensibility. The correction is stated using the term "should" intentionally, in order to make this a recommendation rather than a restriction on implementations. This allows for flexibility in implementations on how they handle future currency symbol additions.

IEEE Std 1003.1-2001/Cor 1-2002, tem XBD/TC1/D6/5 is applied, adding the int_[np]_* values to the POSIX locale definition of LC_MONETARY .

IEEE Std 1003.1-2001/Cor 2-2004, item XBD/TC2/D6/16 is applied, updating the descriptions of p_sep_by_space, n_sep_by_space, int_p_sep_by_space, and int_n_sep_by_space to match the description of these keywords in the ISO C standard and the System Interfaces volume of IEEE Std 1003.1-2001, localeconv().


See the rationale for LC_MONETARY for a description of the behavior of grouping.


Although certain of the conversion specifications in the POSIX locale (such as the name of the month) are shown with initial capital letters, this need not be the case in other locales. Programs using these conversion specifications may need to adjust the capitalization if the output is going to be used at the beginning of a sentence.

The LC_TIME descriptions of abday, day, mon, and abmon imply a Gregorian style calendar (7-day weeks, 12-month years, leap years, and so on). Formatting time strings for other types of calendars is outside the scope of IEEE Std 1003.1-2001.

While the ISO 8601:2000 standard numbers the weekdays starting with Monday, historical practice is to use the Sunday as the first day. Rather than change the order and introduce potential confusion, the days must be specified beginning with Sunday; previous references to "first day" have been removed. Note also that the Shell and Utilities volume of IEEE Std 1003.1-2001 date utility supports numbering compliant with the ISO 8601:2000 standard.

As specified under date in the Shell and Utilities volume of IEEE Std 1003.1-2001 and strftime() in the System Interfaces volume of IEEE Std 1003.1-2001, the conversion specifications corresponding to the optional keywords consist of a modifier followed by a traditional conversion specification (for instance, %Ex ). If the optional keywords are not supported by the implementation or are unspecified for the current locale, these modified conversion specifications are treated as the traditional conversion specifications. For example, assume the following keywords:

alt_digits   "0th";"1st";"2nd";"3rd";"4th";"5th";\

d_fmt "The %Od day of %B in %Y"

On July 4th 1776, the %x conversion specifications would result in "The 4th day of July in 1776", while on July 14th 1789 it would result in "The 14 day of July in 1789". It can be noted that the above example is for illustrative purposes only; the %O modifier is primarily intended to provide for Kanji or Hindi digits in date formats.

The following is an example for Japan that supports the current plus last three Emperors and reverts to Western style numbering for years prior to the Meiji era. The example also allows for the custom of using a special name for the first year of an era instead of using 1. (The examples substitute romaji where kanji should be used.)

era_d_fmt "%EY%mgatsu%dnichi (%a)"

era "+:2:1990/01/01:+*:Heisei:%EC%Eynen";\ "+:1:1989/01/08:1989/12/31:Heisei:%ECgannen";\ "+:2:1927/01/01:1989/01/07:Shouwa:%EC%Eynen";\ "+:1:1926/12/25:1926/12/31:Shouwa:%ECgannen";\ "+:2:1913/01/01:1926/12/24:Taishou:%EC%Eynen";\ "+:1:1912/07/30:1912/12/31:Taishou:%ECgannen";\ "+:2:1869/01/01:1912/07/29:Meiji:%EC%Eynen";\ "+:1:1868/09/08:1868/12/31:Meiji:%ECgannen";\ "-:1868:1868/09/07:-*::%Ey"

Assuming that the current date is September 21, 1991, a request to date or strftime() would yield the following results:

%Ec - Heisei3nen9gatsu21nichi (Sat) 14:39:26
%EC - Heisei
%Ex - Heisei3nen9gatsu21nichi (Sat)
%Ey - 3
%EY - Heisei3nen

Example era definitions for the Republic of China:

era    "+:2:1913/01/01:+*:ChungHwaMingGuo:%EC%EyNen";\

Example definitions for the Christian Era:

era    "+:1:0001/01/01:+*:AD:%EC %Ey";\
       "+:1:-0001/12/31:-*:BC:%Ey %EC"


The yesstr and nostr locale keywords and the YESSTR and NOSTR langinfo items were formerly used to match user affirmative and negative responses. In IEEE Std 1003.1-2001, the yesexpr, noexpr, YESEXPR, and NOEXPR extended regular expressions have replaced them. Applications should use the general locale-based messaging facilities to issue prompting messages which include sample desired responses.

A.7.4 Locale Definition Grammar

There is no additional rationale provided for this section.

Locale Lexical Conventions

There is no additional rationale provided for this section.

Locale Grammar

There is no additional rationale provided for this section.

A.7.5 Locale Definition Example

The following is an example of a locale definition file that could be used as input to the localedef utility. It assumes that the utility is executed with the -f option, naming a charmap file with (at least) the following content:

<space>      \x20
<dollar>     \x24
<A>          \101
<a>          \141
<A-acute>    \346
<a-acute>    \365
<A-grave>    \300
<a-grave>    \366
<b>          \142
<C>          \103
<c>          \143
<c-cedilla>  \347
<d>          \x64
<H>          \110
<h>          \150
<eszet>      \xb7
<s>          \x73
<z>          \x7a

It should not be taken as complete or to represent any actual locale, but only to illustrate the syntax.

lower   <a>;<b>;<c>;<c-cedilla>;<d>;...;<z>
upper   A;B;C;Ç;...;Z
space   \x20;\x09;\x0a;\x0b;\x0c;\x0d
blank   \040;\011
toupper (<a>,<A>);(b,B);(c,C);(ç,Ç);(d,D);(z,Z)
# The following example of collation is based on
# Canadian standard Z243.4.1-1998, "Canadian Alphanumeric
# Ordering Standard for Character Sets of CSA Z234.4 Standard".
# (Other parts of this example locale definition file do not
# purport to relate to Canada, or to any other real culture.)
# The proposed standard defines a 4-weight collation, such that
# in the first pass, characters are compared without regard to
# case or accents; in the second pass, backwards-compare without
# regard to case; in the third pass, forwards-compare without
# regard to diacriticals. In the 3 first passes, non-alphabetic
# characters are ignored; in the fourth pass, only special
# characters are considered, such that "The string that has a
# special character in the lowest position comes first. If two
# strings have a special character in the same position, the
# collation value of the special character determines ordering.
# Only a subset of the character set is used here; mostly to
# illustrate the set-up.
collating-symbol <NULL>
collating-symbol <LOW_VALUE>
collating-symbol <LOWER-CASE>
collating-symbol <SUBSCRIPT-LOWER>
collating-symbol <SUPERSCRIPT-LOWER>
collating-symbol <UPPER-CASE>
collating-symbol <NO-ACCENT>
collating-symbol <PECULIAR>
collating-symbol <LIGATURE>
collating-symbol <ACUTE>
collating-symbol <GRAVE>
# Further collating-symbols follow.
# Properly, the standard does not include any multi-character
# collating elements; the one below is added for completeness.
collating_element <ch> from "<c><h>"
collating_element <CH> from "<C><H>"
collating_element <Ch> from "<C><h>"
order_start forward;backward;forward;forward,position
# Collating symbols are specified first in the sequence to allocate
# basic collation values to them, lower than that of any character.
# Further collating symbols are given a basic collating value here.
# Here follow special characters.
<space>        IGNORE;IGNORE;IGNORE;<space>
# Other special characters follow here.
# Here follow the regular characters.
<a>        <a>;<NO-ACCENT>;<LOWER-CASE>;IGNORE
<a-acute>  <a>;<ACUTE>;<LOWER-CASE>;IGNORE
<A-acute>  <a>;<ACUTE>;<UPPER-CASE>;IGNORE
<a-grave>  <a>;<GRAVE>;<LOWER-CASE>;IGNORE
<A-grave>  <a>;<GRAVE>;<UPPER-CASE>;IGNORE
<ae>      "<a><e>";"<LIGATURE><LIGATURE>";\
<AE>      "<a><e>";"<LIGATURE><LIGATURE>";\
<b>        <b>;<NO-ACCENT>;<LOWER-CASE>;IGNORE
<c>        <c>;<NO-ACCENT>;<LOWER-CASE>;IGNORE
<ch>       <ch>;<NO-ACCENT>;<LOWER-CASE>;IGNORE
# As an example, the strings "Bach" and "bach" could be encoded (for
# compare purposes) as:
# "Bach"  <b>;<a>;<ch>;<LOW_VALUE>;<NO_ACCENT>;<NO_ACCENT>;\
#         <LOWER-CASE>;<NULL>
# "bach"  <b>;<a>;<ch>;<LOW_VALUE>;<NO_ACCENT>;<NO_ACCENT>;\
#         <LOWER-CASE>;<NULL>
# The two strings are equal in pass 1 and 2, but differ in pass 3.
# Further characters follow.
int_curr_symbol    "USD "
currency_symbol    "$"
mon_decimal_point  "."
mon_grouping       3;0
positive_sign      ""
negative_sign      "-"
p_cs_precedes      1
n_sign_posn        0
copy "US_en.ASCII"
abday   "Sun";"Mon";"Tue";"Wed";"Thu";"Fri";"Sat"
day     "Sunday";"Monday";"Tuesday";"Wednesday";\
abmon   "Jan";"Feb";"Mar";"Apr";"May";"Jun";\
mon     "January";"February";"March";"April";\
d_t_fmt "%a %b %d %T %Z %Y\n"
yesexpr "^([yY][[:alpha:]]*)|(OK)"
noexpr  "^[nN][[:alpha:]]*"

UNIX ® is a registered Trademark of The Open Group.
POSIX ® is a registered Trademark of The IEEE.
[ Main Index | XBD | XCU | XSH | XRAT ]