B. Rationale for System Interfaces

B.1 Introduction

B.1.1 Change History

The change history is provided as an informative section, to track changes from earlier versions of this standard.

The following sections describe changes made to the System Interfaces volume of POSIX.1-2024 since Issue 7 of the base document. The CHANGE HISTORY section for each entry details the technical changes that have been made in Issue 5 and later. Changes made before Issue 5 are not included.

Changes from Issue 7 to Issue 8 (POSIX.1-2024)

The following list summarizes the major changes that were made in the System Interfaces volume of POSIX.1-2024 from Issue 7 to Issue 8:

New Features in Issue 8

The functions first introduced in Issue 8 (over the Issue 7 base document) are as follows:

New Functions in Issue 8


_Fork()
aligned_alloc()
at_quick_exit()
atomic_compare_exchange_strong()
atomic_compare_exchange_strong_explicit()
atomic_compare_exchange_weak()
atomic_compare_exchange_weak_explicit()
atomic_exchange()
atomic_exchange_explicit()
atomic_fetch_add()
atomic_fetch_add_explicit()
atomic_fetch_and()
atomic_fetch_and_explicit()
atomic_fetch_or()
atomic_fetch_or_explicit()
atomic_fetch_sub()
atomic_fetch_sub_explicit()
atomic_fetch_xor()
atomic_fetch_xor_explicit()
atomic_flag_clear()
atomic_flag_clear_explicit()
atomic_flag_test_and_set()
atomic_flag_test_and_set_explicit()
atomic_init()
atomic_is_lock_free()
atomic_load()
atomic_load_explicit()
atomic_signal_fence()
atomic_store()
atomic_store_explicit()
atomic_thread_fence()
bind_textdomain_codeset()
bindtextdomain()
c16rtomb()
 


c32rtomb()
call_once()
cnd_broadcast()
cnd_destroy()
cnd_init()
cnd_signal()
cnd_timedwait()
cnd_wait()
dcgettext()
dcgettext_l()
dcngettext()
dcngettext_l()
dgettext()
dgettext_l()
dladdr()
dngettext()
dngettext_l()
getentropy()
getlocalename_l()
getresgid()
getresuid()
gettext()
gettext_l()
mbrtoc16()
mbrtoc32()
memmem()
mtx_destroy()
mtx_init()
mtx_lock()
mtx_timedlock()
mtx_trylock()
mtx_unlock()
ngettext()
ngettext_l()
 


posix_close()
posix_devctl()
posix_getdents()
ppoll()
pthread_cond_clockwait()
pthread_mutex_clocklock()
pthread_rwlock_clockrdlock()
pthread_rwlock_clockwrlock()
qsort_r()
quick_exit()
reallocarray()
sem_clockwait()
setresgid()
setresuid()
sig2str()
str2sig()
strlcat()
strlcpy()
textdomain()
thrd_create()
thrd_current()
thrd_detach()
thrd_equal()
thrd_exit()
thrd_join()
thrd_sleep()
thrd_yield()
timespec_get()
tss_create()
tss_delete()
tss_get()
tss_set()
wcslcat()
wcslcpy()
 

The following new headers are introduced in Issue 8:

New Headers in Issue 8


<devctl.h>
<libintl.h>
<stdalign.h>
 


<stdatomic.h>
<stdnoreturn.h>
<threads.h>
 


<uchar.h>
 

Obsolescent Functions in Issue 8

The base functions moved to obsolescent status in Issue 8 (from the Issue 7 base document) are as follows:

Obsolescent Base Functions in Issue 8


inet_addr()
 


inet_ntoa()
 

The XSI functions moved to obsolescent status in Issue 8 (from the Issue 7 base document) are as follows:

Obsolescent XSI Functions in Issue 8


encrypt()
 


setkey()
 

Removed Functions in Issue 8

The functions removed in Issue 8 (from the Issue 7 base document) are as follows:

Removed Functions in Issue 8


_longjmp()
_setjmp()
_tolower()
_toupper()
fattach()
fdetach()
ftw()
getitimer()
getmsg()
getpmsg()
gets()
gettimeofday()
ioctl()
isascii()
isastream()
posix_trace_attr_destroy()
posix_trace_attr_getclockres()
posix_trace_attr_getcreatetime()
posix_trace_attr_getgenversion()
posix_trace_attr_getinherited()
posix_trace_attr_getlogfullpolicy()
posix_trace_attr_getlogsize()
posix_trace_attr_getmaxdatasize()
posix_trace_attr_getmaxsystemeventsize()
posix_trace_attr_getmaxusereventsize()
posix_trace_attr_getname()
posix_trace_attr_getstreamfullpolicy()
posix_trace_attr_getstreamsize()
posix_trace_attr_init()
posix_trace_attr_setinherited()
posix_trace_attr_setlogfullpolicy()
posix_trace_attr_setlogsize()
posix_trace_attr_setmaxdatasize()
posix_trace_attr_setname()
posix_trace_attr_setstreamfullpolicy()
posix_trace_attr_setstreamsize()
posix_trace_clear()
posix_trace_close()
posix_trace_create()
posix_trace_create_withlog()
posix_trace_event()
 


posix_trace_eventid_equal()
posix_trace_eventid_get_name()
posix_trace_eventid_open()
posix_trace_eventset_add()
posix_trace_eventset_del()
posix_trace_eventset_empty()
posix_trace_eventset_fill()
posix_trace_eventset_ismember()
posix_trace_eventtypelist_getnext_id()
posix_trace_eventtypelist_rewind()
posix_trace_flush()
posix_trace_get_attr()
posix_trace_get_filter()
posix_trace_get_status()
posix_trace_getnext_event()
posix_trace_open()
posix_trace_rewind()
posix_trace_set_filter()
posix_trace_shutdown()
posix_trace_start()
posix_trace_stop()
posix_trace_timedgetnext_event()
posix_trace_trid_eventid_open()
posix_trace_trygetnext_event()
pthread_getconcurrency()
pthread_setconcurrency()
putmsg()
putpmsg()
rand_r()
setitimer()
setpgrp()
sighold()
sigignore()
siginterrupt()
sigpause()
sigrelse()
sigset()
tempnam()
toascii()
ulimit()
utime()
 

B.1.2 Relationship to Other Formal Standards

There is no additional rationale provided for this section.

B.1.3 Format of Entries

Each system interface reference page has a common layout of sections describing the interface. This layout is similar to the manual page or "man" page format shipped with most UNIX systems, and each header has sections describing the SYNOPSIS, DESCRIPTION, RETURN VALUE, and ERRORS. These are the four sections that relate to conformance.

Additional sections are informative, and add considerable information for the application developer. EXAMPLES sections provide example usage. APPLICATION USAGE sections provide additional caveats, issues, and recommendations to the developer. RATIONALE sections give additional information on the decisions made in defining the interface.

FUTURE DIRECTIONS sections act as pointers to related work that may impact the interface in the future, and often cautions the developer to architect the code to account for a change in this area. Note that a future directions statement should not be taken as a commitment to adopt a feature or interface in the future.

The CHANGE HISTORY section describes when the interface was introduced, and how it has changed.

Option labels and margin markings in the page can be useful in guiding the application developer.

B.2 General Information

B.2.1 Use and Implementation of Interfaces

B.2.1.1 Use and Implementation of Functions

The information concerning the use of functions was adapted from a description in the ISO C standard. Here is an example of how an application program can protect itself from functions that may or may not be macros, rather than true functions:

The atoi() function may be used in any of several ways:

Note that the ISO C standard reserves names starting with '_' for the compiler. Therefore, the compiler could, for example, implement an intrinsic, built-in function _asm_builtin_atoi(), which it recognized and expanded into inline assembly code. Then, in <stdlib.h>, there could be the following:

#define atoi(X) _asm_builtin_atoi(X)

The user's "normal" call to atoi() would then be expanded inline, but the implementor would also be required to provide a callable function named atoi() for use when the application requires it; for example, if its address is to be stored in a function pointer variable.

Implementors should note that since applications can #undef a macro in order to ensure that the function is used, this means that it is not safe for implementations to use the names of any standard functions in macro values, since the application could use #undef to ensure that no macro exists and then use the same name for an identifier with local scope. For example, historically it was common for a getchar() macro to be defined in <stdio.h> as:

#define getchar() getc(stdin)

This definition does not conform, because an application is allowed to use the identifier getc with local scope, and the expansion of the getchar() macro would then pick up the local getc. The following is conforming code, but would not compile with the above definition of getchar():

#include <stdio.h>
#undef getc

int main(void) { int getc;
getc = getchar();
return getc; }

This does not only affect function-like macros. For example, the following definition does not conform because there could be a local sysconf variable in scope when SIGRTMIN is expanded:

#define SIGRTMIN ((int)sysconf(_SC_SIGRT_MIN))

Implementors can avoid the problem by using aliases for standard functions instead of the actual function, with names that conforming applications cannot use for local variables. For example:

#define SIGRTMIN ((int)__sysconf(_SC_SIGRT_MIN))

Austin Group Defect 655 is applied, making the requirement relating to explicit function declarations apply only to functions from the ISO C standard.

Austin Group Defect 1302 is applied, aligning this section with the ISO/IEC 9899:2018 standard.

Austin Group Defect 1404 is applied, adding to the examples of invalid values for function arguments.

B.2.1.2 Use and Implementation of Macros

There is no additional rationale provided for this section.

B.2.2 The Compilation Environment

B.2.2.1 POSIX.1 Symbols

This and the following section address the issue of "name space pollution". The ISO C standard requires that the name space beyond what it reserves not be altered except by explicit action of the application developer. This section defines the actions to add the POSIX.1 symbols for those headers where both the ISO C standard and POSIX.1 need to define symbols, and also where the XSI option extends the base standard.

When headers are used to provide symbols, there is a potential for introducing symbols that the application developer cannot predict. Ideally, each header should only contain one set of symbols, but this is not practical for historical reasons. Thus, the concept of feature test macros is included. Two feature test macros are explicitly defined by POSIX.1-2024; it is expected that future versions may add to this.

Note:
Feature test macros allow an application to announce to the implementation its desire to have certain symbols and prototypes exposed. They should not be confused with the version test macros and constants for options in <unistd.h> which are the implementation's way of announcing functionality to the application.

It is further intended that these feature test macros apply only to the headers specified by POSIX.1-2024. Implementations are expressly permitted to make visible symbols not specified by POSIX.1-2024, within both POSIX.1 and other headers, under the control of feature test macros that are not defined by POSIX.1-2024.

The _POSIX_C_SOURCE Feature Test Macro

The POSIX.1-1990 standard specified a macro called _POSIX_SOURCE. This has been superseded by _POSIX_C_SOURCE. This symbol will allow implementations to support various versions of this standard simultaneously. For instance, when _POSIX_C_SOURCE is defined as 202405L, the system should make visible the same name space as permitted and required by the POSIX.1-2024 standard. A special case is the one where the implementation wishes to make available support for the 1990 version of the POSIX standard, in which instance when either _POSIX_SOURCE is defined or _POSIX_C_SOURCE is defined as 1, the system should make visible the same name space as permitted and required by the POSIX.1-1990 standard.

It is expected that C bindings to future POSIX standards will define new values for _POSIX_C_SOURCE, with each new value reserving the name space for that new standard.

The _XOPEN_SOURCE Feature Test Macro

The feature test macro _XOPEN_SOURCE is provided as the announcement mechanism for the application that it requires functionality from the Single UNIX Specification. _XOPEN_SOURCE must be defined to the value 800 before the inclusion of any header to enable the functionality in the Single UNIX Specification Version 5. Its definition subsumes the use of _POSIX_C_SOURCE.

An extract of code from a conforming application, that appears before any #include statements, is given below:

#define _XOPEN_SOURCE 800 /* Single UNIX Specification, Version 5 */

#include ...

Note that the definition of _XOPEN_SOURCE with the value 800 makes the definition of _POSIX_C_SOURCE redundant and it can safely be omitted.

The __STDC_WANT_LIB_EXT1__ Feature Test Macro

The ISO C standard specifies the feature test macro __STDC_WANT_LIB_EXT1__ as the announcement mechanism for the application that it requires functionality from Annex K. It specifies that the symbols specified in Annex K (if supported) are made visible when __STDC_WANT_LIB_EXT1__ is 1 and are not made visible when it is 0, but leaves it unspecified whether they are made visible when __STDC_WANT_LIB_EXT1__ is undefined. POSIX.1 requires that they are not made visible when the macro is undefined (except for those symbols that are already explicitly allowed to be visible through the definition of _POSIX_C_SOURCE or _XOPEN_SOURCE, or both).

POSIX.1 does not include the interfaces specified in Annex K of the ISO C standard, but allows the symbols to be made visible in headers when requested by the application in order that applications can use symbols from Annex K and symbols from POSIX.1 in the same translation unit.

Austin Group Defect 1302 is applied, adding this subsection.

B.2.2.2 The Name Space

The reservation of identifiers is paraphrased from the ISO C standard. The text is included because it needs to be part of POSIX.1-2024, regardless of possible changes in future versions of the ISO C standard.

These identifiers may be used by implementations, particularly for feature test macros. Implementations should not use feature test macro names that might be reasonably used by a standard.

Including headers more than once is a reasonably common practice, and it should be carried forward from the ISO C standard. More significantly, having definitions in more than one header is explicitly permitted. Where the potential declaration is "benign" (the same definition twice) the declaration can be repeated, if that is permitted by the compiler. (This is usually true of macros, for example.) In those situations where a repetition is not benign (for example, typedefs), conditional compilation must be used. The situation actually occurs both within the ISO C standard and within POSIX.1: time_t should be in <sys/types.h>, and the ISO C standard mandates that it be in <time.h>.

The area of name space pollution versus additions to structures is difficult because of the macro structure of C. The following discussion summarizes all the various problems with and objections to the issue.

Note the phrase "user-defined macro". Users are not permitted to define macro names (or any other name) beginning with "_[A-Z_]". Thus, the conflict cannot occur for symbols reserved to the vendor's name space, and the permission to add fields automatically applies, without qualification, to those symbols.

  1. Data structures (and unions) need to be defined in headers by implementations to meet certain requirements of POSIX.1 and the ISO C standard.
  2. The structures defined by POSIX.1 are typically minimal, and any practical implementation would wish to add fields to these structures either to hold additional related information or for backwards-compatibility (or both). Future standards (and de facto standards) would also wish to add to these structures. Issues of field alignment make it impractical (at least in the general case) to simply omit fields when they are not defined by the particular standard involved.

    The dirent structure is an example of such a minimal structure (although one could argue about whether the other fields need visible names). The st_rdev field of most implementations' stat structure is a common example where extension is needed and where a conflict could occur.

  3. Fields in structures are in an independent name space, so the addition of such fields presents no problem to the C language itself in that such names cannot interact with identically named user symbols because access is qualified by the specific structure name.
  4. There is an exception to this: macro processing is done at a lexical level. Thus, symbols added to a structure might be recognized as user-provided macro names at the location where the structure is declared. This only can occur if the user-provided name is declared as a macro before the header declaring the structure is included. The user's use of the name after the declaration cannot interfere with the structure because the symbol is hidden and only accessible through access to the structure. Presumably, the user would not declare such a macro if there was an intention to use that field name.
  5. Macros from the same or a related header might use the additional fields in the structure, and those field names might also collide with user macros. Although this is a less frequent occurrence, since macros are expanded at the point of use, no constraint on the order of use of names can apply.
  6. An "obvious" solution of using names in the reserved name space and then redefining them as macros when they should be visible does not work because this has the effect of exporting the symbol into the general name space. For example, given a (hypothetical) system-provided header <h.h>, and two parts of a C program in a.c and b.c, in header <h.h>:
    struct foo {
        int __i;
    }
    
    #ifdef _FEATURE_TEST #define i __i; #endif

    In file a.c:

    #include h.h
    extern int i;
    ...
    

    In file b.c:

    extern int i;
    ...
    

    The symbol that the user thinks of as i in both files has an external name of __i in a.c; the same symbol i in b.c has an external name i (ignoring any hidden manipulations the compiler might perform on the names). This would cause a mysterious name resolution problem when a.o and b.o are linked.

    Simply avoiding definition then causes alignment problems in the structure.

    A structure of the form:

    struct foo {
        union {
            int __i;
    #ifdef _FEATURE_TEST
            int i;
    #endif
        } __ii;
    }
    

    does not work because the name of the logical field i is __ii.i, and introduction of a macro to restore the logical name immediately reintroduces the problem discussed previously (although its manifestation might be more immediate because a syntax error would result if a recursive macro did not cause it to fail first).

  7. A more workable solution would be to declare the structure:
    struct foo {
    #ifdef _FEATURE_TEST
        int i;
    #else
        int __i;
    #endif
    }
    

    However, if a macro (particularly one required by a standard) is to be defined that uses this field, two must be defined: one that uses i, the other that uses __i. If more than one additional field is used in a macro and they are conditional on distinct combinations of features, the complexity goes up as 2n.

All this leaves a difficult situation: vendors must provide very complex headers to deal with what is conceptually simple and safe—adding a field to a structure. It is the possibility of user-provided macros with the same name that makes this difficult.

Several alternatives were proposed that involved constraining the user's access to part of the name space available to the user (as specified by the ISO C standard). In some cases, this was only until all the headers had been included. There were two proposals discussed that failed to achieve consensus:

  1. Limiting it for the whole program.
  2. Restricting the use of identifiers containing only uppercase letters until after all system headers had been included. It was also pointed out that because macros might wish to access fields of a structure (and macro expansion occurs totally at point of use) restricting names in this way would not protect the macro expansion, and thus the solution was inadequate.

It was finally decided that reservation of symbols would occur, but as constrained.

The current wording also allows the addition of fields to a structure, but requires that user macros of the same name not interfere. This allows vendors to do one of the following:

There are at least two ways that the compiler might be extended: add new preprocessor directives that turn off and on macro expansion for certain symbols (without changing the value of the macro) and a function or lexical operation that suppresses expansion of a word. The latter seems more flexible, particularly because it addresses the problem in macros as well as in declarations.

The following seems to be a possible implementation extension to the C language that will do this: any token that during macro expansion is found to be preceded by three '#' symbols shall not be further expanded in exactly the same way as described for macros that expand to their own name as in Section 6.10.3.4 of the ISO C standard. A vendor may also wish to implement this as an operation that is lexically a function, which might be implemented as:

#define __safe_name(x) ###x

Using a function notation would insulate vendors from changes in standards until such a functionality is standardized (if ever). Standardization of such a function would be valuable because it would then permit third parties to take advantage of it portably in software they may supply.

The symbols that are "explicitly permitted, but not required by POSIX.1-2024" include those classified below. (That is, the symbols classified below might, but are not required to, be present when _POSIX_C_SOURCE is defined to have the value 202405L.)

Since both implementations and future versions of this standard and other POSIX standards may use symbols in the reserved spaces described in these tables, there is a potential for name space clashes. To avoid future name space clashes when adding symbols, implementations should not use the posix_, POSIX_, or _POSIX_ prefixes.

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/2 is applied, deleting the entries POSIX_, _POSIX_, and posix_ from the column of allowed name space prefixes for use by an implementation in the first table. The presence of these prefixes was contradicting later text which states that: "The prefixes posix_, POSIX_, and _POSIX are reserved for use by XCU 2. Shell Command Language and other POSIX standards. Implementations may add symbols to the headers shown in the following table, provided the identifiers ... do not use the reserved prefixes posix_, POSIX_, or _POSIX.".

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/3 is applied, correcting the reserved macro prefix from: "PRI[a-z], SCN[a-z]" to: "PRI[Xa-z], SCN[Xa-z]" in the second table. The change was needed since the ISO C standard allows implementations to define macros of the form PRI or SCN followed by any lowercase letter or 'X' in <inttypes.h>. (The ISO/IEC 9899:1999 standard, Subclause 7.26.4.)

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/4 is applied, adding a new section listing reserved names for the <stdint.h> header. This change is for alignment with the ISO C standard.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/2 is applied, making it clear that implementations are permitted to have symbols with the prefix _POSIX_ visible in any header.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/3 is applied, updating the table of allowed macro prefixes to include the prefix FP_[A-Z] for <math.h>. This text is added for consistency with the <math.h> reference page in the Base Definitions volume of POSIX.1-2024 which permits additional implementation-defined floating-point classifications.

Austin Group Interpretation 1003.1-2001 #048 is applied, reserving SEEK_ in the name space.

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0001 [801], XSH/TC2-2008/0002 [780], XSH/TC2-2008/0003 [790], XSH/TC2-2008/0004 [780], XSH/TC2-2008/0005 [790], XSH/TC2-2008/0006 [782], XSH/TC2-2008/0007 [790], and XSH/TC2-2008/0008 [790] are applied.

Austin Group Defect 162 is applied, adding the <endian.h> header.

Austin Group Defect 697 is applied, reserving DT_ in the name space.

Austin Group Defect 845 is applied, reserving in6addr_ in the name space.

Austin Group Defect 993 is applied, reserving dli_ in the name space.

Austin Group Defect 1003 is applied, correcting a mismatch with the ISO C standard regarding reservation of each identifier with file scope described in the header section.

Austin Group Defect 1122 is applied, adding <libintl.h>.

Austin Group Defect 1151 is applied, adding ws_ as a reserved prefix for <termios.h>.

Austin Group Defect 1302 is applied, aligning this section with the ISO/IEC 9899:2018 standard.

Austin Group Defect 1456 is applied, clarifying the reservation of symbolic constants with the prefix _CS_, _PC_, and _SC_ for <unistd.h>.

B.2.3 Error Numbers

It was the consensus of the standard developers that to allow the conformance document to state that an error occurs and under what conditions, but to disallow a statement that it never occurs, does not make sense. It could be implied by the current wording that this is allowed, but to reduce the possibility of future interpretation requests, it is better to make an explicit statement.

The original ISO C standard just required that errno be a modifiable lvalue. Since the introduction of threads in 2011, the ISO C standard has instead required that errno be a macro which expands to a modifiable lvalue that has thread local storage duration.

Checking the value of errno alone is not sufficient to determine the existence or type of an error, since it is not required that a successful function call clear errno. The variable errno should only be examined when the return value of a function indicates that the value of errno is meaningful. In that case, the function is required to set the variable to something other than zero.

The variable errno is never set to zero by any function call; to do so would contradict the ISO C standard.

POSIX.1 requires (in the ERRORS sections of function descriptions) certain error values to be set in certain conditions because many existing applications depend on them. Some error numbers, such as [EFAULT], are entirely implementation-defined and are noted as such in their description in the ERRORS section. This section otherwise allows wide latitude to the implementation in handling error reporting.

Some of the ERRORS sections in POSIX.1-2024 have two subsections. The first:

"The function shall fail if:"

could be called the "mandatory" section.

The second:

"The function may fail if:"

could be informally known as the "optional" section.

Attempting to infer the quality of an implementation based on whether it detects optional error conditions is not useful.

Following each one-word symbolic name for an error, there is a description of the error. The rationale for some of the symbolic names follows:

[ECANCELED]
This spelling was chosen as being more common.
[EFAULT]
Most historical implementations do not catch an error and set errno when an invalid address is given to the functions wait(), time(), or times(). Some implementations cannot reliably detect an invalid address. And most systems that detect invalid addresses will do so only for a system call, not for a library routine.
[EFTYPE]
This error code was proposed in earlier proposals as "Inappropriate operation for file type", meaning that the operation requested is not appropriate for the file specified in the function call. This code was proposed, although the same idea was covered by [ENOTTY], because the connotations of the name would be misleading. It was pointed out that the fcntl() function uses the error code [EINVAL] for this notion, and hence all instances of [EFTYPE] were changed to this code.
[EINTR]
POSIX.1 prohibits conforming implementations from restarting interrupted system calls of conforming applications unless the SA_RESTART flag is in effect for the signal. However, it does not require that [EINTR] be returned when another legitimate value may be substituted; for example, a partial transfer count when read() or write() are interrupted. This is only given when the signal-catching function returns normally as opposed to returns by mechanisms like longjmp() or siglongjmp().
[ELOOP]
In specifying conditions under which implementations would generate this error, the following goals were considered:
[ENAMETOOLONG]

When a symbolic link is encountered during pathname resolution, the contents of that symbolic link are used to create a new pathname. The standard developers intended to allow, but not require, that implementations enforce the restriction of {PATH_MAX} on the result of this pathname substitution.

Implementations are allowed, but not required, to treat a pathname longer than {PATH_MAX} passed into the system as an error. Implementations are required to return a pathname (even if it is longer than {PATH_MAX}) when the user supplies a buffer with an interface that specifies the buffer size, as long as the user-supplied buffer is large enough to hold the entire pathname (see XSH getcwd for an example of this type of interface). Implementations are required to treat a request to pass a pathname longer than {PATH_MAX} from the system to a user-supplied buffer of an unspecified size (usually assumed to be of size {PATH_MAX}) as an error (see XSH realpath for an example of this type of interface).

[ENOMEM]
The term "main memory" is not used in POSIX.1 because it is implementation-defined.
[ENOTSUP]
This error code is to be used when an implementation chooses to implement the required functionality of POSIX.1-2024 but does not support optional facilities defined by POSIX.1-2024. In some earlier versions of this standard, the difference between [ENOTSUP] and [ENOSYS] was that [ENOSYS] indicated that the function was not supported at all. This is no longer the case as [ENOSYS] can also be used to indicate non-support of optional functionality for a function that has some required functionality. (See XSH encrypt .)
[ENOTTY]
The symbolic name for this error is derived from a time when device control was done by ioctl() and that operation was only permitted on a terminal interface. The term "TTY" is derived from "teletypewriter", the devices to which this error originally applied.
[EOVERFLOW]
Most of the uses of this error code are related to large file support. Typically, these cases occur on systems which support multiple programming environments with different sizes for off_t, but they may also occur in connection with remote file systems.

In addition, when different programming environments have different widths for types such as int and uid_t, several functions may encounter a condition where a value in a particular environment is too wide to be represented. In that case, this error should be raised. For example, suppose the currently running process has 64-bit int, and file descriptor 9223372036854775807 is open and does not have the close-on-exec flag set. If the process then uses execl() to exec a file compiled in a programming environment with 32-bit int, the call to execl() can fail with errno set to [EOVERFLOW]. A similar failure can occur with execl() if any of the user IDs or any of the group IDs to be assigned to the new process image are out of range for the executed file's programming environment.

Note, however, that this condition cannot occur for functions that are explicitly described as always being successful, such as getpid().

[EPIPE]
This condition normally generates the signal SIGPIPE; the error is returned if the generation of the signal is suppressed or the signal does not terminate the process.
[EROFS]
In historical implementations, attempting to unlink() or rmdir() a mount point would generate an [EBUSY] error. An implementation could be envisioned where such an operation could be performed without error. In this case, if either the directory entry or the actual data structures reside on a read-only file system, [EROFS] is the appropriate error to generate. (For example, changing the link count of a file on a read-only file system could not be done, as is required by unlink(), and thus an error should be reported.)

Three error numbers, [EDOM], [EILSEQ], and [ERANGE], were added to this section primarily for consistency with the ISO C standard.

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0009 [496] and XSH/TC2-2008/0010 [681] are applied.

Austin Group Defect 1067 is applied, adding [ESOCKTNOSUPPORT].

Austin Group Defect 1380 is applied, changing the descriptions of [EMLINK] and [EXDEV].

Austin Group Defect 1669 is applied, changing the description of [EFBIG].

Alternative Solutions for Per-Thread errno

The historical implementation of errno as a single global variable does not work in a multi-threaded environment. In such an environment, a thread may make a POSIX.1 call and get a -1 error return, but before that thread can check the value of errno, another thread might have made a second POSIX.1 call that also set errno. This behavior is unacceptable in robust programs. There were a number of alternatives that were considered for handling the errno problem:

The first option offers the highest level of compatibility with existing practice but requires special support in the linker, compiler, and/or virtual memory system to support the new concept of thread private variables. When compared with current practice, the third and fourth options are much cleaner, more efficient, and encourage a more robust programming style, but they require new versions of all of the POSIX.1 functions that might detect an error. The second option offers compatibility with existing code that uses the <errno.h> header to define the symbol errno. In this option, errno may be a macro defined:

#define errno  (*__errno())
extern int      *__errno();

This option may be implemented as a per-thread variable whereby an errno field is allocated in the user space object representing a thread, and whereby the function __errno() makes a system call to determine the location of its user space object and returns the address of the errno field of that object. Another implementation, one that avoids calling the kernel, involves allocating stacks in chunks. The stack allocator keeps a side table indexed by chunk number containing a pointer to the thread object that uses that chunk. The __errno() function then looks at the stack pointer, determines the chunk number, and uses that as an index into the chunk table to find its thread object and thus its private value of errno. On most architectures, this can be done in four to five instructions. Some compilers may wish to implement __errno() inline to improve performance.

Disallowing Return of the [EINTR] Error Code

Many blocking interfaces defined by POSIX.1-2024 may return [EINTR] if interrupted during their execution by a signal handler. Blocking interfaces introduced under the threads functionality do not have this property. Instead, they require that the interface appear to be atomic with respect to interruption. In particular, applications calling blocking interfaces need not handle any possible [EINTR] return as a special case since it will never occur. In the case of threads functions in <threads.h>, the requirement is stated in terms of the call not being affected if the calling thread executes a signal handler during the call, since these functions return errors in a different way and cannot distinguish an [EINTR] condition from other error conditions. If it is necessary to restart operations or complete incomplete operations following the execution of a signal handler, this is handled by the implementation, rather than by the application.

Requiring applications to handle [EINTR] errors on blocking interfaces has been shown to be a frequent source of often unreproducible bugs, and it adds no compelling value to the available functionality. Thus, blocking interfaces introduced for use by multi-threaded programs do not use this paradigm. In particular, in none of the functions flockfile(), pthread_cond_timedwait(), pthread_cond_wait(), pthread_join(), pthread_mutex_lock(), and sigwait() did providing [EINTR] returns add value, or even particularly make sense. Thus, these functions do not provide for an [EINTR] return, even when interrupted by a signal handler. The same arguments can be applied to sem_wait(), sem_trywait(), sigwaitinfo(), and sigtimedwait(), but implementations are permitted to return [EINTR] error codes for these functions for compatibility with earlier versions of this standard. Applications cannot rely on calls to these functions returning [EINTR] error codes when signals are delivered to the calling thread, but they should allow for the possibility.

Austin Group Interpretation 1003.1-2001 #050 is applied, allowing [ENOTSUP] and [EOPNOTSUPP] to be the same values.

B.2.3.1 Additional Error Numbers

The ISO C standard defines the name space for implementations to add additional error numbers.

B.2.4 Signal Concepts

Historical implementations of signals, using the signal() function, have shortcomings that make them unreliable for many application uses. Because of this, a new signal mechanism, based very closely on the one of 4.2 BSD and 4.3 BSD, was added to POSIX.1.

Signal Names

The restriction on the actual type used for sigset_t is intended to guarantee that these objects can always be assigned, have their address taken, and be passed as parameters by value. It is not intended that this type be a structure including pointers to other data structures, as that could impact the portability of applications performing such operations. A reasonable implementation could be a structure containing an array of some integer type.

The signals described in POSIX.1-2024 must have unique values so that they may be named as parameters of case statements in the body of a C-language switch clause. However, implementation-defined signals may have values that overlap with each other or with signals specified in POSIX.1-2024. An example of this is SIGABRT, which traditionally overlaps some other signal, such as SIGIOT.

SIGKILL, SIGTERM, SIGUSR1, and SIGUSR2 are ordinarily generated only through the explicit use of the kill() function, although some implementations generate SIGKILL under extraordinary circumstances. SIGTERM is traditionally the default signal sent by the kill command.

The signals SIGBUS, SIGEMT, SIGIOT, SIGTRAP, and SIGSYS were omitted from POSIX.1 because their behavior is implementation-defined and could not be adequately categorized. Conforming implementations may deliver these signals, but must document the circumstances under which they are delivered and note any restrictions concerning their delivery. The signals SIGFPE, SIGILL, and SIGSEGV are similar in that they also generally result only from programming errors. They were included in POSIX.1 because they do indicate three relatively well-categorized conditions. They are all defined by the ISO C standard and thus would have to be defined by any system with an ISO C standard binding, even if not explicitly included in POSIX.1.

There is very little that a Conforming POSIX.1 Application can do by catching, ignoring, or masking any of the signals SIGILL, SIGTRAP, SIGIOT, SIGEMT, SIGBUS, SIGSEGV, SIGSYS, or SIGFPE. They will generally be generated by the system only in cases of programming errors. While it may be desirable for some robust code (for example, a library routine) to be able to detect and recover from programming errors in other code, these signals are not nearly sufficient for that purpose. One portable use that does exist for these signals is that a command interpreter can recognize them as the cause of termination of a process (with wait()) and print an appropriate message. The mnemonic tags for these signals are derived from their PDP-11 origin.

The signals SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU, and SIGCONT are provided for job control and are unchanged from 4.2 BSD. The signal SIGCHLD is also typically used by job control shells to detect children that have terminated or, as in 4.2 BSD, stopped.

Some implementations, including System V, have a signal named SIGCLD, which is similar to SIGCHLD in 4.2 BSD. POSIX.1 permits implementations to have a single signal with both names. POSIX.1 carefully specifies ways in which conforming applications can avoid the semantic differences between the two different implementations. The name SIGCHLD was chosen for POSIX.1 because most current application usages of it can remain unchanged in conforming applications. SIGCLD in System V has more cases of semantics that POSIX.1 does not specify, and thus applications using it are more likely to require changes in addition to the name change.

The signals SIGUSR1 and SIGUSR2 are commonly used by applications for notification of exceptional behavior and are described as "reserved as application-defined" so that such use is not prohibited. Implementations should not generate SIGUSR1 or SIGUSR2, except when explicitly requested by kill(). It is recommended that libraries not use these two signals, as such use in libraries could interfere with their use by applications calling the libraries. If such use is unavoidable, it should be documented. It is prudent for non-portable libraries to use non-standard signals to avoid conflicts with use of standard signals by portable libraries.

There is no portable way for an application to catch or ignore non-standard signals. Some implementations define the range of signal numbers, so applications can install signal-catching functions for all of them. Unfortunately, implementation-defined signals often cause problems when caught or ignored by applications that do not understand the reason for the signal. While the desire exists for an application to be more robust by handling all possible signals (even those only generated by kill()), no existing mechanism was found to be sufficiently portable to include in POSIX.1. The value of such a mechanism, if included, would be diminished given that SIGKILL would still not be catchable.

A number of new signal numbers are reserved for applications because the two user signals defined by POSIX.1 are insufficient for many realtime applications. A range of signal numbers is specified, rather than an enumeration of additional reserved signal names, because different applications and application profiles will require a different number of application signals. It is not desirable to burden all application domains and therefore all implementations with the maximum number of signals required by all possible applications. Note that in this context, signal numbers are essentially different signal priorities.

The relatively small number of required additional signals, {_POSIX_RTSIG_MAX}, was chosen so as not to require an unreasonably large signal mask/set. While this number of signals defined in POSIX.1 will fit in a single 32-bit word signal mask, it is recognized that most existing implementations define many more signals than are specified in POSIX.1 and, in fact, many implementations have already exceeded 32 signals (including the "null signal"). Support of {_POSIX_RTSIG_MAX} additional signals may push some implementation over the single 32-bit word line, but is unlikely to push any implementations that are already over that line beyond the 64-signal line.

B.2.4.1 Signal Generation and Delivery

The terms defined in this section are not used consistently in documentation of historical systems. Each signal can be considered to have a lifetime beginning with generation and ending with delivery or acceptance. The POSIX.1 definition of "delivery" does not exclude ignored signals; this is considered a more consistent definition. This revised text in several parts of POSIX.1-2024 clarifies the distinct semantics of asynchronous signal delivery and synchronous signal acceptance. The previous wording attempted to categorize both under the term "delivery", which led to conflicts over whether the effects of asynchronous signal delivery applied to synchronous signal acceptance.

Signals generated for a process are delivered to only one thread. Thus, if more than one thread is eligible to receive a signal, one has to be chosen. The choice of threads is left entirely up to the implementation both to allow the widest possible range of conforming implementations and to give implementations the freedom to deliver the signal to the "easiest possible" thread should there be differences in ease of delivery between different threads.

Note that should multiple delivery among cooperating threads be required by an application, this can be trivially constructed out of the provided single-delivery semantics. The construction of a sigwait_multiple() function that accomplishes this goal is presented with the rationale for sigwaitinfo().

Implementations should deliver unblocked signals as soon after they are generated as possible. However, it is difficult for POSIX.1 to make specific requirements about this, beyond those in kill() and sigprocmask(). Even on systems with prompt delivery, scheduling of higher priority processes is always likely to cause delays.

In general, the interval between the generation and delivery of unblocked signals cannot be detected by an application. Thus, references to pending signals generally apply to blocked, pending signals. An implementation registers a signal as pending on the process when no thread has the signal unblocked and there are no threads blocked in a sigwait() function for that signal. Thereafter, the implementation delivers the signal to the first thread that unblocks the signal or calls a sigwait() function on a signal set containing this signal rather than choosing the recipient thread at the time the signal is sent.

In the 4.3 BSD system, signals that are blocked and set to SIG_IGN are discarded immediately upon generation. For a signal that is ignored as its default action, if the action is SIG_DFL and the signal is blocked, a generated signal remains pending. In the 4.1 BSD system and in System V Release 3 (two other implementations that support a somewhat similar signal mechanism), all ignored blocked signals remain pending if generated. Because it is not normally useful for an application to simultaneously ignore and block the same signal, it was unnecessary for POSIX.1 to specify behavior that would invalidate any of the historical implementations.

There is one case in some historical implementations where an unblocked, pending signal does not remain pending until it is delivered. In the System V implementation of signal(), pending signals are discarded when the action is set to SIG_DFL or a signal-catching routine (as well as to SIG_IGN). Except in the case of setting SIGCHLD to SIG_DFL, implementations that do this do not conform completely to POSIX.1. Some earlier proposals for POSIX.1 explicitly stated this, but these statements were redundant due to the requirement that functions defined by POSIX.1 not change attributes of processes defined by POSIX.1 except as explicitly stated.

POSIX.1 specifically states that the order in which multiple, simultaneously pending signals are delivered is unspecified. This order has not been explicitly specified in historical implementations, but has remained quite consistent and been known to those familiar with the implementations. Thus, there have been cases where applications (usually system utilities) have been written with explicit or implicit dependencies on this order. Implementors and others porting existing applications may need to be aware of such dependencies.

When there are multiple pending signals that are not blocked, implementations should arrange for the delivery of all signals at once, if possible. Some implementations stack calls to all pending signal-catching routines, making it appear that each signal-catcher was interrupted by the next signal. In this case, the implementation should ensure that this stacking of signals does not violate the semantics of the signal masks established by sigaction(). Other implementations process at most one signal when the operating system is entered, with remaining signals saved for later delivery. Although this practice is widespread, this behavior is neither standardized nor endorsed. In either case, implementations should attempt to deliver signals associated with the current state of the process (for example, SIGFPE) before other signals, if possible.

In 4.2 BSD and 4.3 BSD, it is not permissible to ignore or explicitly block SIGCONT, because if blocking or ignoring this signal prevented it from continuing a stopped process, such a process could never be continued (only killed by SIGKILL). However, 4.2 BSD and 4.3 BSD do block SIGCONT during execution of its signal-catching function when it is caught, creating exactly this problem. A proposal was considered to disallow catching SIGCONT in addition to ignoring and blocking it, but this limitation led to objections. The consensus was to require that SIGCONT always continue a stopped process when generated. This removed the need to disallow ignoring or explicit blocking of the signal; note that SIG_IGN and SIG_DFL are equivalent for SIGCONT.

B.2.4.2 Realtime Signal Generation and Delivery

The realtime signals functionality is required in this version of the standard for the following reasons:

Austin Group Defect 633 is applied, reducing to two the allowed behaviors for the signal mask of the thread that is created to handle a SIGEV_THREAD notification.

Austin Group Defect 1116 is applied, removing a reference to the Realtime Signals Extension option that existed in earlier versions of this standard.

B.2.4.3 Signal Actions

Early proposals mentioned SIGCONT as a second exception to the rule that signals are not delivered to stopped processes until continued. Because POSIX.1-2024 now specifies that SIGCONT causes the stopped process to continue when it is generated, delivery of SIGCONT is not prevented because a process is stopped, even without an explicit exception to this rule.

Ignoring a signal by setting the action to SIG_IGN (or SIG_DFL for signals whose default action is to ignore) is not the same as installing a signal-catching function that simply returns. Invoking such a function will interrupt certain system functions that block processes (for example, wait(), sigsuspend(), pause(), read(), write()) while ignoring a signal has no such effect on the process.

Historical implementations discard pending signals when the action is set to SIG_IGN. However, they do not always do the same when the action is set to SIG_DFL and the default action is to ignore the signal. POSIX.1-2024 requires this for the sake of consistency and also for completeness, since the only signal this applies to is SIGCHLD, and POSIX.1-2024 disallows setting its action to SIG_IGN.

Some implementations (System V, for example) assign different semantics for SIGCLD depending on whether the action is set to SIG_IGN or SIG_DFL. Since POSIX.1 requires that the default action for SIGCHLD be to ignore the signal, applications should always set the action to SIG_DFL in order to avoid SIGCHLD.

Whether or not an implementation allows SIG_IGN as a SIGCHLD disposition to be inherited across a call to one of the exec family of functions or posix_spawn() is explicitly left as unspecified. This change was made as a result of IEEE PASC Interpretation 1003.1 #132, and permits the implementation to decide between the following alternatives:

Some implementations (System V, for example) will deliver a SIGCLD signal immediately when a process establishes a signal-catching function for SIGCLD when that process has a child that has already terminated. Other implementations, such as 4.3 BSD, do not generate a new SIGCHLD signal in this way. In general, a process should not attempt to alter the signal action for the SIGCHLD signal while it has any outstanding children. However, it is not always possible for a process to avoid this; for example, shells sometimes start up processes in pipelines with other processes from the pipeline as children. Processes that cannot ensure that they have no children when altering the signal action for SIGCHLD thus need to be prepared for, but not depend on, generation of an immediate SIGCHLD signal.

The default action of the stop signals (SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU) is to stop a process that is executing. If a stop signal is delivered to a process that is already stopped, it has no effect. In fact, if a stop signal is generated for a stopped process whose signal mask blocks the signal, the signal will never be delivered to the process since the process must receive a SIGCONT, which discards all pending stop signals, in order to continue executing.

The SIGCONT signal continues a stopped process even if SIGCONT is blocked (or ignored). However, if a signal-catching routine has been established for SIGCONT, it will not be entered until SIGCONT is unblocked.

If a process in an orphaned process group stops, it is no longer under the control of a job control shell and hence would not normally ever be continued. Because of this, orphaned processes that receive terminal-related stop signals (SIGTSTP, SIGTTIN, SIGTTOU, but not SIGSTOP) must not be allowed to stop. The goal is to prevent stopped processes from languishing forever. (As SIGSTOP is sent only via kill(), it is assumed that the process or user sending a SIGSTOP can send a SIGCONT when desired.) Instead, the system must discard the stop signal. As an extension, it may also deliver another signal in its place. 4.3 BSD sends a SIGKILL, which is overly effective because SIGKILL is not catchable. Another possible choice is SIGHUP. 4.3 BSD also does this for orphaned processes (processes whose parent has terminated) rather than for members of orphaned process groups; this is less desirable because job control shells manage process groups. POSIX.1 also prevents SIGTTIN and SIGTTOU signals from being generated for processes in orphaned process groups as a direct result of activity on a terminal, preventing infinite loops when read() and write() calls generate signals that are discarded; see A.11.1.4 Terminal Access Control . A similar restriction on the generation of SIGTSTP was considered, but that would be unnecessary and more difficult to implement due to its asynchronous nature.

Although POSIX.1 requires that signal-catching functions be called with only one argument, there is nothing to prevent conforming implementations from extending POSIX.1 to pass additional arguments, as long as Strictly Conforming POSIX.1 Applications continue to compile and execute correctly. Most historical implementations do, in fact, pass additional, signal-specific arguments to certain signal-catching routines.

There was a proposal to change the declared type of the signal handler to:

void func (int sig, ...);

The usage of ellipses ("...") is ISO C standard syntax to indicate a variable number of arguments. Its use was intended to allow the implementation to pass additional information to the signal handler in a standard manner.

Unfortunately, this construct would require all signal handlers to be defined with this syntax because the ISO C standard allows implementations to use a different parameter passing mechanism for variable parameter lists than for non-variable parameter lists. Thus, all existing signal handlers in all existing applications would have to be changed to use the variable syntax in order to be standard and portable. This is in conflict with the goal of Minimal Changes to Existing Application Code.

When terminating a process from a signal-catching function, processes should be aware of any interpretation that their parent may make of the status returned by wait(), waitid(), or waitpid(). In particular, a signal-catching function should not call exit(0) or _exit(0) unless it wants to indicate successful termination. A non-zero argument to exit() or _exit() can be used to indicate unsuccessful termination. Alternatively, the process can use kill() to send itself a fatal signal (first ensuring that the signal is set to the default action and not blocked). See also the RATIONALE section of the _exit() function.

The behavior of unsafe functions, as defined by this section, is undefined when they are called from (or after a longjmp() or siglongjmp() out of) signal-catching functions in certain circumstances. The behavior of async-signal-safe functions, as defined by this section, is as specified by POSIX.1, regardless of invocation from a signal-catching function. This is the only intended meaning of the statement that async-signal-safe functions may be used in signal-catching functions without restriction. Applications must still consider all effects of such functions on such things as data structures, files, and process state. In particular, application developers need to consider the restrictions on interactions when interrupting sleep() (see sleep()) and interactions among multiple handles for a file description. The fact that any specific function is listed as async-signal-safe does not necessarily mean that invocation of that function from a signal-catching function is recommended.

In order to prevent errors arising from interrupting non-async-signal-safe function calls, applications should protect calls to these functions either by blocking the appropriate signals or through the use of some programmatic semaphore. POSIX.1 does not address the more general problem of synchronizing access to shared data structures. Note in particular that even the "safe" functions may modify the global variable errno; the signal-catching function may want to save and restore its value. The same principles apply to the async-signal-safety of application routines and asynchronous data access.

Note that although longjmp() and siglongjmp() are in the list of async-signal-safe functions, there are restrictions on subsequent behavior after the function is called from a signal-catching function. This is because the code executing after longjmp() or siglongjmp() can call any unsafe functions with the same danger as calling those unsafe functions directly from the signal handler. Applications that use longjmp() or siglongjmp() out of signal handlers require rigorous protection in order to be portable. Many of the other functions that are excluded from the list are traditionally implemented using either the C language malloc() or free() functions or the ISO C standard I/O library, both of which traditionally use data structures in a non-async-signal-safe manner. Because any combination of different functions using a common data structure can cause async-signal-safety problems, POSIX.1 does not define the behavior when any unsafe function is called in (or after a longjmp() or siglongjmp() out of) a signal handler that interrupts any unsafe function or the non-async-signal-safe processing equivalent to exit() that is performed after return from the initial call to main().

The only realtime extension to signal actions is the addition of the additional parameters to the signal-catching function. This extension has been explained and motivated in the previous section. In making this extension, though, developers of POSIX.1b ran into issues relating to function prototypes. In response to input from the POSIX.1 standard developers, members were added to the sigaction structure to specify function prototypes for the newer signal-catching function specified by POSIX.1b. These members follow changes that are being made to POSIX.1. Note that POSIX.1-2024 explicitly states that these fields may overlap so that a union can be defined. This enabled existing implementations of POSIX.1 to maintain binary-compatibility when these extensions were added.

The siginfo_t structure was adopted for passing the application-defined value to match existing practice, but the existing practice has no provision for an application-defined value, so this was added. Note that POSIX normally reserves the "_t" type designation for opaque types. The siginfo_t structure breaks with this convention to follow existing practice and thus promote portability.

POSIX.1-2024 specifies several values for the si_code member of the siginfo_t structure. Some were introduced in POSIX.1b; others were XSI functionality in the Single UNIX Specification, Version 2 and Version 3, that has now become Base functionality. Historically, an si_code value of less than or equal to zero indicated that the signal was generated by a process via the kill() function, and values of si_code that provided additional information for implementation-generated signals, such as SIGFPE or SIGSEGV, were all positive. This functionality is partially specified for XSI systems in that if si_code is less than or equal to zero, the signal was generated by a process. However, since POSIX.1b did not specify that SI_USER (or SI_QUEUE) had a value less than or equal to zero, it is not true that when the signal is generated by a process, the value of si_code will always be less than or equal to zero. XSI applications should check whether si_code is SI_USER or SI_QUEUE in addition to checking whether it is less than or equal to zero. Applications on systems that do not support the XSI option should just check for SI_USER and SI_QUEUE.

If an implementation chooses to define additional values for si_code, these values have to be different from the values of the non-signal-specific symbols specified by POSIX.1-2024. This will allow conforming applications to differentiate between signals generated by standard events and those generated by other implementation events in a manner compatible with existing practice.

The unique values of si_code for the POSIX.1b asynchronous events have implications for implementations of, for example, asynchronous I/O or message passing in user space library code. Such an implementation will be required to provide a hidden interface to the signal generation mechanism that allows the library to specify the standard values of si_code.

POSIX.1-2024 also specifies additional members of siginfo_t, beyond those that were in POSIX.1b. Like the si_code values mentioned above, these were XSI functionality in the Single UNIX Specification, Version 2 and Version 3, that has now become Base functionality. They provide additional information when si_code has one of the values that moved from XSI to Base.

Although it is not explicitly visible to applications, there are additional semantics for signal actions implied by queued signals and their interaction with other POSIX.1b realtime functions. Specifically:

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/5 is applied, reordering the RTS shaded text under the third and fourth paragraphs of the SIG_DFL description. This corrects an earlier editorial error in this section.

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/6 is applied, adding the abort() function to the list of async-signal-safe functions.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/4 is applied, adding the sockatmark() function to the list of async-signal-safe functions.

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0011 [690], XSH/TC2-2008/0012 [516], XSH/TC2-2008/0013 [692], XSH/TC2-2008/0014 [615], XSH/TC2-2008/0015 [516], and XSH/TC2-2008/0016 [807] are applied.

Austin Group Defect 62 is applied, adding the _Fork() function to, and removing the fork() function from, the list of async-signal-safe functions.

Austin Group Defect 162 is applied, adding functions from the <endian.h> header to the list of async-signal-safe functions.

Austin Group Defect 411 is applied, adding accept4(), dup3(), and pipe2() to the list of async-signal-safe functions.

Austin Group Defect 614 is applied, adding posix_close() to the list of async-signal-safe functions.

Austin Group Defect 699 is applied, adding setegid(), seteuid(), setregid(), and setreuid() to the list of async-signal-safe functions.

Austin Group Defect 711 is applied, adding va_arg(), va_copy(), va_end(), and va_start() to the list of async-signal-safe functions and updating related text to apply to function-like macros.

Austin Group Defect 728 is applied, reducing the set of circumstances in which undefined behavior results when a signal handler refers to an object with static or thread storage duration.

Austin Group Defect 841 is applied, adding pthread_setcancelstate() to the list of async-signal-safe functions and making it implementation-defined which additional interfaces are also async-signal-safe.

Austin Group Defect 986 is applied, adding strlcat(), strlcpy(), wcslcat(), and wcslcpy() to the list of async-signal-safe functions.

Austin Group Defect 1138 is applied, adding the sig2str() function to the list of async-signal-safe functions.

Austin Group Defect 1141 is applied, changing "core file" to "core image".

Austin Group Defects 1142, 1455, and 1625 are applied, adding the pread(), pwrite(), readv(), waitid(), and writev() functions to the list of async-signal-safe functions.

Austin Group Defect 1151 is applied, adding the tcgetwinsize() and tcsetwinsize() functions to the list of async-signal-safe functions.

Austin Group Defect 1215 is applied, removing XSI shading from text relating to abnormal process termination with additional actions.

Austin Group Defect 1263 is applied, adding the ppoll() function to the list of async-signal-safe functions.

Austin Group Defect 1302 is applied, aligning this section with the ISO/IEC 9899:2018 standard.

Austin Group Defect 1667 is applied, adding getresgid(), getresuid(), setresgid(), and setresuid() to the list of async-signal-safe functions.

Austin Group Defect 1744 is applied, adding killpg() to the list of async-signal-safe functions.

B.2.4.4 Signal Effects on Other Functions

The most common behavior of an interrupted function after a signal-catching function returns is for the interrupted function to give an [EINTR] error unless the SA_RESTART flag is in effect for the signal. However, there are a number of specific exceptions, including sleep() and certain situations with read() and write().

The historical implementations of many functions defined by POSIX.1-2024 are not interruptible, but delay delivery of signals generated during their execution until after they complete. This is never a problem for functions that are guaranteed to complete in a short (imperceptible to a human) period of time. It is normally those functions that can suspend a process indefinitely or for long periods of time (for example, wait(), pause(), sigsuspend(), sleep(), or read()/write() on a slow device like a terminal) that are interruptible. This permits applications to respond to interactive signals or to set timeouts on calls to most such functions with alarm(). Therefore, implementations should generally make such functions (including ones defined as extensions) interruptible.

Functions not mentioned explicitly as interruptible may be so on some implementations, possibly as an extension where the function gives an [EINTR] error. There are several functions (for example, getpid(), getuid()) that are specified as never returning an error, which can thus never be extended in this way.

If a signal-catching function returns while the SA_RESTART flag is in effect, an interrupted function is restarted at the point it was interrupted. Conforming applications cannot make assumptions about the internal behavior of interrupted functions, even if the functions are async-signal-safe. For example, suppose the read() function is interrupted with SA_RESTART in effect, the signal-catching function closes the file descriptor being read from and returns, and the read() function is then restarted; in this case the application cannot assume that the read() function will give an [EBADF] error, since read() might have checked the file descriptor for validity before being interrupted.

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0017 [807] is applied.

B.2.5 Standard I/O Streams

Although the ISO C standard guarantees that, at program start-up, stdin is open for reading and stdout and stderr are open for writing, this guarantee is contingent (as are all guarantees made by the ISO C and POSIX standards) on the program being executed in a conforming environment. Programs executed with file descriptor 0 not open for reading or with file descriptor 1 or 2 not open for writing are executed in a non-conforming environment. Application writers are warned (in exec , posix_spawn , and C.2.7 Redirection ) not to execute a standard utility or a conforming application with file descriptor 0 not open for reading or with file descriptor 1 or 2 not open for writing.

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0018 [608] is applied.

Austin Group Defect 689 is applied, clarifying the handling of deadlock situations when locking a stream.

Austin Group Defect 1144 is applied, clarifying the effect of setvbuf() on memory streams.

Austin Group Defect 1153 is applied, clarifying that the behavior is undefined if a memory buffer associated with a standard I/O stream overlaps with the destination buffer of a call that reads from the stream or with the source buffer of a call that writes to the stream.

Austin Group Defect 1302 is applied, aligning this section with the ISO/IEC 9899:2018 standard.

Austin Group Defect 1347 is applied, clarifying the requirements for how stderr, stdin, and stdout are opened at program start-up.

B.2.5.1 Interaction of File Descriptors and Standard I/O Streams

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0019 [480] is applied.

Austin Group Defect 1183 is applied, changing "non-full" to "non-null".

Austin Group Defect 1318 is applied, changing the list of functions that close file descriptors.

B.2.5.2 Stream Orientation and Encoding Rules

Austin Group Defect 1040 is applied, clarifying that conversion to or from (possibly multi-byte) characters is not performed by wide character I/O functions when the stream was opened using open_wmemstream().

B.2.6 File Descriptor Allocation

Functions such as pipe() and socketpair() which allocate two file descriptors are permitted to perform the two allocations independently. This means that other threads or signal handlers may perform operations on file descriptors in between the two allocations and this can result in the two file descriptors not having adjacent values or in the second allocation producing a lower value than the first.

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0032 [835] is applied.

B.2.7 XSI Interprocess Communication

There are two forms of IPC supported as options in POSIX.1-2024. The traditional System V IPC routines derived from the SVID—that is, the msg*(), sem*(), and shm*() interfaces—are mandatory on XSI-conformant systems. Thus, all XSI-conformant systems provide the same mechanisms for manipulating messages, shared memory, and semaphores.

In addition, the POSIX Realtime Extension provides an alternate set of routines for those systems supporting the appropriate options.

The application developer is presented with a choice: the System V interfaces or the POSIX interfaces (loosely derived from the Berkeley interfaces). The XSI profile prefers the System V interfaces, but the POSIX interfaces may be more suitable for realtime or other performance-sensitive applications.

B.2.7.1 IPC General Description

General information that is shared by all three mechanisms is described in this section. The common permissions mechanism is briefly introduced, describing the mode bits, and how they are used to determine whether or not a process has access to read or write/alter the appropriate instance of one of the IPC mechanisms. All other relevant information is contained in the reference pages themselves.

The semaphore type of IPC allows processes to communicate through the exchange of semaphore values. A semaphore is a positive integer. Since many applications require the use of more than one semaphore, XSI-conformant systems have the ability to create sets or arrays of semaphores.

Calls to support semaphores include:

semctl(), semget(), semop()

Semaphore sets are created by using the semget() function.

The message type of IPC allows processes to communicate through the exchange of data stored in buffers. This data is transmitted between processes in discrete portions known as messages.

Calls to support message queues include:

msgctl(), msgget(), msgrcv(), msgsnd()

The shared memory type of IPC allows two or more processes to share memory and consequently the data contained therein. This is done by allowing processes to set up access to a common memory address space. This sharing of memory provides a fast means of exchange of data between processes.

Calls to support shared memory include:

shmctl(), shmdt(), shmget()

The ftok() interface is also provided.

Austin Group Defect 377 is applied, changing the table giving the values for the mode member of the ipc_perm structure.

B.2.8 Realtime

Advisory Information

POSIX.1b contains an Informative Annex with proposed interfaces for "realtime files". These interfaces could determine groups of the exact parameters required to do "direct I/O" or "extents". These interfaces were objected to by a significant portion of the balloting group as too complex. A conforming application had little chance of correctly navigating the large parameter space to match its desires to the system. In addition, they only applied to a new type of file (realtime files) and they told the implementation exactly what to do as opposed to advising the implementation on application behavior and letting it optimize for the system the (portable) application was running on. For example, it was not clear how a system that had a disk array should set its parameters.

There seemed to be several overall goals:

The advisory interfaces, posix_fadvise() and posix_madvise(), satisfy the first two goals. The POSIX_FADV_SEQUENTIAL and POSIX_MADV_SEQUENTIAL advice tells the implementation to expect serial access. Typically the system will prefetch the next several serial accesses in order to overlap I/O. It may also free previously accessed serial data if memory is tight. If the application is not doing serial access it can use POSIX_FADV_WILLNEED and POSIX_MADV_WILLNEED to accomplish I/O overlap, as required. When the application advises POSIX_FADV_RANDOM or POSIX_MADV_RANDOM behavior, the implementation usually tries to fetch a minimum amount of data with each request and it does not expect much locality. POSIX_FADV_DONTNEED and POSIX_MADV_DONTNEED allow the system to free up caching resources as the data will not be required in the near future.

POSIX_FADV_NOREUSE tells the system that caching the specified data is not optimal. For file I/O, the transfer should go directly to the user buffer instead of being cached internally by the implementation. To portably perform direct disk I/O on all systems, the application must perform its I/O transfers according to the following rules:

  1. The user buffer should be aligned according to the {POSIX_REC_XFER_ALIGN} pathconf() variable.
  2. The number of bytes transferred in an I/O operation should be a multiple of the {POSIX_ALLOC_SIZE_MIN} pathconf() variable.
  3. The offset into the file at the start of an I/O operation should be a multiple of the {POSIX_ALLOC_SIZE_MIN} pathconf() variable.
  4. The application should ensure that all threads which open a given file specify POSIX_FADV_NOREUSE to be sure that there is no unexpected interaction between threads using buffered I/O and threads using direct I/O to the same file.

In some cases, a user buffer must be properly aligned in order to be transferred directly to/from the device. The {POSIX_REC_XFER_ALIGN} pathconf() variable tells the application the proper alignment.

The preallocation goal is met by the space control function, posix_fallocate(). The application can use posix_fallocate() to guarantee no [ENOSPC] errors and to improve performance by prepaying any overhead required for block allocation.

Implementations may use information conveyed by a previous posix_fadvise() call to influence the manner in which allocation is performed. For example, if an application did the following calls:

fd = open("file");
posix_fadvise(fd, offset, len, POSIX_FADV_SEQUENTIAL);
posix_fallocate(fd, len, size);

an implementation might allocate the file contiguously on disk.

Finally, the pathconf() variables {POSIX_REC_MIN_XFER_SIZE}, {POSIX_REC_MAX_XFER_SIZE}, and {POSIX_REC_INCR_XFER_SIZE} tell the application a range of transfer sizes that are recommended for best I/O performance.

Where bounded response time is required, the vendor can supply the appropriate settings of the advisories to achieve a guaranteed performance level.

The interfaces meet the goals while allowing applications using regular files to take advantage of performance optimizations. The interfaces tell the implementation expected application behavior which the implementation can use to optimize performance on a particular system with a particular dynamic load.

The posix_memalign() function was added to allow for the allocation of specifically aligned buffers; for example, for {POSIX_REC_XFER_ALIGN}.

The working group also considered the alternative of adding a function which would return an aligned pointer to memory within a user-supplied buffer. This was not considered to be the best method, because it potentially wastes large amounts of memory when buffers need to be aligned on large alignment boundaries.

Message Passing

This section provides the rationale for the definition of the message passing interface in POSIX.1-2024. This is presented in terms of the objectives, models, and requirements imposed upon this interface.

Semaphores

Semaphores are a high-performance process synchronization mechanism. Semaphores are named by null-terminated strings of characters.

A semaphore is created using the sem_init() function or the sem_open() function with the O_CREAT flag set in oflag.

To use a semaphore, a process has to first initialize the semaphore or inherit an open descriptor for the semaphore via fork().

A semaphore preserves its state when the last reference is closed. For example, if a semaphore has a value of 13 when the last reference is closed, it will have a value of 13 when it is next opened.

When a semaphore is created, an initial state for the semaphore has to be provided. This value is a non-negative integer. Negative values are not possible since they indicate the presence of blocked processes. The persistence of any of these objects across a system crash or a system reboot is undefined. Conforming applications must not depend on any sort of persistence across a system reboot or a system crash.

B.2.8.1 Realtime Signals
Realtime Signals Extension

This portion of the rationale presents models, requirements, and standardization issues relevant to the Realtime Signals Extension. This extension provides the capability required to support reliable, deterministic, asynchronous notification of events. While a new mechanism, unencumbered by the historical usage and semantics of POSIX.1 signals, might allow for a more efficient implementation, the application requirements for event notification can be met with a small number of extensions to signals. Therefore, a minimal set of extensions to signals to support the application requirements is specified.

The realtime signal extensions specified in this section are used by other realtime functions requiring asynchronous notification:

B.2.8.2 Asynchronous I/O

Many applications need to interact with the I/O subsystem in an asynchronous manner. The asynchronous I/O mechanism provides the ability to overlap application processing and I/O operations initiated by the application. The asynchronous I/O mechanism allows a single process to perform I/O simultaneously to a single file multiple times or to multiple files multiple times.

Overview

Asynchronous I/O operations proceed in logical parallel with the processing done by the application after the asynchronous I/O has been initiated. Other than this difference, asynchronous I/O behaves similarly to normal I/O using read(), write(), lseek(), and fsync(). The effect of issuing an asynchronous I/O request is as if a separate thread of execution were to perform atomically the implied lseek() operation, if any, and then the requested I/O operation (either read(), write(), or fsync()). There is no seek implied with a call to aio_fsync(). Concurrent asynchronous operations and synchronous operations applied to the same file update the file as if the I/O operations had proceeded serially.

When asynchronous I/O completes, a signal can be delivered to the application to indicate the completion of the I/O. This signal can be used to indicate that buffers and control blocks used for asynchronous I/O can be reused. Signal delivery is not required for an asynchronous operation and may be turned off on a per-operation basis by the application. Signals may also be synchronously polled using aio_suspend(), sigtimedwait(), or sigwaitinfo().

Normal I/O has a return value and an error status associated with it. Asynchronous I/O returns a value and an error status when the operation is first submitted, but that only relates to whether the operation was successfully queued up for servicing. The I/O operation itself also has a return status and an error value. To allow the application to retrieve the return status and the error value, functions are provided that, given the address of an asynchronous I/O control block, yield the return and error status associated with the operation. Until an asynchronous I/O operation is done, its error status is [EINPROGRESS]. Thus, an application can poll for completion of an asynchronous I/O operation by waiting for the error status to become equal to a value other than [EINPROGRESS]. The return status of an asynchronous I/O operation is undefined so long as the error status is equal to [EINPROGRESS].

Storage for asynchronous operation return and error status may be limited. Submission of asynchronous I/O operations may fail if this storage is exceeded. When an application retrieves the return status of a given asynchronous operation, therefore, any system-maintained storage used for this status and the error status may be reclaimed for use by other asynchronous operations.

Asynchronous I/O can be performed on file descriptors that have been enabled for POSIX.1b synchronized I/O. In this case, the I/O operation still occurs asynchronously, as defined herein; however, the asynchronous operation I/O in this case is not completed until the I/O has reached either the state of synchronized I/O data integrity completion or synchronized I/O file integrity completion, depending on the sort of synchronized I/O that is enabled on the file descriptor.

Models

Three models illustrate the use of asynchronous I/O: a journalization model, a data acquisition model, and a model of the use of asynchronous I/O in supercomputing applications.

Requirements

Asynchronous input and output for realtime implementations have these requirements:

Standardization Issues

The following issues are addressed by the standardization of asynchronous I/O:

B.2.8.3 Memory Management

All memory management and shared memory definitions are located in the <sys/mman.h> header. This is for alignment with historical practice.

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/7 is applied, correcting the shading and margin markers in the introduction to Section 2.8.3.1.

Memory Locking Functions

This portion of the rationale presents models, requirements, and standardization issues relevant to process memory locking.

Mapped Files Functions

The memory mapped files functionality provides a mechanism that allows a process to access files by directly incorporating file data into its address space. Once a file is "mapped" into a process address space, the data can be manipulated by instructions as memory. The use of mapped files can significantly reduce I/O data movement since file data does not have to be copied into process data buffers as in read() and write(). If more than one process maps a file, its contents are shared among them. This provides a low overhead mechanism by which processes can synchronize and communicate.

Shared Memory Functions

Implementations may support the Shared Memory Objects option independently of memory mapped files. Shared memory objects are named regions of storage that may be independent of the file system and can be mapped into the address space of one or more processes to allow them to share the associated memory.

Typed Memory Functions

Implementations may support the Typed Memory Objects option without supporting either the Shared Memory option or memory mapped files. Types memory objects are pools of specialized storage, different from the main memory resource normally used by a processor to hold code and data, that can be mapped into the address space of one or more processes.

B.2.8.4 Process Scheduling

IEEE PASC Interpretation 1003.1 #96 has been applied, adding the pthread_setschedprio() function. This was added since previously there was no way for a thread to lower its own priority without going to the tail of the threads list for its new priority. This capability is necessary to bound the duration of priority inversion encountered by a thread.

The following portion of the rationale presents models, requirements, and standardization issues relevant to process and thread scheduling; see B.2.9.4 Thread Scheduling for additional rationale relevant to thread scheduling.

In an operating system supporting multiple concurrent processes or threads, the system determines the order in which processes or threads execute to meet implementation-defined goals. For time-sharing systems, the goal is to enhance system throughput and promote fairness; the application is provided with little or no control over this sequencing function. While this is acceptable and desirable behavior in a time-sharing system, it is inappropriate in a realtime system; realtime applications must specifically control the execution sequence of their concurrent processes or threads in order to meet externally defined response requirements.

In POSIX.1-2024, the control over process and thread sequencing is provided using a concept of scheduling policies. These policies, described in detail in this section, define the behavior of the system whenever processor resources are to be allocated to competing processes or threads. Only the behavior of the policy is defined; conforming implementations are free to use any mechanism desired to achieve the described behavior.

Austin Group Defect 1302 is applied, making requirements on sched_yield() also apply to thrd_yield().

Austin Group Defect 1610 is applied, clarifying the effects of PTHREAD_PRIO_INHERIT and PTHREAD_PRIO_PROTECT on scheduling queues.

Sporadic Server Scheduling Policy

The sporadic server is a mechanism defined for scheduling aperiodic activities in time-critical realtime systems. This mechanism reserves a certain bounded amount of execution capacity for processing aperiodic events at a high priority level. Any aperiodic events that cannot be processed within the bounded amount of execution capacity are executed in the background at a low priority level. Thus, a certain amount of execution capacity can be guaranteed to be available for processing periodic tasks, even under burst conditions in the arrival of aperiodic processing requests (that is, a large number of requests in a short time interval). The sporadic server also simplifies the schedulability analysis of the realtime system, because it allows aperiodic processes or threads to be treated as if they were periodic. The sporadic server was first described by Sprunt, et al.

The key concept of the sporadic server is to provide and limit a certain amount of computation capacity for processing aperiodic events at their assigned normal priority, during a time interval called the "replenishment period". Once the entity controlled by the sporadic server mechanism is initialized with its period and execution-time budget attributes, it preserves its execution capacity until an aperiodic request arrives. The request will be serviced (if there are no higher priority activities pending) as long as there is execution capacity left. If the request is completed, the actual execution time used to service it is subtracted from the capacity, and a replenishment of this amount of execution time is scheduled to happen one replenishment period after the arrival of the aperiodic request. If the request is not completed, because there is no execution capacity left, then the aperiodic process or thread is assigned a lower background priority. For each portion of consumed execution capacity the execution time used is replenished after one replenishment period. At the time of replenishment, if the sporadic server was executing at a background priority level, its priority is elevated to the normal level. Other similar replenishment policies have been defined, but the one presented here represents a compromise between efficiency and implementation complexity.

The interface that appears in this section defines a new scheduling policy for threads and processes that behaves according to the rules of the sporadic server mechanism. Scheduling attributes are defined and functions are provided to allow the user to set and get the parameters that control the scheduling behavior of this mechanism, namely the normal and low priority, the replenishment period, the maximum number of pending replenishment operations, and the initial execution-time budget.

B.2.8.5 Clocks and Timers
Rationale for the Monotonic Clock

For those applications that use time services to achieve realtime behavior, changing the value of the clock on which these services rely may cause erroneous timing behavior. For these applications, it is necessary to have a monotonic clock which cannot run backwards, and which has a maximum clock jump that is required to be documented by the implementation. Additionally, it is desirable (but not required by POSIX.1-2024) that the monotonic clock increases its value uniformly. This clock should not be affected by changes to the system time; for example, to synchronize the clock with an external source or to account for leap seconds. Such changes would cause errors in the measurement of time intervals for those time services that use the absolute value of the clock.

One could argue that by defining the behavior of time services when the value of a clock is changed, deterministic realtime behavior can be achieved. For example, one could specify that relative time services should be unaffected by changes in the value of a clock. However, there are time services that are based upon an absolute time, but that are essentially intended as relative time services. For example, pthread_cond_timedwait() uses an absolute time to allow it to wake up after the required interval despite spurious wakeups. Although sometimes the pthread_cond_timedwait() timeouts are absolute in nature, there are many occasions in which they are relative, and their absolute value is determined from the current time plus a relative time interval. In this latter case, if the clock changes while the thread is waiting, the wait interval will not be the expected length. If a pthread_cond_timedwait() function were created that would take a relative time, it would not solve the problem because to retain the intended "deadline" a thread would need to compensate for latency due to the spurious wakeup, and preemption between wakeup and the next wait.

The solution is to create a new monotonic clock, whose value does not change except for the regular ticking of the clock, and use this clock for implementing the various relative timeouts that appear in the different POSIX interfaces, as well as allow pthread_cond_timedwait() to choose this new clock for its timeout. A new clock_nanosleep() function is created to allow an application to take advantage of this newly defined clock. Notice that the monotonic clock may be implemented using the same hardware clock as the system clock.

Relative timeouts for sigtimedwait() and aio_suspend() have been redefined to use the monotonic clock, if present. The alarm() function has not been redefined, because the same effect but with better resolution can be achieved by creating a timer (for which the appropriate clock may be chosen).

The pthread_cond_timedwait() function has been treated in a different way, compared to other functions with absolute timeouts, because it is used to wait for an event, and thus it may have a deadline, while the other timeouts are generally used as an error recovery mechanism, and for them the use of the monotonic clock is not so important. Since the desired timeout for the pthread_cond_timedwait() function may either be a relative interval or an absolute time of day deadline, a new initialization attribute has been created for condition variables to specify the clock that is used for measuring the timeout in a call to pthread_cond_timedwait(). In this way, if a relative timeout is desired, the monotonic clock will be used; if an absolute deadline is required instead, the CLOCK_REALTIME or another appropriate clock may be used. For condition variables, this capability is also available by passing CLOCK_MONOTONIC to the pthread_cond_clockwait() function. Similarly, CLOCK_MONOTONIC can be specified when calling pthread_mutex_clocklock(), pthread_rwlock_clockrdlock(), pthread_rwlock_clockwrlock(), and sem_clockwait().

It was later found necessary to add variants of almost all interfaces that accept absolute timeouts that allow the clock to be specified. This is because, despite the claim in the previous paragraph, it is not possible to safely use a CLOCK_REALTIME absolute timeout even to prevent errors when the system clock is warped by a potentially large amount. A "safety timeout" of a minute on a call to pthread_mutex_timedlock() could actually mean that the call would return ETIMEDOUT early without acquiring the lock if the system clock is warped forwards immediately prior to or during the call. On the other hand, a short timeout could end up being arbitrarily long if the system clock is warped backwards immediately prior to or during the call. These problems are solved by the new clockwait and clocklock variants of the existing timedwait and timedlock functions. These variants accept an extra clockid_t parameter to indicate the clock to be used for the wait. The clock ID is passed rather than using attributes as previously for pthread_cond_timedwait() in order to allow the ISO/IEC 14882:2011 standard (C++11) and later to be implemented correctly. C++ requires that the clock to use for the wait is not known until the time of the wait call, so it cannot be supplied during creation. The new functions are pthread_cond_clockwait(), pthread_mutex_clocklock(), pthread_rwlock_clockrdlock(), pthread_rwlock_clockwrlock(), and sem_clockwait(). It is expected that mq_clockreceive() and mq_clocksend() functions will be added in a future version of this standard.

The nanosleep() function has not been modified with the introduction of the monotonic clock. Instead, a new clock_nanosleep() function has been created, in which the desired clock may be specified in the function call.

Austin Group Defect 1346 is applied, requiring support for Monotonic Clock.

Execution Time Monitoring
Rationale Relating to Timeouts

B.2.9 Threads

Threads will normally be more expensive than subroutines (or functions, routines, and so on) if specialized hardware support is not provided. Nevertheless, threads should be sufficiently efficient to encourage their use as a medium to fine-grained structuring mechanism for parallelism in an application. Structuring an application using threads then allows it to take immediate advantage of any underlying parallelism available in the host environment. This means implementors are encouraged to optimize for fast execution at the possible expense of efficient utilization of storage. For example, a common thread creation technique is to cache appropriate thread data structures. That is, rather than releasing system resources, the implementation retains these resources and reuses them when the program next asks to create a new thread. If this reuse of thread resources is to be possible, there has to be very little unique state associated with each thread, because any such state has to be reset when the thread is reused.

Thread Creation Attributes

Attributes objects are provided for threads, mutexes, and condition variables as a mechanism to support probable future standardization in these areas without requiring that the interface itself be changed.

Attributes objects provide clean isolation of the configurable aspects of threads. For example, "stack size" is an important attribute of a thread, but it cannot be expressed portably. When porting a threaded program, stack sizes often need to be adjusted. The use of attributes objects can help by allowing the changes to be isolated in a single place, rather than being spread across every instance of thread creation.

Attributes objects can be used to set up classes of threads with similar attributes; for example, "threads with large stacks and high priority" or "threads with minimal stacks". These classes can be defined in a single place and then referenced wherever threads need to be created. Changes to "class" decisions become straightforward, and detailed analysis of each pthread_create() call is not required.

The attributes objects are defined as opaque types as an aid to extensibility. If these objects had been specified as structures, adding new attributes would force recompilation of all multi-threaded programs when the attributes objects are extended; this might not be possible if different program components were supplied by different vendors.

Additionally, opaque attributes objects present opportunities for improving performance. Argument validity can be checked once when attributes are set, rather than each time a thread is created. Implementations will often need to cache kernel objects that are expensive to create. Opaque attributes objects provide an efficient mechanism to detect when cached objects become invalid due to attribute changes.

Because assignment is not necessarily defined on a given opaque type, implementation-defined default values cannot be defined in a portable way. The solution to this problem is to allow attribute objects to be initialized dynamically by attributes object initialization functions, so that default values can be supplied automatically by the implementation.

The following proposal was provided as a suggested alternative to the supplied attributes:

  1. Maintain the style of passing a parameter formed by the bitwise-inclusive OR of flags to the initialization routines (pthread_create(), pthread_mutex_init(), pthread_cond_init()). The parameter containing the flags should be an opaque type for extensibility. If no flags are set in the parameter, then the objects are created with default characteristics. An implementation may specify implementation-defined flag values and associated behavior.
  2. If further specialization of mutexes and condition variables is necessary, implementations may specify additional procedures that operate on the pthread_mutex_t and pthread_cond_t objects (instead of on attributes objects).

The difficulties with this solution are:

  1. A bitmask is not opaque if bits have to be set into bit-vector attributes objects using explicitly-coded bitwise-inclusive OR operations. If the set of options exceeds an int, application programmers need to know the location of each bit. If bits are set or read by encapsulation (that is, get*() or set*() functions), then the bitmask is merely an implementation of attributes objects as currently defined and should not be exposed to the programmer.
  2. Many attributes are not Boolean or very small integral values. For example, scheduling policy may be placed in 3 bits or 4 bits, but priority requires 5 bits or more, thereby taking up at least 8 bits out of a possible 16 bits on machines with 16-bit integers. Because of this, the bitmask can only reasonably control whether particular attributes are set or not, and it cannot serve as the repository of the value itself. The value needs to be specified as a function parameter (which is non-extensible), or by setting a structure field (which is non-opaque), or by get*() and set*() functions (making the bitmask a redundant addition to the attributes objects).

Stack size is defined as an optional attribute because the very notion of a stack is inherently machine-dependent. Some implementations may not be able to change the size of the stack, for example, and others may not need to because stack pages may be discontiguous and can be allocated and released on demand.

The attribute mechanism has been designed in large measure for extensibility. Future extensions to the attribute mechanism or to any attributes object defined in POSIX.1-2024 have to be done with care so as not to affect binary-compatibility.

Attribute objects, even if allocated by means of dynamic allocation functions such as malloc(), may have their size fixed at compile time. This means, for example, a pthread_create() in an implementation with extensions to the pthread_attr_t cannot look beyond the area that the binary application assumes is valid. This suggests that implementations should maintain a size field in the attributes object, as well as possibly version information, if extensions in different directions (possibly by different vendors) are to be accommodated.

Thread Implementation Models

There are various thread implementation models. At one end of the spectrum is the "library-thread model". In such a model, the threads of a process are not visible to the operating system kernel, and the threads are not kernel-scheduled entities. The process is the only kernel-scheduled entity. The process is scheduled onto the processor by the kernel according to the scheduling attributes of the process. The threads are scheduled onto the single kernel-scheduled entity (the process) by the runtime library according to the scheduling attributes of the threads. A problem with this model is that it constrains concurrency. Since there is only one kernel-scheduled entity (namely, the process), only one thread per process can execute at a time. If the thread that is executing blocks on I/O, then the whole process blocks.

At the other end of the spectrum is the "kernel-thread model". In this model, all threads are visible to the operating system kernel. Thus, all threads are kernel-scheduled entities, and all threads can concurrently execute. The threads are scheduled onto processors by the kernel according to the scheduling attributes of the threads. The drawback to this model is that the creation and management of the threads entails operating system calls, as opposed to subroutine calls, which makes kernel threads heavier weight than library threads.

Hybrids of these two models are common. A hybrid model offers the speed of library threads and the concurrency of kernel threads. In hybrid models, a process has some (relatively small) number of kernel scheduled entities associated with it. It also has a potentially much larger number of library threads associated with it. Some library threads may be bound to kernel-scheduled entities, while the other library threads are multiplexed onto the remaining kernel-scheduled entities. There are two levels of thread scheduling:

  1. The runtime library manages the scheduling of (unbound) library threads onto kernel-scheduled entities.
  2. The kernel manages the scheduling of kernel-scheduled entities onto processors.

For this reason, a hybrid model is referred to as a two-level threads scheduling model. In this model, the process can have multiple concurrently executing threads; specifically, it can have as many concurrently executing threads as it has kernel-scheduled entities.

Thread-Specific Data

Many applications require that a certain amount of context be maintained on a per-thread basis across procedure calls. A common example is a multi-threaded library routine that allocates resources from a common pool and maintains an active resource list for each thread. The thread-specific data interface provided to meet these needs may be viewed as a two-dimensional array of values with keys serving as the row index and thread IDs as the column index (although the implementation need not work this way).

Barriers
Spin Locks
Robust Mutexes

Robust mutexes are intended to protect applications that use mutexes to protect data shared between different processes. If a process is terminated by a signal while a thread is holding a mutex, there is no chance for the process to clean up after it. Waiters for the locked mutex might wait indefinitely.

With robust mutexes the problem can be solved: whenever a fatal signal terminates a process, current or future waiters of the mutex are notified about this fact. The locking function provides notification of this condition through the error condition [EOWNERDEAD]. A thread then has the chance to clean up the state protected by the mutex and mark the state as consistent again by a call to pthread_mutex_consistent().

Pre-existing implementations have used the semantics of robust mutexes for a variety of situations, some of them not defined in the standard. Where a normally terminated process (i.e., when one thread calls exit()) causes notification of other waiters of robust mutexes if the mutex is locked by any thread in the process. This behavior is defined in the standard and makes sense because no thread other than the thread calling exit() has the chance to clean up its data.

If a thread is terminated by cancellation or if it calls pthread_exit(), the situation is different. In both these situations the thread has the chance to clean up after itself by registering appropriate cleanup handlers. There is no real reason to demand that other waiters for a robust mutex the terminating thread owns are notified. The committee felt that this is actively encouraging bad practice because programmers are tempted to rely on the robust mutex semantics instead of correctly cleaning up after themselves.

Therefore, the standard does not require notification of other waiters at the time a thread is terminated while the process continues to run. The mutex is still recognized as being locked by the process (with the thread gone it makes no sense to refer to the thread owning the mutex). Therefore, a terminating process will cause notifications about the dead owner to be sent to all waiters. This delay in the notification is not required, but programmers cannot rely on prompt notification after a thread is terminated.

For the same reason is it not required that an implementation supports robust mutexes that are not shared between processes. If a robust mutex is used only within one process, all the cleanup can be performed by the threads themselves by registering appropriate cleanup handlers. Fatal signals are of no importance in this case because after the signal is delivered there is no thread remaining to use the mutex.

Some implementations might choose to support intra-process robust mutexes and they might also send notification of a dead owner right after the previous owner died. But applications must not rely on this. Applications should only use robust mutexes for the purpose of handling fatal signals in situations where inter-process mutexes are in use.

Supported Threads Functions

On POSIX-conforming systems, the following symbolic constants are always conforming:

_POSIX_READER_WRITER_LOCKS
_POSIX_THREADS

Therefore, the following threads functions are always supported:


pthread_atfork()
pthread_attr_destroy()
pthread_attr_getdetachstate()
pthread_attr_getguardsize()
pthread_attr_getschedparam()
pthread_attr_init()
pthread_attr_setdetachstate()
pthread_attr_setguardsize()
pthread_attr_setschedparam()
pthread_cancel()
pthread_cleanup_pop()
pthread_cleanup_push()
pthread_cond_broadcast()
pthread_cond_clockwait()
pthread_cond_destroy()
pthread_cond_init()
pthread_cond_signal()
pthread_cond_timedwait()
pthread_cond_wait()
pthread_condattr_destroy()
pthread_condattr_getpshared()
pthread_condattr_init()
pthread_condattr_setpshared()
pthread_create()
pthread_detach()
pthread_equal()
pthread_exit()
pthread_getspecific()
pthread_join()
pthread_key_create()
pthread_key_delete()
 


pthread_kill()
pthread_mutex_destroy()
pthread_mutex_init()
pthread_mutex_lock()
pthread_mutex_trylock()
pthread_mutex_unlock()
pthread_mutexattr_destroy()
pthread_mutexattr_getpshared()
pthread_mutexattr_gettype()
pthread_mutexattr_init()
pthread_mutexattr_setpshared()
pthread_mutexattr_settype()
pthread_once()
pthread_rwlock_destroy()
pthread_rwlock_init()
pthread_rwlock_rdlock()
pthread_rwlock_tryrdlock()
pthread_rwlock_trywrlock()
pthread_rwlock_unlock()
pthread_rwlock_wrlock()
pthread_rwlockattr_destroy()
pthread_rwlockattr_getpshared()
pthread_rwlockattr_init()
pthread_rwlockattr_setpshared()
pthread_self()
pthread_setcancelstate()
pthread_setcanceltype()
pthread_setspecific()
pthread_sigmask()
pthread_testcancel()
sigwait()
 


On POSIX-conforming systems, the symbolic constant _POSIX_THREAD_SAFE_FUNCTIONS is always defined. Therefore, the following functions are always supported:


flockfile()
ftrylockfile()
funlockfile()
getc_unlocked()
getchar_unlocked()
getgrgid_r()
getgrnam_r()
getpwnam_r()
 


getpwuid_r()
gmtime_r()
localtime_r()
putc_unlocked()
putchar_unlocked()
readdir_r()
strerror_r()
strtok_r()
 

Threads Extensions

The following extensions to the IEEE P1003.1c draft standard are now supported in POSIX.1-2024 as part of the alignment with the Single UNIX Specification:

These extensions carefully follow the threads programming model specified in POSIX.1c. As with POSIX.1c, all the new functions return zero if successful; otherwise, an error number is returned to indicate the error.

The concept of attribute objects was introduced in POSIX.1c to allow implementations to extend POSIX.1-2024 without changing the existing interfaces. Attribute objects were defined for threads, mutexes, and condition variables. Attributes objects are defined as implementation-defined opaque types to aid extensibility, and functions are defined to allow attributes to be set or retrieved. This model has been followed when adding the new type attribute of pthread_mutexattr_t or the new read-write lock attributes object pthread_rwlockattr_t.

B.2.9.1 Thread-Safety

All functions required by POSIX.1-2024 need to be thread-safe. Implementations have to provide internal synchronization when necessary in order to achieve this goal. In certain cases—for example, most floating-point implementations—context switch code may have to manage the writable shared state.

While a read from a pipe of {PIPE_BUF}*2 bytes may not generate a single atomic and thread-safe stream of bytes, it should generate "several" (individually atomic) thread-safe streams of bytes. Similarly, while reading from a terminal device may not generate a single atomic and thread-safe stream of bytes, it should generate some finite number of (individually atomic) and thread-safe streams of bytes. That is, concurrent calls to read for a pipe, FIFO, or terminal device are not allowed to result in corrupting the stream of bytes or other internal data. However, read(), in these cases, is not required to return a single contiguous and atomic stream of bytes.

It is not required that all functions provided by POSIX.1-2024 be either async-cancel-safe or async-signal-safe.

As it turns out, some functions are inherently not thread-safe; that is, their interface specifications preclude thread-safety. For example, some functions (such as asctime()) return a pointer to a result stored in memory space allocated by the function on a per-process basis. Such a function is not thread-safe, because its result can be overwritten by successive invocations. Other functions, while not inherently non-thread-safe, may be implemented in ways that lead to them not being thread-safe. For example, some functions (such as rand()) store state information (such as a seed value, which survives multiple function invocations) in memory space allocated by the function on a per-process basis. The implementation of such a function is not thread-safe if the implementation fails to synchronize invocations of the function and thus fails to protect the state information. The problem is that when the state information is not protected, concurrent invocations can interfere with one another (for example, applications using rand() may see the same seed value).

Thread-Safety and Locking of Existing Functions

Originally, POSIX.1 was not designed to work in a multi-threaded environment, and some implementations of some existing functions will not work properly when executed concurrently. To provide routines that will work correctly in an environment with threads ("thread-safe"), two problems need to be solved:

  1. Routines that maintain or return pointers to static areas internal to the routine (which may now be shared) need to be modified. The routines ttyname() and localtime() are examples.
  2. Routines that access data space shared by more than one thread need to be modified. The malloc() function and the stdio family routines are examples.

There are a variety of constraints on these changes. The first is compatibility with the existing versions of these functions—non-thread-safe functions will continue to be in use for some time, as the original interfaces are used by existing code. Another is that the new thread-safe versions of these functions represent as small a change as possible over the familiar interfaces provided by the existing non-thread-safe versions. The new interfaces should be independent of any particular threads implementation. In particular, they should be thread-safe without depending on explicit thread-specific memory. Finally, there should be minimal performance penalty due to the changes made to the functions.

It is intended that the list of functions from POSIX.1 that cannot be made thread-safe and for which corrected versions are provided be complete.

Thread-Safety and Locking Solutions

Many of the POSIX.1 functions were thread-safe and did not change at all. However, some functions (for example, the math functions typically found in libm) are not thread-safe because of writable shared global state. For instance, in IEEE Std 754-1985 floating-point implementations, the computation modes and flags are global and shared.

Some functions are not thread-safe because a particular implementation is not reentrant, typically because of a non-essential use of static storage. These require only a new implementation.

Thread-safe libraries are useful in a wide range of parallel (and asynchronous) programming environments, not just within pthreads. In order to be used outside the context of pthreads, however, such libraries still have to use some synchronization method. These could either be independent of the pthread synchronization operations, or they could be a subset of the pthread interfaces. Either method results in thread-safe library implementations that can be used without the rest of pthreads.

Some functions, such as the stdio family interface and dynamic memory allocation functions such as malloc(), are inter-dependent routines that share resources (for example, buffers) across related calls. These require synchronization to work correctly, but they do not require any change to their external (user-visible) interfaces.

In some cases, such as getc() and putc(), adding synchronization is likely to create an unacceptable performance impact. In this case, slower thread-safe synchronized functions are to be provided, but the original, faster (but unsafe) functions (which may be implemented as macros) are retained under new names. Some additional special-purpose synchronization facilities are necessary for these macros to be usable in multi-threaded programs. This also requires changes in <stdio.h>.

The other common reason that functions are unsafe is that they return a pointer to static storage, making the functions non-thread-safe. This has to be changed, and there are three natural choices:

  1. Return a pointer to thread-specific storage

    This could incur a severe performance penalty on those architectures with a costly implementation of the thread-specific data interface.

    A variation on this technique is to use malloc() to allocate storage for the function output and return a pointer to this storage. This technique may also have an undesirable performance impact, however, and a simplistic implementation requires that the user program explicitly free the storage object when it is no longer needed. This technique is used by some existing POSIX.1 functions. With careful implementation for infrequently used functions, there may be little or no performance or storage penalty, and the maintenance of already-standardized interfaces is a significant benefit.

  2. Return the actual value computed by the function

    This technique can only be used with functions that return pointers to structures—routines that return character strings would have to wrap their output in an enclosing structure in order to return the output on the stack. There is also a negative performance impact inherent in this solution in that the output value has to be copied twice before it can be used by the calling function: once from the called routine's local buffers to the top of the stack, then from the top of the stack to the assignment target. Finally, many older compilers cannot support this technique due to a historical tendency to use internal static buffers to deliver the results of structure-valued functions.

  3. Have the caller pass the address of a buffer to contain the computed value

    The only disadvantage of this approach is that extra arguments have to be provided by the calling program. It represents the most efficient solution to the problem, however, and, unlike the malloc() technique, it is semantically clear.

There are some routines (often groups of related routines) whose interfaces are inherently non-thread-safe because they communicate across multiple function invocations by means of static memory locations. The solution is to redesign the calls so that they are thread-safe, typically by passing the needed data as extra parameters. Unfortunately, this may require major changes to the interface as well.

A floating-point implementation using IEEE Std 754-1985 is a case in point. A less problematic example is the rand48 family of pseudo-random number generators. The functions getgrgid(), getgrnam(), getpwnam(), and getpwuid() are another such case.

The problems with errno are discussed in Alternative Solutions for Per-Thread errno .

Some functions can be thread-safe or not, depending on their arguments. These include the tmpnam() and ctermid() functions. These functions have pointers to character strings as arguments. If the pointers are not NULL, the functions store their results in the character string; however, if the pointers are NULL, the functions store their results in an area that may be static and thus subject to overwriting by successive calls. These should only be called by multi-thread applications when their arguments are non-NULL.

Asynchronous Safety and Thread-Safety

A floating-point implementation has many modes that effect rounding and other aspects of computation. Functions in some math library implementations may change the computation modes for the duration of a function call. If such a function call is interrupted by a signal or cancellation, the floating-point state is not required to be protected.

There is a significant cost to make floating-point operations async-cancel-safe or async-signal-safe; accordingly, neither form of async safety is required.

Functions Returning Pointers to Static Storage

For those functions that are not thread-safe because they return values in fixed size statically allocated structures, alternate "_r" forms are provided that pass a pointer to an explicit result structure. Those that return pointers into library-allocated buffers have forms provided with explicit buffer and length parameters.

For functions that return pointers to library-allocated buffers, it makes sense to provide "_r" versions that allow the application control over allocation of the storage in which results are returned. This allows the state used by these functions to be managed on an application-specific basis, supporting per-thread, per-process, or other application-specific sharing relationships.

Early proposals had provided "_r" versions for functions that returned pointers to variable-size buffers without providing a means for determining the required buffer size. This would have made using such functions exceedingly clumsy, potentially requiring iteratively calling them with increasingly larger guesses for the amount of storage required. Hence, sysconf() variables have been provided for such functions that return the maximum required buffer size.

Thus, the rule that has been followed by POSIX.1-2024 when adapting single-threaded non-thread-safe functions is as follows: all functions returning pointers to library-allocated storage should have "_r" versions provided, allowing the application control over the storage allocation. Those with variable-sized return values accept both a buffer address and a length parameter. The sysconf() variables are provided to supply the appropriate buffer sizes when required. Implementors are encouraged to apply the same rule when adapting their own existing functions to a pthreads environment.

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0020 [631], XSH/TC2-2008/0021 [826], and XSH/TC2-2008/0022 [631] are applied.

Austin Group Defect 188 is applied, removing getenv() from the list of functions that need not be thread-safe.

Austin Group Defect 696 is applied, requiring readdir() to be thread-safe except when concurrent calls are made for the same directory stream.

Austin Group Defect 922 is applied, adding the secure_getenv() function.

Austin Group Defect 1064 is applied, removing basename() and dirname() from the list of functions that need not be thread-safe.

Austin Group Defect 1302 is applied, aligning this section with the ISO/IEC 9899:2018 standard.

B.2.9.2 Thread IDs

Separate applications should communicate through well-defined interfaces and should not depend on each other's implementation. For example, if a programmer decides to rewrite the sort utility using multiple threads, it should be easy to do this so that the interface to the sort utility does not change. Consider that if the user causes SIGINT to be generated while the sort utility is running, keeping the same interface means that the entire sort utility is killed, not just one of its threads. As another example, consider a realtime application that manages a reactor. Such an application may wish to allow other applications to control the priority at which it watches the control rods. One technique to accomplish this is to write the ID of the thread watching the control rods into a file and allow other programs to change the priority of that thread as they see fit. A simpler technique is to have the reactor process accept IPCs (Interprocess Communication messages) from other processes, telling it at a semantic level what priority the program should assign to watching the control rods. This allows the programmer greater flexibility in the implementation. For example, the programmer can change the implementation from having one thread per rod to having one thread watching all of the rods without changing the interface. Having threads live inside the process means that the implementation of a process is invisible to outside processes (excepting debuggers and system management tools).

Threads do not provide a protection boundary. Every thread model allows threads to share memory with other threads and encourages this sharing to be widespread. This means that one thread can wipe out memory that is needed for the correct functioning of other threads that are sharing its memory. Consequently, providing each thread with its own user and/or group IDs would not provide a protection boundary between threads sharing memory.

Some applications make the assumption that the implementation can always detect invalid uses of thread IDs of type pthread_t. This is an invalid assumption. Specifically, if pthread_t is defined as a pointer type, no access check needs to be performed before using the ID.

As with other interfaces that take pointer parameters, the outcome of passing an invalid parameter can result in an invalid memory reference or an attempt to access an undefined portion of a memory object, cause signals to be sent (SIGSEGV or SIGBUS) and possible termination of the process. This is a similar case to passing an invalid buffer pointer to read(). Some implementations might implement read() as a system call and set an [EFAULT] error condition. Other implementations might contain parts of read() at user level and the first attempt to access data at an invalid reference will cause a signal to be sent instead.

If an implementation detects use of a thread ID after the end of its lifetime, it is recommended that the function should fail and report an [ESRCH] error. This does not imply that implementations are required to return in this case. It is legitimate behavior to send an "invalid memory reference" signal (SIGSEGV or SIGBUS). It is the application's responsibility to use only valid thread IDs and to keep track of the lifetime of the underlying threads.

Austin Group Defect 792 is applied, clarifying thread lifetime.

Austin Group Defect 1302 is applied, aligning this section with the ISO/IEC 9899:2018 standard.

B.2.9.3 Thread Mutexes

Austin Group Defect 1216 is applied, adding pthread_cond_clockwait() and pthread_mutex_clocklock().

Austin Group Defect 1302 is applied, aligning this section with the ISO/IEC 9899:2018 standard.

B.2.9.4 Thread Scheduling
Scheduling Contention Scope

In order to accommodate the requirement for realtime response, each thread has a scheduling contention scope attribute. Threads with a system scheduling contention scope have to be scheduled with respect to all other threads in the system. These threads are usually bound to a single kernel entity that reflects their scheduling attributes and are directly scheduled by the kernel.

Threads with a process scheduling contention scope need be scheduled only with respect to the other threads in the process. These threads may be scheduled within the process onto a pool of kernel entities. The implementation is also free to bind these threads directly to kernel entities and let them be scheduled by the kernel. Process scheduling contention scope allows the implementation the most flexibility and is the default if both contention scopes are supported and none is specified.

Thus, the choice by implementors to provide one or the other (or both) of these scheduling models is driven by the need of their supported application domains for worst-case (that is, realtime) response, or average-case (non-realtime) response.

Scheduling Allocation Domain

The SCHED_FIFO and SCHED_RR scheduling policies take on different characteristics on a multi-processor. Other scheduling policies are also subject to changed behavior when executed on a multi-processor. The concept of scheduling allocation domain determines the set of processors on which the threads of an application may run. By considering the application's processor scheduling allocation domain for its threads, scheduling policies can be defined in terms of their behavior for varying processor scheduling allocation domain values. It is conceivable that not all scheduling allocation domain sizes make sense for all scheduling policies on all implementations. The concept of scheduling allocation domain, however, is a useful tool for the description of multi-processor scheduling policies.

The "process control" approach to scheduling obtains significant performance advantages from dynamic scheduling allocation domain sizes when it is applicable.

Non-Uniform Memory Access (NUMA) multi-processors may use a system scheduling structure that involves reassignment of threads among scheduling allocation domains. In NUMA machines, a natural model of scheduling is to match scheduling allocation domains to clusters of processors. Load balancing in such an environment requires changing the scheduling allocation domain to which a thread is assigned.

Scheduling Documentation

Implementation-provided scheduling policies need to be completely documented in order to be useful. This documentation includes a description of the attributes required for the policy, the scheduling interaction of threads running under this policy and all other supported policies, and the effects of all possible values for processor scheduling allocation domain. Note that for the implementor wishing to be minimally-compliant, it is (minimally) acceptable to define the behavior as undefined.

Scheduling Contention Scope Attribute

The scheduling contention scope defines how threads compete for resources. Within POSIX.1-2024, scheduling contention scope is used to describe only how threads are scheduled in relation to one another in the system. That is, either they are scheduled against all other threads in the system ("system scope") or only against those threads in the process ("process scope"). In fact, scheduling contention scope may apply to additional resources, including virtual timers and profiling, which are not currently considered by POSIX.1-2024.

Mixed Scopes

If only one scheduling contention scope is supported, the scheduling decision is straightforward. To perform the processor scheduling decision in a mixed scope environment, it is necessary to map the scheduling attributes of the thread with process-wide contention scope to the same attribute space as the thread with system-wide contention scope.

Since a conforming implementation has to support one and may support both scopes, it is useful to discuss the effects of such choices with respect to example applications. If an implementation supports both scopes, mixing scopes provides a means of better managing system-level (that is, kernel-level) and library-level resources. In general, threads with system scope will require the resources of a separate kernel entity in order to guarantee the scheduling semantics. On the other hand, threads with process scope can share the resources of a kernel entity while maintaining the scheduling semantics.

The application is free to create threads with dedicated kernel resources, and other threads that multiplex kernel resources. Consider the example of a window server. The server allocates two threads per widget: one thread manages the widget user interface (including drawing), while the other thread takes any required application action. This allows the widget to be "active" while the application is computing. A screen image may be built from thousands of widgets. If each of these threads had been created with system scope, then most of the kernel-level resources might be wasted, since only a few widgets are active at any one time. In addition, mixed scope is particularly useful in a window server where one thread with high priority and system scope handles the mouse so that it tracks well. As another example, consider a database server. For each of the hundreds or thousands of clients supported by a large server, an equivalent number of threads will have to be created. If each of these threads were system scope, the consequences would be the same as for the window server example above. However, the server could be constructed so that actual retrieval of data is done by several dedicated threads. Dedicated threads that do work for all clients frequently justify the added expense of system scope. If it were not permissible to mix system and process threads in the same process, this type of solution would not be possible.

Dynamic Thread Scheduling Parameters Access

In many time-constrained applications, there is no need to change the scheduling attributes dynamically during thread or process execution, since the general use of these attributes is to reflect directly the time constraints of the application. Since these time constraints are generally imposed to meet higher-level system requirements, such as accuracy or availability, they frequently should remain unchanged during application execution.

However, there are important situations in which the scheduling attributes should be changed. Generally, this will occur when external environmental conditions exist in which the time constraints change. Consider, for example, a space vehicle major mode change, such as the change from ascent to descent mode, or the change from the space environment to the atmospheric environment. In such cases, the frequency with which many of the sensors or actuators need to be read or written will change, which will necessitate a priority change. In other cases, even the existence of a time constraint might be temporary, necessitating not just a priority change, but also a policy change for ongoing threads or processes. For this reason, it is critical that the interface should provide functions to change the scheduling parameters dynamically, but, as with many of the other realtime functions, it is important that applications use them properly to avoid the possibility of unnecessarily degrading performance.

In providing functions for dynamically changing the scheduling behavior of threads, there were two options: provide functions to get and set the individual scheduling parameters of threads, or provide a single interface to get and set all the scheduling parameters for a given thread simultaneously. Both approaches have merit. Access functions for individual parameters allow simpler control of thread scheduling for simple thread scheduling parameters. However, a single function for setting all the parameters for a given scheduling policy is required when first setting that scheduling policy. Since the single all-encompassing functions are required, it was decided to leave the interface as minimal as possible. Note that simpler functions (such as pthread_setprio() for threads running under the priority-based schedulers) can be easily defined in terms of the all-encompassing functions.

If the pthread_setschedparam() function executes successfully, it will have set all of the scheduling parameter values indicated in param; otherwise, none of the scheduling parameters will have been modified. This is necessary to ensure that the scheduling of this and all other threads continues to be consistent in the presence of an erroneous scheduling parameter.

The [EPERM] error value is included in the list of possible pthread_setschedparam() error returns as a reflection of the fact that the ability to change scheduling parameters increases risks to the implementation and application performance if the scheduling parameters are changed improperly. For this reason, and based on some existing practice, it was felt that some implementations would probably choose to define specific permissions for changing either a thread's own or another thread's scheduling parameters. POSIX.1-2024 does not include portable methods for setting or retrieving permissions, so any such use of permissions is completely unspecified.

Mutex Initialization Scheduling Attributes

In a priority-driven environment, a direct use of traditional primitives like mutexes and condition variables can lead to unbounded priority inversion, where a higher priority thread can be blocked by a lower priority thread, or set of threads, for an unbounded duration of time. As a result, it becomes impossible to guarantee thread deadlines. Priority inversion can be bounded and minimized by the use of priority inheritance protocols. This allows thread deadlines to be guaranteed even in the presence of synchronization requirements.

Two useful but simple members of the family of priority inheritance protocols are the basic priority inheritance protocol and the priority ceiling protocol emulation. Under the Basic Priority Inheritance protocol (governed by the Non-Robust Mutex Priority Inheritance option), a thread that is blocking higher priority threads executes at the priority of the highest priority thread that it blocks. This simple mechanism allows priority inversion to be bounded by the duration of critical sections and makes timing analysis possible.

Under the Priority Ceiling Protocol Emulation protocol (governed by the Thread Priority Protection option), each mutex has a priority ceiling, usually defined as the priority of the highest priority thread that can lock the mutex. When a thread is executing inside critical sections, its priority is unconditionally increased to the highest of the priority ceilings of all the mutexes owned by the thread. This protocol has two very desirable properties in uni-processor systems. First, a thread can be blocked by a lower priority thread for at most the duration of one single critical section. Furthermore, when the protocol is correctly used in a single processor, and if threads do not become blocked while owning mutexes, mutual deadlocks are prevented.

The priority ceiling emulation can be extended to multiple processor environments, in which case the values of the priority ceilings will be assigned depending on the kind of mutex that is being used: local to only one processor, or global, shared by several processors. Local priority ceilings will be assigned the usual way, equal to the priority of the highest priority thread that may lock that mutex. Global priority ceilings will usually be assigned a priority level higher than all the priorities assigned to any of the threads that reside in the involved processors to avoid the effect called remote blocking.

Change the Priority Ceiling of a Mutex

In order for the priority protect protocol to exhibit its desired properties of bounding priority inversion and avoidance of deadlock, it is critical that the ceiling priority of a mutex be the same as the priority of the highest thread that can ever hold it, or higher. Thus, if the priorities of the threads using such mutexes never change dynamically, there is no need ever to change the priority ceiling of a mutex.

However, if a major system mode change results in an altered response time requirement for one or more application threads, their priority has to change to reflect it. It will occasionally be the case that the priority ceilings of mutexes held also need to change. While changing priority ceilings should generally be avoided, it is important that POSIX.1-2024 provide these interfaces for those cases in which it is necessary.

B.2.9.5 Thread Cancellation

Many existing threads packages have facilities for canceling an operation or canceling a thread. These facilities are used for implementing user requests (such as the CANCEL button in a window-based application), for implementing OR parallelism (for example, telling the other threads to stop working once one thread has found a forced mate in a parallel chess program), or for implementing the ABORT mechanism in Ada.

POSIX programs traditionally have used the signal mechanism combined with either longjmp() or polling to cancel operations. Many POSIX programmers have trouble using these facilities to solve their problems efficiently in a single-threaded process. With the introduction of threads, these solutions become even more difficult to use.

The main issues with implementing a cancellation facility are specifying the operation to be canceled, cleanly releasing any resources allocated to that operation, controlling when the target notices that it has been canceled, and defining the interaction between asynchronous signals and cancellation.

Specifying the Operation to Cancel

Consider a thread that calls through five distinct levels of program abstraction and then, inside the lowest-level abstraction, calls a function that suspends the thread. (An abstraction boundary is a layer at which the client of the abstraction sees only the service being provided and can remain ignorant of the implementation. Abstractions are often layered, each level of abstraction being a client of the lower-level abstraction and implementing a higher-level abstraction.) Depending on the semantics of each abstraction, one could imagine wanting to cancel only the call that causes suspension, only the bottom two levels, or the operation being done by the entire thread. Canceling operations at a finer grain than the entire thread is difficult because threads are active and they may be run in parallel on a multi-processor. By the time one thread can make a request to cancel an operation, the thread performing the operation may have completed that operation and gone on to start another operation whose cancellation is not desired. Thread IDs are not reused until the thread has exited, and either it was created with the Attr detachstate attribute set to PTHREAD_CREATE_DETACHED or the pthread_join() or pthread_detach() function has been called for that thread. Consequently, a thread cancellation will never be misdirected when the thread terminates. For these reasons, the canceling of operations is done at the granularity of the thread. Threads are designed to be inexpensive enough so that a separate thread may be created to perform each separately cancelable operation; for example, each possibly long running user request.

For cancellation to be used in existing code, cancellation scopes and handlers will have to be established for code that needs to release resources upon cancellation, so that it follows the programming discipline described in the text.

A Special Signal Versus a Special Interface

Two different mechanisms were considered for providing the cancellation interfaces. The first was to provide an interface to direct signals at a thread and then to define a special signal that had the required semantics. The other alternative was to use a special interface that delivered the correct semantics to the target thread.

The solution using signals produced a number of problems. It required the implementation to provide cancellation in terms of signals whereas a perfectly valid (and possibly more efficient) implementation could have both layered on a low-level set of primitives. There were so many exceptions to the special signal (it cannot be used with kill(), no POSIX.1 interfaces can be used with it) that it was clearly not a valid signal. Its semantics on delivery were also completely different from any existing POSIX.1 signal. As such, a special interface that did not mandate the implementation and did not confuse the semantics of signals and cancellation was felt to be the better solution.

Races Between Cancellation and Resuming Execution

Due to the nature of cancellation, there is generally no synchronization between the thread requesting the cancellation of a blocked thread and events that may cause that thread to resume execution. For this reason, and because excess serialization hurts performance, when both an event that a thread is waiting for has occurred and a cancellation request has been made and cancellation is enabled, POSIX.1-2024 explicitly allows the implementation to choose between returning from the blocking call or acting on the cancellation request.

Interaction of Cancellation with Asynchronous Signals

A typical use of cancellation is to acquire a lock on some resource and to establish a cancellation cleanup handler for releasing the resource when and if the thread is canceled.

A correct and complete implementation of cancellation in the presence of asynchronous signals requires considerable care. An implementation has to push a cancellation cleanup handler on the cancellation cleanup stack while maintaining the integrity of the stack data structure. If an asynchronously-generated signal is posted to the thread during a stack operation, the signal handler cannot manipulate the cancellation cleanup stack. As a consequence, asynchronous signal handlers may not cancel threads or otherwise manipulate the cancellation state of a thread. Threads may, of course, be canceled by another thread that used a sigwait() function to wait synchronously for an asynchronous signal.

In order for cancellation to function correctly, it is required that asynchronous signal handlers not change the cancellation state. This requires that some elements of existing practice, such as using longjmp() to exit from an asynchronous signal handler implicitly, be prohibited in cases where the integrity of the cancellation state of the interrupt thread cannot be ensured.

Thread Cancellation Overview

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/8 is applied, adding the pselect() function to the list of functions with cancellation points.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/5 is applied, adding the fdatasync() function into the table of functions that shall have cancellation points.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/6 is applied, adding the numerous functions into the table of functions that may have cancellation points.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/7 is applied, clarifying the requirements in Thread Cancellation Cleanup Handlers.

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0023 [627], XSH/TC2-2008/0024 [627,632], XSH/TC2-2008/0025 [627], XSH/TC2-2008/0026 [632], and XSH/TC2-2008/0027 [622] are applied.

Austin Group Defect 411 is applied, adding accept4() to the table of functions that shall have cancellation points.

Austin Group Defect 508 is applied, adding ptsname() and ptsname_r() to the table of functions that may have cancellation points.

Austin Group Defect 614 is applied, adding posix_close() to the table of functions that shall have cancellation points.

Austin Group Defect 697 is applied, adding posix_getdents() to the table of functions that may have cancellation points.

Austin Group Defect 729 is applied, adding posix_devctl() to the table of functions that may have cancellation points.

Austin Group Defect 841 is applied, allowing pthread_setcancelstate() to be used to disable cancellation in a signal catching function in order to avoid undefined behavior when the signal is delivered during execution of a function that is not async-cancel-safe.

Austin Group Defect 1076 is applied, moving sem_wait() and sem_timedwait() from the table of functions that are required to have cancellation points to the table of functions that may have cancellation points.

Austin Group Defect 1122 is applied, adding bindtextdomain() and the gettext family of functions to the table of functions that may have cancellation points.

Austin Group Defect 1143 is applied, clarifying the conditions under which it is unspecified whether the cancellation request is acted upon or whether the cancellation request remains pending.

Austin Group Defect 1216 is applied, adding pthread_cond_clockwait() to the table of functions that are required to have cancellation points, and adding pthread_rwlock_clockwrlock(), pthread_rwlock_clockrdlock(), and sem_clockwait() to the table of functions that may have cancellation points.

Austin Group Defect 1263 is applied, adding ppoll() to the table of functions that are required to have cancellation points.

Austin Group Defect 1302 is applied, aligning this section with the ISO/IEC 9899:2018 standard.

Austin Group Defect 1410 is applied, removing the asctime_r() and ctime_r() functions.

B.2.9.6 Thread Read-Write Locks
Background

Read-write locks are often used to allow parallel access to data on multi-processors, to avoid context switches on uni-processors when multiple threads access the same data, and to protect data structures that are frequently accessed (that is, read) but rarely updated (that is, written). The in-core representation of a file system directory is a good example of such a data structure. One would like to achieve as much concurrency as possible when searching directories, but limit concurrent access when adding or deleting files.

Although read-write locks can be implemented with mutexes and condition variables, such implementations are significantly less efficient than is possible. Therefore, this synchronization primitive is included in POSIX.1-2024 for the purpose of allowing more efficient implementations in multi-processor systems.

Queuing of Waiting Threads

The pthread_rwlock_unlock() function description states that one writer or one or more readers must acquire the lock if it is no longer held by any thread as a result of the call. However, the function does not specify which thread(s) acquire the lock, unless the Thread Execution Scheduling option is supported.

The standard developers considered the issue of scheduling with respect to the queuing of threads blocked on a read-write lock. The question turned out to be whether POSIX.1-2024 should require priority scheduling of read-write locks for threads whose execution scheduling policy is priority-based (for example, SCHED_FIFO or SCHED_RR). There are tradeoffs between priority scheduling, the amount of concurrency achievable among readers, and the prevention of writer and/or reader starvation.

For example, suppose one or more readers hold a read-write lock and the following threads request the lock in the listed order:

pthread_rwlock_wrlock() - Low priority thread writer_a
pthread_rwlock_rdlock() - High priority thread reader_a
pthread_rwlock_rdlock() - High priority thread reader_b
pthread_rwlock_rdlock() - High priority thread reader_c

When the lock becomes available, should writer_a block the high priority readers? Or, suppose a read-write lock becomes available and the following are queued:

pthread_rwlock_rdlock() - Low priority thread reader_a
pthread_rwlock_rdlock() - Low priority thread reader_b
pthread_rwlock_rdlock() - Low priority thread reader_c
pthread_rwlock_wrlock() - Medium priority thread writer_a
pthread_rwlock_rdlock() - High priority thread reader_d

If priority scheduling is applied then reader_d would acquire the lock and writer_a would block the remaining readers. But should the remaining readers also acquire the lock to increase concurrency? The solution adopted takes into account that when the Thread Execution Scheduling option is supported, high priority threads may in fact starve low priority threads (the application developer is responsible in this case for designing the system in such a way that this starvation is avoided). Therefore, POSIX.1-2024 specifies that high priority readers take precedence over lower priority writers. However, to prevent writer starvation from threads of the same or lower priority, writers take precedence over readers of the same or lower priority.

Priority inheritance mechanisms are non-trivial in the context of read-write locks. When a high priority writer is forced to wait for multiple readers, for example, it is not clear which subset of the readers should inherit the writer's priority. Furthermore, the internal data structures that record the inheritance must be accessible to all readers, and this implies some sort of serialization that could negate any gain in parallelism achieved through the use of multiple readers in the first place. Finally, existing practice does not support the use of priority inheritance for read-write locks. Therefore, no specification of priority inheritance or priority ceiling is attempted. If reliable priority-scheduled synchronization is absolutely required, it can always be obtained through the use of mutexes.

Comparison to fcntl() Locks

The read-write locks and the fcntl() locks in POSIX.1-2024 share a common goal: increasing concurrency among readers, thus increasing throughput and decreasing delay.

However, the read-write locks have two features not present in the fcntl() locks. First, under priority scheduling, read-write locks are granted in priority order. Second, also under priority scheduling, writer starvation is prevented by giving writers preference over readers of equal or lower priority.

Also, read-write locks can be used in systems lacking a file system, such as those conforming to the minimal realtime system profile of IEEE Std 1003.13-1998.

History of Resolution Issues

Based upon some balloting objections, early drafts specified the behavior of threads waiting on a read-write lock during the execution of a signal handler, as if the thread had not called the lock operation. However, this specified behavior would require implementations to establish internal signal handlers even though this situation would be rare, or never happen for many programs. This would introduce an unacceptable performance hit in comparison to the little additional functionality gained. Therefore, the behavior of read-write locks and signals was reverted back to its previous mutex-like specification.

B.2.9.7 Thread Interactions with File Operations

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0028 [498] is applied.

Austin Group Defect 411 is applied, adding dup3().

Austin Group Defect 695 is applied, extending the requirements in this section to non-regular files.

B.2.9.8 Use of Application-Managed Thread Stacks

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/8 is applied, adding this new section. It was added to make it clear that the current standard does not allow an application to determine when a stack can be reclaimed. This may be addressed in a future version.

B.2.9.9 Synchronization Object Copies and Alternative Mappings

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0029 [972] is applied.

Austin Group Defect 1302 is applied, aligning this section with the ISO/IEC 9899:2018 standard.

B.2.10 Sockets

The base document for the sockets interfaces in POSIX.1-2024 is the XNS, Issue 5.2 specification. This was primarily chosen as it aligns with IPv6. Additional material has been added from IEEE Std 1003.1g-2000, notably socket concepts, raw sockets, the pselect() function, the sockatmark() function, and the <sys/select.h> header.

B.2.10.1 Address Families

There is no additional rationale provided for this section.

B.2.10.2 Addressing

There is no additional rationale provided for this section.

B.2.10.3 Protocols

There is no additional rationale provided for this section.

B.2.10.4 Routing

There is no additional rationale provided for this section.

B.2.10.5 Interfaces

There is no additional rationale provided for this section.

B.2.10.6 Socket Types

The type socklen_t was invented to cover the range of implementations seen in the field. The intent of socklen_t is to be the type for all lengths that are naturally bounded in size; that is, that they are the length of a buffer which cannot sensibly become of massive size: network addresses, host names, string representations of these, ancillary data, control messages, and socket options are examples. Truly boundless sizes are represented by size_t as in read(), write(), and so on.

All socklen_t types were originally (in BSD UNIX) of type int. During the development of POSIX.1-2024, it was decided to change all buffer lengths to size_t, which appears at face value to make sense. When dual mode 32/64-bit systems came along, this choice unnecessarily complicated system interfaces because size_t (with long) was a different size under ILP32 and LP64 models. Reverting to int would have happened except that some implementations had already shipped 64-bit-only interfaces. The compromise was a type which could be defined to be any size by the implementation: socklen_t.

B.2.10.7 Socket I/O Mode

There is no additional rationale provided for this section.

B.2.10.8 Socket Owner

There is no additional rationale provided for this section.

B.2.10.9 Socket Queue Limits

There is no additional rationale provided for this section.

B.2.10.10 Pending Error

There is no additional rationale provided for this section.

B.2.10.11 Socket Receive Queue

There is no additional rationale provided for this section.

B.2.10.12 Socket Out-of-Band Data State

There is no additional rationale provided for this section.

B.2.10.13 Connection Indication Queue

There is no additional rationale provided for this section.

B.2.10.14 Signals

There is no additional rationale provided for this section.

B.2.10.15 Asynchronous Errors

Austin Group Defect 1010 is applied, removing [EHOSTDOWN] from the list of asynchronous errors.

B.2.10.16 Use of Options

Austin Group Defect 840 is applied, adding SO_DOMAIN and SO_PROTOCOL.

Austin Group Defect 1337 is applied, clarifying socket option default values.

B.2.10.17 Use of Sockets for Local UNIX Connections

There is no additional rationale provided for this section.

B.2.10.18 Use of Sockets over Internet Protocols

A raw socket allows privileged users direct access to a protocol; for example, raw access to the IP and ICMP protocols is possible through raw sockets. Raw sockets are intended for knowledgeable applications that wish to take advantage of some protocol feature not directly accessible through the other sockets interfaces.

B.2.10.19 Use of Sockets over Internet Protocols Based on IPv4

There is no additional rationale provided for this section.

B.2.10.20 Use of Sockets over Internet Protocols Based on IPv6

The Open Group Base Resolution bwg2001-012 is applied, clarifying that IPv6 implementations are required to support use of AF_INET6 sockets over IPv4.

Austin Group Defect 411 is applied, adding accept4().

B.2.11 Data Types

B.2.11.1 Defined Types

The requirement that additional types defined in this section end in "_t" was prompted by the problem of name space pollution. It is difficult to define a type (where that type is not one defined by POSIX.1-2024) in one header file and use it in another without adding symbols to the name space of the program. To allow implementors to provide their own types, all conforming applications are required to avoid symbols ending in "_t", which permits the implementor to provide additional types. Because a major use of types is in the definition of structure members, which can (and in many cases must) be added to the structures defined in POSIX.1-2024, the need for additional types is compelling.

The types, such as ushort and ulong, which are in common usage, are not defined in POSIX.1-2024 (although ushort_t would be permitted as an extension). They can be added to <sys/types.h> using a feature test macro (see B.2.2.1 POSIX.1 Symbols ). A suggested symbol for these is _SYSIII. Similarly, the types like u_short would probably be best controlled by _BSD.

Some of these symbols may appear in other headers; see B.2.2.2 The Name Space .

dev_t
This type may be made large enough to accommodate host-locality considerations of networked systems.

This type must be arithmetic. Earlier proposals allowed this to be non-arithmetic (such as a structure) and provided a samefile() function for comparison.

gid_t
Some implementations had separated gid_t from uid_t before POSIX.1 was completed. It would be difficult for them to coalesce them when it was unnecessary. Additionally, it is quite possible that user IDs might be different from group IDs because the user ID might wish to span a heterogeneous network, where the group ID might not.

For current implementations, the cost of having a separate gid_t will be only lexical.

mode_t
This type was chosen so that implementations could choose the appropriate integer type, and for compatibility with the ISO C standard. 4.3 BSD uses unsigned short and the SVID uses ushort, which is the same. Historically, only the low-order sixteen bits are significant.
nlink_t
This type was introduced in place of short for st_nlink (see the <sys/stat.h> header) in response to an objection that short was too small.
off_t
This type is used to represent a file offset or file size. On systems supporting large files, off_t is larger than 32 bits in at least one programming environment. Other programming environments may use different sizes for off_t, for compatibility or other reasons.
pid_t
The inclusion of this symbol was controversial because it is tied to the issue of the representation of a process ID as a number. From the point of view of a conforming application, process IDs should be "magic cookies"1 that are produced by calls such as fork(), used by calls such as waitpid() or kill(), and not otherwise analyzed (except that the sign is used as a flag for certain operations).

The concept of a {PID_MAX} value interacted with this in early proposals. Treating process IDs as an opaque type both removes the requirement for {PID_MAX} and allows systems to be more flexible in providing process IDs that span a large range of values, or a small one.

Since the values in uid_t, gid_t, and pid_t will be numbers generally, and potentially both large in magnitude and sparse, applications that are based on arrays of objects of this type are unlikely to be fully portable in any case. Solutions that treat them as magic cookies will be portable.

{CHILD_MAX} precludes the possibility of a "toy implementation", where there would only be one process.

ssize_t
This is intended to be a signed analog of size_t. The wording is such that an implementation may either choose to use a longer type or simply to use the signed version of the type that underlies size_t. All functions that return ssize_t (read() and write()) describe as "implementation-defined" the result of an input exceeding {SSIZE_MAX}. It is recognized that some implementations might have ints that are smaller than size_t. A conforming application would be constrained not to perform I/O in pieces larger than {SSIZE_MAX}, but a conforming application using extensions would be able to use the full range if the implementation provided an extended range, while still having a single type-compatible interface.

The symbols size_t and ssize_t are also required in <unistd.h> to minimize the changes needed for calls to read() and write(). Implementors are reminded that it must be possible to include both <sys/types.h> and <unistd.h> in the same program (in either order) without error.

uid_t
Before the addition of this type, the data types used to represent these values varied throughout early proposals. The <sys/stat.h> header defined these values as type short, the <passwd.h> file (now <pwd.h> and <grp.h>) used an int, and getuid() returned an int. In response to a strong objection to the inconsistent definitions, all the types were switched to uid_t.

In practice, those historical implementations that use varying types of this sort can typedef uid_t to short with no serious consequences.

The problem associated with this change concerns object compatibility after structure size changes. Since most implementations will define uid_t as a short, the only substantive change will be a reduction in the size of the passwd structure. Consequently, implementations with an overriding concern for object compatibility can pad the structure back to its current size. For that reason, this problem was not considered critical enough to warrant the addition of a separate type to POSIX.1.

The types uid_t and gid_t are magic cookies. There is no {UID_MAX} defined by POSIX.1, and no structure imposed on uid_t and gid_t other than that they be positive arithmetic types. (In fact, they could be unsigned char.) There is no maximum or minimum specified for the number of distinct user or group IDs.

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0030 [733] is applied.

Austin Group Defect 697 is applied, adding reclen_t.

Austin Group Defect 1302 is applied, aligning this section with the ISO/IEC 9899:2018 standard.

B.2.11.2 The char Type

POSIX.1-2024 explicitly requires that a char type is exactly one byte (8 bits).

B.2.12 Status Information

POSIX.1-2024 does not require all matching WNOWAIT threads (threads in a matching call to waitid() with the WNOWAIT flag set) to obtain a child's status information because the status information might be discarded (consumed or replaced) before one of the matching WNOWAIT threads is scheduled. If the status information is not discarded, it will remain available, so all of the matching WNOWAIT threads will (eventually) obtain the status information.

POSIX.1-2008, Technical Corrigendum 2, XSH/TC2-2008/0031 [690] is applied.

B.3 System Interfaces

See the RATIONALE sections on the individual reference pages.

B.3.1 System Interfaces Removed in this Version

This section contains a list of options and interfaces removed in POSIX.1-2024, together with advice for application developers on the alternative interfaces that should be used.

B.3.1.1 STREAMS Option

Applications are recommended to use UNIX domain sockets as an alternative for much of the functionality provided by this option. For example, file descriptor passing can be performed using sendmsg() and recvmsg() with SCM_RIGHTS on a UNIX domain socket instead of using ioctl() with I_SENDFD and I_RECVFD on a STREAM.

B.3.1.2 Tracing Option

Applications are recommended to use implementation-provided extension interfaces instead of the functionality provided by this option. (Such interfaces were in widespread use before the Tracing option was added to POSIX.1 and continued to be used in preference to the Tracing option interfaces.)

B.3.1.3 _longjmp() and _setjmp()

Applications are recommended to use siglongjmp() and sigsetjmp() instead of these functions.

B.3.1.4 _tolower() and _toupper()

Applications are recommended to use tolower() and toupper() instead of these functions.

B.3.1.5 ftw()

Applications are recommended to use nftw() instead of this function.

B.3.1.6 getitimer() and setitimer()

Applications are recommended to use timer_gettime() and timer_settime() instead of these functions.

B.3.1.7 gets()

Applications are recommended to use fgets() instead of this function.

B.3.1.8 gettimeofday()

Applications are recommended to use clock_gettime() instead of this function.

B.3.1.9 isascii() and toascii()

Applications are recommended to use macros equivalent to the following instead of these functions:

#define isascii(c) (((c) & ~0177) == 0)
#define toascii(c) ((c) & 0177)

An alternative replacement for isascii(), depending on the intended outcome if the code is ported to implementations with different character encodings, might be:

#define isascii(c) (isprint((c)) || iscntrl((c)))

(In the C or POSIX locale, this determines whether c is a character in the portable character set.)

B.3.1.10 pthread_getconcurrency() and pthread_setconcurrency()

Applications are recommended to use thread scheduling (on implementations that support the Thread Execution Scheduling option) instead of these functions; see XSH 2.9.4 Thread Scheduling .

B.3.1.11 rand_r()

Applications are recommended to use nrand48() or random() instead of this function.

B.3.1.12 setpgrp()

Applications are recommended to use setpgid() or setsid() instead of this function.

B.3.1.13 sighold(), sigpause(), and sigrelse()

Applications are recommended to use pthread_sigmask() or sigprocmask() instead of these functions.

B.3.1.14 sigignore(), siginterrupt(), and sigset()

Applications are recommended to use sigaction() instead of these functions.

B.3.1.15 tempnam()

Applications are recommended to use mkdtemp(), mkstemp(), or tmpfile() instead of this function.

B.3.1.16 ulimit()

Applications are recommended to use getrlimit() or setrlimit() instead of this function.

B.3.1.17 utime()

Applications are recommended to use futimens() if a file descriptor for the file is open, otherwise utimensat(), instead of this function.

B.3.2 System Interfaces Removed in the Previous Version

The functions and symbols removed in Issue 7 (from the Issue 6 base document) were as follows:

Removed Functions and Symbols in Issue 7


bcmp()
bcopy()
bsd_signal()
bzero()
ecvt()
fcvt()
ftime()
gcvt()
getcontext()
 


gethostbyaddr()
gethostbyname()
getwd()
h_errno
index()
makecontext()
mktemp()
pthread_attr_getstackaddr()
pthread_attr_setstackaddr()
 


rindex()
scalb()
setcontext()
swapcontext()
ualarm()
usleep()
vfork()
wcswcs()
 

B.3.3 Examples for Spawn

The following long examples are provided in the Rationale (Informative) volume of POSIX.1-2024 as a supplement to the reference page for posix_spawn().

Example Library Implementation of Spawn

The posix_spawn() or posix_spawnp() functions provide the following:

The posix_spawn() or posix_spawnp() functions do not cover every possible use of the fork() function, but they do span the common applications: typical use by a shell and a login utility.

The price for an application is that before it calls posix_spawn() or posix_spawnp(), the parent must adjust to a state that posix_spawn() or posix_spawnp() can map to the desired state for the child. Environment changes require the parent to save some of its state and restore it afterwards. The example below demonstrates an initial approach to implementing posix_spawn() using other POSIX operations, although an actual implementation will need to be more robust at handling all possible filenames.

#include <sys/types.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <fcntl.h>
#include <signal.h>
#include <errno.h>
#include <string.h>
#include <signal.h>

/* #include <spawn.h> */ /*******************************************/ /* Things that could be defined in spawn.h */ /*******************************************/ typedef struct { short posix_attr_flags; #define POSIX_SPAWN_SETPGROUP 0x1 #define POSIX_SPAWN_SETSIGMASK 0x2 #define POSIX_SPAWN_SETSIGDEF 0x4 #define POSIX_SPAWN_SETSCHEDULER 0x8 #define POSIX_SPAWN_SETSCHEDPARAM 0x10 #define POSIX_SPAWN_RESETIDS 0x20 #define POSIX_SPAWN_SETSID 0x40 pid_t posix_attr_pgroup; sigset_t posix_attr_sigmask; sigset_t posix_attr_sigdefault; int posix_attr_schedpolicy; struct sched_param posix_attr_schedparam; } posix_spawnattr_t;
typedef char *posix_spawn_file_actions_t;
int posix_spawn_file_actions_init( posix_spawn_file_actions_t *file_actions); int posix_spawn_file_actions_destroy( posix_spawn_file_actions_t *file_actions); int posix_spawn_file_actions_addchdir( posix_spawn_file_actions_t *restrict file_actions, const char *restrict path); int posix_spawn_file_actions_addclose( posix_spawn_file_actions_t *file_actions, int fildes); int posix_spawn_file_actions_adddup2( posix_spawn_file_actions_t *file_actions, int fildes, int newfildes); int posix_spawn_file_actions_addfchdir( posix_spawn_file_actions_t *file_actions, int fildes); int posix_spawn_file_actions_addopen( posix_spawn_file_actions_t *file_actions, int fildes, const char *path, int oflag, mode_t mode); int posix_spawnattr_init(posix_spawnattr_t *attr); int posix_spawnattr_destroy(posix_spawnattr_t *attr); int posix_spawnattr_getflags(const posix_spawnattr_t *attr, short *lags); int posix_spawnattr_setflags(posix_spawnattr_t *attr, short flags); int posix_spawnattr_getpgroup(const posix_spawnattr_t *attr, pid_t *pgroup); int posix_spawnattr_setpgroup(posix_spawnattr_t *attr, pid_t pgroup); int posix_spawnattr_getschedpolicy(const posix_spawnattr_t *attr, int *schedpolicy); int posix_spawnattr_setschedpolicy(posix_spawnattr_t *attr, int schedpolicy); int posix_spawnattr_getschedparam(const posix_spawnattr_t *attr, struct sched_param *schedparam); int posix_spawnattr_setschedparam(posix_spawnattr_t *attr, const struct sched_param *schedparam); int posix_spawnattr_getsigmask(const posix_spawnattr_t *attr, sigset_t *sigmask); int posix_spawnattr_setsigmask(posix_spawnattr_t *attr, const sigset_t *sigmask); int posix_spawnattr_getdefault(const posix_spawnattr_t *attr, sigset_t *sigdefault); int posix_spawnattr_setsigdefault(posix_spawnattr_t *attr, const sigset_t *sigdefault); int posix_spawn(pid_t *pid, const char *path, const posix_spawn_file_actions_t *file_actions, const posix_spawnattr_t *attrp, char *const argv[], char *const envp[]); int posix_spawnp(pid_t *pid, const char *file, const posix_spawn_file_actions_t *file_actions, const posix_spawnattr_t *attrp, char *const argv[], char *const envp[]);
/*****************************************/ /* Example posix_spawn() library routine */ /*****************************************/ int posix_spawn(pid_t *pid, const char *path, const posix_spawn_file_actions_t *file_actions, const posix_spawnattr_t *attrp, char *const argv[], char *const envp[]) { /* Create process */ if ((*pid = fork()) == (pid_t) 0) { /* This is the child process */ /* Handle creating a new session */ if (attrp->posix_attr_flags & POSIX_SPAWN_SETSID) { /* Create a new session */ if (setsid() == -1) { /* Failed */ _exit(127); } }
/* Handle process group */ if (attrp->posix_attr_flags & POSIX_SPAWN_SETPGROUP) { /* Override inherited process group */ if (setpgid(0, attrp->posix_attr_pgroup) != 0) { /* Failed */ _exit(127); } }
/* Handle thread signal mask */ if (attrp->posix_attr_flags & POSIX_SPAWN_SETSIGMASK) { /* Set the signal mask (cannot fail) */ sigprocmask(SIG_SETMASK, &attrp->posix_attr_sigmask, NULL); }
/* Handle resetting effective user and group IDs */ if (attrp->posix_attr_flags & POSIX_SPAWN_RESETIDS) { /* None of these can fail for this case. */ setuid(getuid()); setgid(getgid()); }
/* Handle defaulted signals */ if (attrp->posix_attr_flags & POSIX_SPAWN_SETSIGDEF) { struct sigaction deflt; sigset_t all_signals;
int s;
/* Construct default signal action */ deflt.sa_handler = SIG_DFL; deflt.sa_flags = 0;
/* Construct the set of all signals */ sigfillset(&all_signals);
/* Loop for all signals */ for (s = 0; sigismember(&all_signals, s); s++) { /* Signal to be defaulted? */ if (sigismember(&attrp->posix_attr_sigdefault, s)) { /* Yes; default this signal */ if (sigaction(s, &deflt, NULL) == -1) { /* Failed */ _exit(127); } } } }
/* Handle the fds if they are to be mapped */ if (file_actions != NULL) { /* Loop for all actions in object file_actions */ /* (implementation dives beneath abstraction) */ char *p = *file_actions;
while (*p != '\0') { if (strncmp(p, "close(", 6) == 0) { int fd;
if (sscanf(p + 6, "%d)", &fd) != 1) { _exit(127); } if (close(fd) == -1 && errno != EBADF) _exit(127); } else if (strncmp(p, "dup2(", 5) == 0) { int fd, newfd;
if (sscanf(p + 5, "%d,%d)", &fd, &newfd) != 2) { _exit(127); } if (fd == newfd) { int flags = fcntl(fd, F_GETFD); if (flags == -1) _exit(127); flags &= ~FD_CLOEXEC; if (fcntl(fd, F_SETFD, flags) == -1) _exit(127); } else if (dup2(fd, newfd) == -1) _exit(127); } else if (strncmp(p, "open(", 5) == 0) { int fd, oflag; mode_t mode; int tempfd; char path[1000]; /* Should be dynamic */ char *q;
if (sscanf(p + 5, "%d,", &fd) != 1) { _exit(127); } p = strchr(p, ',') + 1; q = strchr(p, '*'); if (q == NULL) _exit(127); strncpy(path, p, q - p); path[q - p] = '\0'; if (sscanf(q + 1, "%o,%o)", &oflag, &mode) != 2) { _exit(127); } if (close(fd) == -1) { if (errno != EBADF) _exit(127); } tempfd = open(path, oflag, mode); if (tempfd == -1) _exit(127); if (tempfd != fd) { if (dup2(tempfd, fd) == -1) { _exit(127); } if (close(tempfd) == -1) { _exit(127); } } } else if (strncmp(p, "chdir(", 6) == 0) { char path[1000]; /* Should be dynamic */ char *q; p += 6 q = strchr(p, '*'); if (q == NULL) _exit(127); strncpy(path, p, q - p); path[q - p] = '\0'; if (chdir(path) == -1) _exit(127); } else if (strncmp(p, "fchdir(", 7) == 0) { int fd; if (sscanf(p + 7, "%d)", &fd) != 1) _exit(127); if (fchdir(fd) == -1) _exit(127); } else { _exit(127); } p = strchr(p, ')') + 1; } }
/* Handle setting new scheduling policy and parameters */ if (attrp->posix_attr_flags & POSIX_SPAWN_SETSCHEDULER) { if (sched_setscheduler(0, attrp->posix_attr_schedpolicy, &attrp->posix_attr_schedparam) == -1) { _exit(127); } }
/* Handle setting only new scheduling parameters */ if (attrp->posix_attr_flags & POSIX_SPAWN_SETSCHEDPARAM) { if (sched_setparam(0, &attrp->posix_attr_schedparam) == -1) { _exit(127); } }
/* Now execute the program at path */ /* Any fd that still has FD_CLOEXEC set will be closed */ execve(path, argv, envp); _exit(127); /* exec failed */ } else { /* This is the parent (calling) process */ if (*pid == (pid_t) - 1) return errno; return 0; } }
/*******************************************************/ /* Here is a crude but effective implementation of the */ /* file action object operators which store actions as */ /* concatenated token-separated strings. */ /*******************************************************/ /* Create object with no actions. */ int posix_spawn_file_actions_init( posix_spawn_file_actions_t *file_actions) { *file_actions = malloc(sizeof(char)); if (*file_actions == NULL) return ENOMEM; strcpy(*file_actions, ""); return 0; }
/* Free object storage and make invalid. */ int posix_spawn_file_actions_destroy( posix_spawn_file_actions_t *file_actions) { free(*file_actions); *file_actions = NULL; return 0; }
/* Add a new action string to object. */ static int add_to_file_actions( posix_spawn_file_actions_t *file_actions, char *new_action) { *file_actions = realloc (*file_actions, strlen(*file_actions) + strlen(new_action) + 1); if (*file_actions == NULL) return ENOMEM; strcat(*file_actions, new_action); return 0; }
/* Add a chdir action to object. */ int posix_spawn_file_actions_addchdir( posix_spawn_file_actions_t *restrict file_actions, const char *restrict path) { char temp[100];
sprintf(temp, "chdir(%s*)", path); return add_to_file_actions(file_actions, temp); }
/* Add a close action to object. */ int posix_spawn_file_actions_addclose( posix_spawn_file_actions_t *file_actions, int fildes) { char temp[100];
sprintf(temp, "close(%d)", fildes); return add_to_file_actions(file_actions, temp); }
/* Add a dup2 action to object. */ int posix_spawn_file_actions_adddup2( posix_spawn_file_actions_t *file_actions, int fildes, int newfildes) { char temp[100];
sprintf(temp, "dup2(%d,%d)", fildes, newfildes); return add_to_file_actions(file_actions, temp); }
/* Add a fchdir action to object. */ int posix_spawn_file_actions_addfchdir( posix_spawn_file_actions_t *file_actions, int fildes) { char temp[100];
sprintf(temp, "fchdir(%d)", fildes); return add_to_file_actions(file_actions, temp); }
/* Add an open action to object. */ int posix_spawn_file_actions_addopen( posix_spawn_file_actions_t *file_actions, int fildes, const char *path, int oflag, mode_t mode) { char temp[100];
sprintf(temp, "open(%d,%s*%o,%o)", fildes, path, oflag, mode); return add_to_file_actions(file_actions, temp); }
/*******************************************************/ /* Here is a crude but effective implementation of the */ /* spawn attributes object functions which manipulate */ /* the individual attributes. */ /*******************************************************/ /* Initialize object with default values. */ int posix_spawnattr_init(posix_spawnattr_t *attr) { attr->posix_attr_flags = 0; attr->posix_attr_pgroup = 0; /* Default value of signal mask is the parent's signal mask; */ /* other values are also allowed */ sigprocmask(0, NULL, &attr->posix_attr_sigmask); sigemptyset(&attr->posix_attr_sigdefault); /* Default values of scheduling attr inherited from the parent; */ /* other values are also allowed */ attr->posix_attr_schedpolicy = sched_getscheduler(0); sched_getparam(0, &attr->posix_attr_schedparam); return 0; }
int posix_spawnattr_destroy(posix_spawnattr_t *attr) { /* No action needed */ return 0; }
int posix_spawnattr_getflags(const posix_spawnattr_t *attr, short *flags) { *flags = attr->posix_attr_flags; return 0; }
int posix_spawnattr_setflags(posix_spawnattr_t *attr, short flags) { attr->posix_attr_flags = flags; return 0; }
int posix_spawnattr_getpgroup(const posix_spawnattr_t *attr, pid_t *pgroup) { *pgroup = attr->posix_attr_pgroup; return 0; }
int posix_spawnattr_setpgroup(posix_spawnattr_t *attr, pid_t pgroup) { attr->posix_attr_pgroup = pgroup; return 0; }
int posix_spawnattr_getschedpolicy(const posix_spawnattr_t *attr, int *schedpolicy) { *schedpolicy = attr->posix_attr_schedpolicy; return 0; }
int posix_spawnattr_setschedpolicy(posix_spawnattr_t *attr, int schedpolicy) { attr->posix_attr_schedpolicy = schedpolicy; return 0; }
int posix_spawnattr_getschedparam(const posix_spawnattr_t *attr, struct sched_param *schedparam) { *schedparam = attr->posix_attr_schedparam; return 0; }
int posix_spawnattr_setschedparam(posix_spawnattr_t *attr, const struct sched_param *schedparam) { attr->posix_attr_schedparam = *schedparam; return 0; }
int posix_spawnattr_getsigmask(const posix_spawnattr_t *attr, sigset_t *sigmask) { *sigmask = attr->posix_attr_sigmask; return 0; }
int posix_spawnattr_setsigmask(posix_spawnattr_t *attr, const sigset_t *sigmask) { attr->posix_attr_sigmask = *sigmask; return 0; }
int posix_spawnattr_getsigdefault(const posix_spawnattr_t *attr, sigset_t *sigdefault) { *sigdefault = attr->posix_attr_sigdefault; return 0; }
int posix_spawnattr_setsigdefault(posix_spawnattr_t *attr, const sigset_t *sigdefault) { attr->posix_attr_sigdefault = *sigdefault; return 0; }
I/O Redirection with Spawn

I/O redirection with posix_spawn() or posix_spawnp() is accomplished by crafting a file_actions argument to effect the desired redirection. Such a redirection follows the general outline of the following example:

/* To redirect new standard output (fd 1) to a file, */
/* and redirect new standard input (fd 0) from my fd socket_pair[1], */
/* and close my fd socket_pair[0] in the new process. */
posix_spawn_file_actions_t file_actions;
posix_spawn_file_actions_init(&file_actions);
posix_spawn_file_actions_addopen(&file_actions, 1, "newout", ...);
posix_spawn_file_actions_dup2(&file_actions, socket_pair[1], 0);
posix_spawn_file_actions_close(&file_actions, socket_pair[0]);
posix_spawn_file_actions_close(&file_actions, socket_pair[1]);
posix_spawn(..., &file_actions, ...);
posix_spawn_file_actions_destroy(&file_actions);
Spawning a Process Under a New User ID

Spawning a process under a new user ID follows the outline shown in the following example:

Save = getuid();
setuid(newid);
posix_spawn(...);
setuid(Save);

Footnotes

1. An historical term meaning: "An opaque object, or token, of determinate size, whose significance is known only to the entity which created it. An entity receiving such a token from the generating entity may only make such use of the `cookie' as is defined and permitted by the supplying entity."