Rationale

The Open Group Base Specifications Issue 6
IEEE Std 1003.1, 2004 Edition
Copyright © 2001-2004 The IEEE and The Open GroupA newer edition of this document exists here

B.2 General Information

B.2.1 Use and Implementation of Functions

The information concerning the use of functions was adapted from a description in the ISO C standard. Here is an example of how an application program can protect itself from functions that may or may not be macros, rather than true functions:

The atoi() function may be used in any of several ways:

By use of its associated header (possibly generating a macro expansion):
```
#include <stdlib.h>
/* ... */
i = atoi(str);
```

By use of its associated header (assuredly generating a true function call):

#include <stdlib.h>
#undef atoi
/* ... */
i = atoi(str);

or:

#include <stdlib.h>
/* ... */
i = (atoi) (str);

By explicit declaration:

extern int atoi (const char *);
/* ... */
i = atoi(str);

By implicit declaration:
```
/* ... */
i = atoi(str);
```
(Assuming no function prototype is in scope. This is not allowed by the ISO C standard for functions with variable arguments; furthermore, parameter type conversion "widening" is subject to different rules in this case.)

Note that the ISO C standard reserves names starting with '_' for the compiler. Therefore, the compiler could, for example, implement an intrinsic, built-in function _asm_builtin_atoi(), which it recognized and expanded into inline assembly code. Then, in <stdlib.h>, there could be the following:

#define atoi(X) _asm_builtin_atoi(X)

The user's "normal" call to atoi() would then be expanded inline, but the implementor would also be required to provide a callable function named atoi() for use when the application requires it; for example, if its address is to be stored in a function pointer variable.

B.2.2 The Compilation Environment

POSIX.1 Symbols

This and the following section address the issue of "name space pollution". The ISO C standard requires that the name space beyond what it reserves not be altered except by explicit action of the application writer. This section defines the actions to add the POSIX.1 symbols for those headers where both the ISO C standard and POSIX.1 need to define symbols, and also where the XSI extension extends the base standard.

When headers are used to provide symbols, there is a potential for introducing symbols that the application writer cannot predict. Ideally, each header should only contain one set of symbols, but this is not practical for historical reasons. Thus, the concept of feature test macros is included. Two feature test macros are explicitly defined by IEEE Std 1003.1-2001; it is expected that future revisions may add to this.

Note:: Feature test macros allow an application to announce to the implementation its desire to have certain symbols and prototypes exposed. They should not be confused with the version test macros and constants for options in <unistd.h> which are the implementation's way of announcing functionality to the application.

It is further intended that these feature test macros apply only to the headers specified by IEEE Std 1003.1-2001. Implementations are expressly permitted to make visible symbols not specified by IEEE Std 1003.1-2001, within both POSIX.1 and other headers, under the control of feature test macros that are not defined by IEEE Std 1003.1-2001.

The _POSIX_C_SOURCE Feature Test Macro

Since _POSIX_SOURCE specified by the POSIX.1-1990 standard did not have a value associated with it, the _POSIX_C_SOURCE macro replaces it, allowing an application to inform the system of the revision of the standard to which it conforms. This symbol will allow implementations to support various revisions of IEEE Std 1003.1-2001 simultaneously. For instance, when either _POSIX_SOURCE is defined or _POSIX_C_SOURCE is defined as 1, the system should make visible the same name space as permitted and required by the POSIX.1-1990 standard. When _POSIX_C_SOURCE is defined, the state of _POSIX_SOURCE is completely irrelevant.

It is expected that C bindings to future POSIX standards will define new values for _POSIX_C_SOURCE, with each new value reserving the name space for that new standard, plus all earlier POSIX standards.

The _XOPEN_SOURCE Feature Test Macro

The feature test macro _XOPEN_SOURCE is provided as the announcement mechanism for the application that it requires functionality from the Single UNIX Specification. _XOPEN_SOURCE must be defined to the value 600 before the inclusion of any header to enable the functionality in the Single UNIX Specification. Its definition subsumes the use of _POSIX_SOURCE and _POSIX_C_SOURCE.

An extract of code from a conforming application, that appears before any #include statements, is given below:

#define _XOPEN_SOURCE 600 /* Single UNIX Specification, Version 3 */


#include ...

Note that the definition of _XOPEN_SOURCE with the value 600 makes the definition of _POSIX_C_SOURCE redundant and it can safely be omitted.

The Name Space

The reservation of identifiers is paraphrased from the ISO C standard. The text is included because it needs to be part of IEEE Std 1003.1-2001, regardless of possible changes in future versions of the ISO C standard.

These identifiers may be used by implementations, particularly for feature test macros. Implementations should not use feature test macro names that might be reasonably used by a standard.

Including headers more than once is a reasonably common practice, and it should be carried forward from the ISO C standard. More significantly, having definitions in more than one header is explicitly permitted. Where the potential declaration is "benign" (the same definition twice) the declaration can be repeated, if that is permitted by the compiler. (This is usually true of macros, for example.) In those situations where a repetition is not benign (for example, typedefs), conditional compilation must be used. The situation actually occurs both within the ISO C standard and within POSIX.1: time_t should be in <sys/types.h>, and the ISO C standard mandates that it be in <time.h>.

The area of name space pollution versus additions to structures is difficult because of the macro structure of C. The following discussion summarizes all the various problems with and objections to the issue.

Note the phrase "user-defined macro". Users are not permitted to define macro names (or any other name) beginning with "_[A-Z_]". Thus, the conflict cannot occur for symbols reserved to the vendor's name space, and the permission to add fields automatically applies, without qualification, to those symbols.

Data structures (and unions) need to be defined in headers by implementations to meet certain requirements of POSIX.1 and the ISO C standard.
The structures defined by POSIX.1 are typically minimal, and any practical implementation would wish to add fields to these structures either to hold additional related information or for backwards-compatibility (or both). Future standards (and de facto standards) would also wish to add to these structures. Issues of field alignment make it impractical (at least in the general case) to simply omit fields when they are not defined by the particular standard involved.

The dirent structure is an example of such a minimal structure (although one could argue about whether the other fields need visible names). The st_rdev field of most implementations' stat structure is a common example where extension is needed and where a conflict could occur.
Fields in structures are in an independent name space, so the addition of such fields presents no problem to the C language itself in that such names cannot interact with identically named user symbols because access is qualified by the specific structure name.
There is an exception to this: macro processing is done at a lexical level. Thus, symbols added to a structure might be recognized as user-provided macro names at the location where the structure is declared. This only can occur if the user-provided name is declared as a macro before the header declaring the structure is included. The user's use of the name after the declaration cannot interfere with the structure because the symbol is hidden and only accessible through access to the structure. Presumably, the user would not declare such a macro if there was an intention to use that field name.
Macros from the same or a related header might use the additional fields in the structure, and those field names might also collide with user macros. Although this is a less frequent occurrence, since macros are expanded at the point of use, no constraint on the order of use of names can apply.
An "obvious" solution of using names in the reserved name space and then redefining them as macros when they should be visible does not work because this has the effect of exporting the symbol into the general name space. For example, given a (hypothetical) system-provided header <h.h>, and two parts of a C program in a.c and b.c, in header <h.h>:
```
struct foo {
    int __i;
}


#ifdef _FEATURE_TEST
#define i __i;
#endif
```
In file a.c:
```
#include h.h
extern int i;
...
```
In file b.c:
```
extern int i;
...
```
The symbol that the user thinks of as i in both files has an external name of __i in a.c; the same symbol i in b.c has an external name i (ignoring any hidden manipulations the compiler might perform on the names). This would cause a mysterious name resolution problem when a.o and b.o are linked.

Simply avoiding definition then causes alignment problems in the structure.

A structure of the form:
```
struct foo {
    union {
        int __i;
#ifdef _FEATURE_TEST
        int i;
#endif
    } __ii;
}
```
does not work because the name of the logical field i is __ii.i, and introduction of a macro to restore the logical name immediately reintroduces the problem discussed previously (although its manifestation might be more immediate because a syntax error would result if a recursive macro did not cause it to fail first).
A more workable solution would be to declare the structure:
```
struct foo {
#ifdef _FEATURE_TEST
    int i;
#else
    int __i;
#endif
}
```
However, if a macro (particularly one required by a standard) is to be defined that uses this field, two must be defined: one that uses i, the other that uses __i. If more than one additional field is used in a macro and they are conditional on distinct combinations of features, the complexity goes up as 2ⁿ.

All this leaves a difficult situation: vendors must provide very complex headers to deal with what is conceptually simple and safe-adding a field to a structure. It is the possibility of user-provided macros with the same name that makes this difficult.

Several alternatives were proposed that involved constraining the user's access to part of the name space available to the user (as specified by the ISO C standard). In some cases, this was only until all the headers had been included. There were two proposals discussed that failed to achieve consensus:

Limiting it for the whole program.
Restricting the use of identifiers containing only uppercase letters until after all system headers had been included. It was also pointed out that because macros might wish to access fields of a structure (and macro expansion occurs totally at point of use) restricting names in this way would not protect the macro expansion, and thus the solution was inadequate.

It was finally decided that reservation of symbols would occur, but as constrained.

The current wording also allows the addition of fields to a structure, but requires that user macros of the same name not interfere. This allows vendors to do one of the following:

Not create the situation (do not extend the structures with user-accessible names or use the solution in (7) above)
Extend their compilers to allow some way of adding names to structures and macros safely

There are at least two ways that the compiler might be extended: add new preprocessor directives that turn off and on macro expansion for certain symbols (without changing the value of the macro) and a function or lexical operation that suppresses expansion of a word. The latter seems more flexible, particularly because it addresses the problem in macros as well as in declarations.

The following seems to be a possible implementation extension to the C language that will do this: any token that during macro expansion is found to be preceded by three '#' symbols shall not be further expanded in exactly the same way as described for macros that expand to their own name as in Section 3.8.3.4 of the ISO C standard. A vendor may also wish to implement this as an operation that is lexically a function, which might be implemented as:

#define __safe_name(x) ###x

Using a function notation would insulate vendors from changes in standards until such a functionality is standardized (if ever). Standardization of such a function would be valuable because it would then permit third parties to take advantage of it portably in software they may supply.

The symbols that are "explicitly permitted, but not required by IEEE Std 1003.1-2001" include those classified below. (That is, the symbols classified below might, but are not required to, be present when _POSIX_C_SOURCE is defined to have the value 200112L.)

Symbols in <limits.h> and <unistd.h> that are defined to indicate support for options or limits that are constant at compile-time
Symbols in the name space reserved for the implementation by the ISO C standard
Symbols in a name space reserved for a particular type of extension (for example, type names ending with _t in <sys/types.h>)
Additional members of structures or unions whose names do not reduce the name space reserved for applications

Since both implementations and future revisions of IEEE Std 1003.1 and other POSIX standards may use symbols in the reserved spaces described in these tables, there is a potential for name space clashes. To avoid future name space clashes when adding symbols, implementations should not use the posix_, POSIX_, or _POSIX_ prefixes.

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/2 is applied, deleting the entries POSIX_, _POSIX_, and posix_ from the column of allowed name space prefixes for use by an implementation in the first table. The presence of these prefixes was contradicting later text which states that: "The prefixes posix_, POSIX_, and _POSIX are reserved for use by Shell and Utilities volume of IEEE Std 1003.1-2001, Chapter 2, Shell Command Language and other POSIX standards. Implementations may add symbols to the headers shown in the following table, provided the identifiers ... do not use the reserved prefixes posix_, POSIX_, or _POSIX.".

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/3 is applied, correcting the reserved macro prefix from: "PRI[a-z], SCN[a-z]" to: "PRI[Xa-z], SCN[Xa-z]" in the second table. The change was needed since the ISO C standard allows implementations to define macros of the form PRI or SCN followed by any lowercase letter or 'X' in <inttypes.h>. (The ISO/IEC 9899:1999 standard, Subclause 7.26.4.)

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/4 is applied, adding a new section listing reserved names for the <stdint.h> header. This change is for alignment with the ISO C standard.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/2 is applied, making it clear that implementations are permitted to have symbols with the prefix _POSIX_ visible in any header.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/3 is applied, updating the table of allowed macro prefixes to include the prefix FP_[A-Z] for <math.h>. This text is added for consistency with the <math.h> reference page in the Base Definitions volume of IEEE Std 1003.1-2001 which permits additional implementation-defined floating-point classifications.

B.2.3 Error Numbers

It was the consensus of the standard developers that to allow the conformance document to state that an error occurs and under what conditions, but to disallow a statement that it never occurs, does not make sense. It could be implied by the current wording that this is allowed, but to reduce the possibility of future interpretation requests, it is better to make an explicit statement.

The ISO C standard requires that errno be an assignable lvalue. Originally, the definition in POSIX.1 was stricter than that in the ISO C standard, extern int errno, in order to support historical usage. In a multi-threaded environment, implementing errno as a global variable results in non-deterministic results when accessed. It is required, however, that errno work as a per-thread error reporting mechanism. In order to do this, a separate errno value has to be maintained for each thread. The following section discusses the various alternative solutions that were considered.

In order to avoid this problem altogether for new functions, these functions avoid using errno and, instead, return the error number directly as the function return value; a return value of zero indicates that no error was detected.

For any function that can return errors, the function return value is not used for any purpose other than for reporting errors. Even when the output of the function is scalar, it is passed through a function argument. While it might have been possible to allow some scalar outputs to be coded as negative function return values and mixed in with positive error status returns, this was rejected-using the return value for a mixed purpose was judged to be of limited use and error prone.

Checking the value of errno alone is not sufficient to determine the existence or type of an error, since it is not required that a successful function call clear errno. The variable errno should only be examined when the return value of a function indicates that the value of errno is meaningful. In that case, the function is required to set the variable to something other than zero.

The variable errno is never set to zero by any function call; to do so would contradict the ISO C standard.

POSIX.1 requires (in the ERRORS sections of function descriptions) certain error values to be set in certain conditions because many existing applications depend on them. Some error numbers, such as [EFAULT], are entirely implementation-defined and are noted as such in their description in the ERRORS section. This section otherwise allows wide latitude to the implementation in handling error reporting.

Some of the ERRORS sections in IEEE Std 1003.1-2001 have two subsections. The first:

"The function shall fail if:''

could be called the "mandatory" section.

The second:

"The function may fail if:''

could be informally known as the "optional" section.

Attempting to infer the quality of an implementation based on whether it detects optional error conditions is not useful.

Following each one-word symbolic name for an error, there is a description of the error. The rationale for some of the symbolic names follows:

[ECANCELED]

This spelling was chosen as being more common.

[EFAULT]

Most historical implementations do not catch an error and set errno when an invalid address is given to the functions wait(), time(), or times(). Some implementations cannot reliably detect an invalid address. And most systems that detect invalid addresses will do so only for a system call, not for a library routine.

[EFTYPE]

This error code was proposed in earlier proposals as "Inappropriate operation for file type", meaning that the operation requested is not appropriate for the file specified in the function call. This code was proposed, although the same idea was covered by [ENOTTY], because the connotations of the name would be misleading. It was pointed out that the fcntl() function uses the error code [EINVAL] for this notion, and hence all instances of [EFTYPE] were changed to this code.

[EINTR]

POSIX.1 prohibits conforming implementations from restarting interrupted system calls of conforming applications unless the SA_RESTART flag is in effect for the signal. However, it does not require that [EINTR] be returned when another legitimate value may be substituted; for example, a partial transfer count when read() or write() are interrupted. This is only given when the signal-catching function returns normally as opposed to returns by mechanisms like longjmp() or siglongjmp().

[ELOOP]

In specifying conditions under which implementations would generate this error, the following goals were considered:

To ensure that actual loops are detected, including loops that result from symbolic links across distributed file systems.
To ensure that during pathname resolution an application can rely on the ability to follow at least {SYMLOOP_MAX} symbolic links in the absence of a loop.
To allow implementations to provide the capability of traversing more than {SYMLOOP_MAX} symbolic links in the absence of a loop.
To allow implementations to detect loops and generate the error prior to encountering {SYMLOOP_MAX} symbolic links.

[ENAMETOOLONG]

When a symbolic link is encountered during pathname resolution, the contents of that symbolic link are used to create a new pathname. The standard developers intended to allow, but not require, that implementations enforce the restriction of {PATH_MAX} on the result of this pathname substitution.

[ENOMEM]

The term "main memory" is not used in POSIX.1 because it is implementation-defined.

[ENOTSUP]

This error code is to be used when an implementation chooses to implement the required functionality of IEEE Std 1003.1-2001 but does not support optional facilities defined by IEEE Std 1003.1-2001. The return of [ENOSYS] is to be taken to indicate that the function of the interface is not supported at all; the function will always fail with this error code.

[ENOTTY]

The symbolic name for this error is derived from a time when device control was done by ioctl() and that operation was only permitted on a terminal interface. The term "TTY" is derived from "teletypewriter", the devices to which this error originally applied.

[EOVERFLOW]

Most of the uses of this error code are related to large file support. Typically, these cases occur on systems which support multiple programming environments with different sizes for off_t, but they may also occur in connection with remote file systems.

In addition, when different programming environments have different widths for types such as int and uid_t, several functions may encounter a condition where a value in a particular environment is too wide to be represented. In that case, this error should be raised. For example, suppose the currently running process has 64-bit int, and file descriptor 9223372036854775807 is open and does not have the close-on- exec flag set. If the process then uses execl() to exec a file compiled in a programming environment with 32-bit int, the call to execl() can fail with errno set to [EOVERFLOW]. A similar failure can occur with execl() if any of the user IDs or any of the group IDs to be assigned to the new process image are out of range for the executed file's programming environment.

Note, however, that this condition cannot occur for functions that are explicitly described as always being successful, such as getpid().

[EPIPE]

This condition normally generates the signal SIGPIPE; the error is returned if the signal does not terminate the process.

[EROFS]

In historical implementations, attempting to unlink() or rmdir() a mount point would generate an [EBUSY] error. An implementation could be envisioned where such an operation could be performed without error. In this case, if either the directory entry or the actual data structures reside on a read-only file system, [EROFS] is the appropriate error to generate. (For example, changing the link count of a file on a read-only file system could not be done, as is required by unlink(), and thus an error should be reported.)

Three error numbers, [EDOM], [EILSEQ], and [ERANGE], were added to this section primarily for consistency with the ISO C standard.

Alternative Solutions for Per-Thread errno

The usual implementation of errno as a single global variable does not work in a multi-threaded environment. In such an environment, a thread may make a POSIX.1 call and get a -1 error return, but before that thread can check the value of errno, another thread might have made a second POSIX.1 call that also set errno. This behavior is unacceptable in robust programs. There were a number of alternatives that were considered for handling the errno problem:

Implement errno as a per-thread integer variable.
Implement errno as a service that can access the per-thread error number.
Change all POSIX.1 calls to accept an extra status argument and avoid setting errno.
Change all POSIX.1 calls to raise a language exception.

The first option offers the highest level of compatibility with existing practice but requires special support in the linker, compiler, and/or virtual memory system to support the new concept of thread private variables. When compared with current practice, the third and fourth options are much cleaner, more efficient, and encourage a more robust programming style, but they require new versions of all of the POSIX.1 functions that might detect an error. The second option offers compatibility with existing code that uses the <errno.h> header to define the symbol errno. In this option, errno may be a macro defined:

#define errno  (*__errno())
extern int      *__errno();

This option may be implemented as a per-thread variable whereby an errno field is allocated in the user space object representing a thread, and whereby the function __errno() makes a system call to determine the location of its user space object and returns the address of the errno field of that object. Another implementation, one that avoids calling the kernel, involves allocating stacks in chunks. The stack allocator keeps a side table indexed by chunk number containing a pointer to the thread object that uses that chunk. The __errno() function then looks at the stack pointer, determines the chunk number, and uses that as an index into the chunk table to find its thread object and thus its private value of errno. On most architectures, this can be done in four to five instructions. Some compilers may wish to implement __errno() inline to improve performance.

Disallowing Return of the [EINTR] Error Code

Many blocking interfaces defined by IEEE Std 1003.1-2001 may return [EINTR] if interrupted during their execution by a signal handler. Blocking interfaces introduced under the Threads option do not have this property. Instead, they require that the interface appear to be atomic with respect to interruption. In particular, clients of blocking interfaces need not handle any possible [EINTR] return as a special case since it will never occur. If it is necessary to restart operations or complete incomplete operations following the execution of a signal handler, this is handled by the implementation, rather than by the application.

Requiring applications to handle [EINTR] errors on blocking interfaces has been shown to be a frequent source of often unreproducible bugs, and it adds no compelling value to the available functionality. Thus, blocking interfaces introduced for use by multi-threaded programs do not use this paradigm. In particular, in none of the functions flockfile(), pthread_cond_timedwait(), pthread_cond_wait(), pthread_join(), pthread_mutex_lock(), and sigwait() did providing [EINTR] returns add value, or even particularly make sense. Thus, these functions do not provide for an [EINTR] return, even when interrupted by a signal handler. The same arguments can be applied to sem_wait(), sem_trywait(), sigwaitinfo(), and sigtimedwait(), but implementations are permitted to return [EINTR] error codes for these functions for compatibility with earlier versions of IEEE Std 1003.1. Applications cannot rely on calls to these functions returning [EINTR] error codes when signals are delivered to the calling thread, but they should allow for the possibility.

Additional Error Numbers

The ISO C standard defines the name space for implementations to add additional error numbers.

B.2.4 Signal Concepts

Historical implementations of signals, using the signal() function, have shortcomings that make them unreliable for many application uses. Because of this, a new signal mechanism, based very closely on the one of 4.2 BSD and 4.3 BSD, was added to POSIX.1.

Signal Names

The restriction on the actual type used for sigset_t is intended to guarantee that these objects can always be assigned, have their address taken, and be passed as parameters by value. It is not intended that this type be a structure including pointers to other data structures, as that could impact the portability of applications performing such operations. A reasonable implementation could be a structure containing an array of some integer type.

The signals described in IEEE Std 1003.1-2001 must have unique values so that they may be named as parameters of case statements in the body of a C-language switch clause. However, implementation-defined signals may have values that overlap with each other or with signals specified in IEEE Std 1003.1-2001. An example of this is SIGABRT, which traditionally overlaps some other signal, such as SIGIOT.

SIGKILL, SIGTERM, SIGUSR1, and SIGUSR2 are ordinarily generated only through the explicit use of the kill() function, although some implementations generate SIGKILL under extraordinary circumstances. SIGTERM is traditionally the default signal sent by the kill command.

The signals SIGBUS, SIGEMT, SIGIOT, SIGTRAP, and SIGSYS were omitted from POSIX.1 because their behavior is implementation-defined and could not be adequately categorized. Conforming implementations may deliver these signals, but must document the circumstances under which they are delivered and note any restrictions concerning their delivery. The signals SIGFPE, SIGILL, and SIGSEGV are similar in that they also generally result only from programming errors. They were included in POSIX.1 because they do indicate three relatively well-categorized conditions. They are all defined by the ISO C standard and thus would have to be defined by any system with an ISO C standard binding, even if not explicitly included in POSIX.1.

There is very little that a Conforming POSIX.1 Application can do by catching, ignoring, or masking any of the signals SIGILL, SIGTRAP, SIGIOT, SIGEMT, SIGBUS, SIGSEGV, SIGSYS, or SIGFPE. They will generally be generated by the system only in cases of programming errors. While it may be desirable for some robust code (for example, a library routine) to be able to detect and recover from programming errors in other code, these signals are not nearly sufficient for that purpose. One portable use that does exist for these signals is that a command interpreter can recognize them as the cause of a process' termination (with wait()) and print an appropriate message. The mnemonic tags for these signals are derived from their PDP-11 origin.

The signals SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU, and SIGCONT are provided for job control and are unchanged from 4.2 BSD. The signal SIGCHLD is also typically used by job control shells to detect children that have terminated or, as in 4.2 BSD, stopped.

Some implementations, including System V, have a signal named SIGCLD, which is similar to SIGCHLD in 4.2 BSD. POSIX.1 permits implementations to have a single signal with both names. POSIX.1 carefully specifies ways in which conforming applications can avoid the semantic differences between the two different implementations. The name SIGCHLD was chosen for POSIX.1 because most current application usages of it can remain unchanged in conforming applications. SIGCLD in System V has more cases of semantics that POSIX.1 does not specify, and thus applications using it are more likely to require changes in addition to the name change.

The signals SIGUSR1 and SIGUSR2 are commonly used by applications for notification of exceptional behavior and are described as "reserved as application-defined" so that such use is not prohibited. Implementations should not generate SIGUSR1 or SIGUSR2, except when explicitly requested by kill(). It is recommended that libraries not use these two signals, as such use in libraries could interfere with their use by applications calling the libraries. If such use is unavoidable, it should be documented. It is prudent for non-portable libraries to use non-standard signals to avoid conflicts with use of standard signals by portable libraries.

There is no portable way for an application to catch or ignore non-standard signals. Some implementations define the range of signal numbers, so applications can install signal-catching functions for all of them. Unfortunately, implementation-defined signals often cause problems when caught or ignored by applications that do not understand the reason for the signal. While the desire exists for an application to be more robust by handling all possible signals (even those only generated by kill()), no existing mechanism was found to be sufficiently portable to include in POSIX.1. The value of such a mechanism, if included, would be diminished given that SIGKILL would still not be catchable.

A number of new signal numbers are reserved for applications because the two user signals defined by POSIX.1 are insufficient for many realtime applications. A range of signal numbers is specified, rather than an enumeration of additional reserved signal names, because different applications and application profiles will require a different number of application signals. It is not desirable to burden all application domains and therefore all implementations with the maximum number of signals required by all possible applications. Note that in this context, signal numbers are essentially different signal priorities.

The relatively small number of required additional signals, {_POSIX_RTSIG_MAX}, was chosen so as not to require an unreasonably large signal mask/set. While this number of signals defined in POSIX.1 will fit in a single 32-bit word signal mask, it is recognized that most existing implementations define many more signals than are specified in POSIX.1 and, in fact, many implementations have already exceeded 32 signals (including the "null signal"). Support of {_POSIX_RTSIG_MAX} additional signals may push some implementation over the single 32-bit word line, but is unlikely to push any implementations that are already over that line beyond the 64-signal line.

Signal Generation and Delivery

The terms defined in this section are not used consistently in documentation of historical systems. Each signal can be considered to have a lifetime beginning with generation and ending with delivery or acceptance. The POSIX.1 definition of "delivery" does not exclude ignored signals; this is considered a more consistent definition. This revised text in several parts of IEEE Std 1003.1-2001 clarifies the distinct semantics of asynchronous signal delivery and synchronous signal acceptance. The previous wording attempted to categorize both under the term "delivery", which led to conflicts over whether the effects of asynchronous signal delivery applied to synchronous signal acceptance.

Signals generated for a process are delivered to only one thread. Thus, if more than one thread is eligible to receive a signal, one has to be chosen. The choice of threads is left entirely up to the implementation both to allow the widest possible range of conforming implementations and to give implementations the freedom to deliver the signal to the "easiest possible" thread should there be differences in ease of delivery between different threads.

Note that should multiple delivery among cooperating threads be required by an application, this can be trivially constructed out of the provided single-delivery semantics. The construction of a sigwait_multiple() function that accomplishes this goal is presented with the rationale for sigwaitinfo().

Implementations should deliver unblocked signals as soon after they are generated as possible. However, it is difficult for POSIX.1 to make specific requirements about this, beyond those in kill() and sigprocmask(). Even on systems with prompt delivery, scheduling of higher priority processes is always likely to cause delays.

In general, the interval between the generation and delivery of unblocked signals cannot be detected by an application. Thus, references to pending signals generally apply to blocked, pending signals. An implementation registers a signal as pending on the process when no thread has the signal unblocked and there are no threads blocked in a sigwait() function for that signal. Thereafter, the implementation delivers the signal to the first thread that unblocks the signal or calls a sigwait() function on a signal set containing this signal rather than choosing the recipient thread at the time the signal is sent.

In the 4.3 BSD system, signals that are blocked and set to SIG_IGN are discarded immediately upon generation. For a signal that is ignored as its default action, if the action is SIG_DFL and the signal is blocked, a generated signal remains pending. In the 4.1 BSD system and in System V Release 3 (two other implementations that support a somewhat similar signal mechanism), all ignored blocked signals remain pending if generated. Because it is not normally useful for an application to simultaneously ignore and block the same signal, it was unnecessary for POSIX.1 to specify behavior that would invalidate any of the historical implementations.

There is one case in some historical implementations where an unblocked, pending signal does not remain pending until it is delivered. In the System V implementation of signal(), pending signals are discarded when the action is set to SIG_DFL or a signal-catching routine (as well as to SIG_IGN). Except in the case of setting SIGCHLD to SIG_DFL, implementations that do this do not conform completely to POSIX.1. Some earlier proposals for POSIX.1 explicitly stated this, but these statements were redundant due to the requirement that functions defined by POSIX.1 not change attributes of processes defined by POSIX.1 except as explicitly stated.

POSIX.1 specifically states that the order in which multiple, simultaneously pending signals are delivered is unspecified. This order has not been explicitly specified in historical implementations, but has remained quite consistent and been known to those familiar with the implementations. Thus, there have been cases where applications (usually system utilities) have been written with explicit or implicit dependencies on this order. Implementors and others porting existing applications may need to be aware of such dependencies.

When there are multiple pending signals that are not blocked, implementations should arrange for the delivery of all signals at once, if possible. Some implementations stack calls to all pending signal-catching routines, making it appear that each signal-catcher was interrupted by the next signal. In this case, the implementation should ensure that this stacking of signals does not violate the semantics of the signal masks established by sigaction(). Other implementations process at most one signal when the operating system is entered, with remaining signals saved for later delivery. Although this practice is widespread, this behavior is neither standardized nor endorsed. In either case, implementations should attempt to deliver signals associated with the current state of the process (for example, SIGFPE) before other signals, if possible.

In 4.2 BSD and 4.3 BSD, it is not permissible to ignore or explicitly block SIGCONT, because if blocking or ignoring this signal prevented it from continuing a stopped process, such a process could never be continued (only killed by SIGKILL). However, 4.2 BSD and 4.3 BSD do block SIGCONT during execution of its signal-catching function when it is caught, creating exactly this problem. A proposal was considered to disallow catching SIGCONT in addition to ignoring and blocking it, but this limitation led to objections. The consensus was to require that SIGCONT always continue a stopped process when generated. This removed the need to disallow ignoring or explicit blocking of the signal; note that SIG_IGN and SIG_DFL are equivalent for SIGCONT.

Realtime Signal Generation and Delivery

The Realtime Signals Extension option to POSIX.1 signal generation and delivery behavior is required for the following reasons:

The sigevent structure is used by other POSIX.1 functions that result in asynchronous event notifications to specify the notification mechanism to use and other information needed by the notification mechanism. IEEE Std 1003.1-2001 defines only three symbolic values for the notification mechanism:
- SIGEV_NONE is used to indicate that no notification is required when the event occurs. This is useful for applications that use asynchronous I/O with polling for completion.
- SIGEV_SIGNAL indicates that a signal is generated when the event occurs.
- SIGEV_THREAD provides for "callback functions" for asynchronous notifications done by a function call within the context of a new thread. This provides a multi-threaded process with a more natural means of notification than signals.
The primary difficulty with previous notification approaches has been to specify the environment of the notification routine.
- One approach is to limit the notification routine to call only functions permitted in a signal handler. While the list of permissible functions is clearly stated, this is overly restrictive.
- A second approach is to define a new list of functions or classes of functions that are explicitly permitted or not permitted. This would give a programmer more lists to deal with, which would be awkward.
- The third approach is to define completely the environment for execution of the notification function. A clear definition of an execution environment for notification is provided by executing the notification function in the environment of a newly created thread.
Implementations may support additional notification mechanisms by defining new values for sigev_notify.

For a notification type of SIGEV_SIGNAL, the other members of the sigevent structure defined by IEEE Std 1003.1-2001 specify the realtime signal-that is, the signal number and application-defined value that differentiates between occurrences of signals with the same number-that will be generated when the event occurs. The structure is defined in <signal.h>, even though the structure is not directly used by any of the signal functions, because it is part of the signals interface used by the POSIX.1b "client functions". When the client functions include <signal.h> to define the signal names, the sigevent structure will also be defined.

An application-defined value passed to the signal handler is used to differentiate between different "events" instead of requiring that the application use different signal numbers for several reasons:
- Realtime applications potentially handle a very large number of different events. Requiring that implementations support a correspondingly large number of distinct signal numbers will adversely impact the performance of signal delivery because the signal masks to be manipulated on entry and exit to the handlers will become large.
- Event notifications are prioritized by signal number (the rationale for this is explained in the following paragraphs) and the use of different signal numbers to differentiate between the different event notifications overloads the signal number more than has already been done. It also requires that the application writer make arbitrary assignments of priority to events that are logically of equal priority.
A union is defined for the application-defined value so that either an integer constant or a pointer can be portably passed to the signal-catching function. On some architectures a pointer cannot be cast to an int and vice versa.

Use of a structure here with an explicit notification type discriminant rather than explicit parameters to realtime functions, or embedded in other realtime structures, provides for future extensions to IEEE Std 1003.1-2001. Additional, perhaps more efficient, notification mechanisms can be supported for existing realtime function interfaces, such as timers and asynchronous I/O, by extending the sigevent structure appropriately. The existing realtime function interfaces will not have to be modified to use any such new notification mechanism. The revised text concerning the SIGEV_SIGNAL value makes consistent the semantics of the members of the sigevent structure, particularly in the definitions of lio_listio() and aio_fsync(). For uniformity, other revisions cause this specification to be referred to rather than inaccurately duplicated in the descriptions of functions and structures using the sigevent structure. The revised wording does not relax the requirement that the signal number be in the range SIGRTMIN to SIGRTMAX to guarantee queuing and passing of the application value, since that requirement is still implied by the signal names.
IEEE Std 1003.1-2001 is intentionally vague on whether "non-realtime" signal-generating mechanisms can result in a siginfo_t being supplied to the handler on delivery. In one existing implementation, a siginfo_t is posted on signal generation, even though the implementation does not support queuing of multiple occurrences of a signal. It is not the intent of IEEE Std 1003.1-2001 to preclude this, independent of the mandate to define signals that do support queuing. Any interpretation that appears to preclude this is a mistake in the reading or writing of the standard.
Signals handled by realtime signal handlers might be generated by functions or conditions that do not allow the specification of an application-defined value and do not queue. IEEE Std 1003.1-2001 specifies the si_code member of the siginfo_t structure used in existing practice and defines additional codes so that applications can detect whether an application-defined value is present or not. The code SI_USER for kill()- generated signals is adopted from existing practice.
The sigaction() sa_flags value SA_SIGINFO tells the implementation that the signal-catching function expects two additional arguments. When the flag is not set, a single argument, the signal number, is passed as specified by IEEE Std 1003.1-2001. Although IEEE Std 1003.1-2001 does not explicitly allow the info argument to the handler function to be NULL, this is existing practice. This provides for compatibility with programs whose signal-catching functions are not prepared to accept the additional arguments. IEEE Std 1003.1-2001 is explicitly unspecified as to whether signals actually queue when SA_SIGINFO is not set for a signal, as there appear to be no benefits to applications in specifying one behavior or another. One existing implementation queues a siginfo_t on each signal generation, unless the signal is already pending, in which case the implementation discards the new siginfo_t; that is, the queue length is never greater than one. This implementation only examines SA_SIGINFO on signal delivery, discarding the queued siginfo_t if its delivery was not requested.

IEEE Std 1003.1-2001 specifies several new values for the si_code member of the siginfo_t structure. In existing practice, a si_code value of less than or equal to zero indicates that the signal was generated by a process via the kill() function. In existing practice, values of si_code that provide additional information for implementation-generated signals, such as SIGFPE or SIGSEGV, are all positive. Thus, if implementations define the new constants specified in IEEE Std 1003.1-2001 to be negative numbers, programs written to use existing practice will not break. IEEE Std 1003.1-2001 chose not to attempt to specify existing practice values of si_code other than SI_USER both because it was deemed beyond the scope of IEEE Std 1003.1-2001 and because many of the values in existing practice appear to be platform and implementation-defined. But, IEEE Std 1003.1-2001 does specify that if an implementation-for example, one that does not have existing practice in this area-chooses to define additional values for si_code, these values have to be different from the values of the symbols specified by IEEE Std 1003.1-2001. This will allow conforming applications to differentiate between signals generated by one of the POSIX.1b asynchronous events and those generated by other implementation events in a manner compatible with existing practice.

The unique values of si_code for the POSIX.1b asynchronous events have implications for implementations of, for example, asynchronous I/O or message passing in user space library code. Such an implementation will be required to provide a hidden interface to the signal generation mechanism that allows the library to specify the standard values of si_code.

Existing practice also defines additional members of siginfo_t, such as the process ID and user ID of the sending process for kill()- generated signals. These members were deemed not necessary to meet the requirements of realtime applications and are not specified by IEEE Std 1003.1-2001. Neither are they precluded.

The third argument to the signal-catching function, context, is left undefined by IEEE Std 1003.1-2001, but is specified in the interface because it matches existing practice for the SA_SIGINFO flag. It was considered undesirable to require a separate implementation for SA_SIGINFO for POSIX conformance on implementations that already support the two additional parameters.
The requirement to deliver lower numbered signals in the range SIGRTMIN to SIGRTMAX first, when multiple unblocked signals are pending, results from several considerations:
- A method is required to prioritize event notifications. The signal number was chosen instead of, for instance, associating a separate priority with each request, because an implementation has to check pending signals at various points and select one for delivery when more than one is pending. Specifying a selection order is the minimal additional semantic that will achieve prioritized delivery. If a separate priority were to be associated with queued signals, it would be necessary for an implementation to search all non-empty, non-blocked signal queues and select from among them the pending signal with the highest priority. This would significantly increase the cost of and decrease the determinism of signal delivery.
- Given the specified selection of the lowest numeric unblocked pending signal, preemptive priority signal delivery can be achieved using signal numbers and signal masks by ensuring that the sa_mask for each signal number blocks all signals with a higher numeric value.
  
  For realtime applications that want to use only the newly defined realtime signal numbers without interference from the standard signals, this can be achieved by blocking all of the standard signals in the thread signal mask and in the sa_mask installed by the signal action for the realtime signal handlers.
IEEE Std 1003.1-2001 explicitly leaves unspecified the ordering of signals outside of the range of realtime signals and the ordering of signals within this range with respect to those outside the range. It was believed that this would unduly constrain implementations or standards in the future definition of new signals.

Signal Actions

Early proposals mentioned SIGCONT as a second exception to the rule that signals are not delivered to stopped processes until continued. Because IEEE Std 1003.1-2001 now specifies that SIGCONT causes the stopped process to continue when it is generated, delivery of SIGCONT is not prevented because a process is stopped, even without an explicit exception to this rule.

Ignoring a signal by setting the action to SIG_IGN (or SIG_DFL for signals whose default action is to ignore) is not the same as installing a signal-catching function that simply returns. Invoking such a function will interrupt certain system functions that block processes (for example, wait(), sigsuspend(), pause(), read(), write()) while ignoring a signal has no such effect on the process.

Historical implementations discard pending signals when the action is set to SIG_IGN. However, they do not always do the same when the action is set to SIG_DFL and the default action is to ignore the signal. IEEE Std 1003.1-2001 requires this for the sake of consistency and also for completeness, since the only signal this applies to is SIGCHLD, and IEEE Std 1003.1-2001 disallows setting its action to SIG_IGN.

Some implementations (System V, for example) assign different semantics for SIGCLD depending on whether the action is set to SIG_IGN or SIG_DFL. Since POSIX.1 requires that the default action for SIGCHLD be to ignore the signal, applications should always set the action to SIG_DFL in order to avoid SIGCHLD.

Whether or not an implementation allows SIG_IGN as a SIGCHLD disposition to be inherited across a call to one of the exec family of functions or posix_spawn() is explicitly left as unspecified. This change was made as a result of IEEE PASC Interpretation 1003.1 #132, and permits the implementation to decide between the following alternatives:

Unconditionally leave SIGCHLD set to SIG_IGN, in which case the implementation would not allow applications that assume inheritance of SIG_DFL to conform to IEEE Std 1003.1-2001 without change. The implementation would, however, retain an ability to control applications that create child processes but never call on the wait family of functions, potentially filling up the process table.
Unconditionally reset SIGCHLD to SIG_DFL, in which case the implementation would allow applications that assume inheritance of SIG_DFL to conform. The implementation would, however, lose an ability to control applications that spawn child processes but never reap them.
Provide some mechanism, not specified in IEEE Std 1003.1-2001, to control inherited SIGCHLD dispositions.

Some implementations (System V, for example) will deliver a SIGCLD signal immediately when a process establishes a signal-catching function for SIGCLD when that process has a child that has already terminated. Other implementations, such as 4.3 BSD, do not generate a new SIGCHLD signal in this way. In general, a process should not attempt to alter the signal action for the SIGCHLD signal while it has any outstanding children. However, it is not always possible for a process to avoid this; for example, shells sometimes start up processes in pipelines with other processes from the pipeline as children. Processes that cannot ensure that they have no children when altering the signal action for SIGCHLD thus need to be prepared for, but not depend on, generation of an immediate SIGCHLD signal.

The default action of the stop signals (SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU) is to stop a process that is executing. If a stop signal is delivered to a process that is already stopped, it has no effect. In fact, if a stop signal is generated for a stopped process whose signal mask blocks the signal, the signal will never be delivered to the process since the process must receive a SIGCONT, which discards all pending stop signals, in order to continue executing.

The SIGCONT signal continues a stopped process even if SIGCONT is blocked (or ignored). However, if a signal-catching routine has been established for SIGCONT, it will not be entered until SIGCONT is unblocked.

If a process in an orphaned process group stops, it is no longer under the control of a job control shell and hence would not normally ever be continued. Because of this, orphaned processes that receive terminal-related stop signals (SIGTSTP, SIGTTIN, SIGTTOU, but not SIGSTOP) must not be allowed to stop. The goal is to prevent stopped processes from languishing forever. (As SIGSTOP is sent only via kill(), it is assumed that the process or user sending a SIGSTOP can send a SIGCONT when desired.) Instead, the system must discard the stop signal. As an extension, it may also deliver another signal in its place. 4.3 BSD sends a SIGKILL, which is overly effective because SIGKILL is not catchable. Another possible choice is SIGHUP. 4.3 BSD also does this for orphaned processes (processes whose parent has terminated) rather than for members of orphaned process groups; this is less desirable because job control shells manage process groups. POSIX.1 also prevents SIGTTIN and SIGTTOU signals from being generated for processes in orphaned process groups as a direct result of activity on a terminal, preventing infinite loops when read() and write() calls generate signals that are discarded; see Terminal Access Control. A similar restriction on the generation of SIGTSTP was considered, but that would be unnecessary and more difficult to implement due to its asynchronous nature.

Although POSIX.1 requires that signal-catching functions be called with only one argument, there is nothing to prevent conforming implementations from extending POSIX.1 to pass additional arguments, as long as Strictly Conforming POSIX.1 Applications continue to compile and execute correctly. Most historical implementations do, in fact, pass additional, signal-specific arguments to certain signal-catching routines.

There was a proposal to change the declared type of the signal handler to:

void func (int sig, ...);

The usage of ellipses ( "..." ) is ISO C standard syntax to indicate a variable number of arguments. Its use was intended to allow the implementation to pass additional information to the signal handler in a standard manner.

Unfortunately, this construct would require all signal handlers to be defined with this syntax because the ISO C standard allows implementations to use a different parameter passing mechanism for variable parameter lists than for non-variable parameter lists. Thus, all existing signal handlers in all existing applications would have to be changed to use the variable syntax in order to be standard and portable. This is in conflict with the goal of Minimal Changes to Existing Application Code.

When terminating a process from a signal-catching function, processes should be aware of any interpretation that their parent may make of the status returned by wait() or waitpid(). In particular, a signal-catching function should not call exit(0) or _exit(0) unless it wants to indicate successful termination. A non-zero argument to exit() or _exit() can be used to indicate unsuccessful termination. Alternatively, the process can use kill() to send itself a fatal signal (first ensuring that the signal is set to the default action and not blocked). See also the RATIONALE section of the _exit() function.

The behavior of unsafe functions, as defined by this section, is undefined when they are invoked from signal-catching functions in certain circumstances. The behavior of reentrant functions, as defined by this section, is as specified by POSIX.1, regardless of invocation from a signal-catching function. This is the only intended meaning of the statement that reentrant functions may be used in signal-catching functions without restriction. Applications must still consider all effects of such functions on such things as data structures, files, and process state. In particular, application writers need to consider the restrictions on interactions when interrupting sleep() (see sleep()) and interactions among multiple handles for a file description. The fact that any specific function is listed as reentrant does not necessarily mean that invocation of that function from a signal-catching function is recommended.

In order to prevent errors arising from interrupting non-reentrant function calls, applications should protect calls to these functions either by blocking the appropriate signals or through the use of some programmatic semaphore. POSIX.1 does not address the more general problem of synchronizing access to shared data structures. Note in particular that even the "safe" functions may modify the global variable errno; the signal-catching function may want to save and restore its value. The same principles apply to the reentrancy of application routines and asynchronous data access.

Note that longjmp() and siglongjmp() are not in the list of reentrant functions. This is because the code executing after longjmp() or siglongjmp() can call any unsafe functions with the same danger as calling those unsafe functions directly from the signal handler. Applications that use longjmp() or siglongjmp() out of signal handlers require rigorous protection in order to be portable. Many of the other functions that are excluded from the list are traditionally implemented using either the C language malloc() or free() functions or the ISO C standard I/O library, both of which traditionally use data structures in a non-reentrant manner. Because any combination of different functions using a common data structure can cause reentrancy problems, POSIX.1 does not define the behavior when any unsafe function is called in a signal handler that interrupts any unsafe function.

The only realtime extension to signal actions is the addition of the additional parameters to the signal-catching function. This extension has been explained and motivated in the previous section. In making this extension, though, developers of POSIX.1b ran into issues relating to function prototypes. In response to input from the POSIX.1 standard developers, members were added to the sigaction structure to specify function prototypes for the newer signal-catching function specified by POSIX.1b. These members follow changes that are being made to POSIX.1. Note that IEEE Std 1003.1-2001 explicitly states that these fields may overlap so that a union can be defined. This enabled existing implementations of POSIX.1 to maintain binary-compatibility when these extensions were added.

The siginfo_t structure was adopted for passing the application-defined value to match existing practice, but the existing practice has no provision for an application-defined value, so this was added. Note that POSIX normally reserves the "_t" type designation for opaque types. The siginfo_t structure breaks with this convention to follow existing practice and thus promote portability. Standardization of the existing practice for the other members of this structure may be addressed in the future.

Although it is not explicitly visible to applications, there are additional semantics for signal actions implied by queued signals and their interaction with other POSIX.1b realtime functions. Specifically:

It is not necessary to queue signals whose action is SIG_IGN.
For implementations that support POSIX.1b timers, some interaction with the timer functions at signal delivery is implied to manage the timer overrun count.

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/5 is applied, reordering the RTS shaded text under the third and fourth paragraphs of the SIG_DFL description. This corrects an earlier editorial error in this section.

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/6 is applied, adding the abort() function to the list of async-cancel-safe functions.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/4 is applied, adding the sockatmark() function to the list of functions that shall be either reentrant or non-interruptible by signals and shall be async-signal-safe.

Signal Effects on Other Functions

The most common behavior of an interrupted function after a signal-catching function returns is for the interrupted function to give an [EINTR] error unless the SA_RESTART flag is in effect for the signal. However, there are a number of specific exceptions, including sleep() and certain situations with read() and write().

The historical implementations of many functions defined by IEEE Std 1003.1-2001 are not interruptible, but delay delivery of signals generated during their execution until after they complete. This is never a problem for functions that are guaranteed to complete in a short (imperceptible to a human) period of time. It is normally those functions that can suspend a process indefinitely or for long periods of time (for example, wait(), pause(), sigsuspend(), sleep(), or read()/ write() on a slow device like a terminal) that are interruptible. This permits applications to respond to interactive signals or to set timeouts on calls to most such functions with alarm(). Therefore, implementations should generally make such functions (including ones defined as extensions) interruptible.

Functions not mentioned explicitly as interruptible may be so on some implementations, possibly as an extension where the function gives an [EINTR] error. There are several functions (for example, getpid(), getuid()) that are specified as never returning an error, which can thus never be extended in this way.

If a signal-catching function returns while the SA_RESTART flag is in effect, an interrupted function is restarted at the point it was interrupted. Conforming applications cannot make assumptions about the internal behavior of interrupted functions, even if the functions are async-signal-safe. For example, suppose the read() function is interrupted with SA_RESTART in effect, the signal-catching function closes the file descriptor being read from and returns, and the read() function is then restarted; in this case the application cannot assume that the read() function will give an [EBADF] error, since read() might have checked the file descriptor for validity before being interrupted.

B.2.5 Standard I/O Streams

Interaction of File Descriptors and Standard I/O Streams

There is no additional rationale provided for this section.

Stream Orientation and Encoding Rules

There is no additional rationale provided for this section.

B.2.6 STREAMS

STREAMS are introduced into IEEE Std 1003.1-2001 as part of the alignment with the Single UNIX Specification, but marked as an option in recognition that not all systems may wish to implement the facility. The option within IEEE Std 1003.1-2001 is denoted by the XSR margin marker. The standard developers made this option independent of the XSI option.

STREAMS are a method of implementing network services and other character-based input/output mechanisms, with the STREAM being a full-duplex connection between a process and a device. STREAMS provides direct access to protocol modules, and optional protocol modules can be interposed between the process-end of the STREAM and the device-driver at the device-end of the STREAM. Pipes can be implemented using the STREAMS mechanism, so they can provide process-to-process as well as process-to-device communications.

This section introduces STREAMS I/O, the message types used to control them, an overview of the priority mechanism, and the interfaces used to access them.

Accessing STREAMS

There is no additional rationale provided for this section.

B.2.7 XSI Interprocess Communication

There are two forms of IPC supported as options in IEEE Std 1003.1-2001. The traditional System V IPC routines derived from the SVID-that is, the msg*(), sem*(), and shm*() interfaces-are mandatory on XSI-conformant systems. Thus, all XSI-conformant systems provide the same mechanisms for manipulating messages, shared memory, and semaphores.

In addition, the POSIX Realtime Extension provides an alternate set of routines for those systems supporting the appropriate options.

The application writer is presented with a choice: the System V interfaces or the POSIX interfaces (loosely derived from the Berkeley interfaces). The XSI profile prefers the System V interfaces, but the POSIX interfaces may be more suitable for realtime or other performance-sensitive applications.

IPC General Information

General information that is shared by all three mechanisms is described in this section. The common permissions mechanism is briefly introduced, describing the mode bits, and how they are used to determine whether or not a process has access to read or write/alter the appropriate instance of one of the IPC mechanisms. All other relevant information is contained in the reference pages themselves.

The semaphore type of IPC allows processes to communicate through the exchange of semaphore values. A semaphore is a positive integer. Since many applications require the use of more than one semaphore, XSI-conformant systems have the ability to create sets or arrays of semaphores.

Calls to support semaphores include:

semctl(), semget(), semop()

Semaphore sets are created by using the semget() function.

The message type of IPC allows processes to communicate through the exchange of data stored in buffers. This data is transmitted between processes in discrete portions known as messages.

Calls to support message queues include:

msgctl(), msgget(), msgrcv(), msgsnd()

The shared memory type of IPC allows two or more processes to share memory and consequently the data contained therein. This is done by allowing processes to set up access to a common memory address space. This sharing of memory provides a fast means of exchange of data between processes.

Calls to support shared memory include:

shmctl(), shmdt(), shmget()

The ftok() interface is also provided.

B.2.8 Realtime

Advisory Information

POSIX.1b contains an Informative Annex with proposed interfaces for "realtime files". These interfaces could determine groups of the exact parameters required to do "direct I/O" or "extents". These interfaces were objected to by a significant portion of the balloting group as too complex. A conforming application had little chance of correctly navigating the large parameter space to match its desires to the system. In addition, they only applied to a new type of file (realtime files) and they told the implementation exactly what to do as opposed to advising the implementation on application behavior and letting it optimize for the system the (portable) application was running on. For example, it was not clear how a system that had a disk array should set its parameters.

There seemed to be several overall goals:

Optimizing sequential access
Optimizing caching behavior
Optimizing I/O data transfer
Preallocation

The advisory interfaces, posix_fadvise() and posix_madvise(), satisfy the first two goals. The POSIX_FADV_SEQUENTIAL and POSIX_MADV_SEQUENTIAL advice tells the implementation to expect serial access. Typically the system will prefetch the next several serial accesses in order to overlap I/O. It may also free previously accessed serial data if memory is tight. If the application is not doing serial access it can use POSIX_FADV_WILLNEED and POSIX_MADV_WILLNEED to accomplish I/O overlap, as required. When the application advises POSIX_FADV_RANDOM or POSIX_MADV_RANDOM behavior, the implementation usually tries to fetch a minimum amount of data with each request and it does not expect much locality. POSIX_FADV_DONTNEED and POSIX_MADV_DONTNEED allow the system to free up caching resources as the data will not be required in the near future.

POSIX_FADV_NOREUSE tells the system that caching the specified data is not optimal. For file I/O, the transfer should go directly to the user buffer instead of being cached internally by the implementation. To portably perform direct disk I/O on all systems, the application must perform its I/O transfers according to the following rules:

The user buffer should be aligned according to the {POSIX_REC_XFER_ALIGN} pathconf() variable.
The number of bytes transferred in an I/O operation should be a multiple of the {POSIX_ALLOC_SIZE_MIN} pathconf() variable.
The offset into the file at the start of an I/O operation should be a multiple of the {POSIX_ALLOC_SIZE_MIN} pathconf() variable.
The application should ensure that all threads which open a given file specify POSIX_FADV_NOREUSE to be sure that there is no unexpected interaction between threads using buffered I/O and threads using direct I/O to the same file.

In some cases, a user buffer must be properly aligned in order to be transferred directly to/from the device. The {POSIX_REC_XFER_ALIGN} pathconf() variable tells the application the proper alignment.

The preallocation goal is met by the space control function, posix_fallocate(). The application can use posix_fallocate() to guarantee no [ENOSPC] errors and to improve performance by prepaying any overhead required for block allocation.

Implementations may use information conveyed by a previous posix_fadvise() call to influence the manner in which allocation is performed. For example, if an application did the following calls:

fd = open("file");
posix_fadvise(fd, offset, len, POSIX_FADV_SEQUENTIAL);
posix_fallocate(fd, len, size);

an implementation might allocate the file contiguously on disk.

Finally, the pathconf() variables {POSIX_REC_MIN_XFER_SIZE}, {POSIX_REC_MAX_XFER_SIZE}, and {POSIX_REC_INCR_XFER_SIZE} tell the application a range of transfer sizes that are recommended for best I/O performance.

Where bounded response time is required, the vendor can supply the appropriate settings of the advisories to achieve a guaranteed performance level.

The interfaces meet the goals while allowing applications using regular files to take advantage of performance optimizations. The interfaces tell the implementation expected application behavior which the implementation can use to optimize performance on a particular system with a particular dynamic load.

The posix_memalign() function was added to allow for the allocation of specifically aligned buffers; for example, for {POSIX_REC_XFER_ALIGN}.

The working group also considered the alternative of adding a function which would return an aligned pointer to memory within a user-supplied buffer. This was not considered to be the best method, because it potentially wastes large amounts of memory when buffers need to be aligned on large alignment boundaries.

Message Passing

This section provides the rationale for the definition of the message passing interface in IEEE Std 1003.1-2001. This is presented in terms of the objectives, models, and requirements imposed upon this interface.

Objectives

Many applications, including both realtime and database applications, require a means of passing arbitrary amounts of data between cooperating processes comprising the overall application on one or more processors. Many conventional interfaces for interprocess communication are insufficient for realtime applications in that efficient and deterministic data passing methods cannot be implemented. This has prompted the definition of message passing interfaces providing these facilities:
- Open a message queue.
- Send a message to a message queue.
- Receive a message from a queue, either synchronously or asynchronously.
- Alter message queue attributes for flow and resource control.
It is assumed that an application may consist of multiple cooperating processes and that these processes may wish to communicate and coordinate their activities. The message passing facility described in IEEE Std 1003.1-2001 allows processes to communicate through system-wide queues. These message queues are accessed through names that may be pathnames. A message queue can be opened for use by multiple sending and/or multiple receiving processes.
Background on Embedded Applications

Interprocess communication utilizing message passing is a key facility for the construction of deterministic, high-performance realtime applications. The facility is present in all realtime systems and is the framework upon which the application is constructed. The performance of the facility is usually a direct indication of the performance of the resulting application.

Realtime applications, especially for embedded systems, are typically designed around the performance constraints imposed by the message passing mechanisms. Applications for embedded systems are typically very tightly constrained. Application writers expect to design and control the entire system. In order to minimize system costs, the writer will attempt to use all resources to their utmost and minimize the requirement to add additional memory or processors.

The embedded applications usually share address spaces and only a simple message passing mechanism is required. The application can readily access common data incurring only mutual-exclusion overheads. The models desired are the simplest possible with the application building higher-level facilities only when needed.
Requirements

The following requirements determined the features of the message passing facilities defined in IEEE Std 1003.1-2001:
- Naming of Message Queues
  
  The mechanism for gaining access to a message queue is a pathname evaluated in a context that is allowed to be a file system name space, or it can be independent of any file system. This is a specific attempt to allow implementations based on either method in order to address both embedded systems and to also allow implementation in larger systems.
  
  The interface of mq_open() is defined to allow but not require the access control and name conflicts resulting from utilizing a file system for name resolution. All required behavior is specified for the access control case. Yet a conforming implementation, such as an embedded system kernel, may define that there are no distinctions between users and may define that all processes have all access privileges.
- Embedded System Naming
  
  Embedded systems need to be able to utilize independent name spaces for accessing the various system objects. They typically do not have a file system, precluding its utilization as a common name resolution mechanism. The modularity of an embedded system limits the connections between separate mechanisms that can be allowed.
  
  Embedded systems typically do not have any access protection. Since the system does not support the mixing of applications from different areas, and usually does not even have the concept of an authorization entity, access control is not useful.
- Large System Naming
  
  On systems with more functionality, the name resolution must support the ability to use the file system as the name resolution mechanism/object storage medium and to have control over access to the objects. Utilizing the pathname space can result in further errors when the names conflict with other objects.
- Fixed Size of Messages
  
  The interfaces impose a fixed upper bound on the size of messages that can be sent to a specific message queue. The size is set on an individual queue basis and cannot be changed dynamically.
  
  The purpose of the fixed size is to increase the ability of the system to optimize the implementation of mq_send() and mq_receive(). With fixed sizes of messages and fixed numbers of messages, specific message blocks can be pre-allocated. This eliminates a significant amount of checking for errors and boundary conditions. Additionally, an implementation can optimize data copying to maximize performance. Finally, with a restricted range of message sizes, an implementation is better able to provide deterministic operations.
- Prioritization of Messages
  
  Message prioritization allows the application to determine the order in which messages are received. Prioritization of messages is a key facility that is provided by most realtime kernels and is heavily utilized by the applications. The major purpose of having priorities in message queues is to avoid priority inversions in the message system, where a high-priority message is delayed behind one or more lower-priority messages. This allows the applications to be designed so that they do not need to be interrupted in order to change the flow of control when exceptional conditions occur. The prioritization does add additional overhead to the message operations in those cases it is actually used but a clever implementation can optimize for the FIFO case to make that more efficient.
- Asynchronous Notification
  
  The interface supports the ability to have a task asynchronously notified of the availability of a message on the queue. The purpose of this facility is to allow the task to perform other functions and yet still be notified that a message has become available on the queue.
  
  To understand the requirement for this function, it is useful to understand two models of application design: a single task performing multiple functions and multiple tasks performing a single function. Each of these models has advantages.
  
  Asynchronous notification is required to build the model of a single task performing multiple operations. This model typically results from either the expectation that interruption is less expensive than utilizing a separate task or from the growth of the application to include additional functions.

Semaphores

Semaphores are a high-performance process synchronization mechanism. Semaphores are named by null-terminated strings of characters.

A semaphore is created using the sem_init() function or the sem_open() function with the O_CREAT flag set in oflag.

To use a semaphore, a process has to first initialize the semaphore or inherit an open descriptor for the semaphore via fork().

A semaphore preserves its state when the last reference is closed. For example, if a semaphore has a value of 13 when the last reference is closed, it will have a value of 13 when it is next opened.

When a semaphore is created, an initial state for the semaphore has to be provided. This value is a non-negative integer. Negative values are not possible since they indicate the presence of blocked processes. The persistence of any of these objects across a system crash or a system reboot is undefined. Conforming applications must not depend on any sort of persistence across a system reboot or a system crash.

Models and Requirements

A realtime system requires synchronization and communication between the processes comprising the overall application. An efficient and reliable synchronization mechanism has to be provided in a realtime system that will allow more than one schedulable process mutually-exclusive access to the same resource. This synchronization mechanism has to allow for the optimal implementation of synchronization or systems implementors will define other, more cost-effective methods.

At issue are the methods whereby multiple processes (tasks) can be designed and implemented to work together in order to perform a single function. This requires interprocess communication and synchronization. A semaphore mechanism is the lowest level of synchronization that can be provided by an operating system.

A semaphore is defined as an object that has an integral value and a set of blocked processes associated with it. If the value is positive or zero, then the set of blocked processes is empty; otherwise, the size of the set is equal to the absolute value of the semaphore value. The value of the semaphore can be incremented or decremented by any process with access to the semaphore and must be done as an indivisible operation. When a semaphore value is less than or equal to zero, any process that attempts to lock it again will block or be informed that it is not possible to perform the operation.

A semaphore may be used to guard access to any resource accessible by more than one schedulable task in the system. It is a global entity and not associated with any particular process. As such, a method of obtaining access to the semaphore has to be provided by the operating system. A process that wants access to a critical resource (section) has to wait on the semaphore that guards that resource. When the semaphore is locked on behalf of a process, it knows that it can utilize the resource without interference by any other cooperating process in the system. When the process finishes its operation on the resource, leaving it in a well-defined state, it posts the semaphore, indicating that some other process may now obtain the resource associated with that semaphore.

In this section, mutexes and condition variables are specified as the synchronization mechanisms between threads.

These primitives are typically used for synchronizing threads that share memory in a single process. However, this section provides an option allowing the use of these synchronization interfaces and objects between processes that share memory, regardless of the method for sharing memory.

Much experience with semaphores shows that there are two distinct uses of synchronization: locking, which is typically of short duration; and waiting, which is typically of long or unbounded duration. These distinct usages map directly onto mutexes and condition variables, respectively.

Semaphores are provided in IEEE Std 1003.1-2001 primarily to provide a means of synchronization for processes; these processes may or may not share memory. Mutexes and condition variables are specified as synchronization mechanisms between threads; these threads always share (some) memory. Both are synchronization paradigms that have been in widespread use for a number of years. Each set of primitives is particularly well matched to certain problems.

With respect to binary semaphores, experience has shown that condition variables and mutexes are easier to use for many synchronization problems than binary semaphores. The primary reason for this is the explicit appearance of a Boolean predicate that specifies when the condition wait is satisfied. This Boolean predicate terminates a loop, including the call to pthread_cond_wait(). As a result, extra wakeups are benign since the predicate governs whether the thread will actually proceed past the condition wait. With stateful primitives, such as binary semaphores, the wakeup in itself typically means that the wait is satisfied. The burden of ensuring correctness for such waits is thus placed on all signalers of the semaphore rather than on an explicitly coded Boolean predicate located at the condition wait. Experience has shown that the latter creates a major improvement in safety and ease-of-use.

Counting semaphores are well matched to dealing with producer/consumer problems, including those that might exist between threads of different processes, or between a signal handler and a thread. In the former case, there may be little or no memory shared by the processes; in the latter case, one is not communicating between co-equal threads, but between a thread and an interrupt-like entity. It is for these reasons that IEEE Std 1003.1-2001 allows semaphores to be used by threads.

Mutexes and condition variables have been effectively used with and without priority inheritance, priority ceiling, and other attributes to synchronize threads that share memory. The efficiency of their implementation is comparable to or better than that of other synchronization primitives that are sometimes harder to use (for example, binary semaphores). Furthermore, there is at least one known implementation of Ada tasking that uses these primitives. Mutexes and condition variables together constitute an appropriate, sufficient, and complete set of inter-thread synchronization primitives.

Efficient multi-threaded applications require high-performance synchronization primitives. Considerations of efficiency and generality require a small set of primitives upon which more sophisticated synchronization functions can be built.
Standardization Issues

It is possible to implement very high-performance semaphores using test-and-set instructions on shared memory locations. The library routines that implement such a high-performance interface have to properly ensure that a sem_wait() or sem_trywait() operation that cannot be performed will issue a blocking semaphore system call or properly report the condition to the application. The same interface to the application program would be provided by a high-performance implementation.

Realtime Signals

Realtime Signals Extension

This portion of the rationale presents models, requirements, and standardization issues relevant to the Realtime Signals Extension. This extension provides the capability required to support reliable, deterministic, asynchronous notification of events. While a new mechanism, unencumbered by the historical usage and semantics of POSIX.1 signals, might allow for a more efficient implementation, the application requirements for event notification can be met with a small number of extensions to signals. Therefore, a minimal set of extensions to signals to support the application requirements is specified.

The realtime signal extensions specified in this section are used by other realtime functions requiring asynchronous notification:

Models

The model supported is one of multiple cooperating processes, each of which handles multiple asynchronous external events. Events represent occurrences that are generated as the result of some activity in the system. Examples of occurrences that can constitute an event include:
- Completion of an asynchronous I/O request
- Expiration of a POSIX.1b timer
- Arrival of an interprocess message
- Generation of a user-defined event
Processing of these events may occur synchronously via polling for event notifications or asynchronously via a software interrupt mechanism. Existing practice for this model is well established for traditional proprietary realtime operating systems, realtime executives, and realtime extended POSIX-like systems.

A contrasting model is that of "cooperating sequential processes" where each process handles a single priority of events via polling. Each process blocks while waiting for events, and each process depends on the preemptive, priority-based process scheduling mechanism to arbitrate between events of different priority that need to be processed concurrently. Existing practice for this model is also well established for small realtime executives that typically execute in an unprotected physical address space, but it is just emerging in the context of a fuller function operating system with multiple virtual address spaces.

It could be argued that the cooperating sequential process model, and the facilities supported by the POSIX Threads Extension obviate a software interrupt model. But, even with the cooperating sequential process model, the need has been recognized for a software interrupt model to handle exceptional conditions and process aborting, so the mechanism must be supported in any case. Furthermore, it is not the purview of IEEE Std 1003.1-2001 to attempt to convince realtime practitioners that their current application models based on software interrupts are "broken" and should be replaced by the cooperating sequential process model. Rather, it is the charter of IEEE Std 1003.1-2001 to provide standard extensions to mechanisms that support existing realtime practice.
Requirements

This section discusses the following realtime application requirements for asynchronous event notification:
- Reliable delivery of asynchronous event notification
  
  The events notification mechanism guarantees delivery of an event notification. Asynchronous operations (such as asynchronous I/O and timers) that complete significantly after they are invoked have to guarantee that delivery of the event notification can occur at the time of completion.
- Prioritized handling of asynchronous event notifications
  
  The events notification mechanism supports the assigning of a user function as an event notification handler. Furthermore, the mechanism supports the preemption of an event handler function by a higher priority event notification and supports the selection of the highest priority pending event notification when multiple notifications (of different priority) are pending simultaneously.
  
  The model here is based on hardware interrupts. Asynchronous event handling allows the application to ensure that time-critical events are immediately processed when delivered, without the indeterminism of being at a random location within a polling loop. Use of handler priority allows the specification of how handlers are interrupted by other higher priority handlers.
- Differentiation between multiple occurrences of event notifications of the same type
  
  The events notification mechanism passes an application-defined value to the event handler function. This value can be used for a variety of purposes, such as enabling the application to identify which of several possible events of the same type (for example, timer expirations) has occurred.
- Polled reception of asynchronous event notifications
  
  The events notification mechanism supports blocking and non-blocking polls for asynchronous event notification.
  
  The polled mode of operation is often preferred over the interrupt mode by those practitioners accustomed to this model. Providing support for this model facilitates the porting of applications based on this model to POSIX.1b conforming systems.
- Deterministic response to asynchronous event notifications
  
  The events notification mechanism does not preclude implementations that provide deterministic event dispatch latency and minimizes the number of system calls needed to use the event facilities during realtime processing.
Rationale for Extension

POSIX.1 signals have many of the characteristics necessary to support the asynchronous handling of event notifications, and the Realtime Signals Extension addresses the following deficiencies in the POSIX.1 signal mechanism:
- Signals do not support reliable delivery of event notification. Subsequent occurrences of a pending signal are not guaranteed to be delivered.
- Signals do not support prioritized delivery of event notifications. The order of signal delivery when multiple unblocked signals are pending is undefined.
- Signals do not support the differentiation between multiple signals of the same type.

Asynchronous I/O

Many applications need to interact with the I/O subsystem in an asynchronous manner. The asynchronous I/O mechanism provides the ability to overlap application processing and I/O operations initiated by the application. The asynchronous I/O mechanism allows a single process to perform I/O simultaneously to a single file multiple times or to multiple files multiple times.

Overview

Asynchronous I/O operations proceed in logical parallel with the processing done by the application after the asynchronous I/O has been initiated. Other than this difference, asynchronous I/O behaves similarly to normal I/O using read(), write(), lseek(), and fsync(). The effect of issuing an asynchronous I/O request is as if a separate thread of execution were to perform atomically the implied lseek() operation, if any, and then the requested I/O operation (either read(), write(), or fsync()). There is no seek implied with a call to aio_fsync(). Concurrent asynchronous operations and synchronous operations applied to the same file update the file as if the I/O operations had proceeded serially.

When asynchronous I/O completes, a signal can be delivered to the application to indicate the completion of the I/O. This signal can be used to indicate that buffers and control blocks used for asynchronous I/O can be reused. Signal delivery is not required for an asynchronous operation and may be turned off on a per-operation basis by the application. Signals may also be synchronously polled using aio_suspend(), sigtimedwait(), or sigwaitinfo().

Normal I/O has a return value and an error status associated with it. Asynchronous I/O returns a value and an error status when the operation is first submitted, but that only relates to whether the operation was successfully queued up for servicing. The I/O operation itself also has a return status and an error value. To allow the application to retrieve the return status and the error value, functions are provided that, given the address of an asynchronous I/O control block, yield the return and error status associated with the operation. Until an asynchronous I/O operation is done, its error status is [EINPROGRESS]. Thus, an application can poll for completion of an asynchronous I/O operation by waiting for the error status to become equal to a value other than [EINPROGRESS]. The return status of an asynchronous I/O operation is undefined so long as the error status is equal to [EINPROGRESS].

Storage for asynchronous operation return and error status may be limited. Submission of asynchronous I/O operations may fail if this storage is exceeded. When an application retrieves the return status of a given asynchronous operation, therefore, any system-maintained storage used for this status and the error status may be reclaimed for use by other asynchronous operations.

Asynchronous I/O can be performed on file descriptors that have been enabled for POSIX.1b synchronized I/O. In this case, the I/O operation still occurs asynchronously, as defined herein; however, the asynchronous operation I/O in this case is not completed until the I/O has reached either the state of synchronized I/O data integrity completion or synchronized I/O file integrity completion, depending on the sort of synchronized I/O that is enabled on the file descriptor.

Models

Three models illustrate the use of asynchronous I/O: a journalization model, a data acquisition model, and a model of the use of asynchronous I/O in supercomputing applications.

Journalization Model

Many realtime applications perform low-priority journalizing functions. Journalizing requires that logging records be queued for output without blocking the initiating process.
Data Acquisition Model

A data acquisition process may also serve as a model. The process has two or more channels delivering intermittent data that must be read within a certain time. The process issues one asynchronous read on each channel. When one of the channels needs data collection, the process reads the data and posts it through an asynchronous write to secondary memory for future processing.
Supercomputing Model

The supercomputing community has used asynchronous I/O much like that specified in POSIX.1 for many years. This community requires the ability to perform multiple I/O operations to multiple devices with a minimal number of entries to "the system''; each entry to "the system" provokes a major delay in operations when compared to the normal progress made by the application. This existing practice motivated the use of combined lseek() and read() or write() calls, as well as the lio_listio() call. Another common practice is to disable signal notification for I/O completion, and simply poll for I/O completion at some interval by which the I/O should be completed. Likewise, interfaces like aio_cancel() have been in successful commercial use for many years. Note also that an underlying implementation of asynchronous I/O will require the ability, at least internally, to cancel outstanding asynchronous I/O, at least when the process exits. (Consider an asynchronous read from a terminal, when the process intends to exit immediately.)

Requirements

Asynchronous input and output for realtime implementations have these requirements:

The ability to queue multiple asynchronous read and write operations to a single open instance. Both sequential and random access should be supported.
The ability to queue asynchronous read and write operations to multiple open instances.
The ability to obtain completion status information by polling and/or asynchronous event notification.
Asynchronous event notification on asynchronous I/O completion is optional.
It has to be possible for the application to associate the event with the aiocbp for the operation that generated the event.
The ability to cancel queued requests.
The ability to wait upon asynchronous I/O completion in conjunction with other types of events.
The ability to accept an aio_read() and an aio_cancel() for a device that accepts a read(), and the ability to accept an aio_write() and an aio_cancel() for a device that accepts a write(). This does not imply that the operation is asynchronous.

Standardization Issues

The following issues are addressed by the standardization of asynchronous I/O:

Rationale for New Interface

Non-blocking I/O does not satisfy the needs of either realtime or high-performance computing models; these models require that a process overlap program execution and I/O processing. Realtime applications will often make use of direct I/O to or from the address space of the process, or require synchronized (unbuffered) I/O; they also require the ability to overlap this I/O with other computation. In addition, asynchronous I/O allows an application to keep a device busy at all times, possibly achieving greater throughput. Supercomputing and database architectures will often have specialized hardware that can provide true asynchrony underlying the logical asynchrony provided by this interface. In addition, asynchronous I/O should be supported by all types of files and devices in the same manner.
Effect of Buffering

If asynchronous I/O is performed on a file that is buffered prior to being actually written to the device, it is possible that asynchronous I/O will offer no performance advantage over normal I/O; the cycles stolen to perform the asynchronous I/O will be taken away from the running process and the I/O will occur at interrupt time. This potential lack of gain in performance in no way obviates the need for asynchronous I/O by realtime applications, which very often will use specialized hardware support, multiple processors, and/or unbuffered, synchronized I/O.

Memory Management

All memory management and shared memory definitions are located in the <sys/mman.h> header. This is for alignment with historical practice.

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/7 is applied, correcting the shading and margin markers in the introduction to Section 2.8.3.1.

Memory Locking Functions

This portion of the rationale presents models, requirements, and standardization issues relevant to process memory locking.

Models

Realtime systems that conform to IEEE Std 1003.1-2001 are expected (and desired) to be supported on systems with demand-paged virtual memory management, non-paged swapping memory management, and physical memory systems with no memory management hardware. The general case, however, is the demand-paged, virtual memory system with each POSIX process running in a virtual address space. Note that this includes architectures where each process resides in its own virtual address space and architectures where the address space of each process is only a portion of a larger global virtual address space.

The concept of memory locking is introduced to eliminate the indeterminacy introduced by paging and swapping, and to support an upper bound on the time required to access the memory mapped into the address space of a process. Ideally, this upper bound will be the same as the time required for the processor to access "main memory", including any address translation and cache miss overheads. But some implementations-primarily on mainframes-will not actually force locked pages to be loaded and held resident in main memory. Rather, they will handle locked pages so that accesses to these pages will meet the performance metrics for locked process memory in the implementation. Also, although it is not, for example, the intention that this interface, as specified, be used to lock process memory into "cache", it is conceivable that an implementation could support a large static RAM memory and define this as "main memory" and use a large[r] dynamic RAM as "backing store". These interfaces could then be interpreted as supporting the locking of process memory into the static RAM. Support for multiple levels of backing store would require extensions to these interfaces.

Implementations may also use memory locking to guarantee a fixed translation between virtual and physical addresses where such is beneficial to improving determinancy for direct-to/from-process input/output. IEEE Std 1003.1-2001 does not guarantee to the application that the virtual-to-physical address translations, if such exist, are fixed, because such behavior would not be implementable on all architectures on which implementations of IEEE Std 1003.1-2001 are expected. But IEEE Std 1003.1-2001 does mandate that an implementation define, for the benefit of potential users, whether or not locking guarantees fixed translations.

Memory locking is defined with respect to the address space of a process. Only the pages mapped into the address space of a process may be locked by the process, and when the pages are no longer mapped into the address space-for whatever reason-the locks established with respect to that address space are removed. Shared memory areas warrant special mention, as they may be mapped into more than one address space or mapped more than once into the address space of a process; locks may be established on pages within these areas with respect to several of these mappings. In such a case, the lock state of the underlying physical pages is the logical OR of the lock state with respect to each of the mappings. Only when all such locks have been removed are the shared pages considered unlocked.

In recognition of the page granularity of Memory Management Units (MMU), and in order to support locking of ranges of address space, memory locking is defined in terms of "page" granularity. That is, for the interfaces that support an address and size specification for the region to be locked, the address must be on a page boundary, and all pages mapped by the specified range are locked, if valid. This means that the length is implicitly rounded up to a multiple of the page size. The page size is implementation-defined and is available to applications as a compile-time symbolic constant or at runtime via sysconf().

A "real memory" POSIX.1b implementation that has no MMU could elect not to support these interfaces, returning [ENOSYS]. But an application could easily interpret this as meaning that the implementation would unconditionally page or swap the application when such is not the case. It is the intention of IEEE Std 1003.1-2001 that such a system could define these interfaces as "NO-OPs", returning success without actually performing any function except for mandated argument checking.
Requirements

For realtime applications, memory locking is generally considered to be required as part of application initialization. This locking is performed after an application has been loaded (that is, exec'd) and the program remains locked for its entire lifetime. But to support applications that undergo major mode changes where, in one mode, locking is required, but in another it is not, the specified interfaces allow repeated locking and unlocking of memory within the lifetime of a process.

When a realtime application locks its address space, it should not be necessary for the application to then "touch" all of the pages in the address space to guarantee that they are resident or else suffer potential paging delays the first time the page is referenced. Thus, IEEE Std 1003.1-2001 requires that the pages locked by the specified interfaces be resident when the locking functions return successfully.

Many architectures support system-managed stacks that grow automatically when the current extent of the stack is exceeded. A realtime application has a requirement to be able to "preallocate" sufficient stack space and lock it down so that it will not suffer page faults to grow the stack during critical realtime operation. There was no consensus on a portable way to specify how much stack space is needed, so IEEE Std 1003.1-2001 supports no specific interface for preallocating stack space. But an application can portably lock down a specific amount of stack space by specifying MCL_FUTURE in a call to mlockall() and then calling a dummy function that declares an automatic array of the desired size.

Memory locking for realtime applications is also generally considered to be an "all or nothing" proposition. That is, the entire process, or none, is locked down. But, for applications that have well-defined sections that need to be locked and others that do not, IEEE Std 1003.1-2001 supports an optional set of interfaces to lock or unlock a range of process addresses. Reasons for locking down a specific range include:
- An asynchronous event handler function that must respond to external events in a deterministic manner such that page faults cannot be tolerated
- An input/output "buffer" area that is the target for direct-to-process I/O, and the overhead of implicit locking and unlocking for each I/O call cannot be tolerated
Finally, locking is generally viewed as an "application-wide" function. That is, the application is globally aware of which regions are locked and which are not over time. This is in contrast to a function that is used temporarily within a "third party'' library routine whose function is unknown to the application, and therefore must have no "side effects". The specified interfaces, therefore, do not support "lock stacking" or "lock nesting" within a process. But, for pages that are shared between processes or mapped more than once into a process address space, "lock stacking" is essentially mandated by the requirement that unlocking of pages that are mapped by more that one process or more than once by the same process does not affect locks established on the other mappings.

There was some support for "lock stacking" so that locking could be transparently used in functions or opaque modules. But the consensus was not to burden all implementations with lock stacking (and reference counting), and an implementation option was proposed. There were strong objections to the option because applications would have to support both options in order to remain portable. The consensus was to eliminate lock stacking altogether, primarily through overwhelming support for the System V "m[un]lock[all]" interface on which IEEE Std 1003.1-2001 is now based.

Locks are not inherited across fork()s because some implementations implement fork() by creating new address spaces for the child. In such an implementation, requiring locks to be inherited would lead to new situations in which a fork would fail due to the inability of the system to lock sufficient memory to lock both the parent and the child. The consensus was that there was no benefit to such inheritance. Note that this does not mean that locks are removed when, for instance, a thread is created in the same address space.

Similarly, locks are not inherited across exec because some implementations implement exec by unmapping all of the pages in the address space (which, by definition, removes the locks on these pages), and maps in pages of the exec'd image. In such an implementation, requiring locks to be inherited would lead to new situations in which exec would fail. Reporting this failure would be very cumbersome to detect in time to report to the calling process, and no appropriate mechanism exists for informing the exec'd process of its status.

It was determined that, if the newly loaded application required locking, it was the responsibility of that application to establish the locks. This is also in keeping with the general view that it is the responsibility of the application to be aware of all locks that are established.

There was one request to allow (not mandate) locks to be inherited across fork(), and a request for a flag, MCL_INHERIT, that would specify inheritance of memory locks across execs. Given the difficulties raised by this and the general lack of support for the feature in IEEE Std 1003.1-2001, it was not added. IEEE Std 1003.1-2001 does not preclude an implementation from providing this feature for administrative purposes, such as a "run" command that will lock down and execute a specified application. Additionally, the rationale for the objection equated fork() with creating a thread in the address space. IEEE Std 1003.1-2001 does not mandate releasing locks when creating additional threads in an existing process.
Standardization Issues

One goal of IEEE Std 1003.1-2001 is to define a set of primitives that provide the necessary functionality for realtime applications, with consideration for the needs of other application domains where such were identified, which is based to the extent possible on existing industry practice.

The Memory Locking option is required by many realtime applications to tune performance. Such a facility is accomplished by placing constraints on the virtual memory system to limit paging of time of the process or of critical sections of the process. This facility should not be used by most non-realtime applications.

Optional features provided in IEEE Std 1003.1-2001 allow applications to lock selected address ranges with the caveat that the process is responsible for being aware of the page granularity of locking and the unnested nature of the locks.

Mapped Files Functions

The Memory Mapped Files option provides a mechanism that allows a process to access files by directly incorporating file data into its address space. Once a file is "mapped" into a process address space, the data can be manipulated by instructions as memory. The use of mapped files can significantly reduce I/O data movement since file data does not have to be copied into process data buffers as in read() and write(). If more than one process maps a file, its contents are shared among them. This provides a low overhead mechanism by which processes can synchronize and communicate.

Historical Perspective

Realtime applications have historically been implemented using a collection of cooperating processes or tasks. In early systems, these processes ran on bare hardware (that is, without an operating system) with no memory relocation or protection. The application paradigms that arose from this environment involve the sharing of data between the processes.

When realtime systems were implemented on top of vendor-supplied operating systems, the paradigm or performance benefits of direct access to data by multiple processes was still deemed necessary. As a result, operating systems that claim to support realtime applications must support the shared memory paradigm.

Additionally, a number of realtime systems provide the ability to map specific sections of the physical address space into the address space of a process. This ability is required if an application is to obtain direct access to memory locations that have specific properties (for example, refresh buffers or display devices, dual ported memory locations, DMA target locations). The use of this ability is common enough to warrant some degree of standardization of its interface. This ability overlaps the general paradigm of shared memory in that, in both instances, common global objects are made addressable by individual processes or tasks.

Finally, a number of systems also provide the ability to map process addresses to files. This provides both a general means of sharing persistent objects, and using files in a manner that optimizes memory and swapping space usage.

Simple shared memory is clearly a special case of the more general file mapping capability. In addition, there is relatively widespread agreement and implementation of the file mapping interface. In these systems, many different types of objects can be mapped (for example, files, memory, devices, and so on) using the same mapping interfaces. This approach both minimizes interface proliferation and maximizes the generality of programs using the mapping interfaces.
Memory Mapped Files Usage

A memory object can be concurrently mapped into the address space of one or more processes. The mmap() and munmap() functions allow a process to manipulate their address space by mapping portions of memory objects into it and removing them from it. When multiple processes map the same memory object, they can share access to the underlying data. Implementations may restrict the size and alignment of mappings to be on page-size boundaries. The page size, in bytes, is the value of the system-configurable variable {PAGESIZE}, typically accessed by calling sysconf() with a name argument of _SC_PAGESIZE. If an implementation has no restrictions on size or alignment, it may specify a 1-byte page size.

To map memory, a process first opens a memory object. The ftruncate() function can be used to contract or extend the size of the memory object even when the object is currently mapped. If the memory object is extended, the contents of the extended areas are zeros.

After opening a memory object, the application maps the object into its address space using the mmap() function call. Once a mapping has been established, it remains mapped until unmapped with munmap(), even if the memory object is closed. The mprotect() function can be used to change the memory protections initially established by mmap().

A close() of the file descriptor, while invalidating the file descriptor itself, does not unmap any mappings established for the memory object. The address space, including all mapped regions, is inherited on fork(). The entire address space is unmapped on process termination or by successful calls to any of the exec family of functions.

The msync() function is used to force mapped file data to permanent storage.
Effects on Other Functions

When the Memory Mapped Files option is supported, the operation of the open(), creat(), and unlink() functions are a natural result of using the file system name space to map the global names for memory objects.

The ftruncate() function can be used to set the length of a sharable memory object.

The meaning of stat() fields other than the size and protection information is undefined on implementations where memory objects are not implemented using regular files. When regular files are used, the times reflect when the implementation updated the file image of the data, not when a process updated the data in memory.

The operations of fdopen(), write(), read(), and lseek() were made unspecified for objects opened with shm_open(), so that implementations that did not implement memory objects as regular files would not have to support the operation of these functions on shared memory objects.

The behavior of memory objects with respect to close(), dup(), dup2(), open(), close(), fork(), _exit(), and the exec family of functions is the same as the behavior of the existing practice of the mmap() function.

A memory object can still be referenced after a close. That is, any mappings made to the file are still in effect, and reads and writes that are made to those mappings are still valid and are shared with other processes that have the same mapping. Likewise, the memory object can still be used if any references remain after its name(s) have been deleted. Any references that remain after a close must not appear to the application as file descriptors.

This is existing practice for mmap() and close(). In addition, there are already mappings present (text, data, stack) that do not have open file descriptors. The text mapping in particular is considered a reference to the file containing the text. The desire was to treat all mappings by the process uniformly. Also, many modern implementations use mmap() to implement shared libraries, and it would not be desirable to keep file descriptors for each of the many libraries an application can use. It was felt there were many other existing programs that used this behavior to free a file descriptor, and thus IEEE Std 1003.1-2001 could not forbid it and still claim to be using existing practice.

For implementations that implement memory objects using memory only, memory objects will retain the memory allocated to the file after the last close and will use that same memory on the next open. Note that closing the memory object is not the same as deleting the name, since the memory object is still defined in the memory object name space.

The locks of fcntl() do not block any read or write operation, including read or write access to shared memory or mapped files. In addition, implementations that only support shared memory objects should not be required to implement record locks. The reference to fcntl() is added to make this point explicitly. The other fcntl() commands are useful with shared memory objects.

The size of pages that mapping hardware may be able to support may be a configurable value, or it may change based on hardware implementations. The addition of the _SC_PAGESIZE parameter to the sysconf() function is provided for determining the mapping page size at runtime.

Shared Memory Functions

Implementations may support the Shared Memory Objects option without supporting a general Memory Mapped Files option. Shared memory objects are named regions of storage that may be independent of the file system and can be mapped into the address space of one or more processes to allow them to share the associated memory.

Requirements

Shared memory is used to share data among several processes, each potentially running at different priority levels, responding to different inputs, or performing separate tasks. Shared memory is not just simply providing common access to data, it is providing the fastest possible communication between the processes. With one memory write operation, a process can pass information to as many processes as have the memory region mapped.

As a result, shared memory provides a mechanism that can be used for all other interprocess communication facilities. It may also be used by an application for implementing more sophisticated mechanisms than semaphores and message queues.

The need for a shared memory interface is obvious for virtual memory systems, where the operating system is directly preventing processes from accessing each other's data. However, in unprotected systems, such as those found in some embedded controllers, a shared memory interface is needed to provide a portable mechanism to allocate a region of memory to be shared and then to communicate the address of that region to other processes.

This, then, provides the minimum functionality that a shared memory interface must have in order to support realtime applications: to allocate and name an object to be mapped into memory for potential sharing ( open() or shm_open()), and to make the memory object available within the address space of a process ( mmap()). To complete the interface, a mechanism to release the claim of a process on a shared memory object ( munmap()) is also needed, as well as a mechanism for deleting the name of a sharable object that was previously created ( unlink() or shm_unlink()).

After a mapping has been established, an implementation should not have to provide services to maintain that mapping. All memory writes into that area will appear immediately in the memory mapping of that region by any other processes.

Thus, requirements include:
- Support creation of sharable memory objects and the mapping of these objects into the address space of a process.
- Sharable memory objects should be accessed by global names accessible from all processes.
- Support the mapping of specific sections of physical address space (such as a memory mapped device) into the address space of a process. This should not be done by the process specifying the actual address, but again by an implementation-defined global name (such as a special device name) dedicated to this purpose.
- Support the mapping of discrete portions of these memory objects.
- Support for minimum hardware configurations that contain no physical media on which to store shared memory contents permanently.
- The ability to preallocate the entire shared memory region so that minimum hardware configurations without virtual memory support can guarantee contiguous space.
- The maximizing of performance by not requiring functionality that would require implementation interaction above creating the shared memory area and returning the mapping.
Note that the above requirements do not preclude:
- The sharable memory object from being implemented using actual files on an actual file system.
- The global name that is accessible from all processes being restricted to a file system area that is dedicated to handling shared memory.
- An implementation not providing implementation-defined global names for the purpose of physical address mapping.
Shared Memory Objects Usage

If the Shared Memory Objects option is supported, a shared memory object may be created, or opened if it already exists, with the shm_open() function. If the shared memory object is created, it has a length of zero. The ftruncate() function can be used to set the size of the shared memory object after creation. The shm_unlink() function removes the name for a shared memory object created by shm_open().
Shared Memory Overview

The shared memory facility defined by IEEE Std 1003.1-2001 usually results in memory locations being added to the address space of the process. The implementation returns the address of the new space to the application by means of a pointer. This works well in languages like C. However, in languages without pointer types it will not work. In the bindings for such a language, either a special COMMON section will need to be defined (which is unlikely), or the binding will have to allow existing structures to be mapped. The implementation will likely have to place restrictions on the size and alignment of such structures or will have to map a suitable region of the address space of the process into the memory object, and thus into other processes. These are issues for that particular language binding. For IEEE Std 1003.1-2001, however, the practice will not be forbidden, merely undefined.

Two potentially different name spaces are used for naming objects that may be mapped into process address spaces. When the Memory Mapped Files option is supported, files may be accessed via open(). When the Shared Memory Objects option is supported, sharable memory objects that might not be files may be accessed via the shm_open() function. These options are not mutually-exclusive.

Some implementations supporting the Shared Memory Objects option may choose to implement the shared memory object name space as part of the file system name space. There are several reasons for this:
- It allows applications to prevent name conflicts by use of the directory structure.
- It uses an existing mechanism for accessing global objects and prevents the creation of a new mechanism for naming global objects.
In such implementations, memory objects can be implemented using regular files, if that is what the implementation chooses. The shm_open() function can be implemented as an open() call in a fixed directory followed by a call to fcntl() to set FD_CLOEXEC. The shm_unlink() function can be implemented as an unlink() call.

On the other hand, it is also expected that small embedded systems that support the Shared Memory Objects option may wish to implement shared memory without having any file systems present. In this case, the implementations may choose to use a simple string valued name space for shared memory regions. The shm_open() function permits either type of implementation.

Some implementations have hardware that supports protection of mapped data from certain classes of access and some do not. Systems that supply this functionality can support the Memory Protection option.

Some implementations restrict size, alignment, and protections to be on page-size boundaries. If an implementation has no restrictions on size or alignment, it may specify a 1-byte page size. Applications on implementations that do support larger pages must be cognizant of the page size since this is the alignment and protection boundary.

Simple embedded implementations may have a 1-byte page size and only support the Shared Memory Objects option. This provides simple shared memory between processes without requiring mapping hardware.

IEEE Std 1003.1-2001 specifically allows a memory object to remain referenced after a close because that is existing practice for the mmap() function.

Typed Memory Functions

Implementations may support the Typed Memory Objects option without supporting either the Shared Memory option or the Memory Mapped Files option. Typed memory objects are pools of specialized storage, different from the main memory resource normally used by a processor to hold code and data, that can be mapped into the address space of one or more processes.

Model

Realtime systems conforming to one of the POSIX.13 realtime profiles are expected (and desired) to be supported on systems with more than one type or pool of memory (for example, SRAM, DRAM, ROM, EPROM, EEPROM), where each type or pool of memory may be accessible by one or more processors via one or more busses (ports). Memory mapped files, shared memory objects, and the language-specific storage allocation operators ( malloc() for the ISO C standard, new for ISO Ada) fail to provide application program interfaces versatile enough to allow applications to control their utilization of such diverse memory resources. The typed memory interfaces posix_typed_mem_open(), posix_mem_offset(), posix_typed_mem_get_info(), mmap(), and munmap() defined herein support the model of typed memory described below.

For purposes of this model, a system comprises several processors (for example, P₁ and P₂), several physical memory pools (for example, M₁, M₂, M_2a, M_2b, M₃, M₄, and M₅), and several busses or "ports" (for example, B₁, B₂, B₃, and B₄) interconnecting the various processors and memory pools in some system-specific way. Notice that some memory pools may be contained in others (for example, M_2a and M_2b are contained in M₂).

Example of a System with Typed Memory shows an example of such a model. In a system like this, an application should be able to perform the following operations:

Figure: Example of a System with Typed Memory
- Typed Memory Allocation
  
  An application should be able to allocate memory dynamically from the desired pool using the desired bus, and map it into a process' address space. For example, processor P₁ can allocate some portion of memory pool M₁ through port B₁, treating all unmapped subareas of M₁ as a heap-storage resource from which memory may be allocated. This portion of memory is mapped into the process' address space, and subsequently deallocated when unmapped from all processes.
- Using the Same Storage Region from Different Busses
  
  An application process with a mapped region of storage that is accessed from one bus should be able to map that same storage area at another address (subject to page size restrictions detailed in mmap()), to allow it to be accessed from another bus. For example, processor P₁ may wish to access the same region of memory pool M_2b both through ports B₁ and B₂.
- Sharing Typed Memory Regions
  
  Several application processes running on the same or different processors may wish to share a particular region of a typed memory pool. Each process or processor may wish to access this region through different busses. For example, processor P₁ may want to share a region of memory pool M₄ with processor P₂, and they may be required to use busses B₂ and B₃, respectively, to minimize bus contention. A problem arises here when a process allocates and maps a portion of fragmented memory and then wants to share this region of memory with another process, either in the same processor or different processors. The solution adopted is to allow the first process to find out the memory map (offsets and lengths) of all the different fragments of memory that were mapped into its address space, by repeatedly calling posix_mem_offset(). Then, this process can pass the offsets and lengths obtained to the second process, which can then map the same memory fragments into its address space.
- Contiguous Allocation
  
  The problem of finding the memory map of the different fragments of the memory pool that were mapped into logically contiguous addresses of a given process can be solved by requesting contiguous allocation. For example, a process in P₁ can allocate 10 Kbytes of physically contiguous memory from M₃-B₁, and obtain the offset (within pool M₃) of this block of memory. Then, it can pass this offset (and the length) to a process in P₂ using some interprocess communication mechanism. The second process can map the same block of memory by using the offset transferred and specifying M₃-B₂.
- Unallocated Mapping
  
  Any subarea of a memory pool that is mapped to a process, either as the result of an allocation request or an explicit mapping, is normally unavailable for allocation. Special processes such as debuggers, however, may need to map large areas of a typed memory pool, yet leave those areas available for allocation.
Typed memory allocation and mapping has to coexist with storage allocation operators like malloc(), but systems are free to choose how to implement this coexistence. For example, it may be system configuration-dependent if all available system memory is made part of one of the typed memory pools or if some part will be restricted to conventional allocation operators. Equally system configuration-dependent may be the availability of operators like malloc() to allocate storage from certain typed memory pools. It is not excluded to configure a system such that a given named pool, P₁, is in turn split into non-overlapping named subpools. For example, M₁-B₁, M₂-B₁, and M₃-B₁ could also be accessed as one common pool M₁₂₃-B₁. A call to malloc() on P₁ could work on such a larger pool while full optimization of memory usage by P₁ would require typed memory allocation at the subpool level.
Existing Practice

OS-9 provides for the naming (numbering) and prioritization of memory types by a system administrator. It then provides APIs to request memory allocation of typed (colored) memory by number, and to generate a bus address from a mapped memory address (translate). When requesting colored memory, the user can specify type 0 to signify allocation from the first available type in priority order.

HP-RT presents interfaces to map different kinds of storage regions that are visible through a VME bus, although it does not provide allocation operations. It also provides functions to perform address translation between VME addresses and virtual addresses. It represents a VME-bus unique solution to the general problem.

The PSOS approach is similar (that is, based on a pre-established mapping of bus address ranges to specific memories) with a concept of segments and regions (regions dynamically allocated from a heap which is a special segment). Therefore, PSOS does not fully address the general allocation problem either. PSOS does not have a "process''-based model, but more of a "thread''-only-based model of multi-tasking. So mapping to a process address space is not an issue.

QNX uses the System V approach of opening specially named devices (shared memory segments) and using mmap() to then gain access from the process. They do not address allocation directly, but once typed shared memory can be mapped, an "allocation manager" process could be written to handle requests for allocation.

The System V approach also included allocation, implemented by opening yet other special "devices" which allocate, rather than appearing as a whole memory object.

The Orkid realtime kernel interface definition has operations to manage memory "regions" and "pools", which are areas of memory that may reflect the differing physical nature of the memory. Operations to allocate memory from these regions and pools are also provided.
Requirements

Existing practice in SVID-derived UNIX systems relies on functionality similar to mmap() and its related interfaces to achieve mapping and allocation of typed memory. However, the issue of sharing typed memory (allocated or mapped) and the complication of multiple ports are not addressed in any consistent way by existing UNIX system practice. Part of this functionality is existing practice in specialized realtime operating systems. In order to solidify the capabilities implied by the model above, the following requirements are imposed on the interface:
- Identification of Typed Memory Pools and Ports
  
  All processes (running in all processors) in the system are able to identify a particular (system configured) typed memory pool accessed through a particular (system configured) port by a name. That name is a member of a name space common to all these processes, but need not be the same name space as that containing ordinary filenames. The association between memory pools/ports and corresponding names is typically established when the system is configured. The "open" operation for typed memory objects should be distinct from the open() function, for consistency with other similar services, but implementable on top of open(). This implies that the handle for a typed memory object will be a file descriptor.
- Allocation and Mapping of Typed Memory
  
  Once a typed memory object has been identified by a process, it is possible to both map user-selected subareas of that object into process address space and to map system-selected (that is, dynamically allocated) subareas of that object, with user-specified length, into process address space. It is also possible to determine the maximum length of memory allocation that may be requested from a given typed memory object.
- Sharing Typed Memory
  
  Two or more processes are able to share portions of typed memory, either user-selected or dynamically allocated. This requirement applies also to dynamically allocated regions of memory that are composed of several non-contiguous pieces.
- Contiguous Allocation
  
  For dynamic allocation, it is the user's option whether the system is required to allocate a contiguous subarea within the typed memory object, or whether it is permitted to allocate discontiguous fragments which appear contiguous in the process mapping. Contiguous allocation simplifies the process of sharing allocated typed memory, while discontiguous allocation allows for potentially better recovery of deallocated typed memory.
- Accessing Typed Memory Through Different Ports
  
  Once a subarea of a typed memory object has been mapped, it is possible to determine the location and length corresponding to a user-selected portion of that object within the memory pool. This location and length can then be used to remap that portion of memory for access from another port. If the referenced portion of typed memory was allocated discontiguously, the length thus determined may be shorter than anticipated, and the user code must adapt to the value returned.
- Deallocation
  
  When a previously mapped subarea of typed memory is no longer mapped by any process in the system-as a result of a call or calls to munmap()- that subarea becomes potentially reusable for dynamic allocation; actual reuse of the subarea is a function of the dynamic typed memory allocation policy.
- Unallocated Mapping
  
  It must be possible to map user-selected subareas of a typed memory object without marking that subarea as unavailable for allocation. This option is not the default behavior, and requires appropriate privilege.
Scenario

The following scenario will serve to clarify the use of the typed memory interfaces.

Process A running on P₁ (see Example of a System with Typed Memory) wants to allocate some memory from memory pool M₂, and it wants to share this portion of memory with process B running on P₂. Since P₂ only has access to the lower part of M₂, both processes will use the memory pool named M_2b which is the part of M₂ that is accessible both from P₁ and P₂. The operations that both processes need to perform are shown below:
- Allocating Typed Memory
  
  Process A calls posix_typed_mem_open() with the name /typed.m2b-b1 and a tflag of POSIX_TYPED_MEM_ALLOCATE to get a file descriptor usable for allocating from pool M_2b accessed through port B₁. It then calls mmap() with this file descriptor requesting a length of 4096 bytes. The system allocates two discontiguous blocks of sizes 1024 and 3072 bytes within M_2b. The mmap() function returns a pointer to a 4096-byte array in process A's logical address space, mapping the allocated blocks contiguously. Process A can then utilize the array, and store data in it.
- Determining the Location of the Allocated Blocks
  
  Process A can determine the lengths and offsets (relative to M_2b) of the two blocks allocated, by using the following procedure: First, process A calls posix_mem_offset() with the address of the first element of the array and length 4096. Upon return, the offset and length (1024 bytes) of the first block are returned. A second call to posix_mem_offset() is then made using the address of the first element of the array plus 1024 (the length of the first block), and a new length of 4096-1024. If there were more fragments allocated, this procedure could have been continued within a loop until the offsets and lengths of all the blocks were obtained. Notice that this relatively complex procedure can be avoided if contiguous allocation is requested (by opening the typed memory object with the tflag POSIX_TYPED_MEM_ALLOCATE_CONTIG).
- Sharing Data Across Processes
  
  Process A passes the two offset values and lengths obtained from the posix_mem_offset() calls to process B running on P₂, via some form of interprocess communication. Process B can gain access to process A's data by calling posix_typed_mem_open() with the name /typed.m2b-b2 and a tflag of zero, then using two mmap() calls on the resulting file descriptor to map the two subareas of that typed memory object to its own address space.
Rationale for no mem_alloc() and mem_free()

The standard developers had originally proposed a pair of new flags to mmap() which, when applied to a typed memory object descriptor, would cause mmap() to allocate dynamically from an unallocated and unmapped area of the typed memory object. Deallocation was similarly accomplished through the use of munmap(). This was rejected by the ballot group because it excessively complicated the (already rather complex) mmap() interface and introduced semantics useful only for typed memory, to a function which must also map shared memory and files. They felt that a memory allocator should be built on top of mmap() instead of being incorporated within the same interface, much as the ISO C standard libraries build malloc() on top of the virtual memory mapping functions brk() and sbrk(). This would eliminate the complicated semantics involved with unmapping only part of an allocated block of typed memory.

To attempt to achieve ballot group consensus, typed memory allocation and deallocation was first migrated from mmap() and munmap() to a pair of complementary functions modeled on the ISO C standard malloc() and free(). The mem_alloc() function specified explicitly the typed memory object (typed memory pool/access port) from which allocation takes place, unlike malloc() where the memory pool and port are unspecified. The mem_free() function handled deallocation. These new semantics still met all of the requirements detailed above without modifying the behavior of mmap() except to allow it to map specified areas of typed memory objects. An implementation would have been free to implement mem_alloc() and mem_free() over mmap(), through mmap(), or independently but cooperating with mmap().

The ballot group was queried to see if this was an acceptable alternative, and while there was some agreement that it achieved the goal of removing the complicated semantics of allocation from the mmap() interface, several balloters realized that it just created two additional functions that behaved, in great part, like mmap(). These balloters proposed an alternative which has been implemented here in place of a separate mem_alloc() and mem_free(). This alternative is based on four specific suggestions:
1. The posix_typed_mem_open() function should provide a flag which specifies "allocate on mmap()" (otherwise, mmap() just maps the underlying object). This allows things roughly similar to /dev/zero versus /dev/swap. Two such flags have been implemented, one of which forces contiguous allocation.
2. The posix_mem_offset() function is acceptable because it can be applied usefully to mapped objects in general. It should return the file descriptor of the underlying object.
3. The mem_get_info() function in an earlier draft should be renamed posix_typed_mem_get_info() because it is not generally applicable to memory objects. It should probably return the file descriptor's allocation attribute. The renaming of the function has been implemented, but having it return a piece of information which is readily known by an application without this function has been rejected. Its whole purpose is to query the typed memory object for attributes that are not user-specified, but determined by the implementation.
4. There should be no separate mem_alloc() or mem_free() functions. Instead, using mmap() on a typed memory object opened with an "allocate on mmap()" flag should be used to force allocation. These are precisely the semantics defined in the current draft.
Rationale for no Typed Memory Access Management

The working group had originally defined an additional interface (and an additional kind of object: typed memory master) to establish and dissolve mappings to typed memory on behalf of devices or processors which were independent of the operating system and had no inherent capability to directly establish mappings on their own. This was to have provided functionality similar to device driver interfaces such as physio() and their underlying bus-specific interfaces (for example, mballoc()) which serve to set up and break down DMA pathways, and derive mapped addresses for use by hardware devices and processor cards.

The ballot group felt that this was beyond the scope of POSIX.1 and its amendments. Furthermore, the removal of interrupt handling interfaces from a preceding amendment (the IEEE Std 1003.1d-1999) during its balloting process renders these typed memory access management interfaces an incomplete solution to portable device management from a user process; it would be possible to initiate a device transfer to/from typed memory, but impossible to handle the transfer-complete interrupt in a portable way.

To achieve ballot group consensus, all references to typed memory access management capabilities were removed. The concept of portable interfaces from a device driver to both operating system and hardware is being addressed by the Uniform Driver Interface (UDI) industry forum, with formal standardization deferred until proof of concept and industry-wide acceptance and implementation.

Process Scheduling

IEEE PASC Interpretation 1003.1 #96 has been applied, adding the pthread_setschedprio() function. This was added since previously there was no way for a thread to lower its own priority without going to the tail of the threads list for its new priority. This capability is necessary to bound the duration of priority inversion encountered by a thread.

The following portion of the rationale presents models, requirements, and standardization issues relevant to process scheduling; see also Thread Scheduling.

In an operating system supporting multiple concurrent processes, the system determines the order in which processes execute to meet implementation-defined goals. For time-sharing systems, the goal is to enhance system throughput and promote fairness; the application is provided with little or no control over this sequencing function. While this is acceptable and desirable behavior in a time-sharing system, it is inappropriate in a realtime system; realtime applications must specifically control the execution sequence of their concurrent processes in order to meet externally defined response requirements.

In IEEE Std 1003.1-2001, the control over process sequencing is provided using a concept of scheduling policies. These policies, described in detail in this section, define the behavior of the system whenever processor resources are to be allocated to competing processes. Only the behavior of the policy is defined; conforming implementations are free to use any mechanism desired to achieve the described behavior.

Models

In an operating system supporting multiple concurrent processes, the system determines the order in which processes execute and might force long-running processes to yield to other processes at certain intervals. Typically, the scheduling code is executed whenever an event occurs that might alter the process to be executed next.

The simplest scheduling strategy is a "first-in, first-out" (FIFO) dispatcher. Whenever a process becomes runnable, it is placed on the end of a ready list. The process at the front of the ready list is executed until it exits or becomes blocked, at which point it is removed from the list. This scheduling technique is also known as "run-to-completion" or "run-to-block".

A natural extension to this scheduling technique is the assignment of a "non-migrating priority" to each process. This policy differs from strict FIFO scheduling in only one respect: whenever a process becomes runnable, it is placed at the end of the list of processes runnable at that priority level. When selecting a process to run, the system always selects the first process from the highest priority queue with a runnable process. Thus, when a process becomes unblocked, it will preempt a running process of lower priority without otherwise altering the ready list. Further, if a process elects to alter its priority, it is removed from the ready list and reinserted, using its new priority, according to the policy above.

While the above policy might be considered unfriendly in a time-sharing environment in which multiple users require more balanced resource allocation, it could be ideal in a realtime environment for several reasons. The most important of these is that it is deterministic: the highest-priority process is always run and, among processes of equal priority, the process that has been runnable for the longest time is executed first. Because of this determinism, cooperating processes can implement more complex scheduling simply by altering their priority. For instance, if processes at a single priority were to reschedule themselves at fixed time intervals, a time-slice policy would result.

In a dedicated operating system in which all processes are well-behaved realtime applications, non-migrating priority scheduling is sufficient. However, many existing implementations provide for more complex scheduling policies.

IEEE Std 1003.1-2001 specifies a linear scheduling model. In this model, every process in the system has a priority. The system scheduler always dispatches a process that has the highest (generally the most time-critical) priority among all runnable processes in the system. As long as there is only one such process, the dispatching policy is trivial. When multiple processes of equal priority are eligible to run, they are ordered according to a strict run-to-completion (FIFO) policy.

The priority is represented as a positive integer and is inherited from the parent process. For processes running under a fixed priority scheduling policy, the priority is never altered except by an explicit function call.

It was determined arbitrarily that larger integers correspond to "higher priorities".

Certain implementations might impose restrictions on the priority ranges to which processes can be assigned. There also can be restrictions on the set of policies to which processes can be set.
Requirements

Realtime processes require that scheduling be fast and deterministic, and that it guarantees to preempt lower priority processes.

Thus, given the linear scheduling model, realtime processes require that they be run at a priority that is higher than other processes. Within this framework, realtime processes are free to yield execution resources to each other in a completely portable and implementation-defined manner.

As there is a generally perceived requirement for processes at the same priority level to share processor resources more equitably, provisions are made by providing a scheduling policy (that is, SCHED_RR) intended to provide a timeslice-like facility.

Note:

The following topics assume that low numeric priority implies low scheduling criticality and vice versa.
Rationale for New Interface

Realtime applications need to be able to determine when processes will run in relation to each other. It must be possible to guarantee that a critical process will run whenever it is runnable; that is, whenever it wants to for as long as it needs. SCHED_FIFO satisfies this requirement. Additionally, SCHED_RR was defined to meet a realtime requirement for a well-defined time-sharing policy for processes at the same priority.

It would be possible to use the BSD setpriority() and getpriority() functions by redefining the meaning of the "nice" parameter according to the scheduling policy currently in use by the process. The System V nice() interface was felt to be undesirable for realtime because it specifies an adjustment to the "nice" value, rather than setting it to an explicit value. Realtime applications will usually want to set priority to an explicit value. Also, System V nice() does not allow for changing the priority of another process.

With the POSIX.1b interfaces, the traditional "nice" value does not affect the SCHED_FIFO or SCHED_RR scheduling policies. If a "nice" value is supported, it is implementation-defined whether it affects the SCHED_OTHER policy.

An important aspect of IEEE Std 1003.1-2001 is the explicit description of the queuing and preemption rules. It is critical, to achieve deterministic scheduling, that such rules be stated clearly in IEEE Std 1003.1-2001.

IEEE Std 1003.1-2001 does not address the interaction between priority and swapping. The issues involved with swapping and virtual memory paging are extremely implementation-defined and would be nearly impossible to standardize at this point. The proposed scheduling paradigm, however, fully describes the scheduling behavior of runnable processes, of which one criterion is that the working set be resident in memory. Assuming the existence of a portable interface for locking portions of a process in memory, paging behavior need not affect the scheduling of realtime processes.

IEEE Std 1003.1-2001 also does not address the priorities of "system" processes. In general, these processes should always execute in low-priority ranges to avoid conflict with other realtime processes. Implementations should document the priority ranges in which system processes run.

The default scheduling policy is not defined. The effect of I/O interrupts and other system processing activities is not defined. The temporary lending of priority from one process to another (such as for the purposes of affecting freeing resources) by the system is not addressed. Preemption of resources is not addressed. Restrictions on the ability of a process to affect other processes beyond a certain level (influence levels) is not addressed.

The rationale used to justify the simple time-quantum scheduler is that it is common practice to depend upon this type of scheduling to ensure "fair" distribution of processor resources among portions of the application that must interoperate in a serial fashion. Note that IEEE Std 1003.1-2001 is silent with respect to the setting of this time quantum, or whether it is a system-wide value or a per-process value, although it appears that the prevailing realtime practice is for it to be a system-wide value.

In a system with N processes at a given priority, all processor-bound, in which the time quantum is equal for all processes at a specific priority level, the following assumptions are made of such a scheduling policy:
1. A time quantum Q exists and the current process will own control of the processor for at least a duration of Q and will have the processor for a duration of Q.
2. The Nth process at that priority will control a processor within a duration of ( N-1) × Q.
These assumptions are necessary to provide equal access to the processor and bounded response from the application.

The assumptions hold for the described scheduling policy only if no system overhead, such as interrupt servicing, is present. If the interrupt servicing load is non-zero, then one of the two assumptions becomes fallacious, based upon how Q is measured by the system.

If Q is measured by clock time, then the assumption that the process obtains a duration Q processor time is false if interrupt overhead exists. Indeed, a scenario can be constructed with N processes in which a single process undergoes complete processor starvation if a peripheral device, such as an analog-to-digital converter, generates significant interrupt activity periodically with a period of N × Q.

If Q is measured as actual processor time, then the assumption that the Nth process runs in within the duration ( N-1) × Q is false.

It should be noted that SCHED_FIFO suffers from interrupt-based delay as well. However, for SCHED_FIFO, the implied response of the system is "as soon as possible", so that the interrupt load for this case is a vendor selection and not a compliance issue.

With this in mind, it is necessary either to complete the definition by including bounds on the interrupt load, or to modify the assumptions that can be made about the scheduling policy.

Since the motivation of inclusion of the policy is common usage, and since current applications do not enjoy the luxury of bounded interrupt load, item (2) above is sufficient to express existing application needs and is less restrictive in the standard definition. No difference in interface is necessary.

In an implementation in which the time quantum is equal for all processes at a specific priority, our assumptions can then be restated as:
- A time quantum Q exists, and a processor-bound process will be rescheduled after a duration of, at most, Q. Time quantum Q may be defined in either wall clock time or execution time.
- In general, the Nth process of a priority level should wait no longer than ( N-1) × Q time to execute, assuming no processes exist at higher priority levels.
- No process should wait indefinitely.
For implementations supporting per-process time quanta, these assumptions can be readily extended.

Sporadic Server Scheduling Policy

The sporadic server is a mechanism defined for scheduling aperiodic activities in time-critical realtime systems. This mechanism reserves a certain bounded amount of execution capacity for processing aperiodic events at a high priority level. Any aperiodic events that cannot be processed within the bounded amount of execution capacity are executed in the background at a low priority level. Thus, a certain amount of execution capacity can be guaranteed to be available for processing periodic tasks, even under burst conditions in the arrival of aperiodic processing requests (that is, a large number of requests in a short time interval). The sporadic server also simplifies the schedulability analysis of the realtime system, because it allows aperiodic processes or threads to be treated as if they were periodic. The sporadic server was first described by Sprunt, et al.

The key concept of the sporadic server is to provide and limit a certain amount of computation capacity for processing aperiodic events at their assigned normal priority, during a time interval called the "replenishment period". Once the entity controlled by the sporadic server mechanism is initialized with its period and execution-time budget attributes, it preserves its execution capacity until an aperiodic request arrives. The request will be serviced (if there are no higher priority activities pending) as long as there is execution capacity left. If the request is completed, the actual execution time used to service it is subtracted from the capacity, and a replenishment of this amount of execution time is scheduled to happen one replenishment period after the arrival of the aperiodic request. If the request is not completed, because there is no execution capacity left, then the aperiodic process or thread is assigned a lower background priority. For each portion of consumed execution capacity the execution time used is replenished after one replenishment period. At the time of replenishment, if the sporadic server was executing at a background priority level, its priority is elevated to the normal level. Other similar replenishment policies have been defined, but the one presented here represents a compromise between efficiency and implementation complexity.

The interface that appears in this section defines a new scheduling policy for threads and processes that behaves according to the rules of the sporadic server mechanism. Scheduling attributes are defined and functions are provided to allow the user to set and get the parameters that control the scheduling behavior of this mechanism, namely the normal and low priority, the replenishment period, the maximum number of pending replenishment operations, and the initial execution-time budget.

Scheduling Aperiodic Activities

Virtually all realtime applications are required to process aperiodic activities. In many cases, there are tight timing constraints that the response to the aperiodic events must meet. Usual timing requirements imposed on the response to these events are:
- The effects of an aperiodic activity on the response time of lower priority activities must be controllable and predictable.
- The system must provide the fastest possible response time to aperiodic events.
- It must be possible to take advantage of all the available processing bandwidth not needed by time-critical activities to enhance average-case response times to aperiodic events.
Traditional methods for scheduling aperiodic activities are background processing, polling tasks, and direct event execution:
- Background processing consists of assigning a very low priority to the processing of aperiodic events. It utilizes all the available bandwidth in the system that has not been consumed by higher priority threads. However, it is very difficult, or impossible, to meet requirements on average-case response time, because the aperiodic entity has to wait for the execution of all other entities which have higher priority.
- Polling consists of creating a periodic process or thread for servicing aperiodic requests. At regular intervals, the polling entity is started and its services accumulated pending aperiodic requests. If no aperiodic requests are pending, the polling entity suspends itself until its next period. Polling allows the aperiodic requests to be processed at a higher priority level. However, worst and average-case response times of polling entities are a direct function of the polling period, and there is execution overhead for each polling period, even if no event has arrived. If the deadline of the aperiodic activity is short compared to the inter-arrival time, the polling frequency must be increased to guarantee meeting the deadline. For this case, the increase in frequency can dramatically reduce the efficiency of the system and, therefore, its capacity to meet all deadlines. Yet, polling represents a good way to handle a large class of practical problems because it preserves system predictability, and because the amortized overhead drops as load increases.
- Direct event execution consists of executing the aperiodic events at a high fixed-priority level. Typically, the aperiodic event is processed by an interrupt service routine as soon as it arrives. This technique provides predictable response times for aperiodic events, but makes the response times of all lower priority activities completely unpredictable under burst arrival conditions. Therefore, if the density of aperiodic event arrivals is unbounded, it may be a dangerous technique for time-critical systems. Yet, for those cases in which the physics of the system imposes a bound on the event arrival rate, it is probably the most efficient technique.
- The sporadic server scheduling algorithm combines the predictability of the polling approach with the short response times of the direct event execution. Thus, it allows systems to meet an important class of application requirements that cannot be met by using the traditional approaches. Multiple sporadic servers with different attributes can be applied to the scheduling of multiple classes of aperiodic events, each with different kinds of timing requirements, such as individual deadlines, average response times, and so on. It also has many other interesting applications for realtime, such as scheduling producer/consumer tasks in time-critical systems, limiting the effects of faults on the estimation of task execution-time requirements, and so on.
Existing Practice

The sporadic server has been used in different kinds of applications, including military avionics, robot control systems, industrial automation systems, and so on. There are examples of many systems that cannot be successfully scheduled using the classic approaches, such as direct event execution, or polling, and are schedulable using a sporadic server scheduler. The sporadic server algorithm itself can successfully schedule all systems scheduled with direct event execution or polling.

The sporadic server scheduling policy has been implemented as a commercial product in the run-time system of the Verdix Ada compiler. There are also many applications that have used a much less efficient application-level sporadic server. These realtime applications would benefit from a sporadic server scheduler implemented at the scheduler level.
Library-Level versus Kernel-Level Implementation

The sporadic server interface described in this section requires the sporadic server policy to be implemented at the same level as the scheduler. This means that the process sporadic server must be implemented at the kernel level and the thread sporadic server policy implemented at the same level as the thread scheduler; that is, kernel or library level.

In an earlier interface for the sporadic server, this mechanism was implementable at a different level than the scheduler. This feature allowed the implementor to choose between an efficient scheduler-level implementation, or a simpler user or library-level implementation. However, the working group considered that this interface made the use of sporadic servers more complex, and that library-level implementations would lack some of the important functionality of the sporadic server, namely the limitation of the actual execution time of aperiodic activities. The working group also felt that the interface described in this chapter does not preclude library-level implementations of threads intended to provide efficient low-overhead scheduling for those threads that are not scheduled under the sporadic server policy.
Range of Scheduling Priorities

Each of the scheduling policies supported in IEEE Std 1003.1-2001 has an associated range of priorities. The priority ranges for each policy might or might not overlap with the priority ranges of other policies. For time-critical realtime applications it is usual for periodic and aperiodic activities to be scheduled together in the same processor. Periodic activities will usually be scheduled using the SCHED_FIFO scheduling policy, while aperiodic activities may be scheduled using SCHED_SPORADIC. Since the application developer will require complete control over the relative priorities of these activities in order to meet his timing requirements, it would be desirable for the priority ranges of SCHED_FIFO and SCHED_SPORADIC to overlap completely. Therefore, although IEEE Std 1003.1-2001 does not require any particular relationship between the different priority ranges, it is recommended that these two ranges should coincide.
Dynamically Setting the Sporadic Server Policy

Several members of the working group requested that implementations should not be required to support dynamically setting the sporadic server scheduling policy for a thread. The reason is that this policy may have a high overhead for library-level implementations of threads, and if threads are allowed to dynamically set this policy, this overhead can be experienced even if the thread does not use that policy. By disallowing the dynamic setting of the sporadic server scheduling policy, these implementations can accomplish efficient scheduling for threads using other policies. If a strictly conforming application needs to use the sporadic server policy, and is therefore willing to pay the overhead, it must set this policy at the time of thread creation.
Limitation of the Number of Pending Replenishments

The number of simultaneously pending replenishment operations must be limited for each sporadic server for two reasons: an unlimited number of replenishment operations would need an unlimited number of system resources to store all the pending replenishment operations; on the other hand, in some implementations each replenishment operation will represent a source of priority inversion (just for the duration of the replenishment operation) and thus, the maximum amount of replenishments must be bounded to guarantee bounded response times. The way in which the number of replenishments is bounded is by lowering the priority of the sporadic server to sched_ss_low_priority when the number of pending replenishments has reached its limit. In this way, no new replenishments are scheduled until the number of pending replenishments decreases.

In the sporadic server scheduling policy defined in IEEE Std 1003.1-2001, the application can specify the maximum number of pending replenishment operations for a single sporadic server, by setting the value of the sched_ss_max_repl scheduling parameter. This value must be between one and {SS_REPL_MAX}, which is a maximum limit imposed by the implementation. The limit {SS_REPL_MAX} must be greater than or equal to {_POSIX_SS_REPL_MAX}, which is defined to be four in IEEE Std 1003.1-2001. The minimum limit of four was chosen so that an application can at least guarantee that four different aperiodic events can be processed during each interval of length equal to the replenishment period.

Clocks and Timers

Clocks

IEEE Std 1003.1-2001 and the ISO C standard both define functions for obtaining system time. Implicit behind these functions is a mechanism for measuring passage of time. This specification makes this mechanism explicit and calls it a clock. The CLOCK_REALTIME clock required by IEEE Std 1003.1-2001 is a higher resolution version of the clock that maintains POSIX.1 system time. This is a "system-wide" clock, in that it is visible to all processes and, were it possible for multiple processes to all read the clock at the same time, they would see the same value.

An extensible interface was defined, with the ability for implementations to define additional clocks. This was done because of the observation that many realtime platforms support multiple clocks, and it was desired to fit this model within the standard interface. But implementation-defined clocks need not represent actual hardware devices, nor are they necessarily system-wide.
Timers

Two timer types are required for a system to support realtime applications:
1. One-shot
  
  A one-shot timer is a timer that is armed with an initial expiration time, either relative to the current time or at an absolute time (based on some timing base, such as time in seconds and nanoseconds since the Epoch). The timer expires once and then is disarmed. With the specified facilities, this is accomplished by setting the it_value member of the value argument to the desired expiration time and the it_interval member to zero.
2. Periodic
  
  A periodic timer is a timer that is armed with an initial expiration time, again either relative or absolute, and a repetition interval. When the initial expiration occurs, the timer is reloaded with the repetition interval and continues counting. With the specified facilities, this is accomplished by setting the it_value member of the value argument to the desired initial expiration time and the it_interval member to the desired repetition interval.
For both of these types of timers, the time of the initial timer expiration can be specified in two ways:
1. Relative (to the current time)
2. Absolute
Examples of Using Realtime Timers

In the diagrams below, S indicates a program schedule, R shows a schedule method request, and E suggests an internal operating system event.
- Periodic Timer: Data Logging
  
  During an experiment, it might be necessary to log realtime data periodically to an internal buffer or to a mass storage device. With a periodic scheduling method, a logging module can be started automatically at fixed time intervals to log the data.
  
  Program schedule is requested every 10 seconds.
```
   R         S         S         S         S         S
----+----+----+----+----+----+----+----+----+----+----+--->
    5   10   15   20   25   30   35   40   45   50   55
```
  [Time (in Seconds)]
  
  To achieve this type of scheduling using the specified facilities, one would allocate a per-process timer based on clock ID CLOCK_REALTIME. Then the timer would be armed via a call to timer_settime() with the TIMER_ABSTIME flag reset, and with an initial expiration value and a repetition interval of 10 seconds.
- One-shot Timer (Relative Time): Device Initialization
  
  In an emission test environment, large sample bags are used to capture the exhaust from a vehicle. The exhaust is purged from these bags before each and every test. With a one-shot timer, a module could initiate the purge function and then suspend itself for a predetermined period of time while the sample bags are prepared.
  
  Program schedule requested 20 seconds after call is issued.
```
   R                   S
----+----+----+----+----+----+----+----+----+----+----+--->
    5   10   15   20   25   30   35   40   45   50   55
```
  [Time (in Seconds)]
  
  To achieve this type of scheduling using the specified facilities, one would allocate a per-process timer based on clock ID CLOCK_REALTIME. Then the timer would be armed via a call to timer_settime() with the TIMER_ABSTIME flag reset, and with an initial expiration value of 20 seconds and a repetition interval of zero.
  
  Note that if the program wishes merely to suspend itself for the specified interval, it could more easily use nanosleep().
- One-shot Timer (Absolute Time): Data Transmission
  
  The results from an experiment are often moved to a different system within a network for postprocessing or archiving. With an absolute one-shot timer, a module that moves data from a test-cell computer to a host computer can be automatically scheduled on a daily basis.
  
  Program schedule requested for 2:30 a.m.
```
        R                                     S
-----+-----+-----+-----+-----+-----+-----+-----+-----+----->
   23:00 23:30 24:00 00:30 01:00 01:30 02:00 02:30 03:00
```
  [Time of Day]
  
  To achieve this type of scheduling using the specified facilities, a per-process timer would be allocated based on clock ID CLOCK_REALTIME. Then the timer would be armed via a call to timer_settime() with the TIMER_ABSTIME flag set, and an initial expiration value equal to 2:30 a.m. of the next day.
- Periodic Timer (Relative Time): Signal Stabilization
  
  Some measurement devices, such as emission analyzers, do not respond instantaneously to an introduced sample. With a periodic timer with a relative initial expiration time, a module that introduces a sample and records the average response could suspend itself for a predetermined period of time while the signal is stabilized and then sample at a fixed rate.
  
  Program schedule requested 15 seconds after call is issued and every 2 seconds thereafter.
```
  R              S S S S S S S S S S S S S S S S S S S S
----+----+----+----+----+----+----+----+----+----+----+--->
    5   10   15   20   25   30   35   40   45   50   55
```
  [Time (in Seconds)]
  
  To achieve this type of scheduling using the specified facilities, one would allocate a per-process timer based on clock ID CLOCK_REALTIME. Then the timer would be armed via a call to timer_settime() with TIMER_ABSTIME flag reset, and with an initial expiration value of 15 seconds and a repetition interval of 2 seconds.
- Periodic Timer (Absolute Time): Work Shift-related Processing
  
  Resource utilization data is useful when time to perform experiments is being scheduled at a facility. With a periodic timer with an absolute initial expiration time, a module can be scheduled at the beginning of a work shift to gather resource utilization data throughout the shift. This data can be used to allocate resources effectively to minimize bottlenecks and delays and maximize facility throughput.
  
  Program schedule requested for 2:00 a.m. and every 15 minutes thereafter.
```
        R                               S  S  S  S  S  S
-----+-----+-----+-----+-----+-----+-----+-----+-----+----->
   23:00 23:30 24:00 00:30 01:00 01:30 02:00 02:30 03:00
```
  [Time of Day]
  
  To achieve this type of scheduling using the specified facilities, one would allocate a per-process timer based on clock ID CLOCK_REALTIME. Then the timer would be armed via a call to timer_settime() with TIMER_ABSTIME flag set, and with an initial expiration value equal to 2:00 a.m. and a repetition interval equal to 15 minutes.
Relationship of Timers to Clocks

The relationship between clocks and timers armed with an absolute time is straightforward: a timer expiration signal is requested when the associated clock reaches or exceeds the specified time. The relationship between clocks and timers armed with a relative time (an interval) is less obvious, but not unintuitive. In this case, a timer expiration signal is requested when the specified interval, as measured by the associated clock, has passed. For the required CLOCK_REALTIME clock, this allows timer expiration signals to be requested at specified "wall clock" times (absolute), or when a specified interval of "realtime'' has passed (relative). For an implementation-defined clock-say, a process virtual time clock-timer expirations could be requested when the process has used a specified total amount of virtual time (absolute), or when it has used a specified additional amount of virtual time (relative).

The interfaces also allow flexibility in the implementation of the functions. For example, an implementation could convert all absolute times to intervals by subtracting the clock value at the time of the call from the requested expiration time and "counting down" at the supported resolution. Or it could convert all relative times to absolute expiration time by adding in the clock value at the time of the call and comparing the clock value to the expiration time at the supported resolution. Or it might even choose to maintain absolute times as absolute and compare them to the clock value at the supported resolution for absolute timers, and maintain relative times as intervals and count them down at the resolution supported for relative timers. The choice will be driven by efficiency considerations and the underlying hardware or software clock implementation.
Data Definitions for Clocks and Timers

IEEE Std 1003.1-2001 uses a time representation capable of supporting nanosecond resolution timers for the following reasons:
- To enable IEEE Std 1003.1-2001 to represent those computer systems already using nanosecond or submicrosecond resolution clocks.
- To accommodate those per-process timers that might need nanoseconds to specify an absolute value of system-wide clocks, even though the resolution of the per-process timer may only be milliseconds, or vice versa.
- Because the number of nanoseconds in a second can be represented in 32 bits.
Time values are represented in the timespec structure. The tv_sec member is of type time_t so that this member is compatible with time values used by POSIX.1 functions and the ISO C standard. The tv_nsec member is a signed long in order to simplify and clarify code that decrements or finds differences of time values. Note that because 1 billion (number of nanoseconds per second) is less than half of the value representable by a signed 32-bit value, it is always possible to add two valid fractional seconds represented as integral nanoseconds without overflowing the signed 32-bit value.

A maximum allowable resolution for the CLOCK_REALTIME clock of 20 ms (1/50 seconds) was chosen to allow line frequency clocks in European countries to be conforming. 60 Hz clocks in the U.S. will also be conforming, as will finer granularity clocks, although a Strictly Conforming Application cannot assume a granularity of less than 20 ms (1/50 seconds).

The minimum allowable maximum time allowed for the CLOCK_REALTIME clock and the function nanosleep(), and timers created with clock_id= CLOCK_REALTIME, is determined by the fact that the tv_sec member is of type time_t.

IEEE Std 1003.1-2001 specifies that timer expirations must not be delivered early, and nanosleep() must not return early due to quantization error. IEEE Std 1003.1-2001 discusses the various implementations of alarm() in the rationale and states that implementations that do not allow alarm signals to occur early are the most appropriate, but refrained from mandating this behavior. Because of the importance of predictability to realtime applications, IEEE Std 1003.1-2001 takes a stronger stance.

The developers of IEEE Std 1003.1-2001 considered using a time representation that differs from POSIX.1b in the second 32 bit of the 64-bit value. Whereas POSIX.1b defines this field as a fractional second in nanoseconds, the other methodology defines this as a binary fraction of one second, with the radix point assumed before the most significant bit.

POSIX.1b is a software, source-level standard and most of the benefits of the alternate representation are enjoyed by hardware implementations of clocks and algorithms. It was felt that mandating this format for POSIX.1b clocks and timers would unnecessarily burden the application writer with writing, possibly non-portable, multiple precision arithmetic packages to perform conversion between binary fractions and integral units such as nanoseconds, milliseconds, and so on.

Rationale for the Monotonic Clock

For those applications that use time services to achieve realtime behavior, changing the value of the clock on which these services rely may cause erroneous timing behavior. For these applications, it is necessary to have a monotonic clock which cannot run backwards, and which has a maximum clock jump that is required to be documented by the implementation. Additionally, it is desirable (but not required by IEEE Std 1003.1-2001) that the monotonic clock increases its value uniformly. This clock should not be affected by changes to the system time; for example, to synchronize the clock with an external source or to account for leap seconds. Such changes would cause errors in the measurement of time intervals for those time services that use the absolute value of the clock.

One could argue that by defining the behavior of time services when the value of a clock is changed, deterministic realtime behavior can be achieved. For example, one could specify that relative time services should be unaffected by changes in the value of a clock. However, there are time services that are based upon an absolute time, but that are essentially intended as relative time services. For example, pthread_cond_timedwait() uses an absolute time to allow it to wake up after the required interval despite spurious wakeups. Although sometimes the pthread_cond_timedwait() timeouts are absolute in nature, there are many occasions in which they are relative, and their absolute value is determined from the current time plus a relative time interval. In this latter case, if the clock changes while the thread is waiting, the wait interval will not be the expected length. If a pthread_cond_timedwait() function were created that would take a relative time, it would not solve the problem because to retain the intended "deadline" a thread would need to compensate for latency due to the spurious wakeup, and preemption between wakeup and the next wait.

The solution is to create a new monotonic clock, whose value does not change except for the regular ticking of the clock, and use this clock for implementing the various relative timeouts that appear in the different POSIX interfaces, as well as allow pthread_cond_timedwait() to choose this new clock for its timeout. A new clock_nanosleep() function is created to allow an application to take advantage of this newly defined clock. Notice that the monotonic clock may be implemented using the same hardware clock as the system clock.

Relative timeouts for sigtimedwait() and aio_suspend() have been redefined to use the monotonic clock, if present. The alarm() function has not been redefined, because the same effect but with better resolution can be achieved by creating a timer (for which the appropriate clock may be chosen).

The pthread_cond_timedwait() function has been treated in a different way, compared to other functions with absolute timeouts, because it is used to wait for an event, and thus it may have a deadline, while the other timeouts are generally used as an error recovery mechanism, and for them the use of the monotonic clock is not so important. Since the desired timeout for the pthread_cond_timedwait() function may either be a relative interval or an absolute time of day deadline, a new initialization attribute has been created for condition variables to specify the clock that is used for measuring the timeout in a call to pthread_cond_timedwait(). In this way, if a relative timeout is desired, the monotonic clock will be used; if an absolute deadline is required instead, the CLOCK_REALTIME or another appropriate clock may be used. This capability has not been added to other functions with absolute timeouts because for those functions the expected use of the timeout is mostly to prevent errors, and not so often to meet precise deadlines. As a consequence, the complexity of adding this capability is not justified by its perceived application usage.

The nanosleep() function has not been modified with the introduction of the monotonic clock. Instead, a new clock_nanosleep() function has been created, in which the desired clock may be specified in the function call.

History of Resolution Issues

Due to the shift from relative to absolute timeouts in IEEE Std 1003.1d-1999, the amendments to the sem_timedwait(), pthread_mutex_timedlock(), mq_timedreceive(), and mq_timedsend() functions of that standard have been removed. Those amendments specified that CLOCK_MONOTONIC would be used for the (relative) timeouts if the Monotonic Clock option was supported.

Having these functions continue to be tied solely to CLOCK_MONOTONIC would not work. Since the absolute value of a time value obtained from CLOCK_MONOTONIC is unspecified, under the absolute timeouts interface, applications would behave differently depending on whether the Monotonic Clock option was supported or not (because the absolute value of the clock would have different meanings in either case).

Two options were considered:
1. Leave the current behavior unchanged, which specifies the CLOCK_REALTIME clock for these (absolute) timeouts, to allow portability of applications between implementations supporting or not the Monotonic Clock option.
2. Modify these functions in the way that pthread_cond_timedwait() was modified to allow a choice of clock, so that an application could use CLOCK_REALTIME when it is trying to achieve an absolute timeout and CLOCK_MONOTONIC when it is trying to achieve a relative timeout.
It was decided that the features of CLOCK_MONOTONIC are not as critical to these functions as they are to pthread_cond_timedwait(). The pthread_cond_timedwait() function is given a relative timeout; the timeout may represent a deadline for an event. When these functions are given relative timeouts, the timeouts are typically for error recovery purposes and need not be so precise.

Therefore, it was decided that these functions should be tied to CLOCK_REALTIME and not complicated by being given a choice of clock.

Execution Time Monitoring

Introduction

The main goals of the execution time monitoring facilities defined in this chapter are to measure the execution time of processes and threads and to allow an application to establish CPU time limits for these entities.

The analysis phase of time-critical realtime systems often relies on the measurement of execution times of individual threads or processes to determine whether the timing requirements will be met. Also, performance analysis techniques for soft deadline realtime systems rely heavily on the determination of these execution times. The execution time monitoring functions provide application developers with the ability to measure these execution times online and open the possibility of dynamic execution-time analysis and system reconfiguration, if required.

The second goal of allowing an application to establish execution time limits for individual processes or threads and detecting when they overrun allows program robustness to be increased by enabling online checking of the execution times.

If errors are detected-possibly because of erroneous program constructs, the existence of errors in the analysis phase, or a burst of event arrivals-online detection and recovery is possible in a portable way. This feature can be extremely important for many time-critical applications. Other applications require trapping CPU-time errors as a normal way to exit an algorithm; for instance, some realtime artificial intelligence applications trigger a number of independent inference processes of varying accuracy and speed, limit how long they can run, and pick the best answer available when time runs out. In many periodic systems, overrun processes are simply restarted in the next resource period, after necessary end-of-period actions have been taken. This allows algorithms that are inherently data-dependent to be made predictable.

The interface that appears in this chapter defines a new type of clock, the CPU-time clock, which measures execution time. Each process or thread can invoke the clock and timer functions defined in POSIX.1 to use them. Functions are also provided to access the CPU-time clock of other processes or threads to enable remote monitoring of these clocks. Monitoring of threads of other processes is not supported, since these threads are not visible from outside of their own process with the interfaces defined in POSIX.1.
Execution Time Monitoring Interface

The clock and timer interface defined in POSIX.1 historically only defined one clock, which measures wall-clock time. The requirements for measuring execution time of processes and threads, and setting limits to their execution time by detecting when they overrun, can be accomplished with that interface if a new kind of clock is defined. These new clocks measure execution time, and one is associated with each process and with each thread. The clock functions currently defined in POSIX.1 can be used to read and set these CPU-time clocks, and timers can be created using these clocks as their timing base. These timers can then be used to send a signal when some specified execution time has been exceeded. The CPU-time clocks of each process or thread can be accessed by using the symbols CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID.

The clock and timer interface defined in POSIX.1 and extended with the new kind of CPU-time clock would only allow processes or threads to access their own CPU-time clocks. However, many realtime systems require the possibility of monitoring the execution time of processes or threads from independent monitoring entities. In order to allow applications to construct independent monitoring entities that do not require cooperation from or modification of the monitored entities, two functions have been added: clock_getcpuclockid(), for accessing CPU-time clocks of other processes, and pthread_getcpuclockid(), for accessing CPU-time clocks of other threads. These functions return the clock identifier associated with the process or thread specified in the call. These clock IDs can then be used in the rest of the clock function calls.

The clocks accessed through these functions could also be used as a timing base for the creation of timers, thereby allowing independent monitoring entities to limit the CPU time consumed by other entities. However, this possibility would imply additional complexity and overhead because of the need to maintain a timer queue for each process or thread, to store the different expiration times associated with timers created by different processes or threads. The working group decided this additional overhead was not justified by application requirements. Therefore, creation of timers attached to the CPU-time clocks of other processes or threads has been specified as implementation-defined.
Overhead Considerations

The measurement of execution time may introduce additional overhead in the thread scheduling, because of the need to keep track of the time consumed by each of these entities. In library-level implementations of threads, the efficiency of scheduling could be somehow compromised because of the need to make a kernel call, at each context switch, to read the process CPU-time clock. Consequently, a thread creation attribute called cpu-clock-requirement was defined, to allow threads to disconnect their respective CPU-time clocks. However, the Ballot Group considered that this attribute itself introduced some overhead, and that in current implementations it was not worth the effort. Therefore, the attribute was deleted, and thus thread CPU-time clocks are required for all threads if the Thread CPU-Time Clocks option is supported.
Accuracy of CPU-Time Clocks

The mechanism used to measure the execution time of processes and threads is specified in IEEE Std 1003.1-2001 as implementation-defined. The reason for this is that both the underlying hardware and the implementation architecture have a very strong influence on the accuracy achievable for measuring CPU time. For some implementations, the specification of strict accuracy requirements would represent very large overheads, or even the impossibility of being implemented.

Since the mechanism for measuring execution time is implementation-defined, realtime applications will be able to take advantage of accurate implementations using a portable interface. Of course, strictly conforming applications cannot rely on any particular degree of accuracy, in the same way as they cannot rely on a very accurate measurement of wall clock time. There will always exist applications whose accuracy or efficiency requirements on the implementation are more rigid than the values defined in IEEE Std 1003.1-2001 or any other standard.

In any case, there is a minimum set of characteristics that realtime applications would expect from most implementations. One such characteristic is that the sum of all the execution times of all the threads in a process equals the process execution time, when no CPU-time clocks are disabled. This need not always be the case because implementations may differ in how they account for time during context switches. Another characteristic is that the sum of the execution times of all processes in a system equals the number of processors, multiplied by the elapsed time, assuming that no processor is idle during that elapsed time. However, in some implementations it might not be possible to relate CPU time to elapsed time. For example, in a heterogeneous multi-processor system in which each processor runs at a different speed, an implementation may choose to define each "second" of CPU time to be a certain number of "cycles" that a CPU has executed.
Existing Practice

Measuring and limiting the execution time of each concurrent activity are common features of most industrial implementations of realtime systems. Almost all critical realtime systems are currently built upon a cyclic executive. With this approach, a regular timer interrupt kicks off the next sequence of computations. It also checks that the current sequence has completed. If it has not, then some error recovery action can be undertaken (or at least an overrun is avoided). Current software engineering principles and the increasing complexity of software are driving application developers to implement these systems on multi-threaded or multi-process operating systems. Therefore, if a POSIX operating system is to be used for this type of application, then it must offer the same level of protection.

Execution time clocks are also common in most UNIX implementations, although these clocks usually have requirements different from those of realtime applications. The POSIX.1 times() function supports the measurement of the execution time of the calling process, and its terminated child processes. This execution time is measured in clock ticks and is supplied as two different values with the user and system execution times, respectively. BSD supports the function getrusage(), which allows the calling process to get information about the resources used by itself and/or all of its terminated child processes. The resource usage includes user and system CPU time. Some UNIX systems have options to specify high resolution (up to one microsecond) CPU-time clocks using the times() or the getrusage() functions.

The times() and getrusage() interfaces do not meet important realtime requirements, such as the possibility of monitoring execution time from a different process or thread, or the possibility of detecting an execution time overrun. The latter requirement is supported in some UNIX implementations that are able to send a signal when the execution time of a process has exceeded some specified value. For example, BSD defines the functions getitimer() and setitimer(), which can operate either on a realtime clock (wall-clock), or on virtual-time or profile-time clocks which measure CPU time in two different ways. These functions do not support access to the execution time of other processes.

IBM's MVS operating system supports per-process and per-thread execution time clocks. It also supports limiting the execution time of a given process.

Given all this existing practice, the working group considered that the POSIX.1 clocks and timers interface was appropriate to meet most of the requirements that realtime applications have for execution time clocks. Functions were added to get the CPU time clock IDs, and to allow/disallow the thread CPU-time clocks (in order to preserve the efficiency of some implementations of threads).
Clock Constants

The definition of the manifest constants CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID allows processes or threads, respectively, to access their own execution-time clocks. However, given a process or thread, access to its own execution-time clock is also possible if the clock ID of this clock is obtained through a call to clock_getcpuclockid() or pthread_getcpuclockid(). Therefore, these constants are not necessary and could be deleted to make the interface simpler. Their existence saves one system call in the first access to the CPU-time clock of each process or thread. The working group considered this issue and decided to leave the constants in IEEE Std 1003.1-2001 because they are closer to the POSIX.1b use of clock identifiers.
Library Implementations of Threads

In library implementations of threads, kernel entities and library threads can coexist. In this case, if the CPU-time clocks are supported, most of the clock and timer functions will need to have two implementations: one in the thread library, and one in the system calls library. The main difference between these two implementations is that the thread library implementation will have to deal with clocks and timers that reside in the thread space, while the kernel implementation will operate on timers and clocks that reside in kernel space. In the library implementation, if the clock ID refers to a clock that resides in the kernel, a kernel call will have to be made. The correct version of the function can be chosen by specifying the appropriate order for the libraries during the link process.
History of Resolution Issues: Deletion of the enable Attribute

In early proposals, consideration was given to inclusion of an attribute called enable for CPU-time clocks. This would allow implementations to avoid the overhead of measuring execution time for those processes or threads for which this measurement was not required. However, this is unnecessary since processes are already required to measure execution time by the POSIX.1 times() function. Consequently, the enable attribute is not present.

Rationale Relating to Timeouts

Requirements for Timeouts

Realtime systems which must operate reliably over extended periods without human intervention are characteristic in embedded applications such as avionics, machine control, and space exploration, as well as more mundane applications such as cable TV, security systems, and plant automation. A multi-tasking paradigm, in which many independent and/or cooperating software functions relinquish the processor(s) while waiting for a specific stimulus, resource, condition, or operation completion, is very useful in producing well engineered programs for such systems. For such systems to be robust and fault-tolerant, expected occurrences that are unduly delayed or that never occur must be detected so that appropriate recovery actions may be taken. This is difficult if there is no way for a task to regain control of a processor once it has relinquished control (blocked) awaiting an occurrence which, perhaps because of corrupted code, hardware malfunction, or latent software bugs, will not happen when expected. Therefore, the common practice in realtime operating systems is to provide a capability to time out such blocking services. Although there are several methods to achieve this already defined by POSIX, none are as reliable or efficient as initiating a timeout simultaneously with initiating a blocking service. This is especially critical in hard-realtime embedded systems because the processors typically have little time reserve, and allowed fault recovery times are measured in milliseconds rather than seconds.

The working group largely agreed that such timeouts were necessary and ought to become part of IEEE Std 1003.1-2001, particularly vendors of realtime operating systems whose customers had already expressed a strong need for timeouts. There was some resistance to inclusion of timeouts in IEEE Std 1003.1-2001 because the desired effect, fault tolerance, could, in theory, be achieved using existing facilities and alternative software designs, but there was no compelling evidence that realtime system designers would embrace such designs at the sacrifice of performance and/or simplicity.
Which Services should be Timed Out?

Originally, the working group considered the prospect of providing timeouts on all blocking services, including those currently existing in POSIX.1, POSIX.1b, and POSIX.1c, and future interfaces to be defined by other working groups, as sort of a general policy. This was rather quickly rejected because of the scope of such a change, and the fact that many of those services would not normally be used in a realtime context. More traditional timesharing solutions to timeout would suffice for most of the POSIX.1 interfaces, while others had asynchronous alternatives which, while more complex to utilize, would be adequate for some realtime and all non-realtime applications.

The list of potential candidates for timeouts was narrowed to the following for further consideration:
- POSIX.1b
  - sem_wait()
  - mq_receive()
  - mq_send()
  - lio_listio()
  - aio_suspend()
  - sigwait() (timeout already implemented by sigtimedwait())
- POSIX.1c
  - pthread_mutex_lock()
  - pthread_join()
  - pthread_cond_wait() (timeout already implemented by pthread_cond_timedwait())
- POSIX.1
  - read()
  - write()
After further review by the working group, the lio_listio(), read(), and write() functions (all forms of blocking synchronous I/O) were eliminated from the list because of the following:
- Asynchronous alternatives exist
- Timeouts can be implemented, albeit non-portably, in device drivers
- A strong desire not to introduce modifications to POSIX.1 interfaces
The working group ultimately rejected pthread_join() since both that interface and a timed variant of that interface are non-minimal and may be implemented as a function. See below for a library implementation of pthread_join().

Thus, there was a consensus among the working group members to add timeouts to 4 of the remaining 5 functions (the timeout for aio_suspend() was ultimately added directly to POSIX.1b, while the others were added by POSIX.1d). However, pthread_mutex_lock() remained contentious.

Many feel that pthread_mutex_lock() falls into the same class as the other functions; that is, it is desirable to time out a mutex lock because a mutex may fail to be unlocked due to errant or corrupted code in a critical section (looping or branching outside of the unlock code), and therefore is equally in need of a reliable, simple, and efficient timeout. In fact, since mutexes are intended to guard small critical sections, most pthread_mutex_lock() calls would be expected to obtain the lock without blocking nor utilizing any kernel service, even in implementations of threads with global contention scope; the timeout alternative need only be considered after it is determined that the thread must block.

Those opposed to timing out mutexes feel that the very simplicity of the mutex is compromised by adding a timeout semantic, and that to do so is senseless. They claim that if a timed mutex is really deemed useful by a particular application, then it can be constructed from the facilities already in POSIX.1b and POSIX.1c. The following two C-language library implementations of mutex locking with timeout represent the solutions offered (in both implementations, the timeout parameter is specified as absolute time, not relative time as in the proposed POSIX.1c interfaces).
Spinlock Implementation
```
#include <pthread.h>
#include <time.h>
#include <errno.h>


int pthread_mutex_timedlock(pthread_mutex_t *mutex,
        const struct timespec *timeout)
    {
    struct timespec timenow;


    while (pthread_mutex_trylock(mutex) == EBUSY)
        {
        clock_gettime(CLOCK_REALTIME, &timenow);
        if (timespec_cmp(&timenow,timeout) >= 0)
            {
            return ETIMEDOUT;
        }
        pthread_yield();
        }
    return 0;
    }
```
The Spinlock implementation is generally unsuitable for any application using priority-based thread scheduling policies such as SCHED_FIFO or SCHED_RR, since the mutex could currently be held by a thread of lower priority within the same allocation domain, but since the waiting thread never blocks, only threads of equal or higher priority will ever run, and the mutex cannot be unlocked. Setting priority inheritance or priority ceiling protocol on the mutex does not solve this problem, since the priority of a mutex owning thread is only boosted if higher priority threads are blocked waiting for the mutex; clearly not the case for this spinlock.
Condition Wait Implementation
```
#include <pthread.h>
#include <time.h>
#include <errno.h>


struct timed_mutex
    {
    int locked;
    pthread_mutex_t mutex;
    pthread_cond_t cond;
    };
typedef struct timed_mutex timed_mutex_t;


int timed_mutex_lock(timed_mutex_t *tm,
        const struct timespec *timeout)
    {
    int timedout=FALSE;
    int error_status;


    pthread_mutex_lock(&tm->mutex);


    while (tm->locked && !timedout)
        {
        if ((error_status=pthread_cond_timedwait(&tm->cond,
            &tm->mutex,
            timeout))!=0)
        {
        if (error_status==ETIMEDOUT) timedout = TRUE;
        }
    }


    if(timedout)
        {
        pthread_mutex_unlock(&tm->mutex);
        return ETIMEDOUT;
        }
    else
        {
        tm->locked = TRUE;
        pthread_mutex_unlock(&tm->mutex);
        return 0;
        }
    }


void timed_mutex_unlock(timed_mutex_t *tm)
    {
    pthread_mutex_lock(&tm->mutex); / for case assignment not atomic /
    tm->locked = FALSE;
    pthread_mutex_unlock(&tm->mutex);
    pthread_cond_signal(&tm->cond);
    }
```
The Condition Wait implementation effectively substitutes the pthread_cond_timedwait() function (which is currently timed out) for the desired pthread_mutex_timedlock(). Since waits on condition variables currently do not include protocols which avoid priority inversion, this method is generally unsuitable for realtime applications because it does not provide the same priority inversion protection as the untimed pthread_mutex_lock(). Also, for any given implementations of the current mutex and condition variable primitives, this library implementation has a performance cost at least 2.5 times that of the untimed pthread_mutex_lock() even in the case where the timed mutex is readily locked without blocking (the interfaces required for this case are shown in bold). Even in uniprocessors or where assignment is atomic, at least an additional pthread_cond_signal() is required. pthread_mutex_timedlock() could be implemented at effectively no performance penalty in this case because the timeout parameters need only be considered after it is determined that the mutex cannot be locked immediately.

Thus it has not yet been shown that the full semantics of mutex locking with timeout can be efficiently and reliably achieved using existing interfaces. Even if the existence of an acceptable library implementation were proven, it is difficult to justify why the interface itself should not be made portable, especially considering approval for the other four timeouts.

Rationale for Library Implementation of pthread_timedjoin()

Library implementation of pthread_timedjoin():

/*
 * Construct a thread variety entirely from existing functions
 * with which a join can be done, allowing the join to time out.
 */
#include <pthread.h>
#include <time.h>


struct timed_thread {
    pthread_t t;
    pthread_mutex_t m;
    int exiting;
    pthread_cond_t exit_c;
    void *(*start_routine)(void *arg);
    void *arg;
    void *status;
};


typedef struct timed_thread *timed_thread_t;
static pthread_key_t timed_thread_key;
static pthread_once_t timed_thread_once = PTHREAD_ONCE_INIT;


static void timed_thread_init()
{
    pthread_key_create(&timed_thread_key, NULL);
}


static void *timed_thread_start_routine(void *args)


/*
 * Routine to establish thread-specific data value and run the actual
 * thread start routine which was supplied to timed_thread_create().
 */
{
    timed_thread_t tt = (timed_thread_t) args;


    pthread_once(&timed_thread_once, timed_thread_init);
    pthread_setspecific(timed_thread_key, (void *)tt);
    timed_thread_exit((tt->start_routine)(tt->arg));
}


int timed_thread_create(timed_thread_t ttp, const pthread_attr_t *attr,
    void *(*start_routine)(void *), void *arg)


/*
 * Allocate a thread which can be used with timed_thread_join().
 */
{
    timed_thread_t tt;
    int result;


    tt = (timed_thread_t) malloc(sizeof(struct timed_thread));
    pthread_mutex_init(&tt->m,NULL);
    tt->exiting = FALSE;
    pthread_cond_init(&tt->exit_c,NULL);
    tt->start_routine = start_routine;
    tt->arg = arg;
    tt->status = NULL;


    if ((result = pthread_create(&tt->t, attr,
        timed_thread_start_routine, (void *)tt)) != 0) {
        free(tt);
        return result;
    }


    pthread_detach(tt->t);
    ttp = tt;
    return 0;
}


int timed_thread_join(timed_thread_t tt,
    struct timespec *timeout,
    void **status)
{
    int result;


    pthread_mutex_lock(&tt->m);
    result = 0;
    /*
     * Wait until the thread announces that it is exiting,
     * or until timeout.
     */
    while (result == 0 && ! tt->exiting) {
        result = pthread_cond_timedwait(&tt->exit_c, &tt->m, timeout);
    }
    pthread_mutex_unlock(&tt->m);
    if (result == 0 && tt->exiting) {
        *status = tt->status;
        free((void *)tt);
        return result;
    }
    return result;
}


void timed_thread_exit(void *status)
{
    timed_thread_t tt;
    void *specific;


    if ((specific=pthread_getspecific(timed_thread_key)) == NULL){
        /*
         * Handle cases which won't happen with correct usage.
         */
        pthread_exit( NULL);
    }
    tt = (timed_thread_t) specific;
    pthread_mutex_lock(&tt->m);
    /*
     * Tell a joiner that we're exiting.
     */
    tt->status = status;
    tt->exiting = TRUE;
    pthread_cond_signal(&tt->exit_c);
    pthread_mutex_unlock(&tt->m);
    /*
     * Call pthread exit() to call destructors and really
     * exit the thread.
     */
    pthread_exit(NULL);
}

The pthread_join() C-language example shown above demonstrates that it is possible, using existing pthread facilities, to construct a variety of thread which allows for joining such a thread, but which allows the join operation to time out. It does this by using a pthread_cond_timedwait() to wait for the thread to exit. A timed_thread_t descriptor structure is used to pass parameters from the creating thread to the created thread, and from the exiting thread to the joining thread. This implementation is roughly equivalent to what a normal pthread_join() implementation would do, with the single change being that pthread_cond_timedwait() is used in place of a simple pthread_cond_wait().

Since it is possible to implement such a facility entirely from existing pthread interfaces, and with roughly equal efficiency and complexity to an implementation which would be provided directly by a pthreads implementation, it was the consensus of the working group members that any pthread_timedjoin() facility would be unnecessary, and should not be provided.

Form of the Timeout Interfaces

The working group considered a number of alternative ways to add timeouts to blocking services. At first, a system interface which would specify a one-shot or persistent timeout to be applied to subsequent blocking services invoked by the calling process or thread was considered because it allowed all blocking services to be timed out in a uniform manner with a single additional interface; this was rather quickly rejected because it could easily result in the wrong services being timed out.

It was suggested that a timeout value might be specified as an attribute of the object (semaphore, mutex, message queue, and so on), but there was no consensus on this, either on a case-by-case basis or for all timeouts.

Looking at the two existing timeouts for blocking services indicates that the working group members favor a separate interface for the timed version of a function. However, pthread_cond_timedwait() utilizes an absolute timeout value while sigtimedwait() uses a relative timeout value. The working group members agreed that relative timeout values are appropriate where the timeout mechanism's primary use was to deal with an unexpected or error situation, but they are inappropriate when the timeout must expire at a particular time, or before a specific deadline. For the timeouts being introduced in IEEE Std 1003.1-2001, the working group considered allowing both relative and absolute timeouts as is done with POSIX.1b timers, but ultimately favored the simpler absolute timeout form.

An absolute time measure can be easily implemented on top of an interface that specifies relative time, by reading the clock, calculating the difference between the current time and the desired wake-up time, and issuing a relative timeout call. But there is a race condition with this approach because the thread could be preempted after reading the clock, but before making the timed-out call; in this case, the thread would be awakened later than it should and, thus, if the wake-up time represented a deadline, it would miss it.

There is also a race condition when trying to build a relative timeout on top of an interface that specifies absolute timeouts. In this case, the clock would have to be read to calculate the absolute wake-up time as the sum of the current time plus the relative timeout interval. In this case, if the thread is preempted after reading the clock but before making the timed-out call, the thread would be awakened earlier than desired.

But the race condition with the absolute timeouts interface is not as bad as the one that happens with the relative timeout interface, because there are simple workarounds. For the absolute timeouts interface, if the timing requirement is a deadline, the deadline can still be met because the thread woke up earlier than the deadline. If the timeout is just used as an error recovery mechanism, the precision of timing is not really important. If the timing requirement is that between actions A and B a minimum interval of time must elapse, the absolute timeout interface can be safely used by reading the clock after action A has been started. It could be argued that, since the call with the absolute timeout is atomic from the application point of view, it is not possible to read the clock after action A, if this action is part of the timed-out call. But looking at the nature of the calls for which timeouts are specified (locking a mutex, waiting for a semaphore, waiting for a message, or waiting until there is space in a message queue), the timeouts that an application would build on these actions would not be triggered by these actions themselves, but by some other external action. For example, if waiting for a message to arrive to a message queue, and waiting for at least 20 milliseconds, this time interval would start to be counted from some event that would trigger both the action that produces the message, as well as the action that waits for the message to arrive, and not by the wait-for-message operation itself. In this case, the workaround proposed above could be used.

For these reasons, the absolute timeout is preferred over the relative timeout interface.

B.2.9 Threads

Threads will normally be more expensive than subroutines (or functions, routines, and so on) if specialized hardware support is not provided. Nevertheless, threads should be sufficiently efficient to encourage their use as a medium to fine-grained structuring mechanism for parallelism in an application. Structuring an application using threads then allows it to take immediate advantage of any underlying parallelism available in the host environment. This means implementors are encouraged to optimize for fast execution at the possible expense of efficient utilization of storage. For example, a common thread creation technique is to cache appropriate thread data structures. That is, rather than releasing system resources, the implementation retains these resources and reuses them when the program next asks to create a new thread. If this reuse of thread resources is to be possible, there has to be very little unique state associated with each thread, because any such state has to be reset when the thread is reused.

Thread Creation Attributes

Attributes objects are provided for threads, mutexes, and condition variables as a mechanism to support probable future standardization in these areas without requiring that the interface itself be changed.

Attributes objects provide clean isolation of the configurable aspects of threads. For example, "stack size" is an important attribute of a thread, but it cannot be expressed portably. When porting a threaded program, stack sizes often need to be adjusted. The use of attributes objects can help by allowing the changes to be isolated in a single place, rather than being spread across every instance of thread creation.

Attributes objects can be used to set up classes of threads with similar attributes; for example, "threads with large stacks and high priority" or "threads with minimal stacks". These classes can be defined in a single place and then referenced wherever threads need to be created. Changes to "class" decisions become straightforward, and detailed analysis of each pthread_create() call is not required.

The attributes objects are defined as opaque types as an aid to extensibility. If these objects had been specified as structures, adding new attributes would force recompilation of all multi-threaded programs when the attributes objects are extended; this might not be possible if different program components were supplied by different vendors.

Additionally, opaque attributes objects present opportunities for improving performance. Argument validity can be checked once when attributes are set, rather than each time a thread is created. Implementations will often need to cache kernel objects that are expensive to create. Opaque attributes objects provide an efficient mechanism to detect when cached objects become invalid due to attribute changes.

Because assignment is not necessarily defined on a given opaque type, implementation-defined default values cannot be defined in a portable way. The solution to this problem is to allow attribute objects to be initialized dynamically by attributes object initialization functions, so that default values can be supplied automatically by the implementation.

The following proposal was provided as a suggested alternative to the supplied attributes:

Maintain the style of passing a parameter formed by the bitwise-inclusive OR of flags to the initialization routines ( pthread_create(), pthread_mutex_init(), pthread_cond_init()). The parameter containing the flags should be an opaque type for extensibility. If no flags are set in the parameter, then the objects are created with default characteristics. An implementation may specify implementation-defined flag values and associated behavior.
If further specialization of mutexes and condition variables is necessary, implementations may specify additional procedures that operate on the pthread_mutex_t and pthread_cond_t objects (instead of on attributes objects).

The difficulties with this solution are:

A bitmask is not opaque if bits have to be set into bit-vector attributes objects using explicitly-coded bitwise-inclusive OR operations. If the set of options exceeds an int, application programmers need to know the location of each bit. If bits are set or read by encapsulation (that is, get*() or set*() functions), then the bitmask is merely an implementation of attributes objects as currently defined and should not be exposed to the programmer.
Many attributes are not Boolean or very small integral values. For example, scheduling policy may be placed in 3 bits or 4 bits, but priority requires 5 bits or more, thereby taking up at least 8 bits out of a possible 16 bits on machines with 16-bit integers. Because of this, the bitmask can only reasonably control whether particular attributes are set or not, and it cannot serve as the repository of the value itself. The value needs to be specified as a function parameter (which is non-extensible), or by setting a structure field (which is non-opaque), or by get*() and set*() functions (making the bitmask a redundant addition to the attributes objects).

Stack size is defined as an optional attribute because the very notion of a stack is inherently machine-dependent. Some implementations may not be able to change the size of the stack, for example, and others may not need to because stack pages may be discontiguous and can be allocated and released on demand.

The attribute mechanism has been designed in large measure for extensibility. Future extensions to the attribute mechanism or to any attributes object defined in IEEE Std 1003.1-2001 have to be done with care so as not to affect binary-compatibility.

Attribute objects, even if allocated by means of dynamic allocation functions such as malloc(), may have their size fixed at compile time. This means, for example, a pthread_create() in an implementation with extensions to the pthread_attr_t cannot look beyond the area that the binary application assumes is valid. This suggests that implementations should maintain a size field in the attributes object, as well as possibly version information, if extensions in different directions (possibly by different vendors) are to be accommodated.

Thread Implementation Models

There are various thread implementation models. At one end of the spectrum is the "library-thread model". In such a model, the threads of a process are not visible to the operating system kernel, and the threads are not kernel-scheduled entities. The process is the only kernel-scheduled entity. The process is scheduled onto the processor by the kernel according to the scheduling attributes of the process. The threads are scheduled onto the single kernel-scheduled entity (the process) by the runtime library according to the scheduling attributes of the threads. A problem with this model is that it constrains concurrency. Since there is only one kernel-scheduled entity (namely, the process), only one thread per process can execute at a time. If the thread that is executing blocks on I/O, then the whole process blocks.

At the other end of the spectrum is the "kernel-thread model". In this model, all threads are visible to the operating system kernel. Thus, all threads are kernel-scheduled entities, and all threads can concurrently execute. The threads are scheduled onto processors by the kernel according to the scheduling attributes of the threads. The drawback to this model is that the creation and management of the threads entails operating system calls, as opposed to subroutine calls, which makes kernel threads heavier weight than library threads.

Hybrids of these two models are common. A hybrid model offers the speed of library threads and the concurrency of kernel threads. In hybrid models, a process has some (relatively small) number of kernel scheduled entities associated with it. It also has a potentially much larger number of library threads associated with it. Some library threads may be bound to kernel-scheduled entities, while the other library threads are multiplexed onto the remaining kernel-scheduled entities. There are two levels of thread scheduling:

The runtime library manages the scheduling of (unbound) library threads onto kernel-scheduled entities.
The kernel manages the scheduling of kernel-scheduled entities onto processors.

For this reason, a hybrid model is referred to as a two-level threads scheduling model. In this model, the process can have multiple concurrently executing threads; specifically, it can have as many concurrently executing threads as it has kernel-scheduled entities.

Thread-Specific Data

Many applications require that a certain amount of context be maintained on a per-thread basis across procedure calls. A common example is a multi-threaded library routine that allocates resources from a common pool and maintains an active resource list for each thread. The thread-specific data interface provided to meet these needs may be viewed as a two-dimensional array of values with keys serving as the row index and thread IDs as the column index (although the implementation need not work this way).

Models

Three possible thread-specific data models were considered:
1. No Explicit Support
  
  A standard thread-specific data interface is not strictly necessary to support applications that require per-thread context. One could, for example, provide a hash function that converted a pthread_t into an integer value that could then be used to index into a global array of per-thread data pointers. This hash function, in conjunction with pthread_self(), would be all the interface required to support a mechanism of this sort. Unfortunately, this technique is cumbersome. It can lead to duplicated code as each set of cooperating modules implements their own per-thread data management schemes.
2. Single (void *) Pointer
  
  Another technique would be to provide a single word of per-thread storage and a pair of functions to fetch and store the value of this word. The word could then hold a pointer to a block of per-thread memory. The allocation, partitioning, and general use of this memory would be entirely up to the application. Although this method is not as problematic as technique 1, it suffers from interoperability problems. For example, all modules using the per-thread pointer would have to agree on a common usage protocol.
3. Key/Value Mechanism
  
  This method associates an opaque key (for example, stored in a variable of type pthread_key_t) with each per-thread datum. These keys play the role of identifiers for per-thread data. This technique is the most generic and avoids the problems noted above, albeit at the cost of some complexity.
The primary advantage of the third model is its information hiding properties. Modules using this model are free to create and use their own key(s) independent of all other such usage, whereas the other models require that all modules that use thread-specific context explicitly cooperate with all other such modules. The data-independence provided by the third model is worth the additional interface.
Requirements

It is important that it be possible to implement the thread-specific data interface without the use of thread private memory. To do otherwise would increase the weight of each thread, thereby limiting the range of applications for which the threads interfaces provided by IEEE Std 1003.1-2001 is appropriate.

The values that one binds to the key via pthread_setspecific() may, in fact, be pointers to shared storage locations available to all threads. It is only the key/value bindings that are maintained on a per-thread basis, and these can be kept in any portion of the address space that is reserved for use by the calling thread (for example, on the stack). Thus, no per-thread MMU state is required to implement the interface. On the other hand, there is nothing in the interface specification to preclude the use of a per-thread MMU state if it is available (for example, the key values returned by pthread_key_create() could be thread private memory addresses).
Standardization Issues

Thread-specific data is a requirement for a usable thread interface. The binding described in this section provides a portable thread-specific data mechanism for languages that do not directly support a thread-specific storage class. A binding to IEEE Std 1003.1-2001 for a language that does include such a storage class need not provide this specific interface.

If a language were to include the notion of thread-specific storage, it would be desirable (but not required) to provide an implementation of the pthreads thread-specific data interface based on the language feature. For example, assume that a compiler for a C-like language supports a private storage class that provides thread-specific storage. Something similar to the following macros might be used to effect a compatible implementation:
```
#define pthread_key_t                   private void *
#define pthread_key_create(key)         /* no-op */
#define pthread_setspecific(key,value)  (key)=(value)
#define pthread_getspecific(key)        (key)
```
Note:

For the sake of clarity, this example ignores destructor functions. A correct implementation would have to support them.

Barriers

Background

Barriers are typically used in parallel DO/FOR loops to ensure that all threads have reached a particular stage in a parallel computation before allowing any to proceed to the next stage. Highly efficient implementation is possible on machines which support a "Fetch and Add" operation as described in the referenced Almasi and Gottlieb (1989).

The use of return value PTHREAD_BARRIER_SERIAL_THREAD is shown in the following example:
```
if ( (status=pthread_barrier_wait(&barrier)) ==
    PTHREAD_BARRIER_SERIAL_THREAD) {
    ...serial section
    }
        else if (status != 0) {
        ...error processing
    }
status=pthread_barrier_wait(&barrier);
...
```
This behavior allows a serial section of code to be executed by one thread as soon as all threads reach the first barrier. The second barrier prevents the other threads from proceeding until the serial section being executed by the one thread has completed.

Although barriers can be implemented with mutexes and condition variables, the referenced Almasi and Gottlieb (1989) provides ample illustration that such implementations are significantly less efficient than is possible. While the relative efficiency of barriers may well vary by implementation, it is important that they be recognized in the IEEE Std 1003.1-2001 to facilitate applications portability while providing the necessary freedom to implementors.
Lack of Timeout Feature

Alternate versions of most blocking routines have been provided to support watchdog timeouts. No alternate interface of this sort has been provided for barrier waits for the following reasons:
- Multiple threads may use different timeout values, some of which may be indefinite. It is not clear which threads should break through the barrier with a timeout error if and when these timeouts expire.
- The barrier may become unusable once a thread breaks out of a pthread_barrier_wait() with a timeout error. There is, in general, no way to guarantee the consistency of a barrier's internal data structures once a thread has timed out of a pthread_barrier_wait(). Even the inclusion of a special barrier reinitialization function would not help much since it is not clear how this function would affect the behavior of threads that reach the barrier between the original timeout and the call to the reinitialization function.

Spin Locks

Background

Spin locks represent an extremely low-level synchronization mechanism suitable primarily for use on shared memory multi-processors. It is typically an atomically modified Boolean value that is set to one when the lock is held and to zero when the lock is freed.

When a caller requests a spin lock that is already held, it typically spins in a loop testing whether the lock has become available. Such spinning wastes processor cycles so the lock should only be held for short durations and not across sleep/block operations. Callers should unlock spin locks before calling sleep operations.

Spin locks are available on a variety of systems. The functions included in IEEE Std 1003.1-2001 are an attempt to standardize that existing practice.
Lack of Timeout Feature

Alternate versions of most blocking routines have been provided to support watchdog timeouts. No alternate interface of this sort has been provided for spin locks for the following reasons:
- It is impossible to determine appropriate timeout intervals for spin locks in a portable manner. The amount of time one can expect to spend spin-waiting is inversely proportional to the degree of parallelism provided by the system.
  
  It can vary from a few cycles when each competing thread is running on its own processor, to an indefinite amount of time when all threads are multiplexed on a single processor (which is why spin locking is not advisable on uniprocessors).
- When used properly, the amount of time the calling thread spends waiting on a spin lock should be considerably less than the time required to set up a corresponding watchdog timer. Since the primary purpose of spin locks is to provide a low-overhead synchronization mechanism for multi-processors, the overhead of a timeout mechanism was deemed unacceptable.
It was also suggested that an additional count argument be provided (on the pthread_spin_lock() call) in lieu of a true timeout so that a spin lock call could fail gracefully if it was unable to apply the lock after count attempts. This idea was rejected because it is not existing practice. Furthermore, the same effect can be obtained with pthread_spin_trylock(), as illustrated below:
```
int n = MAX_SPIN;


while ( --n >= 0 )
{
    if ( !pthread_spin_try_lock(...) )
        break;
}
if ( n >= 0 )
{
    /* Successfully acquired the lock */
}
else
{
    /* Unable to acquire the lock */
}
```
process-shared Attribute

The initialization functions associated with most POSIX synchronization objects (for example, mutexes, barriers, and read-write locks) take an attributes object with a process-shared attribute that specifies whether or not the object is to be shared across processes. In the draft corresponding to the first balloting round, two separate initialization functions are provided for spin locks, however: one for spin locks that were to be shared across processes ( spin_init()), and one for locks that were only used by multiple threads within a single process ( pthread_spin_init()). This was done so as to keep the overhead associated with spin waiting to an absolute minimum. However, the balloting group requested that, since the overhead associated to a bit check was small, spin locks should be consistent with the rest of the synchronization primitives, and thus the process-shared attribute was introduced for spin locks.
Spin Locks versus Mutexes

It has been suggested that mutexes are an adequate synchronization mechanism and spin locks are not necessary. Locking mechanisms typically must trade off the processor resources consumed while setting up to block the thread and the processor resources consumed by the thread while it is blocked. Spin locks require very little resources to set up the blocking of a thread. Existing practice is to simply loop, repeating the atomic locking operation until the lock is available. While the resources consumed to set up blocking of the thread are low, the thread continues to consume processor resources while it is waiting.

On the other hand, mutexes may be implemented such that the processor resources consumed to block the thread are large relative to a spin lock. After detecting that the mutex lock is not available, the thread must alter its scheduling state, add itself to a set of waiting threads, and, when the lock becomes available again, undo all of this before taking over ownership of the mutex. However, while a thread is blocked by a mutex, no processor resources are consumed.

Therefore, spin locks and mutexes may be implemented to have different characteristics. Spin locks may have lower overall overhead for very short-term blocking, and mutexes may have lower overall overhead when a thread will be blocked for longer periods of time. The presence of both interfaces allows implementations with these two different characteristics, both of which may be useful to a particular application.

It has also been suggested that applications can build their own spin locks from the pthread_mutex_trylock() function:
```
while (pthread_mutex_trylock(&mutex));
```
The apparent simplicity of this construct is somewhat deceiving, however. While the actual wait is quite efficient, various guarantees on the integrity of mutex objects (for example, priority inheritance rules) may add overhead to the successful path of the trylock operation that is not required of spin locks. One could, of course, add an attribute to the mutex to bypass such overhead, but the very act of finding and testing this attribute represents more overhead than is found in the typical spin lock.

The need to hold spin lock overhead to an absolute minimum also makes it impossible to provide guarantees against starvation similar to those provided for mutexes or read-write locks. The overhead required to implement such guarantees (for example, disabling preemption before spinning) may well exceed the overhead of the spin wait itself by many orders of magnitude. If a "safe" spin wait seems desirable, it can always be provided (albeit at some performance cost) via appropriate mutex attributes.

XSI Supported Functions

On XSI-conformant systems, the following symbolic constants are always defined:

_POSIX_READER_WRITER_LOCKS
_POSIX_THREAD_ATTR_STACKADDR
_POSIX_THREAD_ATTR_STACKSIZE
_POSIX_THREAD_PROCESS_SHARED
_POSIX_THREADS

Therefore, the following threads functions are always supported:

pthread_atfork()
pthread_attr_destroy()
pthread_attr_getdetachstate()
pthread_attr_getguardsize()
pthread_attr_getschedparam()
pthread_attr_getstack()
pthread_attr_getstackaddr()
pthread_attr_getstacksize()
pthread_attr_init()
pthread_attr_setdetachstate()
pthread_attr_setguardsize()
pthread_attr_setschedparam()
pthread_attr_setstack()
pthread_attr_setstackaddr()
pthread_attr_setstacksize()
pthread_cancel()
pthread_cleanup_pop()
pthread_cleanup_push()
pthread_cond_broadcast()
pthread_cond_destroy()
pthread_cond_init()
pthread_cond_signal()
pthread_cond_timedwait()
pthread_cond_wait()
pthread_condattr_destroy()
pthread_condattr_getpshared()
pthread_condattr_init()
pthread_condattr_setpshared()
pthread_create()
pthread_detach()
pthread_equal()
pthread_exit()
pthread_getconcurrency()
pthread_getspecific()
pthread_join()

pthread_key_create()
pthread_key_delete()
pthread_kill()
pthread_mutex_destroy()
pthread_mutex_init()
pthread_mutex_lock()
pthread_mutex_trylock()
pthread_mutex_unlock()
pthread_mutexattr_destroy()
pthread_mutexattr_getpshared()
pthread_mutexattr_gettype()
pthread_mutexattr_init()
pthread_mutexattr_setpshared()
pthread_mutexattr_settype()
pthread_once()
pthread_rwlock_destroy()
pthread_rwlock_init()
pthread_rwlock_rdlock()
pthread_rwlock_tryrdlock()
pthread_rwlock_trywrlock()
pthread_rwlock_unlock()
pthread_rwlock_wrlock()
pthread_rwlockattr_destroy()
pthread_rwlockattr_getpshared()
pthread_rwlockattr_init()
pthread_rwlockattr_setpshared()
pthread_self()
pthread_setcancelstate()
pthread_setcanceltype()
pthread_setconcurrency()
pthread_setspecific()
pthread_sigmask()
pthread_testcancel()
sigwait()

On XSI-conformant systems, the symbolic constant _POSIX_THREAD_SAFE_FUNCTIONS is always defined. Therefore, the following functions are always supported:

asctime_r()
ctime_r()
flockfile()
ftrylockfile()
funlockfile()
getc_unlocked()
getchar_unlocked()
getgrgid_r()
getgrnam_r()
getpwnam_r()

getpwuid_r()
gmtime_r()
localtime_r()
putc_unlocked()
putchar_unlocked()
rand_r()
readdir_r()
strerror_r()
strtok_r()

The following threads functions are only supported on XSI-conformant systems if the Realtime Threads Option Group is supported :

pthread_attr_getinheritsched()
pthread_attr_getschedpolicy()
pthread_attr_getscope()
pthread_attr_setinheritsched()
pthread_attr_setschedpolicy()
pthread_attr_setscope()
pthread_getschedparam()

pthread_mutex_getprioceiling()
pthread_mutex_setprioceiling()
pthread_mutexattr_getprioceiling()
pthread_mutexattr_getprotocol()
pthread_mutexattr_setprioceiling()
pthread_mutexattr_setprotocol()
pthread_setschedparam()

XSI Threads Extensions

The following XSI extensions to POSIX.1c are now supported in IEEE Std 1003.1-2001 as part of the alignment with the Single UNIX Specification:

Extended mutex attribute types
Read-write locks and attributes (also introduced by the IEEE Std 1003.1j-2000 amendment)
Thread concurrency level
Thread stack guard size
Parallel I/O

A total of 19 new functions were added.

These extensions carefully follow the threads programming model specified in POSIX.1c. As with POSIX.1c, all the new functions return zero if successful; otherwise, an error number is returned to indicate the error.

The concept of attribute objects was introduced in POSIX.1c to allow implementations to extend IEEE Std 1003.1-2001 without changing the existing interfaces. Attribute objects were defined for threads, mutexes, and condition variables. Attributes objects are defined as implementation-defined opaque types to aid extensibility, and functions are defined to allow attributes to be set or retrieved. This model has been followed when adding the new type attribute of pthread_mutexattr_t or the new read-write lock attributes object pthread_rwlockattr_t.

Extended Mutex Attributes

POSIX.1c defines a mutex attributes object as an implementation-defined opaque object of type pthread_mutexattr_t, and specifies a number of attributes which this object must have and a number of functions which manipulate these attributes. These attributes include detachstate, inheritsched, schedparm, schedpolicy, contentionscope, stackaddr, and stacksize.

The System Interfaces volume of IEEE Std 1003.1-2001 specifies another mutex attribute called type. The type attribute allows applications to specify the behavior of mutex locking operations in situations where POSIX.1c behavior is undefined. The OSF DCE threads implementation, based on Draft 4 of POSIX.1c, specified a similar attribute. Note that the names of the attributes have changed somewhat from the OSF DCE threads implementation.

The System Interfaces volume of IEEE Std 1003.1-2001 also extends the specification of the following POSIX.1c functions which manipulate mutexes:
```
pthread_mutex_lock()
pthread_mutex_trylock()
pthread_mutex_unlock()
```
to take account of the new mutex attribute type and to specify behavior which was declared as undefined in POSIX.1c. How a calling thread acquires or releases a mutex now depends upon the mutex type attribute.

The type attribute can have the following values:

PTHREAD_MUTEX_NORMAL

Basic mutex with no specific error checking built in. Does not report a deadlock error.

PTHREAD_MUTEX_RECURSIVE

Allows any thread to recursively lock a mutex. The mutex must be unlocked an equal number of times to release the mutex.

PTHREAD_MUTEX_ERRORCHECK

Detects and reports simple usage errors; that is, an attempt to unlock a mutex that is not locked by the calling thread or that is not locked at all, or an attempt to relock a mutex the thread already owns.

PTHREAD_MUTEX_DEFAULT

The default mutex type. May be mapped to any of the above mutex types or may be an implementation-defined type.

Normal mutexes do not detect deadlock conditions; for example, a thread will hang if it tries to relock a normal mutex that it already owns. Attempting to unlock a mutex locked by another thread, or unlocking an unlocked mutex, results in undefined behavior. Normal mutexes will usually be the fastest type of mutex available on a platform but provide the least error checking.

Recursive mutexes are useful for converting old code where it is difficult to establish clear boundaries of synchronization. A thread can relock a recursive mutex without first unlocking it. The relocking deadlock which can occur with normal mutexes cannot occur with this type of mutex. However, multiple locks of a recursive mutex require the same number of unlocks to release the mutex before another thread can acquire the mutex. Furthermore, this type of mutex maintains the concept of an owner. Thus, a thread attempting to unlock a recursive mutex which another thread has locked returns with an error. A thread attempting to unlock a recursive mutex that is not locked returns with an error. Never use a recursive mutex with condition variables because the implicit unlock performed by pthread_cond_wait() or pthread_cond_timedwait() will not actually release the mutex if it had been locked multiple times.

Errorcheck mutexes provide error checking and are useful primarily as a debugging aid. A thread attempting to relock an errorcheck mutex without first unlocking it returns with an error. Again, this type of mutex maintains the concept of an owner. Thus, a thread attempting to unlock an errorcheck mutex which another thread has locked returns with an error. A thread attempting to unlock an errorcheck mutex that is not locked also returns with an error. It should be noted that errorcheck mutexes will almost always be much slower than normal mutexes due to the extra state checks performed.

The default mutex type provides implementation-defined error checking. The default mutex may be mapped to one of the other defined types or may be something entirely different. This enables each vendor to provide the mutex semantics which the vendor feels will be most useful to their target users. Most vendors will probably choose to make normal mutexes the default so as to give applications the benefit of the fastest type of mutexes available on their platform. Check your implementation's documentation.

An application developer can use any of the mutex types almost interchangeably as long as the application does not depend upon the implementation detecting (or failing to detect) any particular errors. Note that a recursive mutex can be used with condition variable waits as long as the application never recursively locks the mutex.

Two functions are provided for manipulating the type attribute of a mutex attributes object. This attribute is set or returned in the type parameter of these functions. The pthread_mutexattr_settype() function is used to set a specific type value while pthread_mutexattr_gettype() is used to return the type of the mutex. Setting the type attribute of a mutex attributes object affects only mutexes initialized using that mutex attributes object. Changing the type attribute does not affect mutexes previously initialized using that mutex attributes object.
Read-Write Locks and Attributes

The read-write locks introduced have been harmonized with those in IEEE Std 1003.1j-2000; see also Thread Read-Write Locks.

Read-write locks (also known as reader-writer locks) allow a thread to exclusively lock some shared data while updating that data, or allow any number of threads to have simultaneous read-only access to the data.

Unlike a mutex, a read-write lock distinguishes between reading data and writing data. A mutex excludes all other threads. A read-write lock allows other threads access to the data, providing no thread is modifying the data. Thus, a read-write lock is less primitive than either a mutex-condition variable pair or a semaphore.

Application developers should consider using a read-write lock rather than a mutex to protect data that is frequently referenced but seldom modified. Most threads (readers) will be able to read the data without waiting and will only have to block when some other thread (a writer) is in the process of modifying the data. Conversely a thread that wants to change the data is forced to wait until there are no readers. This type of lock is often used to facilitate parallel access to data on multi-processor platforms or to avoid context switches on single processor platforms where multiple threads access the same data.

If a read-write lock becomes unlocked and there are multiple threads waiting to acquire the write lock, the implementation's scheduling policy determines which thread acquires the read-write lock for writing. If there are multiple threads blocked on a read-write lock for both read locks and write locks, it is unspecified whether the readers or a writer acquire the lock first. However, for performance reasons, implementations often favor writers over readers to avoid potential writer starvation.

A read-write lock object is an implementation-defined opaque object of type pthread_rwlock_t as defined in <pthread.h>. There are two different sorts of locks associated with a read-write lock: a read lock and a write lock.

The pthread_rwlockattr_init() function initializes a read-write lock attributes object with the default value for all the attributes defined in the implementation. After a read-write lock attributes object has been used to initialize one or more read-write locks, changes to the read-write lock attributes object, including destruction, do not affect previously initialized read-write locks.

Implementations must provide at least the read-write lock attribute process-shared. This attribute can have the following values:

PTHREAD_PROCESS_SHARED

Any thread of any process that has access to the memory where the read-write lock resides can manipulate the read-write lock.

PTHREAD_PROCESS_PRIVATE

Only threads created within the same process as the thread that initialized the read-write lock can manipulate the read-write lock. This is the default value.

The pthread_rwlockattr_setpshared() function is used to set the process-shared attribute of an initialized read-write lock attributes object while the function pthread_rwlockattr_getpshared() obtains the current value of the process-shared attribute.

A read-write lock attributes object is destroyed using the pthread_rwlockattr_destroy() function. The effect of subsequent use of the read-write lock attributes object is undefined.

A thread creates a read-write lock using the pthread_rwlock_init() function. The attributes of the read-write lock can be specified by the application developer; otherwise, the default implementation-defined read-write lock attributes are used if the pointer to the read-write lock attributes object is NULL. In cases where the default attributes are appropriate, the PTHREAD_RWLOCK_INITIALIZER macro can be used to initialize statically allocated read-write locks.

A thread which wants to apply a read lock to the read-write lock can use either pthread_rwlock_rdlock() or pthread_rwlock_tryrdlock(). If pthread_rwlock_rdlock() is used, the thread acquires a read lock if a writer does not hold the write lock and there are no writers blocked on the write lock. If a read lock is not acquired, the calling thread blocks until it can acquire a lock. However, if pthread_rwlock_tryrdlock() is used, the function returns immediately with the error [EBUSY] if any thread holds a write lock or there are blocked writers waiting for the write lock.

A thread which wants to apply a write lock to the read-write lock can use either of two functions: pthread_rwlock_wrlock() or pthread_rwlock_trywrlock(). If pthread_rwlock_wrlock() is used, the thread acquires the write lock if no other reader or writer threads hold the read-write lock. If the write lock is not acquired, the thread blocks until it can acquire the write lock. However, if pthread_rwlock_trywrlock() is used, the function returns immediately with the error [EBUSY] if any thread is holding either a read or a write lock.

The pthread_rwlock_unlock() function is used to unlock a read-write lock object held by the calling thread. Results are undefined if the read-write lock is not held by the calling thread. If there are other read locks currently held on the read-write lock object, the read-write lock object remains in the read locked state but without the current thread as one of its owners. If this function releases the last read lock for this read-write lock object, the read-write lock object is put in the unlocked read state. If this function is called to release a write lock for this read-write lock object, the read-write lock object is put in the unlocked state.
Thread Concurrency Level

On threads implementations that multiplex user threads onto a smaller set of kernel execution entities, the system attempts to create a reasonable number of kernel execution entities for the application upon application startup.

On some implementations, these kernel entities are retained by user threads that block in the kernel. Other implementations do not timeslice user threads so that multiple compute-bound user threads can share a kernel thread. On such implementations, some applications may use up all the available kernel execution entities before their user-space threads are used up. The process may be left with user threads capable of doing work for the application but with no way to schedule them.

The pthread_setconcurrency() function enables an application to request more kernel entities; that is, specify a desired concurrency level. However, this function merely provides a hint to the implementation. The implementation is free to ignore this request or to provide some other number of kernel entities. If an implementation does not multiplex user threads onto a smaller number of kernel execution entities, the pthread_setconcurrency() function has no effect.

The pthread_setconcurrency() function may also have an effect on implementations where the kernel mode and user mode schedulers cooperate to ensure that ready user threads are not prevented from running by other threads blocked in the kernel.

The pthread_getconcurrency() function always returns the value set by a previous call to pthread_setconcurrency(). However, if pthread_setconcurrency() was not previously called, this function returns zero to indicate that the threads implementation is maintaining the concurrency level.
Thread Stack Guard Size

DCE threads introduced the concept of a "thread stack guard size". Most thread implementations add a region of protected memory to a thread's stack, commonly known as a "guard region", as a safety measure to prevent stack pointer overflow in one thread from corrupting the contents of another thread's stack. The default size of the guard regions attribute is {PAGESIZE} bytes and is implementation-defined.

Some application developers may wish to change the stack guard size. When an application creates a large number of threads, the extra page allocated for each stack may strain system resources. In addition to the extra page of memory, the kernel's memory manager has to keep track of the different protections on adjoining pages. When this is a problem, the application developer may request a guard size of 0 bytes to conserve system resources by eliminating stack overflow protection.

Conversely an application that allocates large data structures such as arrays on the stack may wish to increase the default guard size in order to detect stack overflow. If a thread allocates two pages for a data array, a single guard page provides little protection against thread stack overflows since the thread can corrupt adjoining memory beyond the guard page.

The System Interfaces volume of IEEE Std 1003.1-2001 defines a new attribute of a thread attributes object; that is, the guardsize attribute which allows applications to specify the size of the guard region of a thread's stack.

Two functions are provided for manipulating a thread's stack guard size. The pthread_attr_setguardsize() function sets the thread guardsize attribute, and the pthread_attr_getguardsize() function retrieves the current value.

An implementation may round up the requested guard size to a multiple of the configurable system variable {PAGESIZE}. In this case, pthread_attr_getguardsize() returns the guard size specified by the previous pthread_attr_setguardsize() function call and not the rounded up value.

If an application is managing its own thread stacks using the stackaddr attribute, the guardsize attribute is ignored and no stack overflow protection is provided. In this case, it is the responsibility of the application to manage stack overflow along with stack allocation.
Parallel I/O

Suppose two or more threads independently issue read requests on the same file. To read specific data from a file, a thread must first call lseek() to seek to the proper offset in the file, and then call read() to retrieve the required data. If more than one thread does this at the same time, the first thread may complete its seek call, but before it gets a chance to issue its read call a second thread may complete its seek call, resulting in the first thread accessing incorrect data when it issues its read call. One workaround is to lock the file descriptor while seeking and reading or writing, but this reduces parallelism and adds overhead.

Instead, the System Interfaces volume of IEEE Std 1003.1-2001 provides two functions to make seek/read and seek/write operations atomic. The file descriptor's current offset is unchanged, thus allowing multiple read and write operations to proceed in parallel. This improves the I/O performance of threaded applications. The pread() function is used to do an atomic read of data from a file into a buffer. Conversely, the pwrite() function does an atomic write of data from a buffer to a file.

Thread-Safety

All functions required by IEEE Std 1003.1-2001 need to be thread-safe. Implementations have to provide internal synchronization when necessary in order to achieve this goal. In certain cases-for example, most floating-point implementations-context switch code may have to manage the writable shared state.

While a read from a pipe of {PIPE_MAX}*2 bytes may not generate a single atomic and thread-safe stream of bytes, it should generate "several" (individually atomic) thread-safe streams of bytes. Similarly, while reading from a terminal device may not generate a single atomic and thread-safe stream of bytes, it should generate some finite number of (individually atomic) and thread-safe streams of bytes. That is, concurrent calls to read for a pipe, FIFO, or terminal device are not allowed to result in corrupting the stream of bytes or other internal data. However, read(), in these cases, is not required to return a single contiguous and atomic stream of bytes.

It is not required that all functions provided by IEEE Std 1003.1-2001 be either async-cancel-safe or async-signal-safe.

As it turns out, some functions are inherently not thread-safe; that is, their interface specifications preclude reentrancy. For example, some functions (such as asctime()) return a pointer to a result stored in memory space allocated by the function on a per-process basis. Such a function is not thread-safe, because its result can be overwritten by successive invocations. Other functions, while not inherently non-thread-safe, may be implemented in ways that lead to them not being thread-safe. For example, some functions (such as rand()) store state information (such as a seed value, which survives multiple function invocations) in memory space allocated by the function on a per-process basis. The implementation of such a function is not thread-safe if the implementation fails to synchronize invocations of the function and thus fails to protect the state information. The problem is that when the state information is not protected, concurrent invocations can interfere with one another (for example, applications using rand() may see the same seed value).

Thread-Safety and Locking of Existing Functions

Originally, POSIX.1 was not designed to work in a multi-threaded environment, and some implementations of some existing functions will not work properly when executed concurrently. To provide routines that will work correctly in an environment with threads (``thread-safe"), two problems need to be solved:

Routines that maintain or return pointers to static areas internal to the routine (which may now be shared) need to be modified. The routines ttyname() and localtime() are examples.
Routines that access data space shared by more than one thread need to be modified. The malloc() function and the stdio family routines are examples.

There are a variety of constraints on these changes. The first is compatibility with the existing versions of these functions-non-thread-safe functions will continue to be in use for some time, as the original interfaces are used by existing code. Another is that the new thread-safe versions of these functions represent as small a change as possible over the familiar interfaces provided by the existing non-thread-safe versions. The new interfaces should be independent of any particular threads implementation. In particular, they should be thread-safe without depending on explicit thread-specific memory. Finally, there should be minimal performance penalty due to the changes made to the functions.

It is intended that the list of functions from POSIX.1 that cannot be made thread-safe and for which corrected versions are provided be complete.

Thread-Safety and Locking Solutions

Many of the POSIX.1 functions were thread-safe and did not change at all. However, some functions (for example, the math functions typically found in libm) are not thread-safe because of writable shared global state. For instance, in IEEE Std 754-1985 floating-point implementations, the computation modes and flags are global and shared.

Some functions are not thread-safe because a particular implementation is not reentrant, typically because of a non-essential use of static storage. These require only a new implementation.

Thread-safe libraries are useful in a wide range of parallel (and asynchronous) programming environments, not just within pthreads. In order to be used outside the context of pthreads, however, such libraries still have to use some synchronization method. These could either be independent of the pthread synchronization operations, or they could be a subset of the pthread interfaces. Either method results in thread-safe library implementations that can be used without the rest of pthreads.

Some functions, such as the stdio family interface and dynamic memory allocation functions such as malloc(), are inter-dependent routines that share resources (for example, buffers) across related calls. These require synchronization to work correctly, but they do not require any change to their external (user-visible) interfaces.

In some cases, such as getc() and putc(), adding synchronization is likely to create an unacceptable performance impact. In this case, slower thread-safe synchronized functions are to be provided, but the original, faster (but unsafe) functions (which may be implemented as macros) are retained under new names. Some additional special-purpose synchronization facilities are necessary for these macros to be usable in multi-threaded programs. This also requires changes in <stdio.h>.

The other common reason that functions are unsafe is that they return a pointer to static storage, making the functions non-thread-safe. This has to be changed, and there are three natural choices:

Return a pointer to thread-specific storage

This could incur a severe performance penalty on those architectures with a costly implementation of the thread-specific data interface.

A variation on this technique is to use malloc() to allocate storage for the function output and return a pointer to this storage. This technique may also have an undesirable performance impact, however, and a simplistic implementation requires that the user program explicitly free the storage object when it is no longer needed. This technique is used by some existing POSIX.1 functions. With careful implementation for infrequently used functions, there may be little or no performance or storage penalty, and the maintenance of already-standardized interfaces is a significant benefit.
Return the actual value computed by the function

This technique can only be used with functions that return pointers to structures-routines that return character strings would have to wrap their output in an enclosing structure in order to return the output on the stack. There is also a negative performance impact inherent in this solution in that the output value has to be copied twice before it can be used by the calling function: once from the called routine's local buffers to the top of the stack, then from the top of the stack to the assignment target. Finally, many older compilers cannot support this technique due to a historical tendency to use internal static buffers to deliver the results of structure-valued functions.
Have the caller pass the address of a buffer to contain the computed value

The only disadvantage of this approach is that extra arguments have to be provided by the calling program. It represents the most efficient solution to the problem, however, and, unlike the malloc() technique, it is semantically clear.

There are some routines (often groups of related routines) whose interfaces are inherently non-thread-safe because they communicate across multiple function invocations by means of static memory locations. The solution is to redesign the calls so that they are thread-safe, typically by passing the needed data as extra parameters. Unfortunately, this may require major changes to the interface as well.

A floating-point implementation using IEEE Std 754-1985 is a case in point. A less problematic example is the rand48 family of pseudo-random number generators. The functions getgrgid(), getgrnam(), getpwnam(), and getpwuid() are another such case.

The problems with errno are discussed in Alternative Solutions for Per-Thread errno.

Some functions can be thread-safe or not, depending on their arguments. These include the tmpnam() and ctermid() functions. These functions have pointers to character strings as arguments. If the pointers are not NULL, the functions store their results in the character string; however, if the pointers are NULL, the functions store their results in an area that may be static and thus subject to overwriting by successive calls. These should only be called by multi-thread applications when their arguments are non-NULL.

Asynchronous Safety and Thread-Safety

A floating-point implementation has many modes that effect rounding and other aspects of computation. Functions in some math library implementations may change the computation modes for the duration of a function call. If such a function call is interrupted by a signal or cancellation, the floating-point state is not required to be protected.

There is a significant cost to make floating-point operations async-cancel-safe or async-signal-safe; accordingly, neither form of async safety is required.

Functions Returning Pointers to Static Storage

For those functions that are not thread-safe because they return values in fixed size statically allocated structures, alternate "_r" forms are provided that pass a pointer to an explicit result structure. Those that return pointers into library-allocated buffers have forms provided with explicit buffer and length parameters.

For functions that return pointers to library-allocated buffers, it makes sense to provide "_r" versions that allow the application control over allocation of the storage in which results are returned. This allows the state used by these functions to be managed on an application-specific basis, supporting per-thread, per-process, or other application-specific sharing relationships.

Early proposals had provided "_r" versions for functions that returned pointers to variable-size buffers without providing a means for determining the required buffer size. This would have made using such functions exceedingly clumsy, potentially requiring iteratively calling them with increasingly larger guesses for the amount of storage required. Hence, sysconf() variables have been provided for such functions that return the maximum required buffer size.

Thus, the rule that has been followed by IEEE Std 1003.1-2001 when adapting single-threaded non-thread-safe functions is as follows: all functions returning pointers to library-allocated storage should have "_r" versions provided, allowing the application control over the storage allocation. Those with variable-sized return values accept both a buffer address and a length parameter. The sysconf() variables are provided to supply the appropriate buffer sizes when required. Implementors are encouraged to apply the same rule when adapting their own existing functions to a pthreads environment.

Thread IDs

Separate applications should communicate through well-defined interfaces and should not depend on each other's implementation. For example, if a programmer decides to rewrite the sort utility using multiple threads, it should be easy to do this so that the interface to the sort utility does not change. Consider that if the user causes SIGINT to be generated while the sort utility is running, keeping the same interface means that the entire sort utility is killed, not just one of its threads. As another example, consider a realtime application that manages a reactor. Such an application may wish to allow other applications to control the priority at which it watches the control rods. One technique to accomplish this is to write the ID of the thread watching the control rods into a file and allow other programs to change the priority of that thread as they see fit. A simpler technique is to have the reactor process accept IPCs (Interprocess Communication messages) from other processes, telling it at a semantic level what priority the program should assign to watching the control rods. This allows the programmer greater flexibility in the implementation. For example, the programmer can change the implementation from having one thread per rod to having one thread watching all of the rods without changing the interface. Having threads live inside the process means that the implementation of a process is invisible to outside processes (excepting debuggers and system management tools).

Threads do not provide a protection boundary. Every thread model allows threads to share memory with other threads and encourages this sharing to be widespread. This means that one thread can wipe out memory that is needed for the correct functioning of other threads that are sharing its memory. Consequently, providing each thread with its own user and/or group IDs would not provide a protection boundary between threads sharing memory.

Thread Mutexes

There is no additional rationale provided for this section.

Thread Scheduling

Scheduling Implementation Models

The following scheduling implementation models are presented in terms of threads and "kernel entities". This is to simplify exposition of the models, and it does not imply that an implementation actually has an identifiable "kernel entity".

A kernel entity is not defined beyond the fact that it has scheduling attributes that are used to resolve contention with other kernel entities for execution resources. A kernel entity may be thought of as an envelope that holds a thread or a separate kernel thread. It is not a conventional process, although it shares with the process the attribute that it has a single thread of control; it does not necessarily imply an address space, open files, and so on. It is better thought of as a primitive facility upon which conventional processes and threads may be constructed.
- System Thread Scheduling Model
  
  This model consists of one thread per kernel entity. The kernel entity is solely responsible for scheduling thread execution on one or more processors. This model schedules all threads against all other threads in the system using the scheduling attributes of the thread.
- Process Scheduling Model
  
  A generalized process scheduling model consists of two levels of scheduling. A threads library creates a pool of kernel entities, as required, and schedules threads to run on them using the scheduling attributes of the threads. Typically, the size of the pool is a function of the simultaneously runnable threads, not the total number of threads. The kernel then schedules the kernel entities onto processors according to their scheduling attributes, which are managed by the threads library. This set model potentially allows a wide range of mappings between threads and kernel entities.
System and Process Scheduling Model Performance

There are a number of important implications on the performance of applications using these scheduling models. The process scheduling model potentially provides lower overhead for making scheduling decisions, since there is no need to access kernel-level information or functions and the set of schedulable entities is smaller (only the threads within the process).

On the other hand, since the kernel is also making scheduling decisions regarding the system resources under its control (for example, CPU(s), I/O devices, memory), decisions that do not take thread scheduling parameters into account can result in unspecified delays for realtime application threads, causing them to miss maximum response time limits.
Rate Monotonic Scheduling

Rate monotonic scheduling was considered, but rejected for standardization in the context of pthreads. A sporadic server policy is included.
Scheduling Options

In IEEE Std 1003.1-2001, the basic thread scheduling functions are defined under the Threads option, so that they are required of all threads implementations. However, there are no specific scheduling policies required by this option to allow for conforming thread implementations that are not targeted to realtime applications.

Specific standard scheduling policies are defined to be under the Thread Execution Scheduling option, and they are specifically designed to support realtime applications by providing predictable resource-sharing sequences. The name of this option was chosen to emphasize that this functionality is defined as appropriate for realtime applications that require simple priority-based scheduling.

It is recognized that these policies are not necessarily satisfactory for some multi-processor implementations, and work is ongoing to address a wider range of scheduling behaviors. The interfaces have been chosen to create abundant opportunity for future scheduling policies to be implemented and standardized based on this interface. In order to standardize a new scheduling policy, all that is required (from the standpoint of thread scheduling attributes) is to define a new policy name, new members of the thread attributes object, and functions to set these members when the scheduling policy is equal to the new value.

Scheduling Contention Scope

In order to accommodate the requirement for realtime response, each thread has a scheduling contention scope attribute. Threads with a system scheduling contention scope have to be scheduled with respect to all other threads in the system. These threads are usually bound to a single kernel entity that reflects their scheduling attributes and are directly scheduled by the kernel.

Threads with a process scheduling contention scope need be scheduled only with respect to the other threads in the process. These threads may be scheduled within the process onto a pool of kernel entities. The implementation is also free to bind these threads directly to kernel entities and let them be scheduled by the kernel. Process scheduling contention scope allows the implementation the most flexibility and is the default if both contention scopes are supported and none is specified.

Thus, the choice by implementors to provide one or the other (or both) of these scheduling models is driven by the need of their supported application domains for worst-case (that is, realtime) response, or average-case (non-realtime) response.

Scheduling Allocation Domain

The SCHED_FIFO and SCHED_RR scheduling policies take on different characteristics on a multi-processor. Other scheduling policies are also subject to changed behavior when executed on a multi-processor. The concept of scheduling allocation domain determines the set of processors on which the threads of an application may run. By considering the application's processor scheduling allocation domain for its threads, scheduling policies can be defined in terms of their behavior for varying processor scheduling allocation domain values. It is conceivable that not all scheduling allocation domain sizes make sense for all scheduling policies on all implementations. The concept of scheduling allocation domain, however, is a useful tool for the description of multi-processor scheduling policies.

The "process control" approach to scheduling obtains significant performance advantages from dynamic scheduling allocation domain sizes when it is applicable.

Non-Uniform Memory Access (NUMA) multi-processors may use a system scheduling structure that involves reassignment of threads among scheduling allocation domains. In NUMA machines, a natural model of scheduling is to match scheduling allocation domains to clusters of processors. Load balancing in such an environment requires changing the scheduling allocation domain to which a thread is assigned.

Scheduling Documentation

Implementation-provided scheduling policies need to be completely documented in order to be useful. This documentation includes a description of the attributes required for the policy, the scheduling interaction of threads running under this policy and all other supported policies, and the effects of all possible values for processor scheduling allocation domain. Note that for the implementor wishing to be minimally-compliant, it is (minimally) acceptable to define the behavior as undefined.

Scheduling Contention Scope Attribute

The scheduling contention scope defines how threads compete for resources. Within IEEE Std 1003.1-2001, scheduling contention scope is used to describe only how threads are scheduled in relation to one another in the system. That is, either they are scheduled against all other threads in the system (``system scope") or only against those threads in the process (``process scope"). In fact, scheduling contention scope may apply to additional resources, including virtual timers and profiling, which are not currently considered by IEEE Std 1003.1-2001.

Mixed Scopes

If only one scheduling contention scope is supported, the scheduling decision is straightforward. To perform the processor scheduling decision in a mixed scope environment, it is necessary to map the scheduling attributes of the thread with process-wide contention scope to the same attribute space as the thread with system-wide contention scope.

Since a conforming implementation has to support one and may support both scopes, it is useful to discuss the effects of such choices with respect to example applications. If an implementation supports both scopes, mixing scopes provides a means of better managing system-level (that is, kernel-level) and library-level resources. In general, threads with system scope will require the resources of a separate kernel entity in order to guarantee the scheduling semantics. On the other hand, threads with process scope can share the resources of a kernel entity while maintaining the scheduling semantics.

The application is free to create threads with dedicated kernel resources, and other threads that multiplex kernel resources. Consider the example of a window server. The server allocates two threads per widget: one thread manages the widget user interface (including drawing), while the other thread takes any required application action. This allows the widget to be "active" while the application is computing. A screen image may be built from thousands of widgets. If each of these threads had been created with system scope, then most of the kernel-level resources might be wasted, since only a few widgets are active at any one time. In addition, mixed scope is particularly useful in a window server where one thread with high priority and system scope handles the mouse so that it tracks well. As another example, consider a database server. For each of the hundreds or thousands of clients supported by a large server, an equivalent number of threads will have to be created. If each of these threads were system scope, the consequences would be the same as for the window server example above. However, the server could be constructed so that actual retrieval of data is done by several dedicated threads. Dedicated threads that do work for all clients frequently justify the added expense of system scope. If it were not permissible to mix system and process threads in the same process, this type of solution would not be possible.

Dynamic Thread Scheduling Parameters Access

In many time-constrained applications, there is no need to change the scheduling attributes dynamically during thread or process execution, since the general use of these attributes is to reflect directly the time constraints of the application. Since these time constraints are generally imposed to meet higher-level system requirements, such as accuracy or availability, they frequently should remain unchanged during application execution.

However, there are important situations in which the scheduling attributes should be changed. Generally, this will occur when external environmental conditions exist in which the time constraints change. Consider, for example, a space vehicle major mode change, such as the change from ascent to descent mode, or the change from the space environment to the atmospheric environment. In such cases, the frequency with which many of the sensors or actuators need to be read or written will change, which will necessitate a priority change. In other cases, even the existence of a time constraint might be temporary, necessitating not just a priority change, but also a policy change for ongoing threads or processes. For this reason, it is critical that the interface should provide functions to change the scheduling parameters dynamically, but, as with many of the other realtime functions, it is important that applications use them properly to avoid the possibility of unnecessarily degrading performance.

In providing functions for dynamically changing the scheduling behavior of threads, there were two options: provide functions to get and set the individual scheduling parameters of threads, or provide a single interface to get and set all the scheduling parameters for a given thread simultaneously. Both approaches have merit. Access functions for individual parameters allow simpler control of thread scheduling for simple thread scheduling parameters. However, a single function for setting all the parameters for a given scheduling policy is required when first setting that scheduling policy. Since the single all-encompassing functions are required, it was decided to leave the interface as minimal as possible. Note that simpler functions (such as pthread_setprio() for threads running under the priority-based schedulers) can be easily defined in terms of the all-encompassing functions.

If the pthread_setschedparam() function executes successfully, it will have set all of the scheduling parameter values indicated in param; otherwise, none of the scheduling parameters will have been modified. This is necessary to ensure that the scheduling of this and all other threads continues to be consistent in the presence of an erroneous scheduling parameter.

The [EPERM] error value is included in the list of possible pthread_setschedparam() error returns as a reflection of the fact that the ability to change scheduling parameters increases risks to the implementation and application performance if the scheduling parameters are changed improperly. For this reason, and based on some existing practice, it was felt that some implementations would probably choose to define specific permissions for changing either a thread's own or another thread's scheduling parameters. IEEE Std 1003.1-2001 does not include portable methods for setting or retrieving permissions, so any such use of permissions is completely unspecified.

Mutex Initialization Scheduling Attributes

In a priority-driven environment, a direct use of traditional primitives like mutexes and condition variables can lead to unbounded priority inversion, where a higher priority thread can be blocked by a lower priority thread, or set of threads, for an unbounded duration of time. As a result, it becomes impossible to guarantee thread deadlines. Priority inversion can be bounded and minimized by the use of priority inheritance protocols. This allows thread deadlines to be guaranteed even in the presence of synchronization requirements.

Two useful but simple members of the family of priority inheritance protocols are the basic priority inheritance protocol and the priority ceiling protocol emulation. Under the Basic Priority Inheritance protocol (governed by the Thread Priority Inheritance option), a thread that is blocking higher priority threads executes at the priority of the highest priority thread that it blocks. This simple mechanism allows priority inversion to be bounded by the duration of critical sections and makes timing analysis possible.

Under the Priority Ceiling Protocol Emulation protocol (governed by the Thread Priority Protection option), each mutex has a priority ceiling, usually defined as the priority of the highest priority thread that can lock the mutex. When a thread is executing inside critical sections, its priority is unconditionally increased to the highest of the priority ceilings of all the mutexes owned by the thread. This protocol has two very desirable properties in uni-processor systems. First, a thread can be blocked by a lower priority thread for at most the duration of one single critical section. Furthermore, when the protocol is correctly used in a single processor, and if threads do not become blocked while owning mutexes, mutual deadlocks are prevented.

The priority ceiling emulation can be extended to multiple processor environments, in which case the values of the priority ceilings will be assigned depending on the kind of mutex that is being used: local to only one processor, or global, shared by several processors. Local priority ceilings will be assigned the usual way, equal to the priority of the highest priority thread that may lock that mutex. Global priority ceilings will usually be assigned a priority level higher than all the priorities assigned to any of the threads that reside in the involved processors to avoid the effect called remote blocking.

Change the Priority Ceiling of a Mutex

In order for the priority protect protocol to exhibit its desired properties of bounding priority inversion and avoidance of deadlock, it is critical that the ceiling priority of a mutex be the same as the priority of the highest thread that can ever hold it, or higher. Thus, if the priorities of the threads using such mutexes never change dynamically, there is no need ever to change the priority ceiling of a mutex.

However, if a major system mode change results in an altered response time requirement for one or more application threads, their priority has to change to reflect it. It will occasionally be the case that the priority ceilings of mutexes held also need to change. While changing priority ceilings should generally be avoided, it is important that IEEE Std 1003.1-2001 provide these interfaces for those cases in which it is necessary.

Thread Cancellation

Many existing threads packages have facilities for canceling an operation or canceling a thread. These facilities are used for implementing user requests (such as the CANCEL button in a window-based application), for implementing OR parallelism (for example, telling the other threads to stop working once one thread has found a forced mate in a parallel chess program), or for implementing the ABORT mechanism in Ada.

POSIX programs traditionally have used the signal mechanism combined with either longjmp() or polling to cancel operations. Many POSIX programmers have trouble using these facilities to solve their problems efficiently in a single-threaded process. With the introduction of threads, these solutions become even more difficult to use.

The main issues with implementing a cancellation facility are specifying the operation to be canceled, cleanly releasing any resources allocated to that operation, controlling when the target notices that it has been canceled, and defining the interaction between asynchronous signals and cancellation.

Specifying the Operation to Cancel

Consider a thread that calls through five distinct levels of program abstraction and then, inside the lowest-level abstraction, calls a function that suspends the thread. (An abstraction boundary is a layer at which the client of the abstraction sees only the service being provided and can remain ignorant of the implementation. Abstractions are often layered, each level of abstraction being a client of the lower-level abstraction and implementing a higher-level abstraction.) Depending on the semantics of each abstraction, one could imagine wanting to cancel only the call that causes suspension, only the bottom two levels, or the operation being done by the entire thread. Canceling operations at a finer grain than the entire thread is difficult because threads are active and they may be run in parallel on a multi-processor. By the time one thread can make a request to cancel an operation, the thread performing the operation may have completed that operation and gone on to start another operation whose cancellation is not desired. Thread IDs are not reused until the thread has exited, and either it was created with the Attr detachstate attribute set to PTHREAD_CREATE_DETACHED or the pthread_join() or pthread_detach() function has been called for that thread. Consequently, a thread cancellation will never be misdirected when the thread terminates. For these reasons, the canceling of operations is done at the granularity of the thread. Threads are designed to be inexpensive enough so that a separate thread may be created to perform each separately cancelable operation; for example, each possibly long running user request.

For cancellation to be used in existing code, cancellation scopes and handlers will have to be established for code that needs to release resources upon cancellation, so that it follows the programming discipline described in the text.

A Special Signal Versus a Special Interface

Two different mechanisms were considered for providing the cancellation interfaces. The first was to provide an interface to direct signals at a thread and then to define a special signal that had the required semantics. The other alternative was to use a special interface that delivered the correct semantics to the target thread.

The solution using signals produced a number of problems. It required the implementation to provide cancellation in terms of signals whereas a perfectly valid (and possibly more efficient) implementation could have both layered on a low-level set of primitives. There were so many exceptions to the special signal (it cannot be used with kill(), no POSIX.1 interfaces can be used with it) that it was clearly not a valid signal. Its semantics on delivery were also completely different from any existing POSIX.1 signal. As such, a special interface that did not mandate the implementation and did not confuse the semantics of signals and cancellation was felt to be the better solution.

Races Between Cancellation and Resuming Execution

Due to the nature of cancellation, there is generally no synchronization between the thread requesting the cancellation of a blocked thread and events that may cause that thread to resume execution. For this reason, and because excess serialization hurts performance, when both an event that a thread is waiting for has occurred and a cancellation request has been made and cancellation is enabled, IEEE Std 1003.1-2001 explicitly allows the implementation to choose between returning from the blocking call or acting on the cancellation request.

Interaction of Cancellation with Asynchronous Signals

A typical use of cancellation is to acquire a lock on some resource and to establish a cancellation cleanup handler for releasing the resource when and if the thread is canceled.

A correct and complete implementation of cancellation in the presence of asynchronous signals requires considerable care. An implementation has to push a cancellation cleanup handler on the cancellation cleanup stack while maintaining the integrity of the stack data structure. If an asynchronously-generated signal is posted to the thread during a stack operation, the signal handler cannot manipulate the cancellation cleanup stack. As a consequence, asynchronous signal handlers may not cancel threads or otherwise manipulate the cancellation state of a thread. Threads may, of course, be canceled by another thread that used a sigwait() function to wait synchronously for an asynchronous signal.

In order for cancellation to function correctly, it is required that asynchronous signal handlers not change the cancellation state. This requires that some elements of existing practice, such as using longjmp() to exit from an asynchronous signal handler implicitly, be prohibited in cases where the integrity of the cancellation state of the interrupt thread cannot be ensured.

Thread Cancellation Overview

Cancelability States

The three possible cancelability states (disabled, deferred, and asynchronous) are encoded into two separate bits ((disable, enable) and (deferred, asynchronous)) to allow them to be changed and restored independently. For instance, short code sequences that will not block sometimes disable cancelability on entry and restore the previous state upon exit. Likewise, long or unbounded code sequences containing no convenient explicit cancellation points will sometimes set the cancelability type to asynchronous on entry and restore the previous value upon exit.
Cancellation Points

Cancellation points are points inside of certain functions where a thread has to act on any pending cancellation request when cancelability is enabled, if the function would block. As with checking for signals, operations need only check for pending cancellation requests when the operation is about to block indefinitely.

The idea was considered of allowing implementations to define whether blocking calls such as read() should be cancellation points. It was decided that it would adversely affect the design of conforming applications if blocking calls were not cancellation points because threads could be left blocked in an uncancelable state.

There are several important blocking routines that are specifically not made cancellation points:
- pthread_mutex_lock()
  
  If pthread_mutex_lock() were a cancellation point, every routine that called it would also become a cancellation point (that is, any routine that touched shared state would automatically become a cancellation point). For example, malloc(), free(), and rand() would become cancellation points under this scheme. Having too many cancellation points makes programming very difficult, leading to either much disabling and restoring of cancelability or much difficulty in trying to arrange for reliable cleanup at every possible place.
  
  Since pthread_mutex_lock() is not a cancellation point, threads could result in being blocked uninterruptibly for long periods of time if mutexes were used as a general synchronization mechanism. As this is normally not acceptable, mutexes should only be used to protect resources that are held for small fixed lengths of time where not being able to be canceled will not be a problem. Resources that need to be held exclusively for long periods of time should be protected with condition variables.
- pthread_barrier_wait()
  
  Canceling a barrier wait will render a barrier unusable. Similar to a barrier timeout (which the standard developers rejected), there is no way to guarantee the consistency of a barrier's internal data structures if a barrier wait is canceled.
- pthread_spin_lock()
  
  As with mutexes, spin locks should only be used to protect resources that are held for small fixed lengths of time where not being cancelable will not be a problem.
Every library routine should specify whether or not it includes any cancellation points. Typically, only those routines that may block or compute indefinitely need to include cancellation points.

Correctly coded routines only reach cancellation points after having set up a cancellation cleanup handler to restore invariants if the thread is canceled at that point. Being cancelable only at specified cancellation points allows programmers to keep track of actions needed in a cancellation cleanup handler more easily. A thread should only be made asynchronously cancelable when it is not in the process of acquiring or releasing resources or otherwise in a state from which it would be difficult or impossible to recover.
Thread Cancellation Cleanup Handlers

The cancellation cleanup handlers provide a portable mechanism, easy to implement, for releasing resources and restoring invariants. They are easier to use than signal handlers because they provide a stack of cancellation cleanup handlers rather than a single handler, and because they have an argument that can be used to pass context information to the handler.

The alternative to providing these simple cancellation cleanup handlers (whose only use is for cleaning up when a thread is canceled) is to define a general exception package that could be used for handling and cleaning up after hardware traps and software-detected errors. This was too far removed from the charter of providing threads to handle asynchrony. However, it is an explicit goal of IEEE Std 1003.1-2001 to be compatible with existing exception facilities and languages having exceptions.

The interaction of this facility and other procedure-based or language-level exception facilities is unspecified in this version of IEEE Std 1003.1-2001. However, it is intended that it be possible for an implementation to define the relationship between these cancellation cleanup handlers and Ada, C++, or other language-level exception handling facilities.

It was suggested that the cancellation cleanup handlers should also be called when the process exits or calls the exec function. This was rejected partly due to the performance problem caused by having to call the cancellation cleanup handlers of every thread before the operation could continue. The other reason was that the only state expected to be cleaned up by the cancellation cleanup handlers would be the intraprocess state. Any handlers that are to clean up the interprocess state would be registered with atexit(). There is the orthogonal problem that the exec functions do not honor the atexit() handlers, but resolving this is beyond the scope of IEEE Std 1003.1-2001.
Async-Cancel Safety

A function is said to be async-cancel-safe if it is written in such a way that entering the function with asynchronous cancelability enabled will not cause any invariants to be violated, even if a cancellation request is delivered at any arbitrary instruction. Functions that are async-cancel-safe are often written in such a way that they need to acquire no resources for their operation and the visible variables that they may write are strictly limited.

Any routine that gets a resource as a side effect cannot be made async-cancel-safe (for example, malloc()). If such a routine were called with asynchronous cancelability enabled, it might acquire the resource successfully, but as it was returning to the client, it could act on a cancellation request. In such a case, the application would have no way of knowing whether the resource was acquired or not.

Indeed, because many interesting routines cannot be made async-cancel-safe, most library routines in general are not async-cancel-safe. Every library routine should specify whether or not it is async-cancel safe so that programmers know which routines can be called from code that is asynchronously cancelable.

IEEE Std 1003.1-2001/Cor 1-2002, item XSH/TC1/D6/8 is applied, adding the pselect() function to the list of functions with cancellation points.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/5 is applied, adding the fdatasync() function into the table of functions that shall have cancellation points.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/6 is applied, adding the numerous functions into the table of functions that may have cancellation points.

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/7 is applied, clarifying the requirements in Thread Cancellation Cleanup Handlers.

Thread Read-Write Locks

Background

Read-write locks are often used to allow parallel access to data on multi-processors, to avoid context switches on uni-processors when multiple threads access the same data, and to protect data structures that are frequently accessed (that is, read) but rarely updated (that is, written). The in-core representation of a file system directory is a good example of such a data structure. One would like to achieve as much concurrency as possible when searching directories, but limit concurrent access when adding or deleting files.

Although read-write locks can be implemented with mutexes and condition variables, such implementations are significantly less efficient than is possible. Therefore, this synchronization primitive is included in IEEE Std 1003.1-2001 for the purpose of allowing more efficient implementations in multi-processor systems.

Queuing of Waiting Threads

The pthread_rwlock_unlock() function description states that one writer or one or more readers must acquire the lock if it is no longer held by any thread as a result of the call. However, the function does not specify which thread(s) acquire the lock, unless the Thread Execution Scheduling option is supported.

The standard developers considered the issue of scheduling with respect to the queuing of threads blocked on a read-write lock. The question turned out to be whether IEEE Std 1003.1-2001 should require priority scheduling of read-write locks for threads whose execution scheduling policy is priority-based (for example, SCHED_FIFO or SCHED_RR). There are tradeoffs between priority scheduling, the amount of concurrency achievable among readers, and the prevention of writer and/or reader starvation.

For example, suppose one or more readers hold a read-write lock and the following threads request the lock in the listed order:

pthread_rwlock_wrlock() - Low priority thread writer_a
pthread_rwlock_rdlock() - High priority thread reader_a
pthread_rwlock_rdlock() - High priority thread reader_b
pthread_rwlock_rdlock() - High priority thread reader_c

When the lock becomes available, should writer_a block the high priority readers? Or, suppose a read-write lock becomes available and the following are queued:

pthread_rwlock_rdlock() - Low priority thread reader_a
pthread_rwlock_rdlock() - Low priority thread reader_b
pthread_rwlock_rdlock() - Low priority thread reader_c
pthread_rwlock_wrlock() - Medium priority thread writer_a
pthread_rwlock_rdlock() - High priority thread reader_d

If priority scheduling is applied then reader_d would acquire the lock and writer_a would block the remaining readers. But should the remaining readers also acquire the lock to increase concurrency? The solution adopted takes into account that when the Thread Execution Scheduling option is supported, high priority threads may in fact starve low priority threads (the application developer is responsible in this case for designing the system in such a way that this starvation is avoided). Therefore, IEEE Std 1003.1-2001 specifies that high priority readers take precedence over lower priority writers. However, to prevent writer starvation from threads of the same or lower priority, writers take precedence over readers of the same or lower priority.

Priority inheritance mechanisms are non-trivial in the context of read-write locks. When a high priority writer is forced to wait for multiple readers, for example, it is not clear which subset of the readers should inherit the writer's priority. Furthermore, the internal data structures that record the inheritance must be accessible to all readers, and this implies some sort of serialization that could negate any gain in parallelism achieved through the use of multiple readers in the first place. Finally, existing practice does not support the use of priority inheritance for read-write locks. Therefore, no specification of priority inheritance or priority ceiling is attempted. If reliable priority-scheduled synchronization is absolutely required, it can always be obtained through the use of mutexes.

Comparison to fcntl() Locks

The read-write locks and the fcntl() locks in IEEE Std 1003.1-2001 share a common goal: increasing concurrency among readers, thus increasing throughput and decreasing delay.

However, the read-write locks have two features not present in the fcntl() locks. First, under priority scheduling, read-write locks are granted in priority order. Second, also under priority scheduling, writer starvation is prevented by giving writers preference over readers of equal or lower priority.

Also, read-write locks can be used in systems lacking a file system, such as those conforming to the minimal realtime system profile of IEEE Std 1003.13-1998.

History of Resolution Issues

Based upon some balloting objections, early drafts specified the behavior of threads waiting on a read-write lock during the execution of a signal handler, as if the thread had not called the lock operation. However, this specified behavior would require implementations to establish internal signal handlers even though this situation would be rare, or never happen for many programs. This would introduce an unacceptable performance hit in comparison to the little additional functionality gained. Therefore, the behavior of read-write locks and signals was reverted back to its previous mutex-like specification.

Thread Interactions with Regular File Operations

There is no additional rationale provided for this section.

Use of Application-Managed Thread Stacks

IEEE Std 1003.1-2001/Cor 2-2004, item XSH/TC2/D6/8 is applied, adding this new section. It was added to make it clear that the current standard does not allow an application to determine when a stack can be reclaimed. This may be addressed in a future revision.

B.2.10 Sockets

The base document for the sockets interfaces in IEEE Std 1003.1-2001 is the XNS, Issue 5.2 specification. This was primarily chosen as it aligns with IPv6. Additional material has been added from IEEE Std 1003.1g-2000, notably socket concepts, raw sockets, the pselect() function, the sockatmark() function, and the <sys/select.h> header.

Address Families

There is no additional rationale provided for this section.

Addressing

There is no additional rationale provided for this section.

Protocols

There is no additional rationale provided for this section.

Routing

There is no additional rationale provided for this section.

Interfaces

There is no additional rationale provided for this section.

Socket Types

The type socklen_t was invented to cover the range of implementations seen in the field. The intent of socklen_t is to be the type for all lengths that are naturally bounded in size; that is, that they are the length of a buffer which cannot sensibly become of massive size: network addresses, host names, string representations of these, ancillary data, control messages, and socket options are examples. Truly boundless sizes are represented by size_t as in read(), write(), and so on.

All socklen_t types were originally (in BSD UNIX) of type int. During the development of IEEE Std 1003.1-2001, it was decided to change all buffer lengths to size_t, which appears at face value to make sense. When dual mode 32/64-bit systems came along, this choice unnecessarily complicated system interfaces because size_t (with long) was a different size under ILP32 and LP64 models. Reverting to int would have happened except that some implementations had already shipped 64-bit-only interfaces. The compromise was a type which could be defined to be any size by the implementation: socklen_t.

Socket I/O Mode

There is no additional rationale provided for this section.

Socket Owner

There is no additional rationale provided for this section.

Socket Queue Limits

There is no additional rationale provided for this section.

Pending Error

There is no additional rationale provided for this section.

Socket Receive Queue

There is no additional rationale provided for this section.

Socket Out-of-Band Data State

There is no additional rationale provided for this section.

Connection Indication Queue

There is no additional rationale provided for this section.

Signals

There is no additional rationale provided for this section.

Asynchronous Errors

There is no additional rationale provided for this section.

Use of Options

There is no additional rationale provided for this section.

Use of Sockets for Local UNIX Connections

There is no additional rationale provided for this section.

Use of Sockets over Internet Protocols

A raw socket allows privileged users direct access to a protocol; for example, raw access to the IP and ICMP protocols is possible through raw sockets. Raw sockets are intended for knowledgeable applications that wish to take advantage of some protocol feature not directly accessible through the other sockets interfaces.

Use of Sockets over Internet Protocols Based on IPv4

There is no additional rationale provided for this section.

Use of Sockets over Internet Protocols Based on IPv6

The Open Group Base Resolution bwg2001-012 is applied, clarifying that IPv6 implementations are required to support use of AF_INET6 sockets over IPv4.

B.2.11 Tracing

The organization of the tracing rationale differs from the traditional rationale in that this tracing rationale text is written against the trace interface as a whole, rather than against the individual components of the trace interface or the normative section in which those components are defined. Therefore the sections below do not parallel the sections of normative text in IEEE Std 1003.1-2001.

Objectives

The intended uses of tracing are application-system debugging during system development, as a "flight recorder" for maintenance of fielded systems, and as a performance measurement tool. In all of these intended uses, the vendor-supplied computer system and its software are, for this discussion, assumed error-free; the intent being to debug the user-written and/or third-party application code, and their interactions. Clearly, problems with the vendor-supplied system and its software will be uncovered from time to time, but this is a byproduct of the primary activity, debugging user code.

Another need for defining a trace interface in POSIX stems from the objective to provide an efficient portable way to perform benchmarks. Existing practice shows that such interfaces are commonly used in a variety of systems but with little commonality. As part of the benchmarking needs, two aspects within the trace interface must be considered.

The first, and perhaps more important one, is the qualitative aspect.

The second is the quantitative aspect.

Qualitative Aspect

To better understand this aspect, let us consider an example. Suppose that you want to organize a number of actions to be performed during the day. Some of these actions are known at the beginning of the day. Some others, which may be more or less important, will be triggered by reading your mail. During the day you will make some phone calls and synchronously receive some more information. Finally you will receive asynchronous phone calls that also will trigger actions. If you, or somebody else, examines your day at work, you, or he, can discover that you have not efficiently organized your work. For instance, relative to the phone calls you made, would it be preferable to make some of these early in the morning? Or to delay some others until the end of the day? Relative to the phone calls you have received, you might find that somebody you called in the morning has called you 10 times while you were performing some important work. To examine, afterwards, your day at work, you record in sequence all the trace events relative to your work. This should give you a chance of organizing your next day at work.

This is the qualitative aspect of the trace interface. The user of a system needs to keep a trace of particular points the application passes through, so that he can eventually make some changes in the application and/or system configuration, to give the application a chance of running more efficiently.
Quantitative Aspect

This aspect concerns primarily realtime applications, where missed deadlines can be undesirable. Although there are, in IEEE Std 1003.1-2001, some interfaces useful for such applications (timeouts, execution time monitoring, and so on), there are no APIs to aid in the tuning of a realtime application's behavior ( timespec in timeouts, length of message queues, duration of driver interrupt service routine, and so on). The tuning of an application needs a means of recording timestamped important trace events during execution in order to analyze offline, and eventually, to tune some realtime features (redesign the system with less functionalities, readjust timeouts, redesign driver interrupts, and so on).

Detailed Objectives

Objectives were defined to build the trace interface and are kept for historical interest. Although some objectives are not fully respected in this trace interface, the concept of the POSIX trace interface assumes the following points:

It must be possible to trace both system and user trace events concurrently.
It must be possible to trace per-process trace events and also to trace system trace events which are unrelated to any particular process. A per-process trace event is either user-initiated or system-initiated.
It must be possible to control tracing on a per-process basis from either inside or outside the process.
It must be possible to control tracing on a per-thread basis from inside the enclosing process.
Trace points must be controllable by trace event type ID from inside and outside of the process. Multiple trace points can have the same trace event type ID, and will be controlled jointly.
Recording of trace events is dependent on both trace event type ID and the process/thread. Both must be enabled in order to record trace events. System trace events may or may not be handled differently.
The API must not mandate the ability to control tracing for more than one process at the same time.
There is no objective for trace control on anything bigger than a process; for example, group or session.
Trace propagation and control:
1. Trace propagation across fork() is optional; the default is to not trace a child process.
2. Trace control must span pthread_create() operations; that is, if a process is being traced, any thread will be traced as well if this thread allows tracing. The default is to allow tracing.
Trace control must not span exec or posix_spawn() operations.
A triggering API is not required. The triggering API is the ability to command or stop tracing based on the occurrence of a specific trace event other than a POSIX_TRACE_START trace event or a POSIX_TRACE_STOP trace event.
Trace log entries must have timestamps of implementation-defined resolution. Implementations are exhorted to support at least microsecond resolution. When a trace log entry is retrieved, it must have timestamp, PC address, PID, and TID of the entity that generated the trace event.
Independently developed code should be able to use trace facilities without coordination and without conflict.
Even if the trace points in the trace calls are not unique, the trace log entries (after any processing) must be uniquely identified as to trace point.
There must be a standard API to read the trace stream.
The format of the trace stream and the trace log is opaque and unspecified.
It must be possible to read a completed trace, if recorded on some suitable non-volatile storage, even subsequent to a power cycle or subsequent cold boot of the system.
Support of analysis of a trace log while it is being formed is implementation-defined.
The API must allow the application to write trace stream identification information into the trace stream and to be able to retrieve it, without it being overwritten by trace entries, even if the trace stream is full.
It must be possible to specify the destination of trace data produced by trace events.
It must be possible to have different trace streams, and for the tracing enabled by one trace stream to be completely independent of the tracing of another trace stream.
It must be possible to trace events from threads in different CPUs.
The API must support one or more trace streams per-system, and one or more trace streams per-process, up to an implementation-defined set of per-system and per-process maximums.
It must be possible to determine the order in which the trace events happened, without necessarily depending on the clock, up to an implementation-defined time resolution.
For performance reasons, the trace event point call(s) must be implementable as a macro (see the ISO POSIX-1:1996 standard, 1.3.4, Statement 2).
IEEE Std 1003.1-2001 must not define the trace points which a conforming system must implement, except for trace points used in the control of tracing.
The APIs must be thread-safe, and trace points should be lock-free (that is, not require a lock to gain exclusive access to some resource).
The user-provided information associated with a trace event is variable-sized, up to some maximum size.
Bounds on record and trace stream sizes:
1. The API must permit the application to declare the upper bounds on the length of an application data record. The system must return the limit it used. The limit used may be smaller than requested.
2. The API must permit the application to declare the upper bounds on the size of trace streams. The system must return the limit it used. The limit used may be different, either larger or smaller, than requested.
The API must be able to pass any fundamental data type, and a structured data type composed only of fundamental types. The API must be able to pass data by reference, given only as an address and a length. Fundamental types are the POSIX.1 types (see the <sys/types.h> header) plus those defined in the ISO C standard.
The API must apply the POSIX notions of ownership and permission to recorded trace data, corresponding to the sources of that data.

Comments on Objectives

Note:: In the following comments, numbers in square brackets refer to the above objectives.

It is necessary to be able to obtain a trace stream for a complete activity. Thus there is a requirement to be able to trace both application and system trace events. A per-process trace event is either user-initiated, like the write() function, or system-initiated, like a timer expiration. There is also a need to be able to trace an entire process' activity even when it has threads in multiple CPUs. To avoid excess trace activity, it is necessary to be able to control tracing on a trace event type basis.
[Objectives 1,2,5,22]

There is a need to be able to control tracing on a per-process basis, both from inside and outside the process; that is, a process can start a trace activity on itself or any other process. There is also the perceived need to allow the definition of a maximum number of trace streams per system.
[Objectives 3,23]

From within a process, it is necessary to be able to control tracing on a per-thread basis. This provides an additional filtering capability to keep the amount of traced data to a minimum. It also allows for less ambiguity as to the origin of trace events. It is recognized that thread-level control is only valid from within the process itself. It is also desirable to know the maximum number of trace streams per process that can be started. The API should not require thread synchronization or mandate priority inversions that would cause the thread to block. However, the API must be thread-safe.
[Objectives 4,23,24,27]

There was no perceived objective to control tracing on anything larger than a process; for example, a group or session. Also, the ability to start or stop a trace activity on multiple processes atomically may be very difficult or cumbersome in some implementations.
[Objectives 6,8]

It is also necessary to be able to control tracing by trace event type identifier, sometimes called a trace hook ID. However, there is no mandated set of system trace events, since such trace points are implementation-defined. The API must not require from the operating system facilities that are not standard.
[Objectives 6,26]

Trace control must span fork() and pthread_create(). If not, there will be no way to ensure that an application's activity is entirely traced. The newly forked child would not be able to turn on its tracing until after it obtained control after the fork, and trace control externally would be even more problematic.
[Objective 9]

Since exec and posix_spawn() represent a complete change in the execution of a task (a new program), trace control need not persist over an exec or posix_spawn().
[Objective 10]

Where trace activities are started on multiple processes, these trace activities should not interfere with each other.
[Objective 21]

There is no need for a triggering objective, primarily for performance reasons; see also Rationale on Triggering , rationale on triggering.
[Objective 11]

It must be possible to determine the origin of each traced event. The process and thread identifiers for each trace event are needed. Also there was a perceived need for a user-specifiable origin, but it was felt that this would create too much overhead.
[Objectives 12,14]

An allowance must be made for trace points to come embedded in software components from several different sources and vendors without requiring coordination.
[Objective 13]

There is a requirement to be able to uniquely identify trace points that may have the same trace stream identifier. This is only necessary when a trace report is produced.
[Objectives 12,14]

Tracing is a very performance-sensitive activity, and will therefore likely be implemented at a low level within the system. Hence the interface must not mandate any particular buffering or storage method. Therefore, a standard API is needed to read a trace stream. Also the interface must not mandate the format of the trace data, and the interface must not assume a trace storage method. Due to the possibility of a monolithic kernel and the possible presence of multiple processes capable of running trace activities, the two kinds of trace events may be stored in two separate streams for performance reasons. A mandatory dump mechanism, common in some existing practice, has been avoided to allow the implementation of this set of functions on small realtime profiles for which the concept of a file system is not defined. The trace API calls should be implemented as macros.
[Objectives 15,16,25,30]

Since a trace facility is a valuable service tool, the output (or log) of a completed trace stream that is written to permanent storage must be readable on other systems of the type that produced the trace log. Note that there is no objective to be able to interpret a trace log that was not successfully completed.
[Objectives 17,18,19]

For trace streams written to permanent storage, a way to specify the destination of the trace stream is needed.
[Objective 20]

There is a requirement to be able to depend on the ordering of trace events up to some implementation-defined time interval. For example, there is a need to know the time period during which, if trace events are closer together, their ordering is unspecified. Events that occur within an interval smaller than this resolution may or may not be read back in the correct order.
[Objective 24]

The application should be able to know how much data can be traced. When trace event types can be filtered, the application should be able to specify the approximate maximum amount of data that will be traced in a trace event so resources can be more efficiently allocated.
[Objectives 28,29]

Users should not be able to trace data to which they would not normally have access. System trace events corresponding to a process/thread should be associated with the ownership of that process/thread.
[Objective 31]

Trace Model

Introduction

The model is based on two base entities: the "Trace Stream" and the "Trace Log", and a recorded unit called the "Trace Event". The possibility of using Trace Streams and Trace Logs separately gives two use dimensions and solves both the performance issue and the full-information system issue. In the case of a trace stream without log, specific information, although reduced in quantity, is required to be registered, in a possibly small realtime system, with as little overhead as possible. The Trace Log option has been added for small realtime systems. In the case of a trace stream with log, considerable complex application-specific information needs to be collected.

Trace Model Description

The trace model can be examined for three different subfunctions: Application Instrumentation, Trace Operation Control, and Trace Analysis.

Figure: Trace System Overview: for Offline Analysis

Each of these subfunctions requires specific characteristics of the trace mechanism API.

Application Instrumentation

When instrumenting an application, the programmer is not concerned about the future use of the trace events in the trace stream or the trace log, the full policy of the trace stream, or the eventual pre-filtering of trace events. But he is concerned about the correct determination of the specific trace event type identifier, regardless of how many independent libraries are used in the same user application; see Trace System Overview: for Offline Analysis and Trace System Overview: for Online Analysis.

This trace API provides the necessary operations to accomplish this subfunction. This is done by providing functions to associate a programmer-defined name with an implementation-defined trace event type identifier (see the posix_trace_eventid_open() function), and to send this trace event into a potential trace stream (see the posix_trace_event() function).
Trace Operation Control

When controlling the recording of trace events in a trace stream, the programmer is concerned with the correct initialization of the trace mechanism (that is, the sizing of the trace stream), the correct retention of trace events in a permanent storage, the correct dynamic recording of trace events, and so on.

This trace API provides the necessary material to permit this efficiently. This is done by providing functions to initialize a new trace stream, and optionally a trace log:
- Trace Stream Attributes Object Initialization (see posix_trace_attr_init())
- Functions to Retrieve or Set Information About a Trace Stream (see posix_trace_attr_getgenversion())
- Functions to Retrieve or Set the Behavior of a Trace Stream (see posix_trace_attr_getinherited())
- Functions to Retrieve or Set Trace Stream Size Attributes (see posix_trace_attr_getmaxusereventsize())
- Trace Stream Initialization, Flush, and Shutdown from a Process (see posix_trace_create())
- Clear Trace Stream and Trace Log (see posix_trace_clear())
To select the trace event types that are to be traced:
- Manipulate Trace Event Type Identifier (see posix_trace_trid_eventid_open())
- Iterate over a Mapping of Trace Event Type (see posix_trace_eventtypelist_getnext_id())
- Manipulate Trace Event Type Sets (see posix_trace_eventset_empty())
- Set Filter of an Initialized Trace Stream (see posix_trace_set_filter())
To control the execution of an active trace stream:
- Trace Start and Stop (see posix_trace_start())
- Functions to Retrieve the Trace Attributes or Trace Statuses (see posix_trace_get_attr())
Figure: Trace System Overview: for Online Analysis
Trace Analysis

Once correctly recorded, on permanent storage or not, an ultimate activity consists of the analysis of the recorded information. If the recorded data is on permanent storage, a specific open operation is required to associate a trace stream to a trace log.

The first intent of the group was to request the presence of a system identification structure in the trace stream attribute. This was, for the application, to allow some portable way to process the recorded information. However, there is no requirement that the utsname structure, on which this system identification was based, be portable from one machine to another, so the contents of the attribute cannot be interpreted correctly by an application conforming to IEEE Std 1003.1-2001.

This modification has been incorporated and requests that some unspecified information be recorded in the trace log in order to fail opening it if the analysis process and the controller process were running in different types of machine, but does not request that this information be accessible to the application. This modification has implied a modification in the posix_trace_open() function error code returns.

This trace API provides functions to:
- Extract trace stream identification attributes (see posix_trace_attr_getgenversion())
- Extract trace stream behavior attributes (see posix_trace_attr_getinherited())
- Extract trace event, stream, and log size attributes (see posix_trace_attr_getmaxusereventsize())
- Look up trace event type names (see posix_trace_eventid_get_name())
- Iterate over trace event type identifiers (see posix_trace_eventtypelist_getnext_id())
- Open, rewind, and close a trace log (see posix_trace_open())
- Read trace stream attributes and status (see posix_trace_get_attr())
- Read trace events (see posix_trace_getnext_event())

Due to the following two reasons:

The requirement that the trace system must not add unacceptable overhead to the traced process and so that the trace event point execution must be fast
The traced application does not care about tracing errors

the trace system cannot return any internal error to the application. Internal error conditions can range from unrecoverable errors that will force the active trace stream to abort, to small errors that can affect the quality of tracing without aborting the trace stream. The group decided to define a system trace event to report to the analysis process such internal errors. It is not the intention of IEEE Std 1003.1-2001 to require an implementation to report an internal error that corrupts or terminates tracing operation. The implementor is free to decide which internal documented errors, if any, the trace system is able to report.

States of a Trace Stream

Figure: Trace System Overview: States of a Trace Stream

Trace System Overview: States of a Trace Stream shows the different states an active trace stream passes through. After the posix_trace_create() function call, a trace stream becomes CREATED and a trace stream is associated for the future collection of trace events. The status of the trace stream is POSIX_TRACE_SUSPENDED. The state becomes STARTED after a call to the posix_trace_start() function, and the status becomes POSIX_TRACE_RUNNING. In this state, all trace events that are not filtered out will be stored into the trace stream. After a call to posix_trace_stop(), the trace stream becomes STOPPED (and the status POSIX_TRACE_SUSPENDED). In this state, no new trace events will be recorded in the trace stream, but previously recorded trace events may continue to be read.

After a call to posix_trace_shutdown(), the trace stream is in the state COMPLETED. The trace stream no longer exists but, if the Trace Log option is supported, all the information contained in it has been logged. If a log object has not been associated with the trace stream at the creation, it is the responsibility of the trace controller process to not shut the trace stream down while trace events remain to be read in the stream.

Tracing All Processes

Some implementations have a tracing subsystem with the ability to trace all processes. This is useful to debug some types of device drivers such as those for ATM or X25 adapters. These types of adapters are used by several independent processes, that are not issued from the same process.

The POSIX trace interface does not define any constant or option to create a trace stream tracing all processes. POSIX.1 does not prevent this type of implementation and an implementor is free to add this capability. Nevertheless, the trace interface allows tracing of all the system trace events and all the processes issued from the same process.

If such a tracing system capability has to be implemented, when a trace stream is created, it is recommended that a constant named POSIX_TRACE_ALLPROC be used instead of the process identifier in the argument of the posix_trace_create() or posix_trace_create_withlog() function. A possible value for POSIX_TRACE_ALLPROC may be -1 instead of a real process identifier.

The implementor has to be aware that there is some impact on the tracing behavior as defined in the POSIX trace interface. For example:

If the default value for the inheritance attribute is set to POSIX_TRACE_CLOSE_FOR_CHILD, the implementation has to stop tracing for the child process.
The trace controller which is creating this type of trace stream must have the appropriate privilege to trace all the processes.

Trace Storage

The model is based on two types of trace events: system trace events and user-defined trace events. The internal representation of trace events is implementation-defined, and so the implementor is free to choose the more suitable, practical, and efficient way to design the internal management of trace events. For the timestamping operation, the model does not impose the CLOCK_REALTIME or any other clock. The buffering allocation and operation follow the same principle. The implementor is free to use one or more buffers to record trace events; the interface assumes only a logical trace stream of sequentially recorded trace events. Regarding flushing of trace events, the interface allows the definition of a trace log object which typically can be a file. But the group was also aware of defining functions to permit the use of this interface in small realtime systems, which may not have general file system capabilities. For instance, the three functions posix_trace_getnext_event() (blocking), posix_trace_timedgetnext_event() (blocking with timeout), and posix_trace_trygetnext_event() (non-blocking) are proposed to read the recorded trace events.

The policy to be used when the trace stream becomes full also relies on common practice:

For an active trace stream, the POSIX_TRACE_LOOP trace stream policy permits automatic overrun (overwrite of oldest trace events) while waiting for some user-defined condition to cause tracing to stop. By contrast, the POSIX_TRACE_UNTIL_FULL trace stream policy requires the system to stop tracing when the trace stream is full. However, if the trace stream that is full is at least partially emptied by a call to the posix_trace_flush() function or by calls to the posix_trace_getnext_event() function, the trace system will automatically resume tracing.

If the Trace Log option is supported, the operation of the POSIX_TRACE_FLUSH policy is an extension of the POSIX_TRACE_UNTIL_FULL policy. The automatic free operation (by flushing to the associated trace log) is added.
If a log is associated with the trace stream and this log is a regular file, these policies also apply for the log. One more policy, POSIX_TRACE_APPEND, is defined to allow indefinite extension of the log. Since the log destination can be any device or pseudo-device, the implementation may not be able to manipulate the destination as required by IEEE Std 1003.1-2001. For this reason, the behavior of the log full policy may be unspecified depending on the trace log type.

The current trace interface does not define a service to preallocate space for a trace log file, because this space can be preallocated by means of a call to the posix_fallocate() function. This function could be called after the file has been opened, but before the trace stream is created. The posix_fallocate() function ensures that any required storage for regular file data is allocated on the file system storage media. If posix_fallocate() returns successfully, subsequent writes to the specified file data will not fail due to the lack of free space on the file system storage media. Besides trace events, a trace stream also includes trace attributes and the mapping from trace event names to trace event type identifiers. The implementor is free to choose how to store the trace attributes and the trace event type map, but must ensure that this information is not lost when a trace stream overrun occurs.

Trace Programming Examples

Several programming examples are presented to show the code of the different possible subfunctions using a trace subsystem. All these programs need to include the <trace.h> header. In the examples shown, error checking is omitted for more simplicity.

Trace Operation Control

These examples show the creation of a trace stream for another process; one which is already trace instrumented. All the default trace stream attributes are used to simplify programming in the first example. The second example shows more possibilities.

First Example

/* Caution. Error checks omitted */
{
    trace_attr_t attr;
    pid_t pid = traced_process_pid;
    int fd;
    trace_id_t trid;


    - - - - - -
    /* Initialize trace stream attributes */
    posix_trace_attr_init(&attr);
    /* Open a trace log */
    fd=open("/tmp/mytracelog",...);
    /*
     * Create a new trace associated with a log
     * and with default attributes
     */


    posix_trace_create_withlog(pid, &attr, fd, &trid);


    /* Trace attribute structure can now be destroyed */
    posix_trace_attr_destroy(&attr);
    /* Start of trace event recording */
    posix_trace_start(trid);
    - - - - - -
    - - - - - -
    /* Duration of tracing */
    - - - - - -
    - - - - - -
    /* Stop and shutdown of trace activity */
    posix_trace_shutdown(trid);
    - - - - - -
}

Second Example

Between the initialization of the trace stream attributes and the creation of the trace stream, these trace stream attributes may be modified; see Trace Stream Attribute Manipulation for a specific programming example. Between the creation and the start of the trace stream, the event filter may be set; after the trace stream is started, the event filter may be changed. The setting of an event set and the change of a filter is shown in Create a Trace Event Type Set and Change the Trace Event Type Filter.

/* Caution. Error checks omitted */
{
    trace_attr_t attr;
    pid_t pid = traced_process_pid;
    int fd;
    trace_id_t trid;
    - - - - - -
    /* Initialize trace stream attributes */
    posix_trace_attr_init(&attr);
    /* Attr default may be changed at this place; see example */
    - - - - - -
    /* Create and open a trace log with R/W user access */
    fd=open("/tmp/mytracelog",O_WRONLY|O_CREAT,S_IRUSR|S_IWUSR);
    /* Create a new trace associated with a log */
    posix_trace_create_withlog(pid, &attr, fd, &trid);
    /*
     * If the Trace Filter option is supported
     * trace event type filter default may be changed at this place;
     * see example about changing the trace event type filter
     */
    posix_trace_start(trid);
    - - - - - -


    /*
     * If you have an uninteresting part of the application
     * you can stop temporarily.
     *
     * posix_trace_stop(trid);
     * - - - - - -
     * - - - - - -
     * posix_trace_start(trid);
     */
    - - - - - -
    /*
     * If the Trace Filter option is supported
     * the current trace event type filter can be changed
     * at any time (see example about how to set
     * a trace event type filter)
     */
    - - - - - -


    /* Stop the recording of trace events */
    posix_trace_stop(trid);
    /* Shutdown the trace stream */
    posix_trace_shutdown(trid);
    /*
     * Destroy trace stream attributes; attr structure may have
     * been used during tracing to fetch the attributes
     */
    posix_trace_attr_destroy(&attr);
    - - - - - -
}

Application Instrumentation

This example shows an instrumented application. The code is included in a block of instructions, perhaps a function from a library. Possibly in an initialization part of the instrumented application, two user trace events names are mapped to two trace event type identifiers (function posix_trace_eventid_open()). Then two trace points are programmed.

/* Caution. Error checks omitted */
{
    trace_event_id_t eventid1, eventid2;
    - - - - - -
    /* Initialization of two trace event type ids */
    posix_trace_eventid_open("my_first_event",&eventid1);
    posix_trace_eventid_open("my_second_event",&eventid2);
    - - - - - -
    - - - - - -
    - - - - - -
    /* Trace point */
    posix_trace_event(eventid1,NULL,0);
    - - - - - -
    /* Trace point */
    posix_trace_event(eventid2,NULL,0);
    - - - - - -
}

Trace Analyzer

This example shows the manipulation of a trace log resulting from the dumping of a completed trace stream. All the default attributes are used to simplify programming, and data associated with a trace event is not shown in the first example. The second example shows more possibilities.

First Example

/* Caution. Error checks omitted */
{
    int fd;
    trace_id_t trid;
    posix_trace_event_info trace_event;
    char trace_event_name[TRACE_EVENT_NAME_MAX];
    int return_value;
    size_t returndatasize;
    int lost_event_number;


    - - - - - -


    /* Open an existing trace log */
    fd=open("/tmp/tracelog", O_RDONLY);
    /* Open a trace stream on the open log */
    posix_trace_open(fd, &trid);
    /* Read a trace event */
    posix_trace_getnext_event(trid, &trace_event,
        NULL, 0, &returndatasize,&return_value);


    /* Read and print all trace event names out in a loop */
    while (return_value == NULL)
    {
        /*
         * Get the name of the trace event associated
         * with trid trace ID
         */
        posix_trace_eventid_get_name(trid, trace_event.event_id,
            trace_event_name);
        /* Print the trace event name out */
        printf("%s\n",trace_event_name);
        /* Read a trace event */
        posix_trace_getnext_event(trid, &trace_event,
            NULL, 0, &returndatasize,&return_value);
    }


    /* Close the trace stream */
    posix_trace_close(trid);
    /* Close the trace log */
    close(fd);
}

Second Example

The complete example includes the two other examples in Retrieve Information from a Trace Log and in Retrieve the List of Trace Event Types Used in a Trace Log. For example, the maxdatasize variable is set in Retrieve the List of Trace Event Types Used in a Trace Log.

/* Caution. Error checks omitted */
{
    int fd;
    trace_id_t trid;
    posix_trace_event_info trace_event;
    char trace_event_name[TRACE_EVENT_NAME_MAX];
    char * data;
    size_t maxdatasize=1024, returndatasize;
    int return_value;
    - - - - - -


    /* Open an existing trace log */
    fd=open("/tmp/tracelog", O_RDONLY);
    /* Open a trace stream on the open log */
    posix_trace_open( fd, &trid);
    /*
     * Retrieve information about the trace stream which
     * was dumped in this trace log (see example)
     */
    - - - - - -


    /* Allocate a buffer for trace event data */
    data=(char *)malloc(maxdatasize);
    /*
     * Retrieve the list of trace events used in this
     * trace log (see example)
     */
    - - - - - -


    /* Read and print all trace event names and data out in a loop */
    while (1)
    {
    posix_trace_getnext_event(trid, &trace_event,
        data, maxdatasize, &returndatasize,&return_value);
        if (return_value != NULL) break;
        /*
         * Get the name of the trace event type associated
         * with trid trace ID
         */
        posix_trace_eventid_get_name(trid, trace_event.event_id,
            trace_event_name);
        {
        int i;


        /* Print the trace event name out */
        printf("%s: ", trace_event_name);
        /* Print the trace event data out */
        for (i=0; i<returndatasize, i++) printf("%02.2X",
            (unsigned char)data[i]);
        printf("\n");
        }
    }


    /* Close the trace stream */
    posix_trace_close(trid);
    /* The buffer data is deallocated */
    free(data);
    /* Now the file can be closed */
    close(fd);
}

Several Programming Manipulations

The following examples show some typical sets of operations needed in some contexts.

Trace Stream Attribute Manipulation

This example shows the manipulation of a trace stream attribute object in order to change the default value provided by a previous posix_trace_attr_init() call.

/* Caution. Error checks omitted */
{
    trace_attr_t attr;
    size_t logsize=100000;
    - - - - - -
    /* Initialize trace stream attributes */
    posix_trace_attr_init(&attr);
    /* Set the trace name in the attributes structure */
    posix_trace_attr_setname(&attr, "my_trace");
    /* Set the trace full policy */
    posix_trace_attr_setstreamfullpolicy(&attr, POSIX_TRACE_LOOP);
    /* Set the trace log size */
    posix_trace_attr_setlogsize(&attr, logsize);
    - - - - - -
}

Create a Trace Event Type Set and Change the Trace Event Type Filter

This example is valid only if the Trace Event Filter option is supported. This example shows the manipulation of a trace event type set in order to change the trace event type filter for an existing active trace stream, which may be just-created, running, or suspended. Some sets of trace event types are well-known, such as the set of trace event types not associated with a process, some trace event types are just-built trace event types for this trace stream; one trace event type is the predefined trace event error type which is deleted from the trace event type set.

/* Caution. Error checks omitted */
{
    trace_id_t trid = existing_trace;
    trace_event_set_t set;
    trace_event_id_t trace_event1, trace_event2;
    - - - - - -
    /* Initialize to an empty set of trace event types */
    /* (not strictly required because posix_trace_event_set_fill() */
    /* will ignore the prior contents of the event set.) */
    posix_trace_eventset_emptyset(&set);
    /*
     * Fill the set with all system trace events
     * not associated with a process
     */
    posix_trace_eventset_fill(&set, POSIX_TRACE_WOPID_EVENTS);


    /*
     * Get the trace event type identifier of the known trace event name
     * my_first_event for the trid trace stream
     */
    posix_trace_trid_eventid_open(trid, "my_first_event", &trace_event1);
    /* Add the set with this trace event type identifier */
    posix_trace_eventset_add_event(trace_event1, &set);
    /*
     * Get the trace event type identifier of the known trace event name
     * my_second_event for the trid trace stream
     */


    posix_trace_trid_eventid_open(trid, "my_second_event", &trace_event2);
    /* Add the set with this trace event type identifier */
    posix_trace_eventset_add_event(trace_event2, &set);
    - - - - - -
    /* Delete the system trace event POSIX_TRACE_ERROR from the set */
    posix_trace_eventset_del_event(POSIX_TRACE_ERROR, &set);
    - - - - - -


    /* Modify the trace stream filter making it equal to the new set */
    posix_trace_set_filter(trid, &set, POSIX_TRACE_SET_EVENTSET);
    - - - - - -
    /*
     * Now trace_event1, trace_event2, and all system trace event types
     * not associated with a process, except for the POSIX_TRACE_ERROR
     * system trace event type, are filtered out of (not recorded in) the
     * existing trace stream.
     */
}

Retrieve Information from a Trace Log

This example shows how to extract information from a trace log, the dump of a trace stream. This code:

Asks if the trace stream has lost trace events
Extracts the information about the version of the trace subsystem which generated this trace log
Retrieves the maximum size of trace event data; this may be used to dynamically allocate an array for extracting trace event data from the trace log without overflow

/* Caution. Error checks omitted */
{
    struct posix_trace_status_info statusinfo;
    trace_attr_t attr;
    trace_id_t trid = existing_trace;
    size_t maxdatasize;
    char genversion[TRACE_NAME_MAX];
    - - - - - -
    /* Get the trace stream status */
    posix_trace_get_status(trid, &statusinfo);
    /* Detect an overrun condition */
    if (statusinfo.posix_stream_overrun_status == POSIX_TRACE_OVERRUN)
        printf("trace events have been lost\n");


    /* Get attributes from the trid trace stream */
    posix_trace_get_attr(trid, &attr);
    /* Get the trace generation version from the attributes */
    posix_trace_attr_getgenversion(&attr, genversion);
    /* Print the trace generation version out */
    printf("Information about Trace Generator:%s\n",genversion);


    /* Get the trace event max data size from the attributes */
    posix_trace_attr_getmaxdatasize(&attr, &maxdatasize);
    /* Print the trace event max data size out */
    printf("Maximum size of associated data:%d\n",maxdatasize);
    /* Destroy the trace stream attributes */
    posix_trace_attr_destroy(&attr);
}

Retrieve the List of Trace Event Types Used in a Trace Log

This example shows the retrieval of a trace stream's trace event type list. This operation may be very useful if you are interested only in tracking the type of trace events in a trace log.

/* Caution. Error checks omitted */
{
    trace_id_t trid = existing_trace;
    trace_event_id_t event_id;
    char event_name[TRACE_EVENT_NAME_MAX];
    int return_value;
    - - - - - -


    /*
     * In a loop print all existing trace event names out
     * for the trid trace stream
     */
    while (1)
    {
        posix_trace_eventtypelist_getnext_id(trid, &event_id
            &return_value);
        if (return_value != NULL) break;
        /*
         * Get the name of the trace event associated
         * with trid trace ID
         */
        posix_trace_eventid_get_name(trid, event_id, event_name);
        /* Print the name out */
        printf("%s\n", event_name);
    }
}

Rationale on Trace for Debugging

Figure: Trace Another Process

Among the different possibilities offered by the trace interface defined in IEEE Std 1003.1-2001, the debugging of an application is the most interesting one. Typical operations in the controlling debugger process are to filter trace event types, to get trace events from the trace stream, to stop the trace stream when the debugged process is executing uninteresting code, to start the trace stream when some interesting point is reached, and so on. The interface defined in IEEE Std 1003.1-2001 should define all the necessary base functions to allow this dynamic debug handling.

Trace Another Process shows an example in which the trace stream is created after the call to the fork() function. If the user does not want to lose trace events, some synchronization mechanism (represented in the figure) may be needed before calling the exec function, to give the parent a chance to create the trace stream before the child begins the execution of its trace points.

Rationale on Trace Event Type Name Space

At first, the working group was in favor of the representation of a trace event type by an integer ( event_name). It seems that existing practice shows the weakness of such a representation. The collision of trace event types is the main problem that cannot be simply resolved using this sort of representation. Suppose, for example, that a third party designs an instrumented library. The user does not have the source of this library and wants to trace his application which uses in some part the third-party library. There is no means for him to know what are the trace event types used in the instrumented library so he has some chance of duplicating some of them and thus to obtain a contaminated tracing of his application.

Figure: Trace Name Space Overview: With Third-Party Library

There are requirements to allow program images containing pieces from various vendors to be traced without also requiring those of any other vendors to coordinate their uses of the trace facility, and especially the naming of their various trace event types and trace point IDs. The chosen solution is to provide a very large name space, large enough so that the individual vendors can give their trace types and tracepoint IDs sufficiently long and descriptive names making the occurrence of collisions quite unlikely. The probability of collision is thus made sufficiently low so that the problem may, as a practical matter, be ignored. By requirement, the consequence of collisions will be a slight ambiguity in the trace streams; tracing will continue in spite of collisions and ambiguities. "The show must go on". The posix_prog_address member of the posix_trace_event_info structure is used to allow trace streams to be unambiguously interpreted, despite the fact that trace event types and trace event names need not be unique.

The posix_trace_eventid_open() function is required to allow the instrumented third-party library to get a valid trace event type identifier for its trace event names. This operation is, somehow, an allocation, and the group was aware of proposing some deallocation mechanism which the instrumented application could use to recover the resources used by a trace event type identifier. This would have given the instrumented application the benefit of being capable of reusing a possible minimum set of trace event type identifiers, but also the inconvenience to have, possibly in the same trace stream, one trace event type identifier identifying two different trace event types. After some discussions the group decided to not define such a function which would make this API thicker for little benefit, the user having always the possibility of adding identification information in the data member of the trace event structure.

The set of the trace event type identifiers the controlling process wants to filter out is initialized in the trace mechanism using the function posix_trace_set_filter(), setting the arguments according to the definitions explained in posix_trace_set_filter(). This operation can be done statically (when the trace is in the STOPPED state) or dynamically (when the trace is in the STARTED state). The preparation of the filter is normally done using the function defined in posix_trace_eventtypelist_getnext_id() and eventually the function posix_trace_eventtypelist_rewind() in order to know (before the recording) the list of the potential set of trace event types that can be recorded. In the case of an active trace stream, this list may not be exhaustive. Actually, the target process may not have yet called the function posix_trace_eventid_open(). But it is a common practice, for a controlling process, to prepare the filtering of a future trace stream before its start. Therefore the user must have a way to get the trace event type identifier corresponding to a well-known trace event name before its future association by the pre-cited function. This is done by calling the posix_trace_trid_eventid_open() function, given the trace stream identifier and the trace name, and described hereafter. Because this trace event type identifier is associated with a trace stream identifier, where a unique process has initialized two or more traces, the implementation is expected to return the same trace event type identifier for successive calls to posix_trace_trid_eventid_open() with different trace stream identifiers. The posix_trace_eventid_get_name() function is used by the controller process to identify, by the name, the trace event type returned by a call to the posix_trace_eventtypelist_getnext_id() function.

Afterwards, the set of trace event types is constructed using the functions defined in posix_trace_eventset_empty(), posix_trace_eventset_fill(), posix_trace_eventset_add(), and posix_trace_eventset_del().

A set of functions is provided devoted to the manipulation of the trace event type identifier and names for an active trace stream. All these functions require the trace stream identifier argument as the first parameter. The opacity of the trace event type identifier implies that the user cannot associate directly its well-known trace event name with the system-associated trace event type identifier.

The posix_trace_trid_eventid_open() function allows the application to get the system trace event type identifier back from the system, given its well-known trace event name. This function is useful only when a controlling process needs to specify specific events to be filtered.

The posix_trace_eventid_get_name() function allows the application to obtain a trace event name given its trace event type identifier. One possible use of this function is to identify the type of a trace event retrieved from the trace stream, and print it. The easiest way to implement this requirement, is to use a single trace event type map for all the processes whose maps are required to be identical. A more difficult way is to attempt to keep multiple maps identical at every call to posix_trace_eventid_open() and posix_trace_trid_eventid_open().

Rationale on Trace Events Type Filtering

The most basic rationale for runtime and pre-registration filtering (selection/rejection) of trace event types is to prevent choking of the trace collection facility, and/or overloading of the computer system. Any worthwhile trace facility can bring even the largest computer to its knees. Otherwise, everything would be recorded and filtered after the fact; it would be much simpler, but impractical.

To achieve debugging, measurement, or whatever the purpose of tracing, the filtering of trace event types is an important part of trace analysis. Due to the fact that the trace events are put into a trace stream and probably logged afterwards into a file, different levels of filtering-that is, rejection of trace event types-are possible.

Filtering of Trace Event Types Before Tracing

This function, represented by the posix_trace_set_filter() function in IEEE Std 1003.1-2001 (see posix_trace_set_filter()), selects, before or during tracing, the set of trace event types to be filtered out. It should be possible also (as OSF suggested in their ETAP trace specifications) to select the kernel trace event types to be traced in a system-wide fashion. These two functionalities are called the pre-filtering of trace event types.

The restriction on the actual type used for the trace_event_set_t type is intended to guarantee that these objects can always be assigned, have their address taken, and be passed by value as parameters. It is not intended that this type be a structure including pointers to other data structures, as that could impact the portability of applications performing such operations. A reasonable implementation could be a structure containing an array of integer types.

Filtering of Trace Event Types at Runtime

It is possible to build this functionality using the posix_trace_set_filter() function. A privileged process or a privileged thread can get trace events from the trace stream of another process or thread, and thus specify the type of trace events to record into a file, using implementation-defined methods and interfaces. This functionality, called inline filtering of trace event types, is used for runtime analysis of trace streams.

Post-Mortem Filtering of Trace Event Types

The word "post-mortem" is used here to indicate that some unanticipated situation occurs during execution that does not permit a pre or inline filtering of trace events and that it is necessary to record all trace event types to have a chance to discover the problem afterwards. When the program stops, all the trace events recorded previously can be analyzed in order to find the solution. This functionality could be named the post-filtering of trace event types.

Discussions about Trace Event Type-Filtering

After long discussions with the parties involved in the process of defining the trace interface, it seems that the sensitivity to the filtering problem is different, but everybody agrees that the level of the overhead introduced during the tracing operation depends on the filtering method elected. If the time that it takes the trace event to be recorded can be neglected, the overhead introduced by the filtering process can be classified as follows:

Pre-filtering: System and process/thread-level overhead
Inline-filtering: Process/thread-level overhead
Post-filtering: No overhead; done offline

The pre-filtering could be named "critical realtime" filtering in the sense that the filtering of trace event type is manageable at the user level so the user can lower to a minimum the filtering overhead at some user selected level of priority for the inline filtering, or delay the filtering to after execution for the post-filtering. The counterpart of this solution is that the size of the trace stream must be sufficient to record all the trace events. The advantage of the pre-filtering is that the utilization of the trace stream is optimized.

Only pre-filtering is defined by IEEE Std 1003.1-2001. However, great care must be taken in specifying pre-filtering, so that it does not impose unacceptable overhead. Moreover, it is necessary to isolate all the functionality relative to the pre-filtering.

The result of this rationale is to define a new option, the Trace Event Filter option, not necessarily implemented in small realtime systems, where system overhead is minimized to the extent possible.

Tracing, pthread API

The objective to be able to control tracing for individual threads may be in conflict with the efficiency expected in threads with a contentionscope attribute of PTHREAD_SCOPE_PROCESS. For these threads, context switches from one thread that has tracing enabled to another thread that has tracing disabled may require a kernel call to inform the kernel whether it has to trace system events executed by that thread or not. For this reason, it was proposed that the ability to enable or disable tracing for PTHREAD_SCOPE_PROCESS threads be made optional, through the introduction of a Trace Scope Process option. A trace implementation which did not implement the Trace Scope Process option would not honor the tracing-state attribute of a thread with PTHREAD_SCOPE_PROCESS; it would, however, honor the tracing-state attribute of a thread with PTHREAD_SCOPE_SYSTEM. This proposal was rejected as:

Removing desired functionality (per-thread trace control)
Introducing counter-intuitive behavior for the tracing-state attribute
Mixing logically orthogonal ideas (thread scheduling and thread tracing)
[Objective 4]

Finally, to solve this complex issue, this API does not provide pthread_gettracingstate(), pthread_settracingstate(), pthread_attr_gettracingstate(), and pthread_attr_settracingstate() interfaces. These interfaces force the thread implementation to add to the weight of the thread and cause a revision of the threads libraries, just to support tracing. Worse yet, posix_trace_event() must always test this per-thread variable even in the common case where it is not used at all. Per-thread tracing is easy to implement using existing interfaces where necessary; see the following example.

Example

/* Caution. Error checks omitted */
static pthread_key_t my_key;
static trace_event_id_t my_event_id;
static pthread_once_t my_once = PTHREAD_ONCE_INIT;


void my_init(void)
{
    (void) pthread_key_create(&my_key, NULL);
    (void) posix_trace_eventid_open("my", &my_event_id);
}


int get_trace_flag(void)
{
    pthread_once(&my_once, my_init);
    return (pthread_getspecific(my_key) != NULL);
}


void set_trace_flag(int f)
{
    pthread_once(&my_once, my_init);
    pthread_setspecific(my_key, f? &my_event_id: NULL);
}


fn()
{
    if (get_trace_flag())
        posix_trace_event(my_event_id, ...)
}

The above example does not implement third-party state setting.

Lastly, per-thread tracing works poorly for threads with PTHREAD_SCOPE_PROCESS contention scope. These "library" threads have minimal interaction with the kernel and would have to explicitly set the attributes whenever they are context switched to a new kernel thread in order to trace system events. Such state was explicitly avoided in POSIX threads to keep PTHREAD_SCOPE_PROCESS threads lightweight.

The reason that keeping PTHREAD_SCOPE_PROCESS threads lightweight is important is that such threads can be used not just for simple multi-processors but also for co-routine style programming (such as discrete event simulation) without inventing a new threads paradigm. Adding extra runtime cost to thread context switches will make using POSIX threads less attractive in these situations.

Rationale on Triggering

The ability to start or stop tracing based on the occurrence of specific trace event types has been proposed as a parallel to similar functionality appearing in logic analyzers. Such triggering, in order to be very useful, should be based not only on the trace event type, but on trace event-specific data, including tests of user-specified fields for matching or threshold values.

Such a facility is unnecessary where the buffering of the stream is not a constraint, since such checks can be performed offline during post-mortem analysis.

For example, a large system could incorporate a daemon utility to collect the trace records from memory buffers and spool them to secondary storage for later analysis. In the instances where resources are truly limited, such as embedded applications, the application incorporation of application code to test the circumstances of a trace event and call the trace point only if needed is usually straightforward.

For performance reasons, the posix_trace_event() function should be implemented using a macro, so if the trace is inactive, the trace event point calls are latent code and must cost no more than a scalar test.

The API proposed in IEEE Std 1003.1-2001 does not include any triggering functionality.

Rationale on Timestamp Clock

It has been suggested that the tracing mechanism should include the possibility of specifying the clock to be used in timestamping the trace events. When application trace events must be correlated to remote trace events, such a facility could provide a global time reference not available from a local clock. Further, the application may be driven by timers based on a clock different from that used for the timestamp, and the correlation of the trace to those untraced timer activities could be an important part of the analysis of the application.

However, the tracing mechanism needs to be fast and just the provision of such an option can materially affect its performance. Leaving aside the performance costs of reading some clocks, this notion is also ill-defined when kernel trace events are to be traced by two applications making use of different tracing clocks. This can even happen within a single application where different parts of the application are served by different clocks. Another complication can occur when a clock is maintained strictly at the user level and is unavailable at the kernel level.

It is felt that the benefits of a selectable trace clock do not match its costs. Applications that wish to correlate clocks other than the default tracing clock can include trace events with sample values of those other clocks, allowing correlation of timestamps from the various independent clocks. In any case, such a technique would be required when applications are sensitive to multiple clocks.

Rationale on Different Overrun Conditions

The analysis of the dynamic behavior of the trace mechanism shows that different overrun conditions may occur. The API must provide a means to manage such conditions in a portable way.

Overrun in Trace Streams Initialized with POSIX_TRACE_LOOP Policy

In this case, the user of the trace mechanism is interested in using the trace stream with POSIX_TRACE_LOOP policy to record trace events continuously, but ideally without losing any trace events. The online analyzer process must get the trace events at a mean speed equivalent to the recording speed. Should the trace stream become full, a trace stream overrun occurs. This condition is detected by getting the status of the active trace stream (function posix_trace_get_status()) and looking at the member posix_stream_overrun_status of the read posix_stream_status structure. In addition, two predefined trace event types are defined:

The beginning of a trace overflow, to locate the beginning of an overflow when reading a trace stream
The end of a trace overflow, to locate the end of an overflow, when reading a trace stream

As a timestamp is associated with these predefined trace events, it is possible to know the duration of the overflow.

Overrun in Dumping Trace Streams into Trace Logs

The user lets the trace mechanism dump the trace stream initialized with POSIX_TRACE_FLUSH policy automatically into a trace log. If the dump operation is slower than the recording of trace events, the trace stream can overrun. This condition is detected by getting the status of the active trace stream (function posix_trace_get_status()) and looking at the member posix_log_overrun_status of the read posix_stream_status structure. This overrun indicates that the trace mechanism is not able to operate in this mode at this speed. It is the responsibility of the user to modify one of the trace parameters (the stream size or the trace event type filter, for instance) to avoid such overrun conditions, if overruns are to be prevented. The same already predefined trace event types (see Overrun in Trace Streams Initialized with POSIX_TRACE_LOOP Policy) are used to detect and to know the duration of an overflow.

Reading an Active Trace Stream

Although this trace API allows one to read an active trace stream with log while it is tracing, this feature can lead to false overflow origin interpretation: the trace log or the reader of the trace stream. Reading from an active trace stream with log is thus non-portable, and has been left unspecified.

B.2.12 Data Types

The requirement that additional types defined in this section end in "_t" was prompted by the problem of name space pollution. It is difficult to define a type (where that type is not one defined by IEEE Std 1003.1-2001) in one header file and use it in another without adding symbols to the name space of the program. To allow implementors to provide their own types, all conforming applications are required to avoid symbols ending in "_t", which permits the implementor to provide additional types. Because a major use of types is in the definition of structure members, which can (and in many cases must) be added to the structures defined in IEEE Std 1003.1-2001, the need for additional types is compelling.

The types, such as ushort and ulong, which are in common usage, are not defined in IEEE Std 1003.1-2001 (although ushort_t would be permitted as an extension). They can be added to <sys/types.h> using a feature test macro (see POSIX.1 Symbols). A suggested symbol for these is _SYSIII. Similarly, the types like u_short would probably be best controlled by _BSD.

Some of these symbols may appear in other headers; see The Name Space.

dev_t

This type may be made large enough to accommodate host-locality considerations of networked systems.

This type must be arithmetic. Earlier proposals allowed this to be non-arithmetic (such as a structure) and provided a samefile() function for comparison.

gid_t

Some implementations had separated gid_t from uid_t before POSIX.1 was completed. It would be difficult for them to coalesce them when it was unnecessary. Additionally, it is quite possible that user IDs might be different from group IDs because the user ID might wish to span a heterogeneous network, where the group ID might not.

For current implementations, the cost of having a separate gid_t will be only lexical.

mode_t

This type was chosen so that implementations could choose the appropriate integer type, and for compatibility with the ISO C standard. 4.3 BSD uses unsigned short and the SVID uses ushort, which is the same. Historically, only the low-order sixteen bits are significant.

nlink_t

This type was introduced in place of short for st_nlink (see the <sys/stat.h> header) in response to an objection that short was too small.

off_t

This type is used only in lseek(), fcntl(), and <sys/stat.h>. Many implementations would have difficulties if it were defined as anything other than long. Requiring an integer type limits the capabilities of lseek() to four gigabytes. The ISO C standard supplies routines that use larger types; see fgetpos() and fsetpos(). XSI-conformant systems provide the fseeko() and ftello() functions that use larger types.

pid_t

The inclusion of this symbol was controversial because it is tied to the issue of the representation of a process ID as a number. From the point of view of a conforming application, process IDs should be "magic cookies"¹ that are produced by calls such as fork(), used by calls such as waitpid() or kill(), and not otherwise analyzed (except that the sign is used as a flag for certain operations).

The concept of a {PID_MAX} value interacted with this in early proposals. Treating process IDs as an opaque type both removes the requirement for {PID_MAX} and allows systems to be more flexible in providing process IDs that span a large range of values, or a small one.

Since the values in uid_t, gid_t, and pid_t will be numbers generally, and potentially both large in magnitude and sparse, applications that are based on arrays of objects of this type are unlikely to be fully portable in any case. Solutions that treat them as magic cookies will be portable.

{CHILD_MAX} precludes the possibility of a "toy implementation", where there would only be one process.

ssize_t

This is intended to be a signed analog of size_t. The wording is such that an implementation may either choose to use a longer type or simply to use the signed version of the type that underlies size_t. All functions that return ssize_t ( read() and write()) describe as "implementation-defined" the result of an input exceeding {SSIZE_MAX}. It is recognized that some implementations might have ints that are smaller than size_t. A conforming application would be constrained not to perform I/O in pieces larger than {SSIZE_MAX}, but a conforming application using extensions would be able to use the full range if the implementation provided an extended range, while still having a single type-compatible interface.

The symbols size_t and ssize_t are also required in <unistd.h> to minimize the changes needed for calls to read() and write(). Implementors are reminded that it must be possible to include both <sys/types.h> and <unistd.h> in the same program (in either order) without error.

uid_t

Before the addition of this type, the data types used to represent these values varied throughout early proposals. The <sys/stat.h> header defined these values as type short, the <passwd.h> file (now <pwd.h> and <grp.h>) used an int, and getuid() returned an int. In response to a strong objection to the inconsistent definitions, all the types were switched to uid_t.

In practice, those historical implementations that use varying types of this sort can typedef uid_t to short with no serious consequences.

The problem associated with this change concerns object compatibility after structure size changes. Since most implementations will define uid_t as a short, the only substantive change will be a reduction in the size of the passwd structure. Consequently, implementations with an overriding concern for object compatibility can pad the structure back to its current size. For that reason, this problem was not considered critical enough to warrant the addition of a separate type to POSIX.1.

The types uid_t and gid_t are magic cookies. There is no {UID_MAX} defined by POSIX.1, and no structure imposed on uid_t and gid_t other than that they be positive arithmetic types. (In fact, they could be unsigned char.) There is no maximum or minimum specified for the number of distinct user or group IDs.

Footnotes

1.: An historical term meaning: "An opaque object, or token, of determinate size, whose significance is known only to the entity which created it. An entity receiving such a token from the generating entity may only make such use of the `cookie' as is defined and permitted by the supplying entity."

UNIX ® is a registered Trademark of The Open Group.
POSIX ® is a registered Trademark of The IEEE.
[ Main Index | XBD | XCU | XSH | XRAT ]