Previous section.

Protocols for Interworking: XNFS, Version 3W
Copyright © 1998 The Open Group

XNFS: Protocol Specification, Version 3

This chapter specifies an additional protocol for the Network File System, the Version 3 protocol, which must be supported in addition to the Version 2 protocol specified in Chapter 7. This chapter is written with the assumption that the reader is familiar with the introductory material in Chapter 7.

Summary of Version 3 Protocol Changes

This section provides an informative summary of changes to the NFS protocol from Version 2 to Version 3. All normative aspects of the protocol are described later in this document.

The ROOT and WRITECACHE procedures have been removed. A MKNOD procedure has been defined to allow the creation of special files, eliminating the overloading of CREATE. Caching on the client is not defined nor dictated by the NFS Version 3 protocol, but additional information and hints have been added to the protocol to allow clients that implement caching to manage their caches more effectively. Procedures that affect the attributes of a file or directory may now return the new attributes after the operation has completed to optimise out a subsequent GETATTR used in validating attribute caches. In addition, operations that modify the directory in which the target object resides return the old and new attributes of the directory to allow clients to implement more intelligent cache invalidation procedures. The ACCESS procedure provides access permission checking on the server, the FSSTAT procedure returns dynamic information about a file system, the FSINFO procedure returns static information about a file system and server, the READDIRPLUS procedure returns file handles and attributes in addition to directory entries, and the PATHCONF procedure returns XPG4 pathconf() information about a file.

The following is a list of the important changes between the NFS Version 2 protocol and the NFS Version 3 protocol.

File handle size

The file handle has been increased to a variable-length array of 64 bytes maximum from a fixed array of 32 bytes. This addresses some known requirements for a slightly larger file handle size. The file handle was converted from fixed length to variable length to reduce local storage and network bandwidth requirements for systems that do not utilise the full 64 bytes of length.

Maximum data sizes

The maximum size of a data transfer used in the READ and WRITE procedures is now set by values in the FSINFO return structure. In addition, preferred transfer sizes are returned by FSINFO. The protocol does not place any artificial limits on the maximum transfer sizes. Filenames and pathnames are now specified as strings of variable length. The actual length restrictions are determined by the client and server implementations as appropriate. The protocol does not place any artificial limits on the length. The NFS3ERR_NAMETOOLONG error is provided to allow the server to return an indication to the client that it received a pathname that was too long for it to handle.

Error return

Error returns in some instances now return data (for example, attributes). The nfsstat3 structure now defines the full set of errors that can be returned by a server. No other values are allowed.

File type
The file type now includes NF3CHR and NF3BLK for special files. Attributes for these types include subfields for major and minor device numbers traditionally found on UNIX systems. NF3SOCK and NF3FIFO are now defined for sockets and FIFOs in the file system.

File attributes

The blocksize (the size in bytes of a block in the file) field has been removed. The mode field no longer contains file type information. The size and fileid fields have been widened to 8 byte unsigned integers from 4 byte integers. Major and minor device information is now presented in a distinct structure. The blocks field name has been changed to used and now contains the total number of bytes used by the file. It is also an 8 byte unsigned integer.

Set file attributes

In the NFS Version 2 protocol, the attributes that can be set were represented by a subset of the file attributes structure; the client indicated those attributes that were not to be modified by setting the corresponding field to -1, overloading some unsigned fields. The set file attributes structure now uses a discriminated union for each field to tell whether or how to set that field. The atime and mtime fields can be set to either the server's current time or a time supplied by the client.

The LOOKUP return structure now includes the attributes for the directory searched.

An ACCESS procedure has been added to allow an explicit over-the-wire permissions check. This addresses known problems with the super-user ID mapping feature in many server implementations (where, due to mapping of root user, unexpected permission denied errors could occur while reading from or writing to a file). This also removes the assumption that was made in the NFS Version 2 protocol that access to files was based solely on XPG-style mode bits.

The reply structure includes a Boolean that is TRUE if the end-of-file was encountered during the READ. This allows the client to correctly detect end-of-file.

The beginoffset and totalcount fields were removed from the WRITE arguments. The reply now includes a count so that the server can write less than the requested amount of data, if required. An indicator was added to the arguments to instruct the server as to the level of cache synchronisation that is required by the client.

An exclusive flag and a create verifier was added for the exclusive creation of regular files.

This procedure was added to support the creation of special files. This avoids overloading fields of CREATE as was done in some NFS Version 2 protocol implementations.

The READDIR arguments now include a verifier to allow the server to validate the cookie. The cookie is now a 64 bit unsigned integer instead of the 4 byte array that was used in the NFS Version 2 protocol. This will help to reduce interoperability problems.


This procedure was added to return file handles and attributes in an extended directory list.

FSINFO was added to provide nonvolatile information about a file system. The reply includes preferred and maximum read transfer size, preferred and maximum write transfer size, and flags stating whether links or symbolic links are supported. Also returned are preferred transfer size for READDIR procedure replies, server time granularity and whether times can be set in a SETATTR request.

FSSTAT was added to provide volatile information about a file system, for use by utilities such as df. The reply includes the total size and free space in the file system specified in bytes, the total number of files and number of free file slots in the file system, and an estimate of time between file system modifications (for use in cache consistency checking algorithms).

The COMMIT procedure provides the synchronisation mechanism to be used with asynchronous WRITE operations.

RPC Information

The NFS service uses AUTH_NONE in the NULL procedure. AUTH_UNIX, AUTH_DES or AUTH_KERB are used for all other procedures.
Transport Protocols
NFS implementations exist for both UDP/IP and TCP/IP protocols.
Port Number
The NFS Version 3 protocol uses the UDP port number 2049 decimal, the same port as the Version 2 protocol. Since this is not an officially assigned port, it is possible that it may change in the future. For maximum interoperability it is recommended (but not required) that NFS servers use UDP port 2049 if possible, and that NFS clients use the portmap mechanism (see Chapter 6) to locate the NFS program on a server.

WebNFS servers must use UDP and TCP port 2049.

Sizes of XDR Structures

The following table specifies the sizes, given in decimal bytes, of various XDR structures used in the protocol:

Structure Size Description
The maximum size in bytes of the opaque file handle
NFS3_COOKIEVERFSIZE 8 The size in bytes of the opaque cookie verifier passed by READDIR and READDIRPLUS
NFS3_CREATEVERFSIZE 8 The size in bytes of the opaque verifier used for exclusive CREATE
NFS3_WRITEVERFSIZE 8 The size in bytes of the opaque verifier used for asynchronous WRITE

Basic Data Types

The following XDR definitions are basic definitions that are used in other structures.
typedef unsigned hyper uint64;
typedef hyper int64;
typedef unsigned long uint32;
typedef long int32;
typedef string filename3<>;
typedef string nfspath3<>;
typedef uint64 fileid3;
typedef uint64 cookie3;
typedef opaque cookieverf3[NFS3_COOKIEVERFSIZE];
typedef opaque createverf3[NFS3_CREATEVERFSIZE];
typedef opaque writeverf3[NFS3_WRITEVERFSIZE];
typedef uint32 uid3;
typedef uint32 gid3;
typedef uint64 size3;
typedef uint64 offset3;
typedef uint32 mode3;
typedef uint32 count3;
The nfsstat3 type is returned with every procedure's results except for the NULL procedure. A value of NFS3_OK indicates that the call completed successfully. Any other value indicates that some error occurred on the call, as identified by the error code. No other values may be returned by a server. Servers are expected to make a best effort mapping of error conditions to the set of error codes defined. In addition, no error precedences are specified by this document. Error precedences determine the error value that should be returned when more than one error applies in a given situation. The error precedence will be determined by the individual server implementation. If the client requires specific error precedences, it should check for the specific errors for itself.

A description of each defined error follows.

Indicates the call completed successfully.

Not owner. The caller does not have the correct ownership to perform the requested operation.

No such file or directory. The file or directory name specified does not exist.

I/O error. Some sort of hard error occurred when the operation was in progress. This could be a disk error, for example.

No such device or address.

Permission denied. The caller does not have the correct permission to perform the requested operation. Contrast this with NFS3ERR_PERM, which restricts itself to owner permission failures.

File exists. The file specified already exists.

The caller attempted to do a cross-device hard link.

No such device.

Not a directory. The caller specified a non-directory in a directory operation.

Is a directory. The caller specified a directory in a non-directory operation.

Invalid argument or unsupported argument for an operation. Two examples are attempting a READLINK on an object other than a symbolic link or attempting to SETATTR a time field on a server that does not support this operation.

File too large. The operation would have caused a file to grow beyond the server's limit.

No space left on device. The operation would have caused the server's file system to exceed its limit.

Read-only file system. A modifying operation was attempted on a read-only file system.

Too many hard links.


The filename in an operation was too long.


An attempt was made to remove a directory that was not empty.

Resource (quota) hard limit exceeded. The user's resource limit on the server has been exceeded.

Invalid file handle. The file handle given in the arguments was invalid. The file referred to by that file handle no longer exists or access to it has been revoked.

Too many levels of remote in path. The file handle given in the arguments referred to a file on a non-local file system on the server.


Invalid NFS file handle. The file handle failed internal consistency checks.


An update synchronisation mismatch was detected during a SETATTR operation.


A READDIR or READDIRPLUS cookie is stale.


The operation is not supported.


The buffer or request is too small.


An error occurred on the server, which does not map to any of the valid NFS Version 3 protocol error values. The client should translate this into an appropriate error. Clients based on an XPG system may choose to translate this to EIO.


An attempt was made to create an object of a type not supported by the server.


The server initiated the request, but was not able to complete it in a timely fashion. The client should wait and then try the request with a new RPC transaction ID. For example, this error should be returned from a server that supports hierarchical storage and receives a request to process a file that has been migrated. In this case, the server should start the immigration process and respond to client with this error.

enum ftype3 { NF3REG = 1, NF3DIR = 2, NF3BLK = 3, NF3CHR = 4, NF3LNK = 5, NF3SOCK = 6, NF3FIFO = 7 };

The enumeration ftype3 gives the type of a file, as follows:

Regular file


Block special device file

Character special device file

Symbolic link


Named pipe

struct specdata3 { uint32 specdata1; uint32 specdata2; };

The interpretation of the two words depends on the type of file system object. For a block special (.IR NF3BLK ) or character special (.IR NF3CHR ) file, specdata1 and specdata2 are the major and minor device numbers, respectively. For all other file types, these two elements should either be set to zero or the values should be agreed upon by the client and server. If the client and server do not agree upon the values, the client should treat these fields as if they are set to zero. This data field is returned as part of the fattr3 structure and so is available from all replies returning attributes. Since these fields are otherwise unused for objects that are not devices, out of band information can be passed from the server to the client. However, both the server and the client must agree on the values passed.

struct nfs_fh3 { opaque data<NFS3_FHSIZE>; };

The nfs_fh3 structure is the variable-length opaque object returned by the server on LOOKUP, CREATE, SYMLINK, MKNOD, LINK or READDIRPLUS operations, which is used by the client on subsequent operations to reference the file. The file handle contains all the information the server needs to distinguish an individual file. To the client, the file handle is opaque. The client stores file handles for use in a later request and can compare two file handles from the same server for equality by doing a byte-by-byte comparison, but cannot otherwise interpret the contents of file handles. If two file handles from the same server are equal, they must refer to the same file, but if they are not equal, no conclusions can be drawn. Servers should try to maintain a one-to-one correspondence between file handles and files, but this is not required. Clients should use file handle comparisons only to improve performance, not for correct behaviour.

Servers can revoke the access provided by a file handle at any time. If the file handle passed in a call refers to a file system object that no longer exists on the server or access for that file handle has been revoked, the NFS3ERR_STALE error should be returned.

A filehandle with a length of zero is called the public filehandle. It is used by WebNFS clients to identify an associated public directory on the server.

struct nfstime3 { uint32 seconds; uint32 nseconds; };

The nfstime3 structure gives the number of seconds and nanoseconds since midnight January 1, 1970 Greenwich Mean Time. It is used to pass time and date information. The times associated with files are all server times except in the case of a SETATTR operation where the client can explicitly set the file time. A server converts to and from local time when processing time values, preserving as much accuracy as possible. If the precision of timestamps stored for a file is less than that defined by the NFS Version 3 protocol, loss of precision can occur. An adjunct time maintenance protocol is recommended to reduce client and server time skew.

struct fattr3 { ftype3 type; mode3 mode; uint32 nlink; uid3 uid; gid3 gid; size3 size; size3 used; specdata3 rdev; uint64 fsid; fileid3 fileid; nfstime3 atime; nfstime3 mtime; nfstime3 ctime; };

The fattr3 structure defines the attributes of a file system object. It is returned by most operations on an object; in the case of operations that affect two objects (for example, a MKDIR that modifies the target directory attributes and defines new attributes for the newly created directory), the attributes for both may be returned. In some cases, the attributes are returned in the structure, wcc_data, which is defined below; in other cases the attributes are returned alone.

The fattr3 structure contains the basic attributes of a file. All servers must support this set of attributes even if they have to simulate some of the fields.

The type of the file.

The protection mode bits.

The number of hard links to the file; that is, the number of different names for the same file.

The user ID of the owner of the file.

The group ID of the group of the file.

The size of the file in bytes.

The number of bytes of disk space that the file actually uses (which can be smaller than the size because the file may have holes or it may be larger due to fragmentation).

The device file, if the file type is NF3CHR or NF3BLK; see specdata3.

The file system identifier for the file system.

A number that uniquely identifies the file within its file system (on traditional UNIX systems, this would be the i-number).

The time when the file data was last accessed.

The time when the file data was last modified.

The time when the attributes of the file were last changed. Writing to the file changes the ctime in addition to the mtime.

The mode bits are defined as follows:

Bit Description
Set user ID on execution.
0x00400 Set group ID on execution.
0x00200 Save swapped text (not defined in XPG4).
0x00100 Read permission for owner.
0x00080 Write permission for owner.
0x00040 Execute permission for owner on a file. Or lookup (search) permission for owner in directory.
0x00020 Read permission for group.
0x00010 Write permission for group.
0x00008 Execute permission for group on a file. Or lookup (search) permission for group in directory.
0x00004 Read permission for others.
0x00002 Write permission for others.
0x00001 Execute permission for others on a file. Or lookup (search) permission for others in directory.

union post_op_attr switch (bool attributes_follow){ case TRUE: fattr3 attributes; case FALSE: void; };

The post_op_attr structure is used for returning attributes in those operations that are not directly involved with manipulating attributes. One of the principles of this revision of the NFS protocol is to return the real value from the indicated operation and not an error from an incidental operation. The post_op_attr structure was designed to allow the server to recover from errors encountered while getting attributes.

This appears to make returning attributes optional. However, server implementors are strongly encouraged to make best effort to return attributes whenever possible, even when returning an error.

struct wcc_attr { size3 size; nfstime3 mtime; nfstime3 ctime; };

The wcc_attr structure is the subset of pre-operation attributes needed to improve support for the weak cache consistency semantics. The size argument is the file size in bytes of the object before the operation. The mtime argument is the time of last modification of the object before the operation. The ctime argument is the time of last change to the attributes of the object before the operation.

The use of mtime by clients to detect changes to file system objects residing on a server is dependent on the granularity of the time base on the server.

union pre_op_attr switch (bool attributes_follow){ case TRUE: wcc_attr attributes; case FALSE: void; };
struct wcc_data { pre_op_attr before; post_op_attr after; };

When a client performs an operation that modifies the state of a file or directory on the server, it cannot immediately determine from the post-operation attributes whether the operation just performed was the only operation on the object since the last time the client received the attributes for the object. This is important, since if an intervening operation has changed the object, the client will need to invalidate any cached data for the object (except for the data that it just wrote).

To deal with this, the notion of weak cache consistency data (.BR wcc_data ) is introduced. A wcc_data structure consists of certain key fields from the object attributes before the operation, together with the object attributes after the operation. This information allows the client to manage its cache more accurately than in NFS Version 2 protocol implementations. The term weak cache consistency emphasizes the fact that this mechanism does not provide the strict server-client consistency that a cache consistency protocol would provide.

In order to support the weak cache consistency model, the server must be able to get the pre-operation attributes of the object, perform the intended modify operation, and then get the post-operation attributes atomically. If there is a window for the object to get modified between the operation and either of the get attributes operations, then the client will not be able to determine whether it was the only entity to modify the object. Some information will have been lost, thus weakening the weak cache consistency guarantees.

union post_op_fh3 switch (bool handle_follows){ case TRUE: nfs_fh3 handle; case FALSE: void; };

One of the principles of this revision of the NFS protocol is to return the real value from the indicated operation and not an error from an incidental operation. The post_op_fh3 structure was designed to allow the server to recover from errors encountered while constructing a file handle.

This is the structure used to return a file handle from the CREATE, MKDIR, SYMLINK, MKNOD and READDIRPLUS requests. In each case, the client can get the file handle by issuing a LOOKUP request after a successful return from one of the listed operations. Returning the file handle is an optimisation so that the client is not forced to issue a LOOKUP request immediately to get the file handle.

enum time_how { DONT_CHANGE = 0, SET_TO_SERVER_TIME = 1, SET_TO_CLIENT_TIME = 2 };
union set_mode3 switch (bool set_it) { case TRUE: mode3 mode; default: void; };
union set_uid3 switch (bool set_it) { case TRUE: uid3 uid; default: void; };
union set_gid3 switch (bool set_it) { case TRUE: gid3 gid; default: void; };
union set_size3 switch (bool set_it) { case TRUE: size3 size; default: void; };
union set_atime switch (time_how set_it) { case SET_TO_CLIENT_TIME: nfstime3 atime; default: void; };
union set_mtime switch (time_how set_it) { case SET_TO_CLIENT_TIME: nfstime3 mtime; default: void; };
struct sattr3 { set_mode3 mode; set_uid3 uid; set_gid3 gid; set_size3 size; set_atime atime; set_mtime mtime; };

The sattr3 structure contains the file attributes that can be set from the client. The fields are the same as the similarly named fields in the fattr3 structure. In the NFS Version 3 protocol, the attributes that can be set are described by a structure containing a set of discriminated unions. Each union indicates whether the corresponding attribute is to be updated, and if so, how.

There are two forms of discriminated unions used. In setting the mode, uid, gid or size, the discriminated union is switched on a Boolean, set_it; if it is TRUE, a value of the appropriate type is then encoded.

In setting the atime or mtime, the union is switched on an enumeration type, set_it. If set_it has the value DONT_CHANGE, the corresponding attribute is unchanged. If it has the value SET_TO_SERVER_TIME, the corresponding attribute is set by the server to its local time; no data is provided by the client. Finally, if set_it has the value SET_TO_CLIENT_TIME, the attribute is set to the time passed by the client in an nfstime3 structure. (See FSINFO in tagmref_NFSPROC3_FSINFO , which addresses the issue of time granularity).

struct diropargs3 { nfs_fh3 dir; filename3 name; };

The diropargs3 structure is used in directory operations. The file handle, dir, identifies the directory in which to manipulate or access the file, name. See additional comments in Filename Component Handling .

Attributes and Consistency Data on Failure

For those procedures that return either post_op_attr or wcc_data structures on failure, the discriminated union may contain the pre-operation attributes of the object or object parent directory. This depends on the error encountered and may also depend on the particular server implementation. Implementors are strongly encouraged to return as much attribute data as possible upon failure, but client implementors need to be aware that their implementation must correctly handle the variant return instance where no attributes or consistency data is returned.

General File Name Requirements

The following requirements apply to all NFS Version 3 protocol procedures in which the client provides one or more file names in the arguments: LOOKUP, CREATE, MKDIR, SYMLINK, MKNOD, REMOVE, RMDIR, RENAME and LINK.

  1. The file name must not be null nor may it be the null string. The server should return the NFS3ERR_ACCES error if it receives such a file name. On some clients, a null string used as a file name is assumed to be an alias for the current directory. Clients that require this functionality should implement it for themselves and not depend upon the server to support such semantics.

  2. A filename having the value of "." (dot) is assumed to be an alias for the current directory. Clients that require this functionality should implement it for themselves and not depend upon the server to support such semantics. However, the server should be able to handle such a filename correctly.

  3. A filename having the value of ".." (dot-dot) is assumed to be an alias for the parent of the current directory; in other words, the directory that contains the current directory. The server should be prepared to handle this semantic, if it supports directories, even if those directories do not contain XPG-style dot or dot-dot entries.

  4. If the filename is longer than the maximum for the file system (see PATHCONF in tagmref_NFSPROC3_PATHCONF , specifically name_max), the result depends on the value of the PATHCONF flag no_trunc. If no_trunc is FALSE, the filename will be silently truncated to name_max bytes. If no_trunc is TRUE and the filename exceeds the server's file system maximum filename length, the operation will fail with the NFS3ERR_NAMETOOLONG error.

  5. In general, there will be characters that a server will not be able to handle as part of a filename. This set of characters will vary from server to server and from implementation to implementation. In most cases, it is the server that will control the client's view of the file system. If the server receives a filename containing characters that it can not handle, the NFS3ERR_ACCES error should be returned. Client implementations should be prepared to handle this side affect of heterogeneity.

See additional comments in Filename Component Handling .

XNFS Implementation Issues

The NFS Version 3 protocol was designed to allow different operating systems to share files. However, since it was designed in a UNIX environment, many operations have semantics similar to the operations of the UNIX file system. This section discusses some of the general implementation-specific details and semantic issues. Procedure descriptions have implementation guidance specific to that procedure.

Server/Client Relationship

The NFS Version 3 protocol is designed to allow servers to be as simple and general as possible. Sometimes the simplicity of the server can be a problem, if the client implements complicated file system semantics.

For example, some operating systems allow removal of open files. A process can open a file and, while it is open, remove it from the directory. The file can be read and written as long as the process keeps it open, even though the file has no name in the file system. It is impossible for a stateless server to implement these semantics. The client can do some tricks such as renaming the file on remove (to a hidden name), and only physically deleting it on close. The NFS Version 3 protocol provides sufficient functionality to implement most file system semantics on a client.

Every NFS Version 3 protocol client can also potentially be a server, and remote and local mounted file systems can be freely mixed. This leads to some problems when a client travels down the directory tree of a remote file system and reaches the mount point on the server for another remote file system. Allowing the server to follow the second remote mount would require loop detection, server lookup, and user revalidation. Instead, both NFS Version 2 protocol and NFS Version 3 protocol implementations do not typically let clients cross a server's mount point. When a client does a LOOKUP on a directory on which the server has mounted a file system, the client sees the underlying directory instead of the mounted directory.

For example, if a server has a file system called /usr and mounts another file system on /usr/src, if a client mounts /usr, it does not see the mounted version of /usr/src. A client could do remote mounts that match the server's mount points to maintain the server's view. In this example, the client would also have to mount /usr/src in addition to /usr, even if they are from the same server.

Pathname Interpretation

There are a few complications to the rule that pathnames are always parsed on the client. For example, symbolic links could have different interpretations on different clients. There is no answer to this problem in this document.

Another common problem for non-XPG implementations is the special interpretation of the pathname ".." to mean the parent of a given directory. A future revision of the protocol may use an explicit flag to indicate the parent instead; however, it is not a problem because many working non-XPG implementations exist.

Permission Issues

The NFS Version 3 protocol, strictly speaking, does not define the permission checking used by servers. However, it is expected that a server will do normal operating system permission checking using AUTH_UNIX style authentication as the basis of its protection mechanism, or another stronger form of authentication such as AUTH_DES or AUTH_KERB. With AUTH_UNIX authentication, the server gets the client's effective user ID, effective group ID and groups on each call and uses them to check permission. These are the so-called UNIX credentials. AUTH_DES and AUTH_KERB use a network name, or netname, as the basis for identification (from which a UNIX server derives the necessary standard UNIX credentials). There are problems with this method that have been solved.

Using user ID and group ID implies that the client and server share the same user ID list. Every server and client pair must have the same mapping from user to user ID and from group to group ID. Since every client can also be a server, this tends to imply that the whole network shares the same user/group ID space. If this is not the case, then it usually falls upon the server to perform some custom mapping of credentials from one authentication domain into another. A discussion of techniques for managing a shared user space or for providing mechanisms for user ID mapping is beyond the scope of this document.

Another problem arises due to the usually stateful open operation. Most operating systems check permission at open time, and then check that the file is open on each read and write request. With stateless servers, the server cannot detect that the file is open and must do permission checking on each read and write call. UNIX client semantics of access permission checking on open can be provided with the ACCESS procedure call in this revision, which allows a client to explicitly check access permissions without resorting to trying the operation. On a local file system, a user can open a file and then change the permissions so that no one is allowed to touch it, but will still be able to write to the file because it is open. On a remote file system, by contrast, the write would fail. To get around this problem, the server's permission checking algorithm should allow the owner of a file to access it regardless of the permission setting. This is needed in a practical NFS Version 3 protocol server implementation, but it does depart from correct local file system semantics. This should not affect the return result of access permissions as returned by the ACCESS procedure, however.

A similar problem has to do with paging in an executable program over the network. The operating system usually checks for execute permission before opening a file for demand paging, and then reads blocks from the open file. In a local UNIX file system, an executable file does not need read permission to execute (page-in). An NFS Version 3 protocol server can not tell the difference between a normal file read (where the read permission bit is meaningful) and a demand page-in read (where the server should allow access to the executable file if the execute bit is set for that user or group or public). To make this work, the server allows reading of files if the user ID given in the call has either execute or read permission on the file, through ownership, group membership or public access. Again, this departs from correct local file system semantics.

In some operating systems, a particular user (on UNIX systems, the user ID 0) has access to all files, no matter what permission and ownership they have. This super-user permission might not be allowed on the server, since anyone who can become super-user on their client could gain access to all remote files. A UNIX server by default maps user ID 0 to a distinguished value (UID_NOBODY), as well as mapping the groups list, before doing its access checking. A server implementation may provide a mechanism to change this mapping. This works except for NFS Version 3 protocol root file systems (required for diskless NFS Version 3 protocol client support), where super-user access cannot be avoided. Export options are used, on the server, to restrict the set of clients allowed super-user access.

Duplicate Request Cache

The typical NFS Version 3 protocol failure recovery model uses client time-out and retry to handle server crashes, network partitions and lost server replies. A retried request is called a duplicate of the original.

When used in a file server context, the term idempotent can be used to distinguish between operation types. An idempotent request is one that a server can perform more than once with equivalent results (though it may in fact change, as a side effect, the access time on a file, say for READ). Some NFS operations are obviously non-idempotent. They cannot be reprocessed without special attention simply because they may fail if tried a second time. The CREATE request, for example, can be used to create a file for which the owner does not have write permission. A duplicate of this request cannot succeed if the original succeeded. Likewise, a file can be removed only once.

The side effects caused by performing a duplicate non-idempotent request can be destructive (for example, a truncate operation causing lost writes). The combination of a stateless design with the common choice of an unreliable network transport (UDP) implies the possibility of destructive replays of non-idempotent requests. Though to be more accurate, it is the inherent stateless design of the NFS Version 3 protocol on top of an unreliable RPC mechanism that yields the possibility of destructive replays of non-idempotent requests, since even in an implementation of the NFS Version 3 protocol over a reliable connection-oriented transport, a connection break with automatic reestablishment requires duplicate request processing (the client will retransmit the request, and the server needs to deal with a potential duplicate non-idempotent request).

Most NFS Version 3 protocol server implementations use a cache of recent requests (called the duplicate request cache) for the processing of duplicate non-idempotent requests. The duplicate request cache provides a short-term memory mechanism in which the original completion status of a request is remembered and the operation attempted only once. If a duplicate copy of this request is received, then the original completion status is returned.

The duplicate-request cache mechanism has been useful in reducing destructive side effects caused by duplicate NFS Version 3 protocol requests. This mechanism, however, does not guarantee against these destructive side effects in all failure modes. Most servers store the duplicate request cache in RAM, so the contents are lost if the server crashes. The exception to this may possibly occur in a redundant server approach to high availability, where the file system itself may be used to share the duplicate request cache state. Even if the cache survives server reboots (or failovers in the high availability case), its effectiveness is a function of its size. A network partition can cause a cache entry to be reused before a client receives a reply for the corresponding request. If this happens, the duplicate request will be processed as a new one, possibly with destructive side effects.

Filename Component Handling

Server implementations of NFS Version 3 protocol will frequently impose restrictions on the names that can be created. Many servers will also forbid the use of names that contain certain characters, such as the path component separator used by the server operating system. For example, an XPG file system will reject a name that contains "/", while "." and ".." are distinguished in a XPG file system, and must not be specified as the name when creating a file system object. The exact error status values return for these errors is specified in the description of each procedure argument. The values (which conform to NFS Version 2 protocol server practice) are not necessarily obvious, nor are they consistent from one procedure to the next.

Synchronous Modifying Operations

Data-modifying operations in the NFS Version 3 protocol are synchronous. When a procedure returns to the client, the client can assume that the operation has completed and any data associated with the request is now on stable storage.

Stable Storage

NFS Version 3 protocol servers must be able to recover without data loss from multiple power failures (including cascading power failures; that is, several power failures in quick succession), operating system failures and hardware failure of components other than the storage medium itself (for example, disk or non-volatile RAM).

Some examples of stable storage that are allowable for an NFS server include:

Conversely, the following are not examples of stable storage:

The only exception to this (introduced in Version 3 protocol) is as described under the WRITE procedure on the handling of the stable bit, and the use of the COMMIT procedure. It is the use of the synchronous COMMIT procedure that provides the necessary semantic support in the NFS Version 3 protocol.

Lookups and Name Resolution

A common objection to the NFS Version 3 protocol is the philosophy of component-by-component LOOKUP by the client in resolving a name. The objection is that this is inefficient, as latencies for component-by-component LOOKUP would be unbearable.

Implementation practice solves this issue. A name cache, providing component to file-handle mapping, is kept on the client to short circuit actual LOOKUP invocations over the wire. The cache is subject to cache timeout parameters that bound attributes.

Note that multi-component lookup is allowed relative to the public filehandle (see WebNFS Extensions ) for use by WebNFS clients as an alternative to the MNTPROC_MNT procedure of the MOUNT protocol. Clients may also cache the results of these multi-component lookups, subject to the same timeout parameters that bound attributes.

Adaptive Retransmission

Most client implementations use either an exponential back-off strategy to some maximum retransmission value, or a more adaptive strategy that attempts congestion avoidance.

Caching Policies

The NFS Version 3 protocol does not define a policy for caching on the client or server. In particular, there is no support for strict cache consistency between a client and server, nor between different clients.

Stable Versus Unstable Writes

The setting of the stable field in the WRITE arguments (that is, whether or not to do asynchronous WRITE requests) is straightforward on a UNIX client. If the NFS Version 3 protocol client receives a write request that is not marked as being asynchronous, it should generate the RPC with stable set to DATA_SYNC or FILE_SYNC. If the request is marked as being asynchronous, the RPC should be generated with stable set to UNSTABLE. If the response comes back with the committed field set to DATA_SYNC or FILE_SYNC, the client should just mark the write request as done and no further action is required. If committed is set to UNSTABLE, indicating that the buffer was not synchronised with the server's disk, the client will need to mark the buffer in some way that indicates that a copy of the buffer lives on the server and that a new copy does not need to be sent to the server, but that a commit is required.

Note that this algorithm introduces a new state for buffers, thus there are now three states for buffers. The three states are dirty, done but needs to be committed, and done. This extra state on the client will likely require modifications to the system outside of the NFS Version 3 protocol client.

The asynchronous write opens up the window of problems associated with write sharing. For example: client A writes some data asynchronously. Client A is still holding the buffers cached, waiting to commit them later. Client B reads the modified data and writes it back to the server. The server then crashes. When it comes back up, client A issues a COMMIT operation, which returns with a different cookie as well as changed attributes. In this case, the correct action may or may not be to retransmit the cached buffers. Unfortunately, client A can't tell for sure, so it will need to retransmit the buffers, thus overwriting the changes from client B. Fortunately, write sharing is rare and the solution matches the current write sharing situation. Without using locking for synchronisation, the behaviour will be indeterminate.

In a high availability (redundant system) server implementation, two cases exist that relate to the verf changing. If the high availability server implementation does not use a shared-memory scheme, then the verf must change on failover, since the unsynchronised data is not available to the second processor and there is no guarantee that the system that had the data cached was able to flush it to stable storage before going down. The client will need to retransmit the data to be safe. In a shared-memory high availability server implementation, the verf would not need to change because the server would still have the cached data available to it to be flushed. The exact policy regarding the verf in a shared memory high availability implementation, however, is up to the server implementor.

32-bit Clients/Servers and 64-bit Clients/Servers

The 64 bit nature of the NFS Version 3 protocol introduces several compatibility problems. The most notable two are mismatched clients and servers; that is, a 32 bit client and a 64 bit server or a 64 bit client and a 32 bit server.

The problems of a 64 bit client and a 32 bit server are easy to handle. The client will never encounter a file that it can not handle. If it sends a request to the server that the server can not handle, the server should reject the request with an appropriate error.

The problems of a 32 bit client and a 64 bit server are much harder to handle. In this situation, the server does not have a problem because it can handle anything that the client can generate. However, the client may encounter a file that it can not handle. The client will not be able to handle a file whose size can not be expressed in 32 bits. Thus, the client will not be able to properly decode the size of the file into its local attributes structure. Also, a file can grow beyond the limit of the client while the client is accessing the file.

The solutions to these problems are left up to the individual implementor. However, there are two common approaches used to resolve this situation. The implementor can choose between them or even can invent a new solution altogether.

The most common solution is for the client to deny access to any file whose size can not be expressed in 32 bits. This is probably the safest, but does introduce some strange semantics when the file grows beyond the limit of the client while it is being access by that client. The file becomes inaccessible even while it is being accessed.

The second solution is for the client to map any size greater than it can handle to the maximum size that it can handle. This allows the application access as much of the file as possible given the 32 bit offset restriction. This eliminates the strange semantic of the file effectively disappearing after it has been accessed, but does introduce other problems. The client will not be able to access the entire file.

Currently, the first solution is the recommended solution. However, client implementors are encouraged to do the best that they can to reduce the effects of this situation.

Server Procedures

The following reference pages define the additional set of procedures, with arguments and results defined using the RPC language, for the Version 3 protocol.
 *  Remote file service routines
program NFS_PROGRAM {
    version NFS_V3 {
        void NFSPROC3_NULL(void) = 0;
        LOOKUP3res NFSPROC3_LOOKUP(LOOKUP3args) = 3;
        ACCESS3res NFSPROC3_ACCESS(ACCESS3args) = 4;
        READ3res NFSPROC3_READ(READ3args) = 6;
        WRITE3res NFSPROC3_WRITE(WRITE3args) = 7;
        CREATE3res NFSPROC3_CREATE(CREATE3args) = 8;
        MKDIR3res NFSPROC3_MKDIR(MKDIR3args) = 9;
        SYMLINK3res NFSPROC3_SYMLINK(SYMLINK3args) = 10;
        MKNOD3res NFSPROC3_MKNOD(MKNOD3args) = 11;
        REMOVE3res NFSPROC3_REMOVE(REMOVE3args) = 12;
        RMDIR3res NFSPROC3_RMDIR(RMDIR3args) = 13;
        RENAME3res NFSPROC3_RENAME(RENAME3args) = 14;
        LINK3res NFSPROC3_LINK(LINK3args) = 15;
        READDIR3res NFSPROC3_READDIR(READDIR3args) = 16;
        FSSTAT3res NFSPROC3_FSSTAT(FSSTAT3args) = 18;
        FSINFO3res NFSPROC3_FSINFO(FSINFO3args) = 19;
        COMMIT3res NFSPROC3_COMMIT(COMMIT3args) = 21;
    } = 3;
} = 100003;

Why not acquire a nicely bound hard copy?
Click here to return to the publication details or order a copy of this publication.

Contents Next section Index