The
ROOT
and
WRITECACHE
procedures have been removed.
A
MKNOD
procedure has been defined to allow the creation of special
files, eliminating the overloading of
CREATE.
Caching on the client is not defined nor dictated by the
NFS Version 3 protocol, but additional information and
hints have been added to the protocol to allow clients that
implement caching to manage their caches more effectively.
Procedures that affect the attributes of a file or
directory may now return the new attributes after the
operation has completed to optimise out a subsequent
GETATTR
used in validating attribute caches.
In addition,
operations that modify the directory in which the target
object resides return the old and new attributes of the
directory to allow clients to implement more intelligent
cache invalidation procedures.
The
ACCESS
procedure provides access permission checking on the
server, the
FSSTAT
procedure returns dynamic information about a file system,
the
FSINFO
procedure returns static information about a file system
and server, the
READDIRPLUS
procedure returns file handles and attributes in addition
to directory entries, and the
PATHCONF
procedure returns XPG4
The following is a list of the important changes between the NFS Version 2 protocol and the NFS Version 3 protocol.
WebNFS servers must use UDP and TCP port 2049.
Structure | Size | Description |
---|---|---|
NFS3_FHSIZE | 64 | The maximum size in bytes of the opaque file handle |
NFS3_COOKIEVERFSIZE | 8 | The size in bytes of the opaque cookie verifier passed by READDIR and READDIRPLUS |
NFS3_CREATEVERFSIZE | 8 | The size in bytes of the opaque verifier used for exclusive CREATE |
NFS3_WRITEVERFSIZE | 8 | The size in bytes of the opaque verifier used for asynchronous WRITE |
-
-
typedef unsigned hyper uint64;
-
-
typedef hyper int64;
-
-
typedef unsigned long uint32;
-
-
typedef long int32;
-
-
typedef string filename3<>;
-
-
typedef string nfspath3<>;
-
-
typedef uint64 fileid3;
-
-
typedef uint64 cookie3;
-
-
typedef opaque cookieverf3[NFS3_COOKIEVERFSIZE];
-
-
typedef opaque createverf3[NFS3_CREATEVERFSIZE];
-
-
typedef opaque writeverf3[NFS3_WRITEVERFSIZE];
-
-
typedef uint32 uid3;
-
-
typedef uint32 gid3;
-
-
typedef uint64 size3;
-
-
typedef uint64 offset3;
-
-
typedef uint32 mode3;
-
-
typedef uint32 count3;
-
-
enum nfsstat3 {
NFS3_OK = 0,
NFS3ERR_PERM = 1,
NFS3ERR_NOENT = 2,
NFS3ERR_IO = 5,
NFS3ERR_NXIO = 6,
NFS3ERR_ACCES = 13,
NFS3ERR_EXIST = 17,
NFS3ERR_XDEV = 18,
NFS3ERR_NODEV = 19,
NFS3ERR_NOTDIR = 20,
NFS3ERR_ISDIR = 21,
NFS3ERR_INVAL = 22,
NFS3ERR_FBIG = 27,
NFS3ERR_NOSPC = 28,
NFS3ERR_ROFS = 30,
NFS3ERR_MLINK = 31,
NFS3ERR_NAMETOOLONG = 63,
NFS3ERR_NOTEMPTY = 66,
NFS3ERR_DQUOT = 69,
NFS3ERR_STALE = 70,
NFS3ERR_REMOTE = 71,
NFS3ERR_BADHANDLE = 10001,
NFS3ERR_NOT_SYNC = 10002,
NFS3ERR_BAD_COOKIE = 10003,
NFS3ERR_NOTSUPP = 10004,
NFS3ERR_TOOSMALL = 10005,
NFS3ERR_SERVERFAULT = 10006,
NFS3ERR_BADTYPE = 10007,
NFS3ERR_JUKEBOX = 10008
};
The
nfsstat3
type is returned with every procedure's
results except for the
NULL
procedure.
A value of
NFS3_OK
indicates that the call completed successfully.
Any other value indicates that some error occurred on the call,
as identified by the error code.
No other values may be returned by a server.
Servers are expected to make a best
effort mapping of error conditions to the set of error
codes defined.
In addition, no error precedences are
specified by this document.
Error precedences
determine the error value that should be returned when more
than one error applies in a given situation.
The error
precedence will be determined by the individual server
implementation.
If the client requires specific error
precedences, it should check for the specific errors for itself.
A description of each defined error follows.
-
-
enum ftype3 {
NF3REG = 1,
NF3DIR = 2,
NF3BLK = 3,
NF3CHR = 4,
NF3LNK = 5,
NF3SOCK = 6,
NF3FIFO = 7
};
The enumeration ftype3 gives the type of a file, as follows:
-
-
struct specdata3 {
uint32 specdata1;
uint32 specdata2;
};
The interpretation of the two words depends on the type of file system object. For a block special (.IR NF3BLK ) or character special (.IR NF3CHR ) file, specdata1 and specdata2 are the major and minor device numbers, respectively. For all other file types, these two elements should either be set to zero or the values should be agreed upon by the client and server. If the client and server do not agree upon the values, the client should treat these fields as if they are set to zero. This data field is returned as part of the fattr3 structure and so is available from all replies returning attributes. Since these fields are otherwise unused for objects that are not devices, out of band information can be passed from the server to the client. However, both the server and the client must agree on the values passed.
-
-
struct nfs_fh3 {
opaque data<NFS3_FHSIZE>;
};
The nfs_fh3 structure is the variable-length opaque object returned by the server on LOOKUP, CREATE, SYMLINK, MKNOD, LINK or READDIRPLUS operations, which is used by the client on subsequent operations to reference the file. The file handle contains all the information the server needs to distinguish an individual file. To the client, the file handle is opaque. The client stores file handles for use in a later request and can compare two file handles from the same server for equality by doing a byte-by-byte comparison, but cannot otherwise interpret the contents of file handles. If two file handles from the same server are equal, they must refer to the same file, but if they are not equal, no conclusions can be drawn. Servers should try to maintain a one-to-one correspondence between file handles and files, but this is not required. Clients should use file handle comparisons only to improve performance, not for correct behaviour.
Servers can revoke the access provided by a file handle at any time. If the file handle passed in a call refers to a file system object that no longer exists on the server or access for that file handle has been revoked, the NFS3ERR_STALE error should be returned.
A filehandle with a length of zero is called the public filehandle. It is used by WebNFS clients to identify an associated public directory on the server.
-
-
struct nfstime3 {
uint32 seconds;
uint32 nseconds;
};
The
nfstime3
structure gives the number of seconds and
nanoseconds since midnight January 1, 1970 Greenwich Mean Time.
It is used to pass time and date information.
The times associated with files are all server times except in
the case of a
SETATTR
operation where the client can
explicitly set the file time.
A server converts to and
from local time when processing time values, preserving as
much accuracy as possible.
If the precision of timestamps
stored for a file is less than that defined by the NFS Version 3
protocol, loss of precision can occur.
An adjunct time
maintenance protocol is recommended to reduce client and
server time skew.
-
-
struct fattr3 {
ftype3 type;
mode3 mode;
uint32 nlink;
uid3 uid;
gid3 gid;
size3 size;
size3 used;
specdata3 rdev;
uint64 fsid;
fileid3 fileid;
nfstime3 atime;
nfstime3 mtime;
nfstime3 ctime;
};
The fattr3 structure defines the attributes of a file system object. It is returned by most operations on an object; in the case of operations that affect two objects (for example, a MKDIR that modifies the target directory attributes and defines new attributes for the newly created directory), the attributes for both may be returned. In some cases, the attributes are returned in the structure, wcc_data, which is defined below; in other cases the attributes are returned alone.
The fattr3 structure contains the basic attributes of a file. All servers must support this set of attributes even if they have to simulate some of the fields.
The mode bits are defined as follows:
Bit | Description |
---|---|
0x00800 | Set user ID on execution. |
0x00400 | Set group ID on execution. |
0x00200 | Save swapped text (not defined in XPG4). |
0x00100 | Read permission for owner. |
0x00080 | Write permission for owner. |
0x00040 | Execute permission for owner on a file. Or lookup (search) permission for owner in directory. |
0x00020 | Read permission for group. |
0x00010 | Write permission for group. |
0x00008 | Execute permission for group on a file. Or lookup (search) permission for group in directory. |
0x00004 | Read permission for others. |
0x00002 | Write permission for others. |
0x00001 | Execute permission for others on a file. Or lookup (search) permission for others in directory. |
-
-
union post_op_attr switch (bool attributes_follow){
case TRUE:
fattr3 attributes;
case FALSE:
void;
};
The post_op_attr structure is used for returning attributes in those operations that are not directly involved with manipulating attributes. One of the principles of this revision of the NFS protocol is to return the real value from the indicated operation and not an error from an incidental operation. The post_op_attr structure was designed to allow the server to recover from errors encountered while getting attributes.
This appears to make returning attributes optional. However, server implementors are strongly encouraged to make best effort to return attributes whenever possible, even when returning an error.
-
-
struct wcc_attr {
size3 size;
nfstime3 mtime;
nfstime3 ctime;
};
The wcc_attr structure is the subset of pre-operation attributes needed to improve support for the weak cache consistency semantics. The size argument is the file size in bytes of the object before the operation. The mtime argument is the time of last modification of the object before the operation. The ctime argument is the time of last change to the attributes of the object before the operation.
The use of mtime by clients to detect changes to file system objects residing on a server is dependent on the granularity of the time base on the server.
-
-
union pre_op_attr switch (bool attributes_follow){
case TRUE:
wcc_attr attributes;
case FALSE:
void;
};
-
-
struct wcc_data {
pre_op_attr before;
post_op_attr after;
};
When a client performs an operation that modifies the state of a file or directory on the server, it cannot immediately determine from the post-operation attributes whether the operation just performed was the only operation on the object since the last time the client received the attributes for the object. This is important, since if an intervening operation has changed the object, the client will need to invalidate any cached data for the object (except for the data that it just wrote).
To deal with this, the notion of weak cache consistency data (.BR wcc_data ) is introduced. A wcc_data structure consists of certain key fields from the object attributes before the operation, together with the object attributes after the operation. This information allows the client to manage its cache more accurately than in NFS Version 2 protocol implementations. The term weak cache consistency emphasizes the fact that this mechanism does not provide the strict server-client consistency that a cache consistency protocol would provide.
In order to support the weak cache consistency model, the server must be able to get the pre-operation attributes of the object, perform the intended modify operation, and then get the post-operation attributes atomically. If there is a window for the object to get modified between the operation and either of the get attributes operations, then the client will not be able to determine whether it was the only entity to modify the object. Some information will have been lost, thus weakening the weak cache consistency guarantees.
-
-
union post_op_fh3 switch (bool handle_follows){
case TRUE:
nfs_fh3 handle;
case FALSE:
void;
};
One of the principles of this revision of the NFS protocol is to return the real value from the indicated operation and not an error from an incidental operation. The post_op_fh3 structure was designed to allow the server to recover from errors encountered while constructing a file handle.
This is the structure used to return a file handle from the CREATE, MKDIR, SYMLINK, MKNOD and READDIRPLUS requests. In each case, the client can get the file handle by issuing a LOOKUP request after a successful return from one of the listed operations. Returning the file handle is an optimisation so that the client is not forced to issue a LOOKUP request immediately to get the file handle.
-
-
enum time_how {
DONT_CHANGE = 0,
SET_TO_SERVER_TIME = 1,
SET_TO_CLIENT_TIME = 2
};
-
-
union set_mode3 switch (bool set_it) {
case TRUE:
mode3 mode;
default:
void;
};
-
-
union set_uid3 switch (bool set_it) {
case TRUE:
uid3 uid;
default:
void;
};
-
-
union set_gid3 switch (bool set_it) {
case TRUE:
gid3 gid;
default:
void;
};
-
-
union set_size3 switch (bool set_it) {
case TRUE:
size3 size;
default:
void;
};
-
-
union set_atime switch (time_how set_it) {
case SET_TO_CLIENT_TIME:
nfstime3 atime;
default:
void;
};
-
-
union set_mtime switch (time_how set_it) {
case SET_TO_CLIENT_TIME:
nfstime3 mtime;
default:
void;
};
-
-
struct sattr3 {
set_mode3 mode;
set_uid3 uid;
set_gid3 gid;
set_size3 size;
set_atime atime;
set_mtime mtime;
};
The sattr3 structure contains the file attributes that can be set from the client. The fields are the same as the similarly named fields in the fattr3 structure. In the NFS Version 3 protocol, the attributes that can be set are described by a structure containing a set of discriminated unions. Each union indicates whether the corresponding attribute is to be updated, and if so, how.
There are two forms of discriminated unions used. In setting the mode, uid, gid or size, the discriminated union is switched on a Boolean, set_it; if it is TRUE, a value of the appropriate type is then encoded.
In setting the
atime
or
mtime,
the union is switched on an enumeration type,
set_it.
If
set_it
has the value
DONT_CHANGE,
the corresponding attribute is unchanged.
If it has the value
SET_TO_SERVER_TIME,
the corresponding
attribute is set by the server to its local time;
no data is provided by the client.
Finally, if
set_it
has the value
SET_TO_CLIENT_TIME,
the attribute is set to the time
passed by the client in an
nfstime3
structure.
(See
FSINFO
in
-
-
struct diropargs3 {
nfs_fh3 dir;
filename3 name;
};
The
diropargs3
structure is used in directory operations.
The file handle,
dir,
identifies the directory in which to
manipulate or access the file,
name.
See additional comments in
See additional comments in
For example, some operating systems allow removal of open files. A process can open a file and, while it is open, remove it from the directory. The file can be read and written as long as the process keeps it open, even though the file has no name in the file system. It is impossible for a stateless server to implement these semantics. The client can do some tricks such as renaming the file on remove (to a hidden name), and only physically deleting it on close. The NFS Version 3 protocol provides sufficient functionality to implement most file system semantics on a client.
Every NFS Version 3 protocol client can also potentially be a server, and remote and local mounted file systems can be freely mixed. This leads to some problems when a client travels down the directory tree of a remote file system and reaches the mount point on the server for another remote file system. Allowing the server to follow the second remote mount would require loop detection, server lookup, and user revalidation. Instead, both NFS Version 2 protocol and NFS Version 3 protocol implementations do not typically let clients cross a server's mount point. When a client does a LOOKUP on a directory on which the server has mounted a file system, the client sees the underlying directory instead of the mounted directory.
For example, if a server has a file system called /usr and mounts another file system on /usr/src, if a client mounts /usr, it does not see the mounted version of /usr/src. A client could do remote mounts that match the server's mount points to maintain the server's view. In this example, the client would also have to mount /usr/src in addition to /usr, even if they are from the same server.
Another common problem for non-XPG implementations is the special interpretation of the pathname ".." to mean the parent of a given directory. A future revision of the protocol may use an explicit flag to indicate the parent instead; however, it is not a problem because many working non-XPG implementations exist.
Using user ID and group ID implies that the client and server share the same user ID list. Every server and client pair must have the same mapping from user to user ID and from group to group ID. Since every client can also be a server, this tends to imply that the whole network shares the same user/group ID space. If this is not the case, then it usually falls upon the server to perform some custom mapping of credentials from one authentication domain into another. A discussion of techniques for managing a shared user space or for providing mechanisms for user ID mapping is beyond the scope of this document.
Another problem arises due to the usually stateful open operation. Most operating systems check permission at open time, and then check that the file is open on each read and write request. With stateless servers, the server cannot detect that the file is open and must do permission checking on each read and write call. UNIX client semantics of access permission checking on open can be provided with the ACCESS procedure call in this revision, which allows a client to explicitly check access permissions without resorting to trying the operation. On a local file system, a user can open a file and then change the permissions so that no one is allowed to touch it, but will still be able to write to the file because it is open. On a remote file system, by contrast, the write would fail. To get around this problem, the server's permission checking algorithm should allow the owner of a file to access it regardless of the permission setting. This is needed in a practical NFS Version 3 protocol server implementation, but it does depart from correct local file system semantics. This should not affect the return result of access permissions as returned by the ACCESS procedure, however.
A similar problem has to do with paging in an executable program over the network. The operating system usually checks for execute permission before opening a file for demand paging, and then reads blocks from the open file. In a local UNIX file system, an executable file does not need read permission to execute (page-in). An NFS Version 3 protocol server can not tell the difference between a normal file read (where the read permission bit is meaningful) and a demand page-in read (where the server should allow access to the executable file if the execute bit is set for that user or group or public). To make this work, the server allows reading of files if the user ID given in the call has either execute or read permission on the file, through ownership, group membership or public access. Again, this departs from correct local file system semantics.
In some operating systems, a particular user (on UNIX systems, the user ID 0) has access to all files, no matter what permission and ownership they have. This super-user permission might not be allowed on the server, since anyone who can become super-user on their client could gain access to all remote files. A UNIX server by default maps user ID 0 to a distinguished value (UID_NOBODY), as well as mapping the groups list, before doing its access checking. A server implementation may provide a mechanism to change this mapping. This works except for NFS Version 3 protocol root file systems (required for diskless NFS Version 3 protocol client support), where super-user access cannot be avoided. Export options are used, on the server, to restrict the set of clients allowed super-user access.
When used in a file server context, the term idempotent can be used to distinguish between operation types. An idempotent request is one that a server can perform more than once with equivalent results (though it may in fact change, as a side effect, the access time on a file, say for READ). Some NFS operations are obviously non-idempotent. They cannot be reprocessed without special attention simply because they may fail if tried a second time. The CREATE request, for example, can be used to create a file for which the owner does not have write permission. A duplicate of this request cannot succeed if the original succeeded. Likewise, a file can be removed only once.
The side effects caused by performing a duplicate non-idempotent request can be destructive (for example, a truncate operation causing lost writes). The combination of a stateless design with the common choice of an unreliable network transport (UDP) implies the possibility of destructive replays of non-idempotent requests. Though to be more accurate, it is the inherent stateless design of the NFS Version 3 protocol on top of an unreliable RPC mechanism that yields the possibility of destructive replays of non-idempotent requests, since even in an implementation of the NFS Version 3 protocol over a reliable connection-oriented transport, a connection break with automatic reestablishment requires duplicate request processing (the client will retransmit the request, and the server needs to deal with a potential duplicate non-idempotent request).
Most NFS Version 3 protocol server implementations use a cache of recent requests (called the duplicate request cache) for the processing of duplicate non-idempotent requests. The duplicate request cache provides a short-term memory mechanism in which the original completion status of a request is remembered and the operation attempted only once. If a duplicate copy of this request is received, then the original completion status is returned.
The duplicate-request cache mechanism has been useful in reducing destructive side effects caused by duplicate NFS Version 3 protocol requests. This mechanism, however, does not guarantee against these destructive side effects in all failure modes. Most servers store the duplicate request cache in RAM, so the contents are lost if the server crashes. The exception to this may possibly occur in a redundant server approach to high availability, where the file system itself may be used to share the duplicate request cache state. Even if the cache survives server reboots (or failovers in the high availability case), its effectiveness is a function of its size. A network partition can cause a cache entry to be reused before a client receives a reply for the corresponding request. If this happens, the duplicate request will be processed as a new one, possibly with destructive side effects.
Some examples of stable storage that are allowable for an NFS server include:
Conversely, the following are not examples of stable storage:
The only exception to this (introduced in Version 3 protocol) is as described under the WRITE procedure on the handling of the stable bit, and the use of the COMMIT procedure. It is the use of the synchronous COMMIT procedure that provides the necessary semantic support in the NFS Version 3 protocol.
Implementation practice solves this issue. A name cache, providing component to file-handle mapping, is kept on the client to short circuit actual LOOKUP invocations over the wire. The cache is subject to cache timeout parameters that bound attributes.
Note that multi-component lookup is allowed relative
to the public filehandle (see
Note that this algorithm introduces a new state for buffers, thus there are now three states for buffers. The three states are dirty, done but needs to be committed, and done. This extra state on the client will likely require modifications to the system outside of the NFS Version 3 protocol client.
The asynchronous write opens up the window of problems associated with write sharing. For example: client A writes some data asynchronously. Client A is still holding the buffers cached, waiting to commit them later. Client B reads the modified data and writes it back to the server. The server then crashes. When it comes back up, client A issues a COMMIT operation, which returns with a different cookie as well as changed attributes. In this case, the correct action may or may not be to retransmit the cached buffers. Unfortunately, client A can't tell for sure, so it will need to retransmit the buffers, thus overwriting the changes from client B. Fortunately, write sharing is rare and the solution matches the current write sharing situation. Without using locking for synchronisation, the behaviour will be indeterminate.
In a high availability (redundant system) server implementation, two cases exist that relate to the verf changing. If the high availability server implementation does not use a shared-memory scheme, then the verf must change on failover, since the unsynchronised data is not available to the second processor and there is no guarantee that the system that had the data cached was able to flush it to stable storage before going down. The client will need to retransmit the data to be safe. In a shared-memory high availability server implementation, the verf would not need to change because the server would still have the cached data available to it to be flushed. The exact policy regarding the verf in a shared memory high availability implementation, however, is up to the server implementor.
The problems of a 64 bit client and a 32 bit server are easy to handle. The client will never encounter a file that it can not handle. If it sends a request to the server that the server can not handle, the server should reject the request with an appropriate error.
The problems of a 32 bit client and a 64 bit server are much harder to handle. In this situation, the server does not have a problem because it can handle anything that the client can generate. However, the client may encounter a file that it can not handle. The client will not be able to handle a file whose size can not be expressed in 32 bits. Thus, the client will not be able to properly decode the size of the file into its local attributes structure. Also, a file can grow beyond the limit of the client while the client is accessing the file.
The solutions to these problems are left up to the individual implementor. However, there are two common approaches used to resolve this situation. The implementor can choose between them or even can invent a new solution altogether.
The most common solution is for the client to deny access to any file whose size can not be expressed in 32 bits. This is probably the safest, but does introduce some strange semantics when the file grows beyond the limit of the client while it is being access by that client. The file becomes inaccessible even while it is being accessed.
The second solution is for the client to map any size greater than it can handle to the maximum size that it can handle. This allows the application access as much of the file as possible given the 32 bit offset restriction. This eliminates the strange semantic of the file effectively disappearing after it has been accessed, but does introduce other problems. The client will not be able to access the entire file.
Currently, the first solution is the recommended solution. However, client implementors are encouraged to do the best that they can to reduce the effects of this situation.
/*
* Remote file service routines
*/
program NFS_PROGRAM {
version NFS_V3 {
void NFSPROC3_NULL(void) = 0;
GETATTR3res NFSPROC3_GETATTR(GETATTR3args) = 1;
SETATTR3res NFSPROC3_SETATTR(SETATTR3args) = 2;
LOOKUP3res NFSPROC3_LOOKUP(LOOKUP3args) = 3;
ACCESS3res NFSPROC3_ACCESS(ACCESS3args) = 4;
READLINK3res NFSPROC3_READLINK(READLINK3args) = 5;
READ3res NFSPROC3_READ(READ3args) = 6;
WRITE3res NFSPROC3_WRITE(WRITE3args) = 7;
CREATE3res NFSPROC3_CREATE(CREATE3args) = 8;
MKDIR3res NFSPROC3_MKDIR(MKDIR3args) = 9;
SYMLINK3res NFSPROC3_SYMLINK(SYMLINK3args) = 10;
MKNOD3res NFSPROC3_MKNOD(MKNOD3args) = 11;
REMOVE3res NFSPROC3_REMOVE(REMOVE3args) = 12;
RMDIR3res NFSPROC3_RMDIR(RMDIR3args) = 13;
RENAME3res NFSPROC3_RENAME(RENAME3args) = 14;
LINK3res NFSPROC3_LINK(LINK3args) = 15;
READDIR3res NFSPROC3_READDIR(READDIR3args) = 16;
READDIRPLUS3res
NFSPROC3_READDIRPLUS(READDIRPLUS3args) = 17;
FSSTAT3res NFSPROC3_FSSTAT(FSSTAT3args) = 18;
FSINFO3res NFSPROC3_FSINFO(FSINFO3args) = 19;
PATHCONF3res NFSPROC3_PATHCONF(PATHCONF3args) = 20;
COMMIT3res NFSPROC3_COMMIT(COMMIT3args) = 21;
} = 3;
} = 100003;
Contents | Next section | Index |