Previous section.
Protocols for Interworking: XNFS, Version 3W
Copyright © 1998 The Open Group
XNFS: Protocol Specification, Version 2
This chapter specifies a protocol that Sun Microsystems, Inc. and
others are using.
It is derived from a document designated RFC 1094
by the ARPA Network Information Center (see
References to RFCs
).
Introduction
The Network File System (NFS) protocol provides transparent remote
access to shared file systems over local area networks.
The NFS protocol is designed to be machine, operating system,
network architecture and transport protocol-independent.
This independence is
achieved through the use of Remote Procedure Call (RPC) primitives
built on top of an External Data Representation (XDR).
Implementations exist for a variety of machines, from personal computers to
supercomputers.
The supporting mount protocol allows the server to hand out remote
access privileges to a restricted set of clients.
It performs the operating system-specific functions that allow a client to
attach remote directory trees to a local file system.
The supporting mount protocol (see
Mount Protocol
)
is used by a client to obtain access to a particular file system,
or a subset thereof.
The server will provide a "handle" which
the client can use to identify the file system in subsequent NFS
operations.
Typically, the client will use the handle to arrange for the
remote file system to appear to the user as part of the local file system.
Remote Procedure Call
The remote procedure call specification provides a procedure-oriented
interface to remote services.
Each server supplies a program that is a set of procedures.
NFS is one such "program".
The combination of host address, program number and procedure
number specifies one remote service procedure.
RPC does not depend
on services provided by specific protocols, so it can be used with
any underlying transport protocol
(see
Remote Procedure Calls: Protocol Specification
).
The remote procedure call specification provides a
procedure-oriented interface to remote services.
Each server supplies a program that is a set of procedures.
NFS is one such "program".
The RPC protocol is described in
Remote Procedure Calls: Protocol Specification
.
External Data Representation
The External Data Representation (XDR) standard provides a common
way of representing a set of data types over a network.
The NFS Protocol Specification is written using the RPC data description
language.
For more information, see
XDR Protocol Specification
.
Implementations of XDR and RPC are available in the public domain,
but XNFS does not require their use.
Any software that
provides equivalent functionality can be used, and if the encoding
is exactly the same it can interoperate with other implementations
of XNFS.
Stateless Servers and Idempotency
The NFS protocol is stateless, in that a server need not
maintain any state about the clients which it serves.
It may in fact
store state to improve performance, but this state is not necessary
for correct operation.
This means that the protocol does not include any mechanisms for
managing server or client failure and restart.
However, NFS deals with
objects such as files and directories which inherently have state.
This apparent contradiction is resolved by introducing distributed
state and by making operations idempotent.
Distributed state arises when an NFS server passes information such
as a file handle or directory search
cookie
to a client.
The server promises, in effect, that when the client passes this
information back to the server at a later date, it will usually
still be valid and can be used to reconstruct the state needed
to perform the requested operation. If the server detects that
the state is invalid, it responds with an indication of the
problem. In some cases the client may pass the response to the calling
application. In other cases the client may take some
corrective action and retry the operation.
With a few exceptions, rebooting the server must not invalidate
distributed state information. One exception is that the state
associated with unstable writes (see
tagmref_NFSPROC3_WRITE
)
may be invalidated when the server
reboots. Another exception is that the state associated with
temporary file systems, that is, those that are recreated from
scratch by the reboot may be invalidated. This implies that
distributed state will usually refer to objects held on stable server
storage, though servers may employ caching techniques to accelerate the
interpretation of this state in the normal case when no reboot
has occurred.
An idempotent operation is one which can be repeated several times
without changing the results.
For example, a request to write 5 bytes
at offset 165 in a file is idempotent; a request to write 5 bytes
at the current end-of-file is not.
NFS employs idempotent operations wherever possible.
Certain operations are inherently not
idempotent, for example, deleting a file, so NFS server
implementations will normally include mechanisms to attempt to
detect duplicate requests and furnish the appropriate results.
Occasionally this strategy will fail and a client will receive an
unexpected error; NFS clients and their applications must be tolerant
of such occurrences.
XNFS Protocol Definition
Servers can change over time, and so can the protocol that they use.
RPC therefore provides a version number with each RPC request.
This chapter describes version 2 of the NFS protocol.
It contains procedures and parameters which are unused (obsolete)
but which are retained for compatibility purposes.
NFS server
implementations should be prepared to handle these appropriately.
File System Model
NFS assumes a file system that is hierarchical, with directories as
all but the bottom-level files.
Each entry in a directory (file, directory, device, and so on)
has a string name.
Different operating
systems may have restrictions on the depth of the tree or the names
used, as well as using different syntax to represent
the "pathname", which is the concatenation of all the "components"
(directory and filenames) in the name.
A "file system" is a tree
on a single server (usually a single disk or physical partition)
with a specified "root".
Some operating systems provide a "mount"
operation to make all file systems appear as a single tree, while
others maintain a "forest" of file systems.
Ordinary files are unstructured streams of uninterpreted bytes.
NFS looks up one component of a pathname at a time.
It may not be obvious why it does not just take the whole pathname,
travel down the directories, and return a file handle when it is done.
There are several good reasons not to do this.
First, pathnames need
separators between the directory components, and different
operating systems use different separators.
A Network Standard Pathname Representation could be defined,
but then every pathname
would have to be parsed and converted at each end.
Other issues are discussed in
XNFS Implementation Issues
.
An exception to the single component lookup policy
can be made in the case of a multi-component lookup
relative to a public filehandle (see
WebNFS Extensions
).
In this case the pathname is required to be slash (/)
separated and evaluated by the server. The server
must evaluate any symbolic links that occur in
intermediate components of the path, but not a link
that occurs as the final component.
Although files and directories are similar objects in many ways,
different procedures are used to read directories and files.
This enforces a common network representation of directory contents and
places the XDR encoding of this information directly in the NFS protocol,
rather than overloading the interpretation of file access operations.
It also enforces an access model in which it is important to retrieve
partial directory information or to start a directory search at
an invalid point.
The same argument as above could have been used to justify a
procedure that returns only one directory entry per call.
However, directories can contain many entries, and a
remote call to return each would lead to unacceptable performance.
Symbolic Links
The NFS file system model includes the concept of symbolic links,
in which a directory entry is associated with a
piece of text instead of a file or directory.
An NFS client which encounters a symbolic link while
processing a path will normally issue an
NFSPROC_READLINK
to retrieve the text, and will then
treat this as a path and look up the components to locate the
actual file or directory.
An NFS server need not implement symbolic links;
if it does not, it must be prepared to return a
PROC_UNAVAIL error if a client invokes NFSPROC_READLINK or
NFSPROC_SYMLINK.
Similarly, an NFS client should only issue an
NFSPROC_READLINK if a NFSPROC_LOOKUP
returns an entry typed as an NFLNK,
and should be prepared to handle failures of any symbolic link operations.
RPC Information
Authentication
The NFS service uses
AUTH_UNIX
style authentication, except in the NULL procedure where
AUTH_NONE is also permitted.
Transport Protocols
Current implementations of NFS are supported over UDP/IP only.
Port Number
The NFS protocol uses the UDP portnumber 2049 decimal.
Since this is not
an officially assigned port, it is possible that it may change in
the future.
For maximum interoperability it is recommended (but not required)
that NFS servers use UDP port 2049 if possible, and that NFS clients
use the portmap mechanism to locate the NFS program on a server.
WebNFS servers must use UDP and TCP port 2049.
Sizes of XDR Structures
These are the sizes, given in decimal bytes, of various XDR
structures used in the protocol:
-
-
/*
* The maximum number of bytes of data in a READ or
* WRITE request.
*/
const NFS_MAXDATA = 8192;
/* The maximum number of bytes in a pathname argument. */
const NFS_MAXPATHLEN = 1024;
/* The maximum number of bytes in a filename argument. */
const NFS_MAXNAMLEN = 255;
/*
* The size in bytes of the opaque "cookie" passed by
* READDIR.
*/
const NFS_COOKIESIZE = 4;
/* The size in bytes of the opaque file handle. */
const NFS_FHSIZE = 32;
Basic Data Types
The following XDR definitions are basic structures and types used
in other structures described later.
stat
-
-
enum stat {
NFS_OK = 0,
NFSERR_PERM=1,
NFSERR_NOENT=2,
NFSERR_IO=5,
NFSERR_NXIO=6,
NFSERR_ACCES=13,
NFSERR_EXIST=17,
NFSERR_NODEV=19,
NFSERR_NOTDIR=20,
NFSERR_ISDIR=21,
NFSERR_FBIG=27,
NFSERR_NOSPC=28,
NFSERR_ROFS=30,
NFSERR_NAMETOOLONG=63,
NFSERR_NOTEMPTY=66,
NFSERR_DQUOT=69,
NFSERR_STALE=70,
};
The stat type is returned with every procedure's results.
A value of
NFS_OK indicates that the call completed successfully and
the results are valid.
The other values indicate some kind of
error occurred on the server side during the servicing of the procedure.
- NFSERR_PERM
- Not owner.
The caller does not have the correct ownership
to perform the requested operation.
- NFSERR_NOENT
- No such file or directory.
The file or directory specified does not exist.
- NFSERR_IO
- Some sort of hard error occurred when the operation was in progress.
This could be a disk error, for example.
- NFSERR_NXIO
- No such device or address.
- NFSERR_ACCES
- Permission denied.
The caller does not have the correct permission to perform the
requested operation.
- NFSERR_EXIST
- File exists.
The file specified already exists.
- NFSERR_NODEV
- No such device.
- NFSERR_NOTDIR
- Not a directory.
The caller specified a non-directory in a directory operation.
- NFSERR_ISDIR
- Is a directory.
The caller specified a directory in a non-directory operation.
- NFSERR_FBIG
- File too large.
The operation caused a file to grow beyond the server's limit.
- NFSERR_NOSPC
- No space left on device.
The operation caused the server's file system to reach its limit.
- NFSERR_ROFS
- Read-only file system.
Write attempted on a read-only file system.
- NFSERR_NAMETOOLONG
File name too long.
The filename in an operation was too long.
- NFSERR_NOTEMPTY
Directory not empty.
Attempted to remove a directory that was not empty.
- NFSERR_DQUOT
- Disk quota exceeded.
The client's disk quota on the server has been exceeded.
- NFSERR_STALE
- The fhandle given in the arguments was invalid.
That is, the file referred to by that file handle no longer exists,
or access to it has been revoked.
ftype
-
-
enum ftype {
NFNON = 0,
NFREG = 1,
NFDIR = 2,
NFBLK = 3,
NFCHR = 4,
NFLNK = 5
};
The enumeration ftype gives the type of a file.
The type NFNON indicates a non-file, NFREG is a regular file,
NFDIR is a directory, NFBLK is a block-special device,
NFCHR is a character-special device, and NFLNK
is a symbolic link.
nfscookie
-
-
typedef opaque nfscookie[NFS_COOKIESIZE];
The nfscookie
is an opaque value that identifies a particular piece of data, such as a
directory entry in the
NFSPROC_READDIR call.
fhandle
-
-
typedef opaque fhandle[NFS_FHSIZE];
The fhandle
is the file handle passed between the server and the client.
All file operations are done using file handles to refer to a file or
directory.
The file handle can contain whatever information the server
needs to distinguish an individual file.
A filehandle that consists of 32 zero bytes is called the
public
filehandle. It is used by WebNFS clients
to identify an associated public directory on the server. See
WebNFS Extensions
for further information.
timeval
-
-
struct timeval {
unsigned int seconds;
unsigned int useconds;
};
The timeval
structure is the number of seconds and microseconds
since midnight January 1, 1970, Greenwich Mean Time.
It is used to
pass time and date information.
diropok
-
-
struct diropok {
fhandle file;
fattr attributes;
};
The diropok structure is used by the server
to return the file handle and attributes of a file after a successful
NFSPROC_LOOKUP, NFSPROC_CREATE or NFSPROC_MKDIR operation.
fattr
-
-
struct fattr {
ftype type;
unsigned int mode;
unsigned int nlink;
unsigned int uid;
unsigned int gid;
unsigned int size;
unsigned int blocksize;
unsigned int rdev;
unsigned int blocks;
unsigned int fsid;
unsigned int fileid;
timeval atime;
timeval mtime;
timeval ctime;
};
The fattr
structure contains the attributes of a file; type is
the type of the file; nlink is the number of hard links to the
file (the number of different names for the same file); uid is
the user identification number of the owner of the file; gid is
the group identification number of the group of the file; size is
the size in bytes of the file;
blocksize is the preferred block size in bytes for the file;
rdev is the device number of the file if it
is type NFCHR or NFBLK;
blocks is the number of 512-byte blocks the file takes
up on the server;
fsid is the file system identifier for the
file system containing the file; fileid
is a number that uniquely identifies the file within its file system;
atime
is the time when the file was last accessed for either read or write;
mtime
is the time when the file data was last modified (written), and
ctime is the time when the status of the file was last changed.
Writing to the file also changes
ctime if the size of the file changes.
mode
is the access mode encoded as a set of bits.
Notice that the file type
is specified both in the mode bits and in the file type; the server must
ensure they are consistent.
The descriptions given below specify the bit positions using octal
numbers.
Bit
| Description
|
---|
0040000
| This is a directory; type field must be NFDIR.
|
0020000
| This is a character special file; type field must be NFCHR.
|
0060000
| This is a block special file; type field must be NFBLK.
|
0100000
| This is a regular file; type field must be NFREG.
|
0120000
| This is a symbolic link file; type field must be NFLNK.
|
0140000
| This is a named socket; type field must be NFNON.
|
0004000
| Set user ID on execution.
|
0002000
| Set group ID on execution.
|
0001000
| Not used.
|
0000400
| Read permission for owner.
|
0000200
| Write permission for owner.
|
0000100
| Execute and search permission for owner.
|
0000040
| Read permission for group.
|
0000020
| Write permission for group.
|
0000010
| Execute and search permission for group.
|
0000004
| Read permission for others.
|
0000002
| Write permission for others.
|
0000001
| Execute and search permission for others.
|
- Notes:
-
The bits correspond to the mode bits returned by the
stat()
XSI system call, with the addition of the socket and symbolic
link combinations
which are supported by NFS and some operating systems.
-
The rdev
field in the attributes structure is an operating system-specific
device specifier.
sattr
-
-
struct sattr {
unsigned int mode;
unsigned int uid;
unsigned int gid;
unsigned int size;
timeval atime;
timeval mtime;
};
The sattr structure contains the file attributes which can be set
from the client.
The fields are the same as for fattr above.
A value of 0xffffffff indicates a field that must be ignored.
A size of zero means the file must be truncated to zero length.
filename
-
-
typedef string filename<NFS_MAXNAMLEN>;
The type filename is used for passing filenames or pathname components.
A string length of zero is invalid.
Implementations and applications must be able to handle file names
as 8-bit transparent data (allowing use of arbitrary character set
encodings).
For maximum portability and interworking,
it is recommended that applications and users define file
names containing only the characters of the Portable Filename
Character Set defined in ISO/IEC 9945-1:1990.
path
-
-
typedef string path<NFS_MAXPATHLEN>;
The type path
is a pathname to be used in the symbolic link operations
NFSPROC_SYMLINK and
NFSPROC_READLINK.
The server must consider it as a string with no internal structure.
A string length of zero is invalid.
For maximum portability and interworking, it is
recommended that applications and users define path names containing
only the slash character (if required) plus the characters of the
Portable Filename Character Set defined in ISO/IEC 9945-1:1990.
attrstat
-
-
union attrstat switch (stat status) {
case NFS_OK:
fattr attributes;
default:
void;
};
The attrstat structure is a common procedure result.
It contains a status
and, if the call succeeded, it also contains the
attributes of the file on which the operation was performed.
diropargs
-
-
struct diropargs {
fhandle dir;
filename name;
};
The diropargs structure is used in directory operations.
The fhandle dir is the directory in which to find the file
name.
A directory operation is one in which the directory is affected.
diropres
-
-
union diropres switch (stat status) {
case NFS_OK:
struct diropok diropok;
default:
void;
};
The results of a directory operation are returned in a
diropres structure.
If the call succeeded, a new file handle file
and the attributes associated with that file are returned
along with the status.
XNFS Implementation Issues
The NFS protocol is designed to be operating system-independent,
but since this version was designed in a UNIX environment, many
operations have semantics similar to the operations of the UNIX
file system.
This section discusses some of the implementation-specific semantic issues.
Server/Client Relationship
Every NFS client can also potentially be a server, and remote and
local mounted file systems can be freely intermixed.
This leads to
some interesting problems when a client travels down the directory
tree of a remote file system and reaches the mount point on the
server for another remote file system.
Allowing the server to
follow the second remote mount would require loop detection, server
lookup and user revalidation.
Instead, it was decided not to let
clients cross a server's mount point.
When a client does an
NFSPROC_LOOKUP
on a directory on which the server has mounted a file system, the
client sees the underlying directory instead of the mounted
directory.
A client can do remote mounts that match the server's
mount points to maintain the server's view.
Permission Issues
The NFS protocol, strictly speaking, does not define the permission
checking used by servers.
However, it is expected that a server
will do normal operating system permission checking using
AUTH_UNIX
style authentication as the basis of its protection mechanism.
The server gets the client's effective UID, effective GID and
groups on each call, and uses them to check permission.
There are various problems with this method that can be
resolved in interesting ways.
Using UID and GID implies that the client and server share the
same UID list.
Every server and client pair must have the same
mapping from user to UID and from group to GID.
Since every
client can also be a server, this tends to imply that the whole
network shares the same UID/GID space.
Another problem arises due to the usually stateful open operation.
Most operating systems check permission at open time, and then
check that the file is open on each read and write request.
With stateless servers, the server has no idea that the file is open and
must do permission checking on each read and write call.
On a local file system, a user can open a file and then change the
permissions so that no one is allowed to touch it, but will still
be able to write to the file because it is open.
On a remote file system, by contrast, the write would fail.
To get around this
problem, the server's permission checking algorithm should allow
the owner of a file to access it regardless of the permission
setting.
A similar problem has to do with paging in from a file over the
network.
The operating system usually checks for execute permission
before opening a file for demand paging, and then reads blocks from
the open file.
The file may not have read permission, but after it
is opened it doesn't matter.
An NFS server cannot tell the
difference between a normal file read and a demand page-in read.
To make this work, the server allows reading of files if the UID
given in the call has execute or read permission on the file.
In most operating systems, a particular user
has access to all files no matter what permission and
ownership they have, an NFS client request on behalf of such a user will
be made with the user ID of zero.
This "super-user" permission might not be
allowed on the server, since anyone who can gain that privilege on their
client system could gain access to all remote files.
An XNFS
server, by default, maps user ID 0 to -2
(0xfffffffe) before doing its access checking.
A server implementation may provide a mechanism to change this mapping.
Server Procedures
The protocol definition is given as a set of procedures with
arguments and results defined using the RPC language.
A brief
description of the function of each procedure should provide enough
information to allow implementation.
All of the procedures in the NFS protocol are synchronous.
When a procedure returns to the client,
the operation has completed and any data associated
with the request is now on stable storage.
For example, a client
NFSPROC_WRITE
request will cause the server to update some or all of the following:
data blocks, file system information blocks (such as indirect blocks),
and file attribute information (size and modify times).
When the
NFSPROC_WRITE
returns to the client, it can assume that the write is safe, even
in case of a server crash, and it can discard the data written.
This is a very important part of the statelessness of the server.
If the server waited to flush data from remote requests, the client
would have to save those requests so that it could resend them in
case of a server crash.
-
-
/*
* Remote file service routines
*/
program NFS_PROGRAM {
version NFS_VERSION {
void NFSPROC_NULL(void) = 0;
attrstat NFSPROC_GETATTR(fhandle)= 1;
attrstat NFSPROC_SETATTR(sattrargs) = 2;
void NFSPROC_ROOT(void) = 3;
diropres NFSPROC_LOOKUP(diropargs) = 4;
readlinkres NFSPROC_READLINK(fhandle) = 5;
readres NFSPROC_READ(readargs) = 6;
void NFSPROC_WRITECACHE(void) = 7;
attrstat NFSPROC_WRITE(writeargs) = 8;
diropres NFSPROC_CREATE(createargs) = 9;
stat NFSPROC_REMOVE(diropargs) = 10;
stat NFSPROC_RENAME(renameargs) = 11;
stat NFSPROC_LINK(linkargs) = 12;
stat NFSPROC_SYMLINK(symlinkargs) = 13;
diropres NFSPROC_MKDIR(createargs) = 14;
stat NFSPROC_RMDIR(diropargs) = 15;
readdirres NFSPROC_READDIR(readdirargs) = 16;
statfsres NFSPROC_STATFS(fhandle) = 17;
} = 2;
} = 100003;
The following reference pages define each of the server mapper procedures.
Why not acquire a nicely bound hard copy?
Click here to return to the publication details or order a copy
of this publication.