Previous section.
Systems Management: Data Storage Management (XDSM) API
Systems Management: Data Storage Management (XDSM) API
Copyright © 1997 The Open Group
Implementation Notes
This Chapter describes some issues and hints which may be of use to the implementer of the DMAPI. These
hints are presented in a separate Chapter
because they are relevant to the DMAPI, yet they may not be applicable to
all DMAPI implementations.
Event Encoding
As discussed in the Non-opaque Data Management Attributes section (see
Non-opaque Data Management Attributes
),
event bit masks may exist that
allow specific encoding of events to be represented with a number of fixed bit patterns. These bit
patterns are implementation defined, are not guaranteed to be the same from platform to platform, and in
fact are not visible through any of the DMAPI interfaces.
A single bit might allow the following event list encoding:
- 0
- The file has no event lists.
- 1
- The file has an event list but there is not enough
persistent storage available to encode the events.
By increasing the number of bits, well known combinations of events can be stored persistently with the
file. For example, with 2 bits, the following event list encoding may exist:
- 00
- The file has no event lists.
- 01
- Read events will be generated for this file.
- 10
- Write events will be generated for this file.
- 11
- The file has an event list but there is not enough persistent storage available to
encode the events.
For example, if the file has both read and write events set, then this cannot be encoded in two bits. In this
case the
debut
event would need to be generated in order to load the event list.
The encoding of events is not defined by the DMAPI. It is expected that different DMAPI
implementations would encode events differently, depending on those data management applications
which would be supported.
- Note:
- Due to the requirement regarding implementation of the DMAPI, support for
persistent non-opaque attributes is a DMAPI implementation option.
Event Ordering
Each implementation of the DMAPI may send a number of different events to a DM application as a
result of a single system call. It may be helpful, but is not required, for DM application writers to know in
advance the sequence of events for each system call. Unfortunately, determining this list is a non-trivial
exercise. It is suggested that the DM application writer should work with the implementor of the DMAPI in
order to determine the sequence and ordering of events that occurs for system calls of interest.
If the application is multi-threaded, then the DM app writer's job is even more difficult. Synchronization
primitives are necessary so that one process is not left hanging waiting for an event that has been
serviced by another process.
Lock Releasing
Some DMAPI implementations may have special lock restrictions. Some may be unable to upgrade an
access right from DM_RIGHT_SHARED to DM_RIGHT_EXCL without sleeping, while others may have
special primitives that allow them to grant whatever right the DM application
requires. Some DMAPI
implementations may also have special requirements with regard to releasing locks. During the servicing
of events, it may not always be possible to relinquish a right. DM application
writers should not assume that it
is always possible to release a
right(via
dm_release_right()).
The careful DM application writer will
always check return codes from all DMAPI functions.
Tokens, Messages and Handles
Tokens reference access rights for handles, and must always be associated with a message. An
implementor may wish to view a token as simply a message ID. When viewed this way, the token is
similar to a file descriptor, in that it just references state maintained in the kernel. Given the one-to-one
correspondence of tokens and messages, one can think of them (within the kernel) as separate objects or
one object independent of how they are actually implemented.
mmap
There are many operating systems that use mmap(2) as a mechanism for utilities to copy file data.
Unfortunately, this can make implementation of the DMAPI difficult, since a page fault often occurs at a
much later point in time than when the file is actually mapped. Usually, there are also very stringent
locking restrictions in place at the time the page fault occurs.
To get around these problems, some systems may have to adopt a paradigm that at the time the file is
mapped (that is, at the time the actual mmap(2) call is executed), then
if the file is non-resident it will have to be made resident.
This is sub-optimal from a disk space utilization standpoint, and it also means that a large
file that is
mmap'd
cannot be partially resident. However, the alternative is to place DMAPI call-outs
deep in the VM subsystem, which is usually a very complex
exercise and is not recommended unless considerable expertise is available.
Invisible I/O
Invisible read and write can also place special burdens on the implementor of the DMAPI. Invisible I/O
is typically used by a DM application to reload data for a non-resident file. This operation should not
modify any of a file's time stamps, nor should it cause events to be generated.
In systems where pages are encached on the vnode, this can lead to
troublesome locking and coherency issues.
How is the actual write to disk performed? Is the page cache bypassed? How do you ensure that
encached pages are dealt with correctly?
Generation of Events
The placement of code that actually generates events will be different from platform to platform. When
an event is actually generated will be different on all systems, and should not concern the DM
application writer. The DMAPI implementor will need to take into consideration lock state, likelihood of
the operation succeeding, and so forth, to determine the best location for the actual callout to occur.
In the case of managed region events, the DMAPI implementor must ensure that a poorly-behaved DM
application does not cause the system to behave in an unexpected manner. For example, if a file has
multiple managed regions that represent non-resident data, and a DM application only restores the data
for one of those regions, the system must be sure that operations such as
read-ahead
do not cause
multiple events to occur. For further information, see the
Managed Region description in
Managed Regions
.
Locking Across Operations
To ensure consistency, some DM application may wish to enforce their own locking scheme across
operations. They may want to develop a wrapper around some operations in order to synchronize and/or
serialize accesses to files.
Tokens and Multiple Handles
Tokens may reference access rights for more than one
file handle. This makes certain operations easier, such as obtaining access rights to a list of file handles;
the same token can be reused without incurring the overhead of
dm_create_userevent()
to
construct a new token for each file handle. However, allowing a single token to reference multiple
handles can make recovery more difficult. How does a DM application determine which file handles
have access rights referenced by a single token?
During recovery, a DM application can always execute
dm_respond_event(),
give it the offending
token, and let the DMAPI release any and all access rights associated with the token. If the DM
application needs to be
selective about which file handles have their access rights released, then it (the DM
application) must provide some mechanism external to the DMAPI to log which file handles are
associated with a token. The DMAPI does not provide interfaces to identify multiple file handles from a
single token.
Structure Lists
Several DMAPI functions return lists of structures. Some of these functions return lists of
variable-length structures.
Since the length of the structure is not known, DMAPI implementations must provide
a mechanism for the DM application to access the various members of the list.
The DMAPI specifies that DM applications should use the DM_STEP_TO_NEXT
macro to access
variable length structures that are in a list. However, the actual implementation of this macro is not
defined. One suggestion is that each variable-length structure should
have a field in a well-known position (say
offset zero) or use a special field name that is opaque to the DM application. For example, using the field name
approach, the definition of the
dm_eventmsg_t
would be:
struct dm_eventmsg {
ssize_t _link;
dm_type_t ev_type;
dm_token_t ev_token;
dm_vardata_t ev_data;
};
The definition of the DM_STEP_TO_NEXT macro would then become:
#define DM_STEP_TO_NEXT(p, type) \
((type)((p)->_link ? (char *)(p) + (p)->_link : NULL))
Undeliverable Event Messages
The implementation of the DMAPI needs to specify the guidelines for delivery of a synchronous event
message when no session exists to receive it. There are three choices with regard to synchronous event
message delivery when no session exists:
-
block the requesting process.
-
return an error to the process that instigated the event.
-
do not generate the event.
If an error is returned to the process, it may be specific to the operation that caused the event. This means
that depending on the operation, two different errors can be returned for the same event type.
The implementation must also define the behavior
for asynchronous events.
dm_vardata_t
One possible implementation of the
dm_vardata_t
structure is:
struct dm_vardata {
ssize_t vd_offset;
size_t vd_length;
};
typedef struct dm_vardata dm_vardata_t;
where the offset field (vd_offset) in
dm_vardata_t
records the distance in bytes from the
beginning of the structure in which the variable length data begins.
In the case of an event message, it
records the beginning of the actual event-specific data.
For other structures, such as
dm_stat_t,
it
indicates where the handle data can be found. The definition of the two access macros would then be:
DM_GET_VALUE(p, field, type) \
((type)((char *)(p) + (p)->filed.vd_offset))
DM_GET_LEN(p, field) ((p)->field.vd_length)
NFS Daemon Starvation
Special consideration needs to be taken when using DMAPI on a file system that is exported via NFS.
Because a migrate-in operation can potentially take several seconds or even minutes, a large number of
NFS client requests to files that are staged out could lead to NFS daemon starvation. Each NFS daemon
could be waiting for a DMAPI operation to complete, with no free daemon threads left to accept new
requests.
A possible solution to this problem is to devise a method for the file system to notify the NFS daemon
when it detects that an operation will take an unusually long time. The NFS daemon could then fork a
separate thread to wait for the migration to complete, and also send a EJUKEBOX notification to the
client NFS.
The
dm_pending()
interface allows the DMAPI application to notify the DMAPI implementation if
an operation is expected to be slow. The implementation may then take appropriate steps to notify NFS.
Unmount and Shutdown Deadlock
Unmount of a DMAPI managed File System can take a long time. This means the
shutdown
script needs to wait
for the DM applications to finish with
unmount
activities before killing off processes. If
shutdown
kills
the DM application process threads without first giving the DM application a chance to clean up its session
IDs, then the
unmount
event will be posted to an orphan session and can never be responded to.
unmount
will then hang. However, if
shutdown
does not kill the non-DM application process threads,
the File System may look busy forever, and
unmount
will likewise hang.
shutdown
therefore needs a way to kill all
processes except the DM application processes.
The dt_change Field in dm_stat
The suggested implementation is to keep an in-kernel counter that is
incremented every time it is read.
In this way, released in-core inodes will be incrememted properly.
Punching Holes
If a call to
dm_punch_hole()
frees media resources, the DMAPI implementation should indicate
these freed resources in subsequent calls to
dm_get_allocinfo(),
by describing the freed extent with the
DM_EXTENT_HOLE flag.
Why not acquire a nicely bound hard copy?
Click here to return to the publication details or order a copy
of this publication.