Previous section.

Systems Management: Data Storage Management (XDSM) API

Systems Management: Data Storage Management (XDSM) API
Copyright © 1997 The Open Group

Implementation Notes

This Chapter describes some issues and hints which may be of use to the implementer of the DMAPI. These hints are presented in a separate Chapter because they are relevant to the DMAPI, yet they may not be applicable to all DMAPI implementations.

Event Encoding

As discussed in the Non-opaque Data Management Attributes section (see Non-opaque Data Management Attributes ), event bit masks may exist that allow specific encoding of events to be represented with a number of fixed bit patterns. These bit patterns are implementation defined, are not guaranteed to be the same from platform to platform, and in fact are not visible through any of the DMAPI interfaces.

A single bit might allow the following event list encoding:

0
The file has no event lists.

1
The file has an event list but there is not enough persistent storage available to encode the events.

By increasing the number of bits, well known combinations of events can be stored persistently with the file. For example, with 2 bits, the following event list encoding may exist:

00
The file has no event lists.

01
Read events will be generated for this file.

10
Write events will be generated for this file.

11
The file has an event list but there is not enough persistent storage available to encode the events.

For example, if the file has both read and write events set, then this cannot be encoded in two bits. In this case the debut event would need to be generated in order to load the event list.

The encoding of events is not defined by the DMAPI. It is expected that different DMAPI implementations would encode events differently, depending on those data management applications which would be supported.

Note:
Due to the requirement regarding implementation of the DMAPI, support for persistent non-opaque attributes is a DMAPI implementation option.

Event Ordering

Each implementation of the DMAPI may send a number of different events to a DM application as a result of a single system call. It may be helpful, but is not required, for DM application writers to know in advance the sequence of events for each system call. Unfortunately, determining this list is a non-trivial exercise. It is suggested that the DM application writer should work with the implementor of the DMAPI in order to determine the sequence and ordering of events that occurs for system calls of interest.

If the application is multi-threaded, then the DM app writer's job is even more difficult. Synchronization primitives are necessary so that one process is not left hanging waiting for an event that has been serviced by another process.

Lock Releasing

Some DMAPI implementations may have special lock restrictions. Some may be unable to upgrade an access right from DM_RIGHT_SHARED to DM_RIGHT_EXCL without sleeping, while others may have special primitives that allow them to grant whatever right the DM application requires. Some DMAPI implementations may also have special requirements with regard to releasing locks. During the servicing of events, it may not always be possible to relinquish a right. DM application writers should not assume that it is always possible to release a right(via dm_release_right()). The careful DM application writer will always check return codes from all DMAPI functions.

Tokens, Messages and Handles

Tokens reference access rights for handles, and must always be associated with a message. An implementor may wish to view a token as simply a message ID. When viewed this way, the token is similar to a file descriptor, in that it just references state maintained in the kernel. Given the one-to-one correspondence of tokens and messages, one can think of them (within the kernel) as separate objects or one object independent of how they are actually implemented.

mmap

There are many operating systems that use mmap(2) as a mechanism for utilities to copy file data. Unfortunately, this can make implementation of the DMAPI difficult, since a page fault often occurs at a much later point in time than when the file is actually mapped. Usually, there are also very stringent locking restrictions in place at the time the page fault occurs.

To get around these problems, some systems may have to adopt a paradigm that at the time the file is mapped (that is, at the time the actual mmap(2) call is executed), then if the file is non-resident it will have to be made resident. This is sub-optimal from a disk space utilization standpoint, and it also means that a large file that is mmap'd cannot be partially resident. However, the alternative is to place DMAPI call-outs deep in the VM subsystem, which is usually a very complex exercise and is not recommended unless considerable expertise is available.

Invisible I/O

Invisible read and write can also place special burdens on the implementor of the DMAPI. Invisible I/O is typically used by a DM application to reload data for a non-resident file. This operation should not modify any of a file's time stamps, nor should it cause events to be generated.

In systems where pages are encached on the vnode, this can lead to troublesome locking and coherency issues. How is the actual write to disk performed? Is the page cache bypassed? How do you ensure that encached pages are dealt with correctly?

Generation of Events

The placement of code that actually generates events will be different from platform to platform. When an event is actually generated will be different on all systems, and should not concern the DM application writer. The DMAPI implementor will need to take into consideration lock state, likelihood of the operation succeeding, and so forth, to determine the best location for the actual callout to occur.

In the case of managed region events, the DMAPI implementor must ensure that a poorly-behaved DM application does not cause the system to behave in an unexpected manner. For example, if a file has multiple managed regions that represent non-resident data, and a DM application only restores the data for one of those regions, the system must be sure that operations such as read-ahead do not cause multiple events to occur. For further information, see the Managed Region description in Managed Regions .

Locking Across Operations

To ensure consistency, some DM application may wish to enforce their own locking scheme across operations. They may want to develop a wrapper around some operations in order to synchronize and/or serialize accesses to files.

Tokens and Multiple Handles

Tokens may reference access rights for more than one file handle. This makes certain operations easier, such as obtaining access rights to a list of file handles; the same token can be reused without incurring the overhead of dm_create_userevent() to construct a new token for each file handle. However, allowing a single token to reference multiple handles can make recovery more difficult. How does a DM application determine which file handles have access rights referenced by a single token?

During recovery, a DM application can always execute dm_respond_event(), give it the offending token, and let the DMAPI release any and all access rights associated with the token. If the DM application needs to be selective about which file handles have their access rights released, then it (the DM application) must provide some mechanism external to the DMAPI to log which file handles are associated with a token. The DMAPI does not provide interfaces to identify multiple file handles from a single token.

Structure Lists

Several DMAPI functions return lists of structures. Some of these functions return lists of variable-length structures. Since the length of the structure is not known, DMAPI implementations must provide a mechanism for the DM application to access the various members of the list.

The DMAPI specifies that DM applications should use the DM_STEP_TO_NEXT macro to access variable length structures that are in a list. However, the actual implementation of this macro is not defined. One suggestion is that each variable-length structure should have a field in a well-known position (say offset zero) or use a special field name that is opaque to the DM application. For example, using the field name approach, the definition of the dm_eventmsg_t would be:

struct dm_eventmsg {
    ssize_t         _link;
    dm_type_t       ev_type;
    dm_token_t      ev_token;
    dm_vardata_t    ev_data;
};

The definition of the DM_STEP_TO_NEXT macro would then become:

#define DM_STEP_TO_NEXT(p, type) \
    ((type)((p)->_link ? (char *)(p) + (p)->_link : NULL))

Undeliverable Event Messages

The implementation of the DMAPI needs to specify the guidelines for delivery of a synchronous event message when no session exists to receive it. There are three choices with regard to synchronous event message delivery when no session exists:

If an error is returned to the process, it may be specific to the operation that caused the event. This means that depending on the operation, two different errors can be returned for the same event type.

The implementation must also define the behavior for asynchronous events.

dm_vardata_t

One possible implementation of the dm_vardata_t structure is:
struct dm_vardata {
    ssize_t   vd_offset;
    size_t    vd_length;
};
typedef struct dm_vardata     dm_vardata_t;

where the offset field (vd_offset) in dm_vardata_t records the distance in bytes from the beginning of the structure in which the variable length data begins.

In the case of an event message, it records the beginning of the actual event-specific data. For other structures, such as dm_stat_t, it indicates where the handle data can be found. The definition of the two access macros would then be:

DM_GET_VALUE(p, field, type) \
            ((type)((char *)(p) + (p)->filed.vd_offset))

DM_GET_LEN(p, field)   ((p)->field.vd_length)

NFS Daemon Starvation

Special consideration needs to be taken when using DMAPI on a file system that is exported via NFS. Because a migrate-in operation can potentially take several seconds or even minutes, a large number of NFS client requests to files that are staged out could lead to NFS daemon starvation. Each NFS daemon could be waiting for a DMAPI operation to complete, with no free daemon threads left to accept new requests.

A possible solution to this problem is to devise a method for the file system to notify the NFS daemon when it detects that an operation will take an unusually long time. The NFS daemon could then fork a separate thread to wait for the migration to complete, and also send a EJUKEBOX notification to the client NFS.

The dm_pending() interface allows the DMAPI application to notify the DMAPI implementation if an operation is expected to be slow. The implementation may then take appropriate steps to notify NFS.

Unmount and Shutdown Deadlock

Unmount of a DMAPI managed File System can take a long time. This means the shutdown script needs to wait for the DM applications to finish with unmount activities before killing off processes. If shutdown kills the DM application process threads without first giving the DM application a chance to clean up its session IDs, then the unmount event will be posted to an orphan session and can never be responded to. unmount will then hang. However, if shutdown does not kill the non-DM application process threads, the File System may look busy forever, and unmount will likewise hang. shutdown therefore needs a way to kill all processes except the DM application processes.

The dt_change Field in dm_stat

The suggested implementation is to keep an in-kernel counter that is incremented every time it is read. In this way, released in-core inodes will be incrememted properly.

Punching Holes

If a call to dm_punch_hole() frees media resources, the DMAPI implementation should indicate these freed resources in subsequent calls to dm_get_allocinfo(), by describing the freed extent with the DM_EXTENT_HOLE flag.


Why not acquire a nicely bound hard copy?
Click here to return to the publication details or order a copy of this publication.

Contents Next section Index