Previous section.

Systems Management; Event Management Service (XEMS)
Copyright © 1997 The Open Group

Introduction

Purpose

Ever increasing critical and complex systems will only be cost controllable if many of today's systems management and administration activities can be automated. This is true for the user's service provision organizations, system vendors and ISVs.

A well designed Event Management Service (EMS) will be of increasing importance to organizations as they become increasingly dependent on information technology (IT) services; it is a fundamental component needed to maintain service availability by:

Organizations are (increasingly) prepared to pay vast sums to duplicate system components (processors, disks, network connections) to maintain systems reliability. As system complexity increases, a largely automated EMS is a necessity for maintaining service availability at a reasonable cost, given systems created by integrating diverse components from an increasing number of suppliers (systems vendors, ISVs, in-house IT developers).

A set of event standards with which these components can inter-communicate events of interest is a fundamental need (that is, inter-connection is not enough; inter-comprehension is required).

Background

Automated detection and response to events is required in order to effectively manage and monitor distributed systems. Today's trouble ticketing systems are not sufficient, primarily because they are reactive systems, where a proactive system is required.

Examples of events include such things as program termination, node available, node down, administrator-defined traps, and so on. Provided independent vendors use the same EMS infrastructure, those same vendors can send and correlate events among each other's system management products.

The value of an EMS can only partially be measured by technical merit, and that value is leveraged many times over by the number of independent vendors that can and will utilize the same standard mechanism; therefore the value of this type of standard increases exponentially with the number of vendors in compliance with the standard.

Scope

This specification addresses event management services for systems administration purposes. However, like SNMP, this technology may well be applicable outside its original charter (for example, management of customer applications).

It is not the function of the EMS to generate events. The underlying managed objects (or some proxy) must raise events as appropriate. The EMS, however, must process all these events in a real-time fashion, administer their definition, enabling, and so on.

Likewise, it is not the function of EMS to provide high-resolution interval data. For example, EMS is not designed to be the feed for a performance meter for CPU utilization on 5 second intervals. The design space for a performance meter would preclude the use of a persistent cache (required for EMS reliability), for example, one would not expect a performance meter to provide real-time and historical data.

Requirements

Event Notification API
An API is needed to specify how events are delivered to applications. This API must include the attributes described in Event Construction below.
Event Subscription API
An event subscription API is needed for applications to tell the system which events are of interest, which should be forwarded, and where. For example, a database expert might subscribe to all database-specific events, as well as some network-specific events.

When subscribing to events it must be possible to designate all instances of an event (for example, all table drop actions), all events of a sub-category (for example, all DDL actions) or all events of a category (for example, all user actions). It must be possible to qualify the events of interest (at any level) by Boolean expression of attribute (for example, events in a particular management domain and time stamps between 9am and 5pm, or all events in a geographic region of a certain priority).

Event Construction
There are a set of attributes that are common to events. These include:

In addition, the above classifications must be extensible to support specific applications and logical extensions of this technology without breaking compatibility in the way the event service and applications handle the default attributes.

Global Name Service
It is expected that EMS will be most useful in a large networked environment where multiple-sourced applications may subscribe to events from each other in order to coordinate their activity. The event service must participate in a global naming standard such that separately developed applications do not have name conflicts when these applications meet in the marketplace at the user's desktop.

The demand here is for a standard naming or numbering system which can:

Centralized Event Management
Key to successful event management is the ability to control and view event services from a single centralized point. The concept of a centralized event management service is purely logical. It may make sense from the performance point of view to implement overlapping distributed event handling processes that minimize network traffic and maximize availability. However, to the user, the EMS must be able to look like a centralized service.

At the same time, there is a need for an event routing system that allows particular categories/classes of event to be ultimately routed to a particular (configurable) process, user or desk. For example, database events to database administrators, network events to a network specialist, process failures to watchdog task, and so on.

It will probably be appropriate for there to be multiple routes to access event management information from a user interface perspective; the issue of centralization is that once you are in position to access some event management information, you will actually be able to access all such information (to which you have authorized access).

Defining and Designating Events
There is a need to allow for arbitrary user-defined events - the event services mechanism must be user-extensible.

Some events will be pre-defined in the management applications, and some events are arbitrary in nature, and defined according to specific needs of a customer. An example of a pre-defined event is notification of a server shutdown or a communications failure. Management applications will have default behaviors for responding to pre-defined events. Statistics-based events (as described in Managed Server Performance Events below) are examples of customer-defined events.

Categories of Events
All categories (types) of events should be handled symmetrically (that is, services available to handle any particular type of event should be equally applicable to all types of event).

Events can be grouped into categories and sub-categories. For example, one category of event may be User Action Events. Within that there may be a sub-category, Data Definition Language (DDL) Action Events (user changes a schema object).

One particular DDL Action Event might "Column Added to Table". All such instances of this event have exactly the same attributes: server, database, table name, column name, and so on.

The following is an example of the category hierarchy that the event services must be able to handle:

  1. Managed Server Status Events
    Notification of change in status of managed database server (for example, starts, becomes suspect, dies normally, dies abnormally).

  2. Managed Server Performance Events

    Note that these statistics may be at any level of aggregation (for example, by server, by object, by user, by group of servers).

  3. User Action Events
    This refers to user actions that take place on managed servers:

    An attribute of all these events should indicate success or failure of the operation. For long running actions (or perhaps all actions) it should be possible to raise events both at commencement and completion of the action.

  4. Management Action Events

    An attribute of all these events should indicate success or failure of the operation. For long running actions (or perhaps all actions) it should be possible to raise events both at commencement and completion of the action.

  5. External Signals

  6. Self-management Events

  7. User-defined Events
    It should be possible to raise user-defined events that can have arbitrary data structures associated with them, so that a customer can use the event mechanism for asynchronous notifications of any sort. This would be particularly useful, for example, for standardized handling of error conditions discovered in the middle of command scripts without having to code the same error handling into multiple scripts.

  8. Composite Events
    A mechanism is required for creating a composite event that is raised when a series of other events (be they simple or composite) occur within a given period of time. This is described more fully below.

    A single event notification is often only one small part of a larger picture. Only when certain events occur in relative proximity can some sense be made of a situation and some appropriate response be made. Therefore, composite events are key to proper event management. Without a composite event mechanism, systems administrators would end up writing complex scripts to manage event combinations that mirror real situations. For example, a series of performance threshold breaches may indicate a serious problem where each individual breach is merely an interesting event to be watched for future use.

Additional background to the requirements is given in the remainder of this subsection.

Architectural Niceties
Implementation of requirements in this section must not cause delay in providing the basic event management service; it is more important to get the basic EMS APIs (for subscription/notification) resolved so ISV and customer software can be rewritten to those APIs as soon as possible.

However, these features are required for the widespread deployment and use of EMS, so the general requirement for this section is that these features must be able to be layered on top of the basic EMS.

Binding Events to Actions
An event subscription invokes some action sequence when an event notification is received. For example, when a managed server shuts down unexpectedly you may want an on-screen notification and a beeper to be called. In this case, one would subscribe to the event by associating the action sequence to post the notification and call the beeper with the occurrence of this event.

It must be possible to associate any action to an event; indeed a generic execute script option is theoretically sufficient to meet all needs. However, there are certain actions for which our applications can provide more friendly support, and these include arbitrary combinations of:

In some cases, subscriptions involve actions on objects that are not active (for example, insert a row in a table where server is down, run a shell script on a node that is not on the network). A mechanism for storing such actions and executing them when possible is required. An event subscription failure event is needed so that such delays can be noticed.

Navigating the EMS Superhighway
There needs to be two basic ways of navigating through the user interface to define events. One is starting from a general event management selection that drills down to the particular event you wish to manage, and the second is from the dialog set for managing a particular object by allowing you to select a "manage events" option.
Convenience Features/Toggling Event Subscriptions
As well as creating a subscription, you may want to deactivate it temporarily, and then reactivate it at a later time without having to recall all the details of what action sequence was associated with what designated event(s).

By placing subscriptions in collections one could activate and deactivate sets of subscriptions together. For example, if one had the need to watch a particular group of resources for threshold events from time to time (say, during heavy business cycles), one could activate the subscriptions for the period of interest and deactivate them for the rest of the time.

Programmable Event Filters
The event service should allow for the provision of programmable event filters at suitable nodes in a network to minimize the degree of "event storms", and to ensure adequate (end-user) service levels. That is, the programmable event filters will identify the most important of a number of events arising at that point in a given (configurable) timezone and will identify the relationship and forward only the single consolidated event.

In this way, "policies" can be established for both event filters and for action systems built on top of an event system. It is understood that programmable filters satisfy a higher level need which may best be layered above the event service; so the requirement here is to ensure programmable event filters can be layered on top of the event service.

Event Definition Language
A common event definition language (EDL) should be specified. This language would allow the specification of all events which an application or managed object generates. This would allow ISVs to ship a list of events that their product could cause, resulting in minimum user effort in integration of new applicants into the EMS.

Performance

In pursuit of efficiency, events must only be posted when active subscriptions are associated with the event. Whenever an event is activated or deactivated, the EMS must ensure that the posting process (probably a managed server or its proxy) is informed whether or not to send the particular event notification.

A key requirement is that the event services have good performance characteristics, and that they do not degrade performance of other processes on the network. This general events mechanism must be efficient enough for real-time performance monitoring of operating system-level activity.

The infrastructure should provide the ability to monitor event notification traffic, and to create performance threshold events on such data. Unfortunately, it will be relatively easy to configure the event services to create event cascades and event storms. This would be through specifying an action to occur when an event is raised, that itself will cause other event notifications which cause more actions with more events, and so on.

In general, it is required that there be minimal propagation of events, to avoid performance degradation. If a store and forward mechanism for event notifications is used, then event aging must be monitored as a key performance indicator. Minimal performance degradation could be achieved through replicating subscription action scripts to the event handling agents near the corresponding managed nodes being watched. (Such replication also enhances fault tolerance.) This could be enhanced by self-load-balancing where event handling agents start-up and die according to need on arbitrary managed nodes.

It is important, for performance reasons, that the only events posted for distribution are those events with active subscriptions (that is, someone is interested in listening to them); otherwise event notifications may well clog the network.

From a performance standpoint, the transport protocol used by the event notification messages may be important. If the event notifications are not sent on a connectionless protocol (most likely to be performant) then appropriate performance characteristics must be validated.

Reliability

Since event management applications are responsible for sending notification of any system and network problems to responsible operators, the management application itself as well as the underlying EMS must be absolutely reliable.

Standardization and Portability

The EMS provides a generic (that is, implementation independent) API. This API permits source code compatibility across implementations. The API provides functions for consumers, producers, and administrators.

Extensibility

As the technology moves, new managed objects and associated events will be required. In addition, customers and vendors may supply events for their applications. All of this must occur in a seamless manner.

Security

Since the event subscription service API is the window for applications to see generic system-wide activity, applications must be prevented from unauthorized snooping of system behavior at this access point. Access to event subscription and composite event construction must be secured by the access permissions of the managed objects.

Internationalization

The EMS must be compliant with internationalization (I18n) requirements.

Interoperability

The event management services from different vendors must interoperate.
Why not acquire a nicely bound hard copy?
Click here to return to the publication details or order a copy of this publication.

Contents Next section Index