Previous section.

Portable Layout Services: Context-dependent and Directional Text

Portable Layout Services:
Context-dependent and Directional Text
Copyright © 1997 The Open Group

Header File <sys/layout.h>

This chapter describes the opaque LayoutObject type and the other data types and layout values used by the layout services APIs. These are all defined in <sys/layout.h>, which is implementation dependent.

LayoutObject

The LayoutObject is an opaque structure that includes values and methods corresponding to a specific locale.

Association with Attribute Objects and Locales

Taking into account emerging trends to facilitate internationalised functions that satisfy multi-locale, multi-threading and multi-node processing, the layout object is associated with a generalised AttrObject that might contain other objects beyond the locale object. For information on type AttrObject, see the Distributed Internationalisation Services snapshot.

In the absence of an AttrObject, the locale defaults to the locale supported by the setlocale() function (see the XSH, Issue 4 specification).

LayoutObject Content

LayoutObject contains or points to:

More detailed descriptions of the different components of LayoutObject follow in subsequent sections.

Layout Values

Descriptors

The different descriptors are described briefly in the following sections. The letters S and G indicate whether the value may be set or retrieved as shown in Standard Layout Values .

For descriptors that do not have an S indicator, the way in which their initial value is set is implementation dependent.

Orientation (SG)
In bidirectional languages, some characters (such as the English letters) are considered to have a strong left-to-right orientation; other characters (such as the Arabic characters) are considered strong right-to-left characters; and other characters (such as punctuation marks, spaces, and so on) do not have a strong direction associated with them.

The descriptor Orientation specifies the global directional text orientation. Possible values are:

ORIENTATION_LTR

Left-to-right horizontal rows that progress from top to bottom.

ORIENTATION_RTL

Right-to-left horizontal rows that progress from top to bottom.

ORIENTATION_TTBRL

Top-to-bottom vertical columns that progress from right to left.

ORIENTATION_TTBLR

Top-to-bottom vertical columns that progress from left to right.

ORIENTATION_CONTEXTUAL

The global orientation is set according to the direction of the first significant (strong) character.

If there are no strong characters in the text and the descriptor is set to this value, the global orientation of the text is set according to the value of the descriptor Context. This option is meaningful only for bidirectional text.

The initial state for Orientation is dependent on the LayoutObject. If no value is present, the default is ORIENTATION_LTR.

Context (SG)
The descriptor Context is meaningful only if the descriptor Orientation is set to ORIENTATION_CONTEXTUAL. It defines what orientation is assumed when no strong character appears in the text. Possible values are:

CONTEXT_LTR

In the absence of characters with strong directionality in the text, orientation is assumed to be left-to-right rows progressing from top to bottom.

CONTEXT_RTL

In the absence of characters with strong directionality in the text, orientation is assumed to be right-to-left rows progressing from top to bottom.

If no value is specified, the default is CONTEXT_LTR.

TypeOfText (SG)

The TypeOfText descriptor specifies the ordering of the directional text. Characters may have a natural orientation attached to them as described under Orientation (SG) . An example of how this characteristic could be defined is by the keywords left_to_right and right_to_left in the layout category LO_LTYPE (see LO_LTYPE Locale Category ). Possible values are:

TEXT_VISUAL

Code elements are stored in visually ordered segments, which can be rendered without any segment inversion. Practically the whole text could be seen as if there were no sub segments.

TEXT_IMPLICIT

Code elements are stored in logically ordered segments. Logically ordered means that the order in which the characters are stored is the same as the order in which the characters are pronounced when reading the presented text or the order in which characters would be entered from a keyboard. Logical order (or logical sequence) of characters is necessary for processing purposes, for example, when there is a need to sort or index the data. Segments of reversed orientation are recognised and inverted by a content-sensitive algorithm based on the natural orientation of characters. Because there are several possible algorithms for implicit reordering of directional segments, the ImplicitAlg layout value is used when TypeOfText is set to TEXT_IMPLICIT, to indicate the actual algorithm used.

TEXT_EXPLICIT

Code elements are stored in logically ordered segments with a set of embedded controls. The explicit algorithm eliminates the ambiguities that might exist in some situations when using an implicit algorithm, but it introduces the need for additional control characters in the data stream. The set of embedded controls for TEXT_EXPLICIT is implementation defined.

Consider the following possible embedded controls:

The LayoutObject preserves a bidirectional state across calls to the m_transform_layout() function. The directional state is reset to the initial state each time TypeOfText is set to any value.

Each LayoutObject is expected to provide transformation from each of the above types to any of the other types. However, some transformations may cause layout (directional) information to be lost so that text is presented differently after a round trip transformation. This does not imply any data loss, but only possible loss in layout information.

If the TypeOfText value is not specifically stated, the default (for the C locale) is TEXT_IMPLICIT.

ImplicitAlg (SG)

The ImplicitAlg descriptor specifies the type of bidirectional implicit algorithm used in reordering and shaping of directional or context-dependent text.

Possible values of ImplicitAlg are:

ALGOR_IMPLICIT

Directional code elements will be reordered using an implementation-defined implicit directional algorithm when converting to or from an implicit form.

Although the basic algorithm used when ImplicitAlg is set to ALGOR_BASIC, is an implicit algorithm, the fact that it recognises some control characters, allows it to be used even when the TypeOfText descriptor is set to TEXT_EXPLICIT.

Note that when TEXT_EXPLICIT is used in conjunction with ALGOR_BASIC, the controls may temporarily change the values of Swapping, Numerals and TextShaping. The ALGOR_IMPLICIT value may be equal to ALGOR_BASIC for a given implementation. Except in this case, it is not meaningful to have TypeOfText=TEXT_EXPLICIT at the same time as ImplicitAlg=ALGOR_IMPLICIT.


Directional code elements should be reordered using the basic implicit directional algorithm when converting to and from an implicit form. The basic reordering algorithm is the Basic Display Algorithm published in the Unicode standard. The basic reordering algorithm is inherently an implicit algorithm, but it may support certain explicit control characters. Among others, the following controls are recognised when reordering with ALGOR_BASIC:
LEFT-TO-RIGHT MARK (LRM) RIGHT-TO-LEFT MARK (RLM)

All the controls can be found in the referenced Unicode standard.

If the ImplicitAlg value is not specifically stated, the default (for the C locale) is ALGOR_IMPLICIT.

Swapping (SG)

The Swapping descriptor specifies whether symmetric swapping is applied to the text. A list of symmetric swapping characters is given in the ISO/IEC 10646 standard. Possible values are:

SWAPPING_YES

The text conforms to symmetric swapping.

SWAPPING_NO

The text does not conform to symmetric swapping.

If no value is present, the default (for the C locale) is SWAPPING_NO.

Numerals (SG)

The Numerals descriptor specifies the shaping of numerals recognised by the LayoutObject. Possible values are:

NUMERALS_NOMINAL

Nominal shaping of numerals using the portable character set (Arabic numerals).

NUMERALS_NATIONAL

National shaping of numerals based on the script of the locale associated with the LayoutObject (such as the Thai, Farsi, Hindi, or Bengali script).

An example of how national numbers can be defined is by using the keyword national_number in the layout category LO_LTYPE (see LO_LTYPE Locale Category ).

NUMERALS_CONTEXTUAL

Contextual shaping of numerals depending on the context (script) of surrounding text (such as Hindi numbers in Arabic text and Arabic numbers otherwise).

If no value is specified the default value (for the C locale) is NUMERALS_NOMINAL.

TextShaping (SG)

The descriptor TextShaping specifies the shaping; that is, choosing (or composing) the correct shape of the input or output text.

Note:
This layout value is important, in particular for languages where the shapes of the characters, when presented, correspond to code points that may be different from the code points of the characters stored for processing:

Possible values of TextShaping are:

TEXT_SHAPED

The text has presentation form shapes.

TEXT_NOMINAL

The text is in basic form.

TEXT_SHFORM1

The text is in shape form 1.

TEXT_SHFORM2

The text is in shape form 2.

TEXT_SHFORM3

The text is in shape form 3.

TEXT_SHFORM4

The text is in shape form 4.

The set of shaping characters is limited to the codeset of the locale associated with the LayoutObject.

If no value is present, the default value (for the C locale) is TEXT_SHAPED.

In this document, the term shape form n is used to mean:

ActiveDirectional (G)

If the descriptor ActiveDirectional is set (True), then the LayoutObject includes knowledge of directional code elements, and proper rendering of text implies reordering of directional code elements. Otherwise the LayoutObject does not require any reordering of directional code elements. The way the value of this layout value is set is implementation dependent.

The ActiveDirectional value is guaranteed to remain unchanged for the life of the LayoutObject.

ActiveShapeEditing (G)

If the descriptor ActiveShapeEditing is set (True), the LayoutObject includes knowledge of context-dependent code elements (an automatic shape determination algorithm) that require shaping for presentation to the ShapeCharset.

The user of a LayoutObject is then required to initiate or perform some shaping transformation prior to rendering the text.

Otherwise, the application that uses the LayoutObject does not perform shaping, and all code elements may be presented independent of the surrounding characters.

The method used to set ActiveShapeEditing is implementation defined. The ActiveShapeEditing value is guaranteed to remain unchanged for the life of the LayoutObject.

ShapeCharset (SG)

The descriptor ShapeCharset specifies the charset of the output text when text is shaped; that is, when ActiveShapeEditing is true. If ActiveShapeEditing is not set (False), in other words, shape editing is a null operation, the ShapeCharset is guaranteed to match the codeset associated with the locale of the LayoutObject.

A charset is defined as "an encoding with a uniform, state-independent mapping from character to code points".

A ShapeCharset is a well known name associated with some type of presentation encoding usually used to identify the encoding of a font. Yet, a ShapeCharset need not be a font encoding but may be some intermediate encoding that can then be rendered to a specific font.

Note:
LayoutObject may be extended to provide an extended layout value, by which the individual glyph metrics may be passed into it.

Since the ShapeCharset is associated with a specific font or glyph encoding, when ActiveShapeEditing is True, the ShapeCharset may (but need not) be the same as the codeset of the locale associated with the LayoutObject.

Once chosen, the ShapeCharset is guaranteed to be of a uniform size and state independent, but the size of each ShapeCharset may vary (for example, 8, 16 or another number of bits) so applications should use the ShapeCharsetSize value when doing storage management.

ShapeCharsetSize (G)

The descriptor ShapeCharsetSize specifies the encoding size of the current ShapeCharset. This value may change when the ShapeCharset is changed. If ActiveShapeEditing is not set (False) the ShapeCharsetSize is set to the maximum code element size (in bytes) for the codeset of the locale for the LayoutObject.

ShapeContextSize (G)

The ShapeContextSize specifies the size of the context (surrounding code elements) that must be accounted for when performing active shape editing. The ShapeContextSize is defined as structure of type LayoutEditSize, (see discussion on LayoutEditSize in Type LayoutEditSize ).

The ShapeContextSize value is guaranteed to remain unchanged for the life of the LayoutObject.

InOutTextDescrMask (SG)

This mask is set to tell the layout functions which text descriptors are initialised to valid values when either InOnlyTextDescr or OutOnlyTextDescr are set or queried. For example, if the InOutTextDescrMask is set to denote Orientation and TypeOfText, only these two descriptors are returned when the InOnlyTextDescr is queried. The values used in InOutTextDescrMask are actually a bitwise OR of one or more classification criteria.

The way in which these layout values are set is implementation dependent. By default, this descriptor is initialised to indicate that all the text descriptors are to be set and queried.

InOnlyTextDescr (SG)

When this descriptor is set it indicates that the input values of the layout values denoted by the InOutTextDescrMask are set or retrieved when using m_setvalues_layout() or m_getvalues_layout() respectively. The way this value is set is implementation dependent.

OutOnlyTextDescr (SG)

When this layout value is set it indicates that the output values of the layout values denoted by the InOutTextDescrMask are set or retrieved when using the m_setvalues_layout() or m_getlayoutvalues() respectively.

CheckMode (SG)

The CheckMode layout value indicates the level of checking of the elements in the InpBuf for shaping and reordering purposes. It also defines the behaviour of the implicit algorithm with respect to standalone neutral characters (until stabilised by a new strong character).

Possible values of CheckMode are:

MODE_STREAM

The string in the InpBuf is expected to have valid combinations of characters or character elements. No validation is needed before shaping and/or combined character cell determination. The only thing validated before the transformation is the current state of the layout object based on previous input data.

The reordering of bidirectional text will assign the nesting level of an unstablised neutral character such that it follows the level of the previous strong character.

When MODE_STREAM is set, it is guaranteed that:

MODE_EDIT

The shaping of input text may vary depending on locale-specific validation or assumptions.

The reordering of bidirectional text will assign the nesting level of an unstablised neutral character such that it follows the level of the global orientation.

When MODE_EDIT is set:

When no value is present, the default of CheckMode (for the C locale) is MODE_STREAM.

QueryValueSize (G)

The user is responsible for his own memory allocation (for the layout values to be queried); therefore he needs to know the actual size of each layout value to be queried.

The name QueryValueSize is defined. This can be ORed with any other name. When m_getvalues_layout() detects that QueryValueSize is ORed with any name it returns the number of bytes needed to store the value, rather than the value itself. This is to avoid adding a parameter to m_getvalues_layout().

The following example illustrates the use of QueryValueSize:

unsigned long Size;


layout[0].name = QueryValueSize | ShapeCharSet;
layout[0].value = &Size;
layout[1].name = 0;
m_getvalues_layout(hlo,layout,&index);
  /*Size should now contain the number of bytes needed
  /*to hold ShapeCharSet*/


Layout Value Data Types

The following describe the data types used for some of the layout values. All layout values are defined in <sys/layout.h>. The content of <sys/layout.h> is implementation dependent. In addition the following layout values may be combined (logic OR) into a single type TextDescriptor:

Orientation Context TypeOfText ImplicitAlg Swapping Numerals TextShaping

The value of these layout values may also be combined (logic OR) into a single attribute. The layout value AllTextDescriptor can be used to indicate that all LayoutTextDescriptor types are set (see Type LayoutTextDescriptor ).

Type LayoutValues
Layout values are defined using the LayoutValues data type which is a pointer to the LayoutValueRec data structure:
#include <sys/layout.h>

typedef struct{ LayoutId name; /* int - the id of the layout value*/ LayoutValue value; /* void* - Data of layout value item */ }LayoutValueRec, *LayoutValues;

The name element denotes the layout value to be set and the value element contains the data to be set. The LayoutValue data type is a C-language type large enough to contain the following: char*, long, int*, or a pointer to a function. The end of the array is indicated by a name of value zero (0).

The m_setvalues_layout() function is a convenient way to set the two members of the LayoutValueRec structure. This function is usually specified in a stylised manner to minimise the probability of making a mistake. For further information see .

Type LayoutTextDescriptor
The LayoutTextDescriptor type is used to identify the attributes of source and target text:
#include <sys/layout.h>
typedef int LayoutDesc
typedef struct{
   LayoutDesc inp;    /* Input buffer description */
   LayoutDesc out;    /* Output buffer description */
   } LayoutTextDescriptorRec, *LayoutTextDescriptor;

The inp and out values are combinations of the appropriate descriptor items. Each of the descriptors is specified as a combination of one value from each of the following groups - a value for the input descriptor and a corresponding value for the output descriptor.
Type LayoutEditSize
The LayoutEditSize structure defines the number of surrounding code elements that need to be considered when performing edit shaping:
structure typedef struct{
   int front; /* number of code element in front of the */
              /* edit position in logical order */
   int back;  /* number of code elements following the
              /* edit position in logical order*/
   } LayoutEditSizeRec, *LayoutEditSize;

When a substring is inserted into a string, the front and back elements define the number of code elements in front of the substring and the number of code elements after the substring respectively that need to considered when performing edit shaping. The total number of code elements needed to be viewed is:
total # of code elements =(front + # code elements in substring + back)

If both front and back elements are set to zero, no additional context needs to be considered for edit shaping. When ActiveShapeEditing is not set (False), the front and back are guaranteed to be zero.

Layout Modifiers

Layout modifiers are Layout values in string form.

Each LayoutObject consists of several different layout values that are specified in Layout Values and are initialised at the time the LayoutObject is created by the m_create_layout() function. Yet users may wish to announce an initial layout value that may be different from the default layout value associated with a locale. Thus, the m_create_layout() function supports a modifier argument that allows the user's default layout values to be passed in a string form. The m_create_layout() function supports a grammar for the specification of layout values in string form.

The following symbols are used in the proposed grammar for layout modifier strings:

Character Description
, Comma
- Hyphen
/ Solidus (Slash)
; Semi-colon
= Equal sign
_ Low line (Underscore)

The following strings are used as prefixes within the grammar definition to mean:

inout_
means the value is to be used for both in and out layout values

in_
means the value is to be used as an in layout value

out_
means the value is to be used as an out layout value.

The proposed grammar is as follows:

LSmodifier_string  : '@ls' layout

layout             : layout ',' layout_values
                   | layout_values
                   ;

layout_values      : orientation
                   | context
                   | typeoftext
                   | implicitalg
                   | swapping
                   | numerals
                   | shaping
                   | checkmode
                   | shapcharset
                   ;

orientation        : 'orientation=' inout_orient_value
                   | 'orientation=' in_orient_value ':' out_orient_value
                   ;

inout_orient_value : orient_value
                   ;

in_orient_value    : orient_value
                   ;

out_orient_value   : orient_value
                   ;

orient_value       : 'ltr' | 'rtl' | 'ttblr' | 'ttbrl' | 'contextual'
                   ;

context            : 'context=' inout_context_value
                   | 'context=' in_context_value ':' out_context_value
                   ;

inout_context_value: context_value
                   ;

in_context_value   : context_value
                   ;

out_context_value  : context_value
                   ;

context_value      : 'ltr' | 'rtl'
                   ;

typeoftext         : 'typeoftext=' inout_text_value
                   | 'typeoftext=' in_text_value ':' out_text_value
                   ;

inout_text_value   : text_value
                   ;

in_text_value      : text_value
                   ;

out_text_value     : text_value
                   ;

text_value         : 'visual' | 'implicit' | 'explicit'
                   ;

implicitalg        : 'implicitalg=' inout_algor_value
                   | 'implicitalg=' in_algor_value ':' out_algor_value
                   ;

inout_algor_value  : algor_value
                   ;

in_algor_value     : algor_value
                   ;

out_algor_value    : algor_value
                   ;

algor_value        : 'basic' | 'implicit'
                   ;

swapping           : 'swapping=' inout_swap_value
                   | 'swapping=' in_swap_value ':' out_swap_value
                   ;

inout_swap_value   : swap_value
                   ;

in_swap_value      : swap_value
                   ;

out_swap_value     : swap_value
                   ;

swap_value         : 'yes' | 'no'
                   ;

numerals           : 'numerals=' inout_num_value
                   | 'numerals=' in_num_value ':' out_num_value
                   ;

inout_num_value    : num_value
                   ;

in_num_value       : num_value
                   ;

out_num_value      : num_value
                   ;

num_value          : 'nominal' | 'national' | 'contextual'
                   ;

shaping            : 'shaping=' inout_shap_value
                   | 'shaping=' in_shap_value ':' out_shap_value
                   ;

inout_shap_value   : shap_value
                   ;

in_shap_value      : shap_value
                   ;

out_shap_value     : shap_value
                   ;

shap_value         : 'shaped' | 'nominal' | 'shform1' | 'shform2'
                   | 'shform3' | 'shform4'
                   ;

checkmode          : 'checkmode=' mode_value
                   ;

mode_value         : 'stream'| 'edit'
                   ;

shapcharset        : 'shapcharset=' charset_name
                   ;

charset_name       : char_list number
                   | number char_list
                   | char_list
                   | number
                   ;

char_list          : char_list char
                   | char
                   ;

char               : 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H'
                   | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P'
                   | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X'
                   | 'Y' | 'Z'
                   | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h'
                   | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p'
                   | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x'
                   | 'y' | 'z'
                   | '!' | '%' | '(' | ')' | '*' | '+' | '-' | '.'
                   | '_' | '?' |
                   ;

number             : number digit
                   | digit
                   ;

digit              : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7'
                   | '8' | '9'
                   ;

The grammar can be adapted by the implementations to suit their particular needs and their possible extensions to the layout values.

It is expected that higher-level services will use the grammar above to provide users with customisation when running applications.

Examples

  1. The Motif text widget may support layout modifiers by means of a resource. If the default for the Hebrew locale is LTR, the following changes it:
    layoutDirection : right_to_left @ls swapping=yes, numerals=national
    
    
    The key is that the Motif resource value says that right_to_left is the orientation but says nothing about swapping or numerals. The @ls modifier clarifies this.

  2. A string could be embedded in a Help repository (such as CDE's help volumes) that describes the layout values to be associated with the text in the help volume. When presented, the help layout values would be passed to the Text widget (see above) in the layoutDirection resource. The following is an example of such a string:
    @ls typeoftext=visual, orientation=rtl, swapping=no,
            numerals=nominal, shapecharset=iso8859-8
    
    

    Note that while this helps speed up presentation of the help text, searches of the help text cannot be made using logical-ordered text (which is the default when entered in an input field). This is because the text (type=visual) has been previously shaped and reordered and thus any text searches need some other processing to account for this.


Footnotes

1.
This example is based upon the control codes required in the ISO/IEC 6429 standard for handling bidirectional text as published in the ECMA TR/53 standard. This list is given for illustration only. There is no implied connection between the embedded controls and a specific encoding scheme. The encoding of the above controls depends on the codeset associated with the LayoutObject. Any ASCII-based encoding uses the ISO/IEC 6429 standard escape sequence definitions.


[??] Some characters or strings that appear in the printed document are not easily representable using HTML.


Why not acquire a nicely bound hard copy?
Click here to return to the publication details or order a copy of this publication.

Contents Next section Index