NLA - Functions needed

NLA - Functions needed for Support of National Languages

Functions to support the NLA in the area of language are:

Compare strings
Search for a string
Sort a list of strings
Merge two lists
Set and query NLA attributes
Translate key
Reverse key translation

Functions «Translate Key» and «Reverse Key Translation» depend on the implementation of the NLA. The function is needed, although it may not be manifest in distinct modules. The relation to the proposed «Data Type Key» is used only as an example to explain the functions.

Language is a global variable for Sort, Search and Compare operations. It need not be an attribute of every string (see Implementation considerations).

Comparison of Strings

In the comparison of text not only are exact matches relevant, but also «fuzzy» matches based on linguistic properties of the text. These functions are especially needed in data base applications for searching and sorting. The parameters indicate the type of compare function which is required.

string1

Input: Reference string

string2

Input: String to be compared to the reference string

precision

Input: Level of similarity required:

Equality only if exact match (string1= 'côté', string2= 'côté')
Quasi match if case and accents are right, but special characters are not (string1= 'vice-president', string2= 'vicepresident' or: string1= 'c'est ... dire' , string2= 'c'est...dire' will match)
Quasi matches if mixed upper case and lower case (with accents) (string1= 'CÔTÊ', string2= 'Côté' will match)
Quasi matches if mixed accented and unaccented (string1= 'COTE', string2= Côté' will match)

Fuzzy matches needed (based on phonetic rules). Examples for fuzzy matches are:

string1= 'MÜLLER',  string2= 'Mueller'
string1= 'Århus',   string2= 'Aarhus'
string1= 'STRASSE', string2= 'Straße'

relation

Output: Result is one of the following:

match: according to the desired precision
less than: according to the desired precision (for example: côte < coté; cote < côté; COTE < côte; coté < coter)
greater than: according to the desired precision (for example: coter > coté)
undetermined: for special non-alphabetic characters
no match

Note:

If quasi mathces or fuzzy matches are used, this match has priority over a «less than» or «greater than» result.

There is no end to fuzzyness - it is only restricted by the implementation, which relies on a precisely stated set of phonetic rules for each language / country combination.

Search

A function similar to the comparison function is required for searching. The parameters indicate the type of search function which is required.

pattern

Input: Pattern to be located

string

Input: String which is to be searched for pattern.

precision

Input: same as precision in «Comparison of Strings»:

1: Equality only if exact match
2: Quasi match if case and accents are right, but special characters are not
3: Quasi matches if mixed upper case and lower case (with accents)
4: Quasi matches if mixed accented and unaccented
5: Fuzzy matches needed (based on phonetic rules)

numvar

Output: Returns the numeric offset of the pattern within the string.
If not found, function returns -1.
For example if string="côtelette» and pattern="COTE":
if quasi match accepted, numvar=0 is returned;
else, numvar=-1 is returned.

length

Output: Returns length of substring delimited. In case of fuzzy matches this may be different from implied length of pattern.
For example if pattern = Müller and string = Klaus Mueller , a fuzzy match is accepted: numvar=6 and length=7 will be returned. The length of the string retrieved is longer than the pattern searched in this example.

Note:

The programming language specification must describe the index used for the first character (zero or 1)

Sorting

A function similar to the comparison function is required for sorting. The function needed is explained here by the parameters.

list1

Input: List to be sorted.

list2

Output: Sorted list. Output must follow NLA rules. An example of such a sorted list in French:

cote
Cote 
COTE 
côte
coté 
côté 
coter

sort rule

Both predefined (e.g. Swedish, German, Swiss-German) and installation-specified order must be possible. See also Hierarchy of defaults.

Please note the subtle rules for French, where accents are not taken into account except for homographs like the the above example in which discrimination starts from the last character and moves left (to the first character) with priority depending on accents. Each language uses different rules [CS-Z243.4.1, IBM GG24-3516, Wingen-2].

Although this paper presents a 4-key sort method (see sort key) the NLA must not restrict the sort method to four keys. Other languages and writing systems may need more keys. A universal sort algorithm for languages based on the latin alphabet must support:

Multiple sort passes (also called sort keys)
Substitution of digraphs by one «sort element» (for example 'ch' in Spanisch, equal value of 'Å' and 'aa' in Danish).
Substitution of arbitrary character sequences by other ones (for example for sorting 'Saint' and 'St.' at the same place; or to remove arbitrary characters from the input string befor sorting).
Direction of comparison (front to end, end to front of string) must be definable for each sort pass or key (for French the discrimination of accents starts from the last character and moves to the left).

It is necessary to specify these functions in the definition of the sort scheme. We encourage implementors of the NLA to follow the POSIX-model of setlocale [POSIX-2].

A key construction algorithm (like that of Alain LaBonté in [CS-Z243.4]) is not the same thing as a sort algorithm (like those found in [Knuth]). The merit of the key construction method is that it computes a single key (constructed from parts), which allow for single pass sorting, not multiple passes as used by others.

Merge

The purpose of this function is to combine previously sorted lists into one sorted list. Hence the same rules must be applied here as in the sort function.

list1: Input: First list to be merged
list2: Input: Second list to be merged
list3: Output: Merged list (result of the merging of list1 and list2)
sort rule: Both predefined (e.g. German, Swiss-German) and installation-specified order must be possible.

Note:

Output is the same as in the previous function (sort). The algorithm must assume (or verify) that the specified sort rule was applied to both input lists. If either list1 or list2is not sorted according to the sort rule (or default) for the output, list3, then the algorithm must copy and resort list1 or list2 using the sort function defined in «Sorting» before performing the merge.

Set and Query NLA Attributes

The architecture must provide mechanisms to set the NLA attributes used for processing, both locally and globally, and to query these attributes. IBM supplied software must exploit these attributes.

Some files will require multiple NLA attributes. The NLA must have mechanisms for changing attributes in a file and for processing those changes. In addition, the NLA must allow the attributes to change during an interactive session.

An application must be able to query the current set of attributes and to define a new one for later processing. The identification of these attributes should not introduce new coding schemes. When needed, it must be possible to display the attributes in the language of the person using the computer. Therefore, attribute values must be independent of the natural language.

It may be desirable to identify a set of National Language-attributes with an ID-number, because the direct association of attributes with the data may be imprecise in cases like sort schemes or conversion rules. These ID's must be published in a «registry».

Also, query functions must be able to «extract» specific attributes from the identification tag of the data. The following model illustrates this:

NLQUERY (what, result )

In this model what will be a keyword (depending on program language binding possibly contained in a character string), for example CODEPAGE. result must be in a form suitable for use in applications, in most cases a character string using only the (see ... syntactic character set. This model of a function call avoids proliferation of function names.

Key Translation

Depending on the implementation of the NLA, this function is fundamental for all kinds of sorts and data base update and retrieval.

input: String to be converted (e.g. Côté). This is the representation used in presentation services.
output: Set of keys used for example for comparison in sorts. This can be seen as the «internal form» of strings.

This function describes a translation between an external form (the file) and an internal representation of the same information, which is intended for processing. Whether this translation is applicable to data files directly (instead of original text the sort keys are stored) or exists only during processing of the text, is left to the implementor.

For some applications, like data base applications, it may be desirable to have the data ready in processable form. For other applications it may be more economic to «convert on the fly».

«Data Type Key» as described in section Implementation Considerations holds all the information of the original string - but in a form more adequate for internal processing.

Reverse Key Translation

This function translates from internal (process dependent) format of textual data to the string format used in presentation services.

input: Text string using 'internal representation' (as an example see Data Type Key).
output: Rebuilt data.

URL:	Created: 1996-12-28	Updated:
© Docu+Design Daube, Zürich