[Main topics] [Navigation]

NLA - Functions needed for Support of National Languages

Functions to support the NLA in the area of language are:

Functions «Translate Key» and «Reverse Key Translation» depend on the implementation of the NLA. The function is needed, although it may not be manifest in distinct modules. The relation to the proposed «Data Type Key» is used only as an example to explain the functions.

Language is a global variable for Sort, Search and Compare operations. It need not be an attribute of every string (see Implementation considerations).

[To top/bottom of page] Comparison of Strings

In the comparison of text not only are exact matches relevant, but also «fuzzy» matches based on linguistic properties of the text. These functions are especially needed in data base applications for searching and sorting. The parameters indicate the type of compare function which is required.

string1
Input: Reference string
string2
Input: String to be compared to the reference string
precision
Input: Level of similarity required:
  1. Equality only if exact match (string1= 'côté', string2= 'côté')
  2. Quasi match if case and accents are right, but special characters are not (string1= 'vice-president', string2= 'vicepresident' or: string1= 'c'est ... dire' , string2= 'c'est...dire' will match)
  3. Quasi matches if mixed upper case and lower case (with accents) (string1= 'CÔTÊ', string2= 'Côté' will match)
  4. Quasi matches if mixed accented and unaccented (string1= 'COTE', string2= Côté' will match)
  5. Fuzzy matches needed (based on phonetic rules). Examples for fuzzy matches are:
    string1= 'MÜLLER',  string2= 'Mueller'
    string1= 'Århus',   string2= 'Aarhus'
    string1= 'STRASSE', string2= 'Straße'
relation
Output: Result is one of the following:
match
according to the desired precision
less than
according to the desired precision (for example: côte < coté; cote < côté; COTE < côte; coté < coter)
greater than
according to the desired precision (for example: coter > coté)
undetermined
for special non-alphabetic characters
no match
Note: If quasi mathces or fuzzy matches are used, this match has priority over a «less than» or «greater than» result.

There is no end to fuzzyness - it is only restricted by the implementation, which relies on a precisely stated set of phonetic rules for each language / country combination.

[To top/bottom of page] Search

A function similar to the comparison function is required for searching. The parameters indicate the type of search function which is required.

pattern
Input: Pattern to be located
string
Input: String which is to be searched for pattern.
precision
Input: same as precision in «Comparison of Strings»:
1
Equality only if exact match
2
Quasi match if case and accents are right, but special characters are not
3
Quasi matches if mixed upper case and lower case (with accents)
4
Quasi matches if mixed accented and unaccented
5
Fuzzy matches needed (based on phonetic rules)
numvar
Output: Returns the numeric offset of the pattern within the string.
If not found, function returns -1.
For example if string="côtelette» and pattern="COTE":
if quasi match accepted, numvar=0 is returned;
else, numvar=-1 is returned.
length
Output: Returns length of substring delimited. In case of fuzzy matches this may be different from implied length of pattern.
For example if pattern = Müller and string = Klaus Mueller , a fuzzy match is accepted: numvar=6 and length=7 will be returned. The length of the string retrieved is longer than the pattern searched in this example.
Note: The programming language specification must describe the index used for the first character (zero or 1)

[To top/bottom of page] Sorting

A function similar to the comparison function is required for sorting. The function needed is explained here by the parameters.

list1
Input: List to be sorted.
list2
Output: Sorted list. Output must follow NLA rules. An example of such a sorted list in French:
cote
Cote 
COTE 
côte
coté 
côté 
coter
sort rule
Both predefined (e.g. Swedish, German, Swiss-German) and installation-specified order must be possible. See also Hierarchy of defaults.

Please note the subtle rules for French, where accents are not taken into account except for homographs like the the above example in which discrimination starts from the last character and moves left (to the first character) with priority depending on accents. Each language uses different rules [CS-Z243.4.1, IBM GG24-3516, Wingen-2].

Although this paper presents a 4-key sort method (see sort key) the NLA must not restrict the sort method to four keys. Other languages and writing systems may need more keys. A universal sort algorithm for languages based on the latin alphabet must support:

It is necessary to specify these functions in the definition of the sort scheme. We encourage implementors of the NLA to follow the POSIX-model of setlocale [POSIX-2].

A key construction algorithm (like that of Alain LaBonté in [CS-Z243.4]) is not the same thing as a sort algorithm (like those found in [Knuth]). The merit of the key construction method is that it computes a single key (constructed from parts), which allow for single pass sorting, not multiple passes as used by others.

[To top/bottom of page] Merge

The purpose of this function is to combine previously sorted lists into one sorted list. Hence the same rules must be applied here as in the sort function.

list1
Input: First list to be merged
list2
Input: Second list to be merged
list3
Output: Merged list (result of the merging of list1 and list2)
sort rule
Both predefined (e.g. German, Swiss-German) and installation-specified order must be possible.
Note: Output is the same as in the previous function (sort). The algorithm must assume (or verify) that the specified sort rule was applied to both input lists. If either list1 or list2is not sorted according to the sort rule (or default) for the output, list3, then the algorithm must copy and resort list1 or list2 using the sort function defined in «Sorting» before performing the merge.

[To top/bottom of page] Set and Query NLA Attributes

The architecture must provide mechanisms to set the NLA attributes used for processing, both locally and globally, and to query these attributes. IBM supplied software must exploit these attributes.

Some files will require multiple NLA attributes. The NLA must have mechanisms for changing attributes in a file and for processing those changes. In addition, the NLA must allow the attributes to change during an interactive session.

An application must be able to query the current set of attributes and to define a new one for later processing. The identification of these attributes should not introduce new coding schemes. When needed, it must be possible to display the attributes in the language of the person using the computer. Therefore, attribute values must be independent of the natural language.

It may be desirable to identify a set of National Language-attributes with an ID-number, because the direct association of attributes with the data may be imprecise in cases like sort schemes or conversion rules. These ID's must be published in a «registry».

Also, query functions must be able to «extract» specific attributes from the identification tag of the data. The following model illustrates this:

NLQUERY (what, result )

In this model what will be a keyword (depending on program language binding possibly contained in a character string), for example CODEPAGE. result must be in a form suitable for use in applications, in most cases a character string using only the (see ... syntactic character set. This model of a function call avoids proliferation of function names.

[To top/bottom of page] Key Translation

Depending on the implementation of the NLA, this function is fundamental for all kinds of sorts and data base update and retrieval.

input
String to be converted (e.g. Côté). This is the representation used in presentation services.
output
Set of keys used for example for comparison in sorts. This can be seen as the «internal form» of strings.

This function describes a translation between an external form (the file) and an internal representation of the same information, which is intended for processing. Whether this translation is applicable to data files directly (instead of original text the sort keys are stored) or exists only during processing of the text, is left to the implementor.

For some applications, like data base applications, it may be desirable to have the data ready in processable form. For other applications it may be more economic to «convert on the fly».

«Data Type Key» as described in section Implementation Considerations holds all the information of the original string - but in a form more adequate for internal processing.

[To top/bottom of page] Reverse Key Translation

This function translates from internal (process dependent) format of textual data to the string format used in presentation services.

input
Text string using 'internal representation' (as an example see Data Type Key).
output
Rebuilt data.

 

[Main topics] [Navigation]
 URL:  Created: 1996-12-28  Updated:
© Docu+Design Daube, Zürich    
  Business of Docu + Design Daube Documentation issues Sharing information Klaus Daube's personal opinions Guests on this site Home of Docu + Design Daube To main page in this category To first page in series To previous page in series To next page in series To bottom of page To top of page Search this site Site map Mail to webmaster To bottom of page To top of page