NLA - Functions needed for Support of National Languages
Functions to support the NLA in the area of language are:
- Compare strings
- Search for a string
- Sort a list of strings
- Merge two lists
- Set and query NLA attributes
- Translate key
- Reverse key translation
Functions «Translate Key» and «Reverse Key Translation» depend on the implementation of the NLA. The function is needed, although it may not be manifest in distinct modules. The relation to the proposed «Data Type Key» is used only as an example to explain the functions.
Language is a global variable for Sort, Search and Compare operations. It need not be an attribute of every string (see Implementation considerations).
Comparison of Strings
In the comparison of text not only are exact matches relevant, but also «fuzzy» matches based on linguistic properties of the text. These functions are especially needed in data base applications for searching and sorting. The parameters indicate the type of compare function which is required.
- string1
- Input: Reference string
- string2
- Input: String to be compared to the reference string
- precision
- Input: Level of similarity required:
- Equality only if exact match (string1= 'côté', string2= 'côté')
- Quasi match if case and accents are right, but special characters are not (string1= 'vice-president', string2= 'vicepresident' or: string1= 'c'est ... dire' , string2= 'c'est...dire' will match)
- Quasi matches if mixed upper case and lower case (with accents) (string1= 'CÔTÊ', string2= 'Côté' will match)
- Quasi matches if mixed accented and unaccented (string1= 'COTE', string2= Côté' will match)
- Fuzzy matches needed (based on phonetic rules).
Examples for fuzzy matches are:
string1= 'MÜLLER', string2= 'Mueller' string1= 'Århus', string2= 'Aarhus' string1= 'STRASSE', string2= 'Straße'
- relation
- Output: Result is one of the following:
- match
- according to the desired precision
- less than
- according to the desired precision (for example: côte < coté; cote < côté; COTE < côte; coté < coter)
- greater than
- according to the desired precision (for example: coter > coté)
- undetermined
- for special non-alphabetic characters
- no match
Note: | If quasi mathces or fuzzy matches are used, this match has priority over a «less than» or «greater
than» result.
There is no end to fuzzyness - it is only restricted by the implementation, which relies on a precisely stated set of phonetic rules for each language / country combination. |
Search
A function similar to the comparison function is required for searching. The parameters indicate the type of search function which is required.
- pattern
- Input: Pattern to be located
- string
- Input: String which is to be searched for pattern.
- precision
- Input: same as precision in
«Comparison of Strings»:
- 1
- Equality only if exact match
- 2
- Quasi match if case and accents are right, but special characters are not
- 3
- Quasi matches if mixed upper case and lower case (with accents)
- 4
- Quasi matches if mixed accented and unaccented
- 5
- Fuzzy matches needed (based on phonetic rules)
- numvar
- Output: Returns the numeric offset of
the pattern within the string.
If not found, function returns -1.
For example if string="côtelette» and pattern="COTE":
if quasi match accepted, numvar=0 is returned;
else, numvar=-1 is returned. - length
- Output: Returns length of substring
delimited. In case of fuzzy matches this may be different
from implied length of pattern.
For example if pattern =Müller
and string =Klaus Mueller
, a fuzzy match is accepted: numvar=6 and length=7 will be returned. The length of the string retrieved is longer than the pattern searched in this example.
Note: | The programming language specification must describe the index used for the first character (zero or 1) |
Sorting
A function similar to the comparison function is required for sorting. The function needed is explained here by the parameters.
- list1
- Input: List to be sorted.
- list2
- Output: Sorted list. Output must follow
NLA rules. An example of such a sorted list in French:
cote Cote COTE côte coté côté coter
- sort rule
- Both predefined (e.g. Swedish, German, Swiss-German) and installation-specified order must be possible. See also Hierarchy of defaults.
Please note the subtle rules for French, where accents are not taken into account except for homographs like the the above example in which discrimination starts from the last character and moves left (to the first character) with priority depending on accents. Each language uses different rules [CS-Z243.4.1, IBM GG24-3516, Wingen-2].
Although this paper presents a 4-key sort method (see sort key) the NLA must not restrict the sort method to four keys. Other languages and writing systems may need more keys. A universal sort algorithm for languages based on the latin alphabet must support:
- Multiple sort passes (also called sort keys)
- Substitution of digraphs by one «sort element» (for example 'ch' in Spanisch, equal value of 'Å' and 'aa' in Danish).
- Substitution of arbitrary character sequences by other ones (for example for sorting 'Saint' and 'St.' at the same place; or to remove arbitrary characters from the input string befor sorting).
- Direction of comparison (front to end, end to front of string) must be definable for each sort pass or key (for French the discrimination of accents starts from the last character and moves to the left).
It is necessary to specify these functions in the definition of the sort scheme. We encourage implementors of the NLA to follow the POSIX-model of setlocale [POSIX-2].
A key construction algorithm (like that of Alain LaBonté in [CS-Z243.4]) is not the same thing as a sort algorithm (like those found in [Knuth]). The merit of the key construction method is that it computes a single key (constructed from parts), which allow for single pass sorting, not multiple passes as used by others.
Merge
The purpose of this function is to combine previously sorted lists into one sorted list. Hence the same rules must be applied here as in the sort function.
- list1
- Input: First list to be merged
- list2
- Input: Second list to be merged
- list3
- Output: Merged list (result of the merging of list1 and list2)
- sort rule
- Both predefined (e.g. German, Swiss-German) and installation-specified order must be possible.
Note: | Output is the same as in the previous function (sort). The algorithm must assume (or verify) that the specified sort rule was applied to both input lists. If either list1 or list2is not sorted according to the sort rule (or default) for the output, list3, then the algorithm must copy and resort list1 or list2 using the sort function defined in «Sorting» before performing the merge. |
Set and Query NLA Attributes
The architecture must provide mechanisms to set the NLA attributes used for processing, both locally and globally, and to query these attributes. IBM supplied software must exploit these attributes.
Some files will require multiple NLA attributes. The NLA must have mechanisms for changing attributes in a file and for processing those changes. In addition, the NLA must allow the attributes to change during an interactive session.
An application must be able to query the current set of attributes and to define a new one for later processing. The identification of these attributes should not introduce new coding schemes. When needed, it must be possible to display the attributes in the language of the person using the computer. Therefore, attribute values must be independent of the natural language.
It may be desirable to identify a set of National Language-attributes with an ID-number, because the direct association of attributes with the data may be imprecise in cases like sort schemes or conversion rules. These ID's must be published in a «registry».
Also, query functions must be able to «extract» specific attributes from the identification tag of the data. The following model illustrates this:
NLQUERY (
what, result )
In this model what will be a keyword (depending on
program language binding possibly contained in a character
string), for example CODEPAGE
. result must be
in a form suitable for use in applications, in most cases a
character string using only the (see ... syntactic character set.
This model of a function call avoids proliferation of function
names.
Key Translation
Depending on the implementation of the NLA, this function is fundamental for all kinds of sorts and data base update and retrieval.
- input
- String to be converted (e.g. Côté). This is the representation used in presentation services.
- output
- Set of keys used for example for comparison in sorts. This can be seen as the «internal form» of strings.
This function describes a translation between an external form (the file) and an internal representation of the same information, which is intended for processing. Whether this translation is applicable to data files directly (instead of original text the sort keys are stored) or exists only during processing of the text, is left to the implementor.
For some applications, like data base applications, it may be desirable to have the data ready in processable form. For other applications it may be more economic to «convert on the fly».
«Data Type Key» as described in section Implementation Considerations holds all the information of the original string - but in a form more adequate for internal processing.
Reverse Key Translation
This function translates from internal (process dependent) format of textual data to the string format used in presentation services.
- input
- Text string using 'internal representation' (as an example see Data Type Key).
- output
- Rebuilt data.