Implementing an Alternative Collation Sequence for Unicode

Implementing an Alternative Collation Sequence for Unicode
Prev	Chapter 12. Internationalization	Next

By default, GT.M sorts string subscripts in the default order of the Unicode numeric code-point ($ASCII()) values. Since this implied ordering may or may not be linguistically or culturally correct for a specific application, an implementation of an algorithm such as the Unicode Collation Algorithm (UCA) may be required. Note that implementation of collation in GT.M requires the implementation of two functions, f(x) and g(y). f(x) transforms each input sequence of bytes into an alternative sequence of bytes for storage. Within the GT.M database engine, M nodes are retrieved according to the byte order in which they are stored. For each y that can be generated by f(x), g(y) is an inverse function that provides the original sequence of bytes; in other words, g(f(x)) must be equal to x for all x that the application processes. For example, for the People's Republic of China, it may be appropriate to convert from UTF-8 to Guojia Biaozhun (国家标准), the GB18030 standard, for example, using the libiconv library. The following requirements are important:

Unambiguous transformation routines: The transform and its inverse must convert each input string to a unique sequence of bytes for storage, and convert each sequence of bytes stored back to the original string.
Collation sequence for all expected character sequences in subscripts: GT.M does not validate the subscript strings passed to/from the collation routines. If the application design allows illegal UTF-8 character sequences to be stored in the database, the collation functions must appropriately transform, and inverse transform, these as well.
Handle different string lengths for before and after transformation: If the lengths of the input string and transformed string differ, and, for local variables, if the output buffer passed by GT.M is not sufficient, follow the procedure described below:
- Global Collation Routines: The transformed key must not exceed 255 bytes, the maximum key size. GT.M allocates a temporary buffer of size 255 bytes in the output string descriptor (of type DSC_K_DTYPE_T) and passes it to the collation routine to return the transformed key.
- Local Collation Routines: GT.M allocates a temporary buffer in the output string descriptor based on the size of the input string. Both transformation and inverse transformation must check the buffer size, and if it is not sufficient, the transformation must allocate sufficient memory, set the output descriptor value (val field of the descriptor) to point to the new memory , and return the transformed key successfully. Since GT.M copies the key from the output descriptor into its internal structures, it is important that the memory allocated remain available even after the collation routines return. Collation routines are typically called throughout the process lifetime, therefore, GT.M expects the collation libraries to define a large static buffer sufficient to hold all key sizes in the application. Alternatively, the collation transform can use a large heap buffer (allocated by the system malloc() or GT.M gtm_malloc()). Application developers must choose the method best suited to their needs.

Prev	Up	Next
Creating the Alternate Collation Routines	Home	Matching Alternative Patterns