LangBox International Codeset Overview

Codeset Overview

Navigation

About Us
Products
	X11/Motif
	AXmedit
	TTY/Console
Languages
	Arabic
	Farsi
	Hebrew
	Korean
Platforms
	Solaris
	Linux
	Irix
	Dec
Services
Documentation
	News
	FAQ
	Codesets
Downloads
	xlangbox-ara
	AXmedit
	AraZilla
	AraMosaic
Resources
Subscribe
Contact Us

Customer

Customer Support Download Area

Technical Documentation

License Key Generator

Your Need

Tell us your Need!

Search on this Site:

Codeset Overview

8 bit codesets : The ISO 8859 serie

On UNIX environments, the ISO 8859 series is the de-facto standard for all 8 bit national codeset. These codesets are storing the ASCII standard from the code position 0x00 and 0x7F, and their own national language characters from the code position 0xA0 to 0xFF. Therefore, and because of the possible mix-up between different languages (which uses the same range of coding position), the support of a specific national language implies the support of a unique specific ISO codeset - However, since in some cases, an ISO codeset includes characters for several languages.

Initially, the support of the common default ISO 8859-1 (and so for the West European languages) was already included in most of Unix systems. This issue is directly handled by operating system localization supports such as MNLS or XPG standards that provide Locale definition. However , the support for non European languages (i.e. for non ISO 8859-1 charset) is not always present. This support must includes at least items such as Locale definition (which is the case in most of latest releases of Vendor's UNIX, but also a large font set, a dual keyboard management, a printing support, etc., and should be embedded in the UNIX kernel or system libraries for transparency. Because of this lack, LangBox products is adding either the Locale definition for these languages as well as fonts set for various devices such as printers, dumb terminals and X servers. The dual keyboard management is also supported via either a pseudo device driver in the kernel, a STREAMS module or a new set of system libraries.

Basically, the ISO serie is as follow:

Codeset	Quick reference	National language(s)
ISO 8859-1		Default (English) West European
ISO 8859-2		Latin 2 : East European
ISO 8859-3		Latin 3 : Catalan, Esperanto, Galician...
ISO 8859-4		Latin 4 : Lithuanian
ISO 8859-5		Cyrillic
ISO 8859-6		Arabic
ISO 8859-7		Greek
ISO 8859-8		Hebrew
ISO 8859-9		Latin 5 : Turkish
TIS 620 (seems to become ISO 8859-11)		Thai

A very good information site on the ISO 8859 series, with latest update is at : Roman Czyborra's ISO 8859 Alphabet soup.

16 bit codesets : UNICODE

Unicode standard (compatible with ISO/IEC 10646-1) represents a change to storing 16 bit characters, which increases the number of characters that can now be represented to over 65,000, which might be theoretically enough to map all known alphabet schemes and even leaves room over for future expansion. However, it appears that this first estimation might be wrong according the Unicode allocation pipeline, and up to 20 bit or 32 bit space area should be requested in the future.

The Unicode Consortium, formed in 1991, works out the details of the Unicode Worldwide Character Standard, which currently covers 49,194 characters from 35 scripts, each of which can be used for one to multiple languages. A large number of scripts are still unsupported by Unicode but are in process or are being considered for processing.

With the adoption of UNICODE standard support, the multiple codeset support should not be needed anymore on application side. All 8-bits national standards characters have been re-encoded into a wide 16-bit encoding width, which simplifies the handling of multi-language document or the share of national document to an other national user environment.

The main organization of the Unicode codeset (area called Basic Multilingual Plane or BMP) is as follow:

Unicode Standard is in continuous extension. Since the UNICODE 1.0 several additional language have been adopted and added in the standard, to reach today the latest version: UNICODE 3.0 - For more information, see: http://www.unicode.org and http://charts.unicode.org/charts

Codeset endoding

However, the support of this wide codeset is not straight-on in Operating Systems. Because of its 16-bits native format and some problem for direct backward compatibility (with 8 bit default read/write function and/or ASCII documents for instance), several encoding format (i.e. storage representation of a Unicode character) have been established : They are basically the following:

Fixed byte number length (no backward compatibility with ASCII) :

UCS-2 : This is the native 16 bit Unicode coding format : Each characters are coded on two bytes (from U-0000 to U-FFFF).
UCS-4 : This is the native ISO/IEC 10646-1 coding format : Each characters are coded on four bytes (from U-00000000 to U-FFFFFFFF). But the use of this format is consuming a lot of disk and memory space.

Multi-Byte length (backward compatibility with ASCII) :

UTF-8 : This is the standard most used to store Unicode data. It is multi-byte based (from 1 to 6 bytes) but is ASCII compatible and reasonably compact (generally only 3 bytes are needed for coding Unicode characters)
UTF-7 : Because UTF-8 is 8 bit byte based, and because some terminal or mail gateway are still striping the 8th bit of each bytes, this encoding is purely 7 bit and use MIME base64 encoding.
EUC : Old AT&T Extended Unix Code : Allows to store 8 836 codes (94 * 94) data on multi-bytes (7 bit).
UTF-16 : This encoding is near from UCS-2, but allows to use a surrogate pair (from 0xD800 to 0xDFFF) to address characters coded outside the BMP area, with 32 bits codes. UTF-16 is able to cover a 20 bits space area (from U-00000000 to U-0010FFFF).

The LangBox's role

Because UNICODE just defines the unified storage standard for documents, the display (or rendering) of these documents under an application might still need some specific processing. This is typically the case for Complex Text Languages, such as the Arabic alphabet based languages for instance (where a "Bi-Directionality" rendering process and a "Glyph Shaping" process must be done). The UNICODE Standard describes this process, but doesn't implement it. See the Unicode report on Bi-Di.

This is at this specific point that LangBox International is interfering by providing support for UNICODE rendering functions for Complex Languages such as Arabic, Farsi, Thai... Please contact us for more info