Codeset
Overview
8 bit codesets : The ISO 8859 serie |
On
UNIX environments, the ISO 8859 series is the de-facto standard
for all 8 bit national codeset. These codesets are storing
the ASCII standard from the code position 0x00 and 0x7F, and
their own national language characters from the code position
0xA0 to 0xFF. Therefore, and because of the possible mix-up
between different languages (which uses the same range of
coding position), the support of a specific national language
implies the support of a unique specific ISO codeset - However,
since in some cases, an ISO codeset includes characters for
several languages.
Initially,
the support of the common default ISO
8859-1 (and so for the West European languages) was already
included in most of Unix systems. This issue is directly handled
by operating system localization supports such as MNLS or
XPG standards that provide Locale definition. However , the
support for non European languages (i.e. for non ISO 8859-1
charset) is not always present. This support must includes
at least items such as Locale definition (which is the case
in most of latest releases of Vendor's UNIX, but also a large
font set, a dual keyboard management, a printing support,
etc., and should be embedded in the UNIX kernel or system
libraries for transparency. Because of this lack, LangBox
products is adding either the Locale definition for these
languages as well as fonts set for various devices such as
printers, dumb terminals and X servers. The dual keyboard
management is also supported via either a pseudo device driver
in the kernel, a STREAMS module or a new set of system libraries.
Basically, the ISO serie is as follow:
A very good information site on the ISO 8859 series, with
latest update is at : Roman
Czyborra's ISO 8859 Alphabet soup.
16 bit codesets : UNICODE |
Unicode
standard (compatible with ISO/IEC 10646-1) represents
a change to storing 16 bit characters, which increases the
number of characters that can now be represented to over 65,000,
which might be theoretically enough to map all known alphabet
schemes and even leaves room over for future expansion. However,
it appears that this first estimation might be wrong according
the Unicode
allocation pipeline, and up to 20 bit or 32 bit space
area should be requested in the future.
The Unicode Consortium,
formed in 1991, works out the details of the Unicode Worldwide
Character Standard, which currently covers
49,194 characters from 35 scripts, each of which can be
used for one to multiple languages. A large number of scripts
are still unsupported
by Unicode but are in process or are being considered for
processing.
With
the adoption of UNICODE standard
support, the multiple codeset support should not be needed
anymore on application side. All 8-bits national standards
characters have been re-encoded into a wide 16-bit encoding
width, which simplifies the handling of multi-language document
or the share of national document to an other national user
environment.
The
main organization of the Unicode codeset (area called Basic
Multilingual Plane or BMP) is as follow:
Unicode
Standard is in continuous extension. Since the UNICODE 1.0
several additional language have been adopted and added in
the standard, to reach today the latest version: UNICODE 3.0
- For more information, see:
http://www.unicode.org and http://charts.unicode.org/charts
However,
the support of this wide codeset is not straight-on in Operating
Systems. Because of its 16-bits native format and some problem
for direct backward compatibility (with 8 bit default read/write
function and/or ASCII documents for instance), several encoding
format (i.e. storage representation of a Unicode character)
have been established : They are basically the following:
Fixed
byte number length (no backward compatibility with ASCII)
:
- UCS-2
: This is the native 16 bit Unicode coding format : Each
characters are coded on two bytes (from U-0000 to U-FFFF).
- UCS-4
: This is the native ISO/IEC 10646-1 coding format : Each
characters are coded on four bytes (from U-00000000 to U-FFFFFFFF).
But the use of this format is consuming a lot of disk and
memory space.
Multi-Byte
length (backward compatibility with ASCII) :
- UTF-8
: This is the standard most used to store Unicode data.
It is multi-byte based (from 1 to 6 bytes) but is ASCII
compatible and reasonably compact (generally only 3 bytes
are needed for coding Unicode characters)
- UTF-7
: Because UTF-8 is 8 bit byte based, and because some terminal
or mail gateway are still striping the 8th bit of each bytes,
this encoding is purely 7 bit and use MIME base64 encoding.
- EUC
: Old AT&T Extended Unix Code : Allows to store 8 836
codes (94 * 94) data on multi-bytes (7 bit).
- UTF-16
: This encoding is near from UCS-2, but allows to use a
surrogate pair (from 0xD800 to 0xDFFF) to address characters
coded outside the BMP area, with 32 bits codes. UTF-16 is
able to cover a 20 bits space area (from U-00000000 to U-0010FFFF).
Because
UNICODE just defines the unified storage standard for documents,
the display (or rendering) of these documents under an application
might still need some specific processing. This is typically
the case for Complex Text Languages, such as the Arabic alphabet
based languages for instance (where a "Bi-Directionality"
rendering process and a "Glyph Shaping" process
must be done). The UNICODE Standard describes this process,
but doesn't implement it. See the
Unicode report on Bi-Di.
This
is at this specific point that LangBox International
is interfering by providing support for UNICODE rendering
functions for Complex Languages such as Arabic, Farsi, Thai...
Please contact us
for more info
|