+----------------------------------------------------------------------------+
|  LangBox International                        Tel: +33 (0)4 9371 1410      |
|                                               Fax: +33 (0)4 9371 1560      |
|  Imm. SPACE - Bat B                        Email : support@langbox.com     |
|  208/212 Rte de Grenoble                      or : langbox@spartacus.com   |
|  06200 Nice - France                              http://www.langbox.com   |
+----------------------------------------------------------------------------+
Copyrights LangBox International


******************************************************************************
*                               Technical Memo                               *
*   Technical extension needed for BI-DI language support to a Web Browser   *
******************************************************************************
*                                                                            *
* From : Franck Portaneri - franck@langbox.com                               *
* Date : 22 Aug 96                                                           *
* To   : Netscape                                                            *
* Info : http://www.langbox.com/AraMosaic                                    *
******************************************************************************

This memo tries to summarize the particular points to address in adding the
BI-DI language support, and more precisely, the Arabic language support
into a WEB Browser. It is the result of the Arabic support enhancement by 
LangBox on X-Mosaic and producing AraMosaic in June 96.
It has been written to define the Arabic language localization needs
into an eventual Netscape Localization toolkit, as the result of our 
previous Netscape contacts.


1- Codeset/Font
---------------

The codeset used for Arabic is ISO 8859-6. This is the best codeset for
UNIX platforms where all other ISO 8859 counterpart have been selected.
However, due to the share of Windows 3 system, and to the Microsoft
Arabic version of Windows, the codeset CP 1256 is also well developed
and most of Arabic text Web servers provide both encoding possibilities
(ISO 8859-6 for MAC and UNIX and CP 1256 for Windows).

Other BI-DI languages (such as Hebrew with ISO 8859-8, Farsi with ISIRI 3342
...) are also based on 8 bit codeset.

The coming of UNICODE might be the global and definitive solution in
the next years, but today the market is still with 8 bit encoding codeset.
In any cases, the codeset choice is not the major problem for BI-DI languages
handling, and the following explanations are valid for 8 bit and 16 bit 
codesets.

Since AraMosaic is build on UNIX, we choose ISO 8859-6, bearing in 
mind that in many cases, a dynamic codeset conversion can be done onto the 
input/output flow of character in order to support CP 1256.

The font used for various styles of HTML must be changed into Arabic.
Several sizes, weights and styles must be available to cover all predefined
HTML styles. The solutions is to install these fonts to the OS
font server and select them for drawing text.

For Arabic, the nicer font are proportional width fonts. However, fixed
width fonts exist and can be used for formatted sections.

Also, for Arabic (or Farsi) language, the fontset includes more characters
than the codeset encoding shows. This is due to various shapes needed
to correctly visualize an encoded string.

2- Localization of menus/messages
----------------------------------

The Mosaic menus and dialog boxes are handled by Motif. Localizing them
can be done by the following operations:

  - Edit/Translate X Resource files from English to Arabic
  - Run Mosaic with an Arabized version of Motif (i.e. XLANGBOX-ARA)

For other platforms such as MS Windows, the same solution should be
done under the Arabic edition of Windows 3.x or Windows95.

This has not been done since it does not present major problems and it is really 
needed for a final commercial package. It was not the purpose of the AraMosaic 
1.0 project.

3- The HTML page widget
-----------------------

Here is the issue: Several point of the HTML page handling need to be
covered for supporting BI-DI language and Arabic.

3.1 The BI-DI language support
------------------------------

BI-DI languages represent Native languages that are read from Right to Left,
(Mainly Arabic, Farsi, Hebrew, Urdu...) where English sub-strings can be
inserted (from Left to Right).
This fact generates mainly problems for text display and text selection.
The BI-DI language support needs to allow the main HTML page to be built
from right to left (and top to bottom), and to always provide
the ability to return to the original position of left to right for 
supporting a bilingual environment.

This BI-DI support should be embedded into all elements of the HTML pages
and in a way to refresh them.

Also, the management of the Horizontal scroll bar should be enhanced:

The BI-DI support must be embedded for all element types inserted into
the HTML page :
        - Text                              (TextRefresh())
        - Horizontal rules                  (HRuleRefresh())
        - Bullet                            (BulletRefresh())
        - Images                            (ImageRefresh())
        - Table                             (TableRefresh())
        - Input Widgets or select widget    (WidgetRefresh())
        - ...
and managed by a new HTML page structure variable : the RTL Mode.
This flag could be set by the user though a preference dialog box or
dynamically on the receipt of an HTML markup (encoding = ISO8859-6
or LANGUAGE=ar or align=right in the HTML Header). On the change of
this variable value, the whole page should be rebuilt and redisplayed.

3.1.1 : Scroll bar management

If the flag RTL language is active, all the management of the scroll bar
should be reversed. It must be initialized on the right side of the
slider (XmNprocessingDirection=XmMAX_ON_LEFT) and the movement from 
right to left should decrease the X offset. 


3.1.2 : Initialize the X origin of all elements

Under RTL, the X origin of all elements must be initiated from the right
side of the page to the right (i.e. from hw->html.view_width - margin_width to
margin_width) and all the related code should be aware of this.
However, If the browser accepts a dynamic language orientation switch (through
a toggle button in a menu or an HTML interpreted specific markup), the
page can be totally rebuilt from Left to Right as normally in English.
Also in RTL mode, after a resize action, all X coordinate elements should be
updated according the new view_width.
This is done in the functions CreateElement() and SetElement().

3.1.3 : Display/Refresh of each element

All functions in charge of the display (or refresh) of each type of
element need to be enhanced in order to manage the X coordinate change
in RTL mode.
Since the basic function that redisplays an element (text, image, applet...)
is always the same and is part of the OS library and does its operation
from left to right, the Browser functions must calculate the basic X 
coordinate which is the element X coordinate less the element length before
calling a low level routine.

Standard LTR mode                   RTL mode
+-----------------------------+     +-----------------------------+
|  ABCD       =======         |     |        =======        ABCD  |
|             =======         |     |        =======              |
|             =======         |     |        =======              |
|                             |     |                             |
   ^          ^                              ^     ^        ^  ^
   |          |                              |     |        |  |
  LXt=PXt    LXi=PXi                        PXi   RXi      PXt RXt

LXt = LTR Text X coordinate
LXi = LTR Image X coordinate
RXt = RTL Text X coordinate
RXi = RTL Image X coordinate
PXt = Physical Text X coordinate for Basic routine (XDrawString())
PXi = Physical Image X coordinate for Basic routine (XCopyArea())

In LTR Mode, PXt = LXt and PXi = LXi.
In RTL Mode, PXt = RXt - text_width = View_width - LXt - text_width
and          PXi = RXi - Img_width  = View_width - LXi - Img_width

This algorithm must be implemented in every element redrawing function.

3.1.4  Text logical and text visual

Under BI-DI languages, the visual text (displayed) does not follow the
same order of the logical text (stored).
The English letters still write from left to Right, but all national
letters are written from Right to left. The mixing of both languages in
the same line creates complex results:

If you have the string "abc123def" where numbers represent RTL directional
letters, and others are just Latin letters, you can have the following 
situations :

        - The internal ISO 8859-6 string is            :   abc123def

        - In LTR main mode, the visual string is       :   abc321def

          +------------------------------+
          |abc321def                     |
          |                              |
          +------------------------------+

        - In RTL main mode, the visual string is       :   def321abc

          +------------------------------+
          |                     def321abc|
          |                              |
          +------------------------------+

So before each string specific handling (width measurement, drawing...)
The browser should call a transformation routine:

        TransformLogicalToVisual(logical_str,visual_str)
        unsigned char *logical_str;
        unsigned char *visual_str;

In all cases, the logical_str should be a pointer on the whole text line,
and the display function should refresh the whole line. 

The transformation process can be done using an Implicit Algorithm
(this is the case with 8 bit codeset where Arabic characters are
dissociated from Latin one using the 8th bit) or an Explicit algorithm
that uses Control character to determine language, direction...
But this is done inside the TransformLogicalToVisual and is opaque to
the browser.

3.1.5  Mouse pointing management

The mouse management function should manage the above changes when
calculating the pointed element according the mouse position on the screen.
For Mosaic, this function is LocateElement().

Also, since the visual string is different from the internal one,
the mouse pointing function also needs to follow a specific algorithm
when clicking on text.
The solution is to call a specific external function that calculates the
internal position for a visual position.

   CheckPositionVisualToLogical(logical_str,vpos,lpos)
   char *logical_str;
   int vpos, *lpos;

The visual position of the mouse (i.e order number of the character
located at the mouse position) is given with vpos and the internal
buffer is pointed by logical_str. The function return the corresponding
logical position of the character in lpos.

So during a selection action, the mouse pointing routine (LocateElement())
returns the logical character position of the mouse, and the RefreshText()
function uses this value as selection area to calculate and draw the
highlighted areas.

Remark: If the system displays a mouse insertion position (like the I beam
    in editing areas, we need to use a counterpart function that makes the
    reverse operation: CheckPositionLogicalToVisual(logical_str,lpos,vpos).
    But this is not the case with a WEB browser, since all input areas are
    handled with OS libraries and not directly managed by the HTML main widget.
    However, if the localization kit also covers the Nestscape Gold editor 
    product, this function should be implemented. 


3.1.6  Text selection management

This is one of the most complex features to be handled by a BI-DI Language.
The problem is that the visual position of characters of a string could be 
totally different to their internal logical position into the input
buffer. Also, a single block input selection range of text can appear as
one, two or three separate highlighted areas on the screen.

This is the case for example with mixed text selection:

If you have the string "123abc456" where numbers represent RTL directional
letters and letters abc are just a Latin "abc" string, you can have the
following situations : ('H' attribute represents Highlight)

        - The internal string is     :   123abc456
        - The visual string is       :   321abc654

        - You select from '2' to '3' :   HH.......     (1 highlight area)
        - You select from '2' to 'b' :   HH.HH....     (2 highlight areas)
        - You select from '2' to '5' :   HH.HHH.HH     (3 highlight areas)

This process is done by the Arabic routine during the TextRefresh. The
Selection_start and the Selection_end are stored into new HTML structure
variable and should be updated permanently.
In all cases, the TextRefresh function should have a pointer on the whole 
text to make the correct context analysis.
General Latin text display optimization that allows to display only the 
last modified (or selected) character should be disabled.

3.2 : Arabic Language specific support
--------------------------------------

In addition the BI-DI support, the Arabic language needs to handle
the text context shaping. This process is composed of known rules and
is easy to implement using an Arabic language toolkit. The condition is that
the "context analysis process" must be done on the whole line before
each display through XDrawString() or string length calculation with
XTextExtent(). The whole line is needed in order to know the exact context
around all characters.

As for example, in the case 3.1.3 above, the text_width value must be
calculated on the "visual" string and not directly on the ISO data
buffer.

This action can be easily done by having a call to a specific external
function that transforms the ISO string to a CTX string. For english mode
this function simply does no modification (i.e strcpy()) or the transform
function can be called only if there is a flag IsTextShapping set to on.

The transformation process should also take in account the Font used.
Since there is no standard for font encoding, several encodings exist
on the market. For AraMosaic, we used our own font encoding, but
our XLANGBOX-ARA routines are able to handle different font sets.


4- The printing process
-----------------------

The printing process must also follow the same kind of algorithm (except
for the text selection) in order to build its page. Using postscript output
is the simplest way to draw Arabic text.

        - We can add a font header to the file if the font is not resident,
        - The full line can be analyzed and a sub-part of text can be
          'showed' in respective font using macros.
          Example: For a full standard line display generation:

                (blablabla...) S

          we must generate the postscript commands :

                (english_text1) S (arabic_text) SA (english_text2) S

          where the Macro SA select that Arabic fonts and display the text.
          Of course the whole line must be transformed before this operation.

          In RTL mode, the output should look like :

                (english_text1) AS (arabic_text) ASA (english_text2) AS

          At this point, the arabic_text string should be already transformed.

5- Netscape localization tool kit Entry point
---------------------------------------------

We believe that in order to cover this issue, the Netscape localization
tool kit should cater to the following issues:

- Provide all menus/messages in Resources files.
  Companies that implement the localization will have to edit/translate
  this text and run Netscape under an already localized OS environment
  (i.e. Arabic Windows for Windows version or XLANGBOX-ARA for UNIX versions)
  But we presume that this is already done, since this is imperative even
  for ISO 8859-1 European languages localization.

- Provide a full font configuration flexibility for the menus and
  the HTML documents. All HTML predefined styled fonts must be setable.

- Netscape HTML page should manage the RTL alignment directly. It should be
  the best comprehensive solution to avoid most of continuing problems    
  if this handling is done externally.
  Otherwise, an entrypoint function should allow the change to X
  coordinates dynamically to each element using the algorithm :

  new_x = view_width - right_margin - left_margin - old_x - element_width \
          + X_offset;

  The entrypoint should look like :

        int                            /* New X coordinate in RTL, output */
        TransformXCoordonate(cur_x,view_w,right_m,left_m,element_w,Xoffset)
        int cur_x;                     /* Current X coordinate in LRT, input*/
        int view_w;                    /* HTML view page width, input */
        int right_m;                   /* right margin width, input */
        int left_m;                    /* left margin width, input */
        int element_w;                 /* current element width, input */
        int Xoffset;                   /* current HTML page X offset, input */

  In this case, the Horizontal scroll bar handling and the main X offset
  should also be managed though an external entry point. This is more 
  delicate, since we noticed that the X_offset is set within the Scroll_bar
  widget callback and its value is used in all functions.

- Before each string specific handling (drawing, width measurement...),
  Netscape should call a string transformation entry point.

        TransformLogicalToVisual(logical_str,visual_str)
        unsigned char *logical_str;       /* Logical data string, input */
        unsigned char *visual_str;        /* Visual data string, output */

  For non BI-DI languages, this entrypoint should simply be replaced as follow:

        TransformLogicalToVisual(logical_str,visual_str)
        unsigned char *logical_str;
        unsigned char *visual_str;
        {
                strcpy(visual_str,logical_str);
        }

- The mouse pointing routine, when pointing to text area, should calculate the
  position in the text (like it already does in English, by adding the width
  of each character on the string and comparing it with the current X pointing
  value) by calling an entrypoint as follow:

        CheckPositionVisualToLogical(logical_str,vpos,lpos)
        unsigned char *logical_str;    /* Logical string, input */
        int vpos;                      /* visual position, input */
        int *lpos;                     /* logical position, output */

  For non BI-DI languages, this function should just assign *lpos = vpos

- If Input widget should be also localized (i.e. Gold editor), the following
  entrypoint should also be called before displaying the Insertion Point I beam:

        CheckPositionLogicalToVisual(logical_str,lpos,vpos)
        unsigned char *logical_str;    /* Logical string, input */
        int lpos;                      /* logical position, input */
        int *vpos;                     /* visual position, output */
        
  For non BI-DI languages, this function should just assign *vpos = lpos
  Warning: The Input widget support implies also to manage a dual virtual
  keyboard. The mapping and the toggle key should be definable through the
  localization kit.

- For the printing support, the browser should concatenate the header with
  a user defined header which could include specific fonts and specific
  Postscript macros. When generating the PS output file, the system
  should call the transformation routine:

      PostscriptTransformationLtoV(logical_str,ps_line,sfl,sfa,dltr,drtl)
      unsigned char * logical_str;  /* Logical text ISO , input*/
      unsigned char * ps_line;      /* Postscript command line, output */
      char * sfl ;                  /* Set Font Latin macro, input */
      char * sfa ;                  /* Set Font Arabic macro, input */
      char * dltr ;                 /* Display in LTR macro, input */
      char * drtl ;                 /* Display in RTL macro, input */

  If the browser does not directly generate the PS and uses the 
  OS printing system, a Transformation routine must be called anyway, except
  if the printing system is considered already arabized and does this
  transformation alone.
  Also, a font setting option should be available to set printing fonts.


Conclusion
----------

We believe that currently there is a serious lack of Arabic language Web
support. With the growing of Intranet solution, Web browser that can
provide this support on a multi-platform environment can take the lead.
Eventually, a Localization tool kit should be able to cover all languages
including BI-DI complex ones. The above list summarizes what we need as a 
minimum for a tool kit and we noticed that in many cases, these points 
cannot be covered by a classical toolkit designed only for European languages. 
Also, if these points are taken into account at the time of conception, it will 
avoid costly re-design patches later.

For more details or for clarification, please contact Franck Portaneri at:
franck@langbox.com

******************************************************************************