+----------------------------------------------------------------------------+ | LangBox International Tel: +33 (0)4 9371 1410 | | Fax: +33 (0)4 9371 1560 | | Imm. SPACE - Bat B Email : support@langbox.com | | 208/212 Rte de Grenoble or : langbox@spartacus.com | | 06200 Nice - France http://www.langbox.com | +----------------------------------------------------------------------------+ Copyrights LangBox International ****************************************************************************** * Technical Memo * * Technical extension needed for BI-DI language support to a Web Browser * ****************************************************************************** * * * From : Franck Portaneri - franck@langbox.com * * Date : 22 Aug 96 * * To : Netscape * * Info : http://www.langbox.com/AraMosaic * ****************************************************************************** This memo tries to summarize the particular points to address in adding the BI-DI language support, and more precisely, the Arabic language support into a WEB Browser. It is the result of the Arabic support enhancement by LangBox on X-Mosaic and producing AraMosaic in June 96. It has been written to define the Arabic language localization needs into an eventual Netscape Localization toolkit, as the result of our previous Netscape contacts. 1- Codeset/Font --------------- The codeset used for Arabic is ISO 8859-6. This is the best codeset for UNIX platforms where all other ISO 8859 counterpart have been selected. However, due to the share of Windows 3 system, and to the Microsoft Arabic version of Windows, the codeset CP 1256 is also well developed and most of Arabic text Web servers provide both encoding possibilities (ISO 8859-6 for MAC and UNIX and CP 1256 for Windows). Other BI-DI languages (such as Hebrew with ISO 8859-8, Farsi with ISIRI 3342 ...) are also based on 8 bit codeset. The coming of UNICODE might be the global and definitive solution in the next years, but today the market is still with 8 bit encoding codeset. In any cases, the codeset choice is not the major problem for BI-DI languages handling, and the following explanations are valid for 8 bit and 16 bit codesets. Since AraMosaic is build on UNIX, we choose ISO 8859-6, bearing in mind that in many cases, a dynamic codeset conversion can be done onto the input/output flow of character in order to support CP 1256. The font used for various styles of HTML must be changed into Arabic. Several sizes, weights and styles must be available to cover all predefined HTML styles. The solutions is to install these fonts to the OS font server and select them for drawing text. For Arabic, the nicer font are proportional width fonts. However, fixed width fonts exist and can be used for formatted sections. Also, for Arabic (or Farsi) language, the fontset includes more characters than the codeset encoding shows. This is due to various shapes needed to correctly visualize an encoded string. 2- Localization of menus/messages ---------------------------------- The Mosaic menus and dialog boxes are handled by Motif. Localizing them can be done by the following operations: - Edit/Translate X Resource files from English to Arabic - Run Mosaic with an Arabized version of Motif (i.e. XLANGBOX-ARA) For other platforms such as MS Windows, the same solution should be done under the Arabic edition of Windows 3.x or Windows95. This has not been done since it does not present major problems and it is really needed for a final commercial package. It was not the purpose of the AraMosaic 1.0 project. 3- The HTML page widget ----------------------- Here is the issue: Several point of the HTML page handling need to be covered for supporting BI-DI language and Arabic. 3.1 The BI-DI language support ------------------------------ BI-DI languages represent Native languages that are read from Right to Left, (Mainly Arabic, Farsi, Hebrew, Urdu...) where English sub-strings can be inserted (from Left to Right). This fact generates mainly problems for text display and text selection. The BI-DI language support needs to allow the main HTML page to be built from right to left (and top to bottom), and to always provide the ability to return to the original position of left to right for supporting a bilingual environment. This BI-DI support should be embedded into all elements of the HTML pages and in a way to refresh them. Also, the management of the Horizontal scroll bar should be enhanced: The BI-DI support must be embedded for all element types inserted into the HTML page : - Text (TextRefresh()) - Horizontal rules (HRuleRefresh()) - Bullet (BulletRefresh()) - Images (ImageRefresh()) - Table (TableRefresh()) - Input Widgets or select widget (WidgetRefresh()) - ... and managed by a new HTML page structure variable : the RTL Mode. This flag could be set by the user though a preference dialog box or dynamically on the receipt of an HTML markup (encoding = ISO8859-6 or LANGUAGE=ar or align=right in the HTML Header). On the change of this variable value, the whole page should be rebuilt and redisplayed. 3.1.1 : Scroll bar management If the flag RTL language is active, all the management of the scroll bar should be reversed. It must be initialized on the right side of the slider (XmNprocessingDirection=XmMAX_ON_LEFT) and the movement from right to left should decrease the X offset. 3.1.2 : Initialize the X origin of all elements Under RTL, the X origin of all elements must be initiated from the right side of the page to the right (i.e. from hw->html.view_width - margin_width to margin_width) and all the related code should be aware of this. However, If the browser accepts a dynamic language orientation switch (through a toggle button in a menu or an HTML interpreted specific markup), the page can be totally rebuilt from Left to Right as normally in English. Also in RTL mode, after a resize action, all X coordinate elements should be updated according the new view_width. This is done in the functions CreateElement() and SetElement(). 3.1.3 : Display/Refresh of each element All functions in charge of the display (or refresh) of each type of element need to be enhanced in order to manage the X coordinate change in RTL mode. Since the basic function that redisplays an element (text, image, applet...) is always the same and is part of the OS library and does its operation from left to right, the Browser functions must calculate the basic X coordinate which is the element X coordinate less the element length before calling a low level routine. Standard LTR mode RTL mode +-----------------------------+ +-----------------------------+ | ABCD ======= | | ======= ABCD | | ======= | | ======= | | ======= | | ======= | | | | | ^ ^ ^ ^ ^ ^ | | | | | | LXt=PXt LXi=PXi PXi RXi PXt RXt LXt = LTR Text X coordinate LXi = LTR Image X coordinate RXt = RTL Text X coordinate RXi = RTL Image X coordinate PXt = Physical Text X coordinate for Basic routine (XDrawString()) PXi = Physical Image X coordinate for Basic routine (XCopyArea()) In LTR Mode, PXt = LXt and PXi = LXi. In RTL Mode, PXt = RXt - text_width = View_width - LXt - text_width and PXi = RXi - Img_width = View_width - LXi - Img_width This algorithm must be implemented in every element redrawing function. 3.1.4 Text logical and text visual Under BI-DI languages, the visual text (displayed) does not follow the same order of the logical text (stored). The English letters still write from left to Right, but all national letters are written from Right to left. The mixing of both languages in the same line creates complex results: If you have the string "abc123def" where numbers represent RTL directional letters, and others are just Latin letters, you can have the following situations : - The internal ISO 8859-6 string is : abc123def - In LTR main mode, the visual string is : abc321def +------------------------------+ |abc321def | | | +------------------------------+ - In RTL main mode, the visual string is : def321abc +------------------------------+ | def321abc| | | +------------------------------+ So before each string specific handling (width measurement, drawing...) The browser should call a transformation routine: TransformLogicalToVisual(logical_str,visual_str) unsigned char *logical_str; unsigned char *visual_str; In all cases, the logical_str should be a pointer on the whole text line, and the display function should refresh the whole line. The transformation process can be done using an Implicit Algorithm (this is the case with 8 bit codeset where Arabic characters are dissociated from Latin one using the 8th bit) or an Explicit algorithm that uses Control character to determine language, direction... But this is done inside the TransformLogicalToVisual and is opaque to the browser. 3.1.5 Mouse pointing management The mouse management function should manage the above changes when calculating the pointed element according the mouse position on the screen. For Mosaic, this function is LocateElement(). Also, since the visual string is different from the internal one, the mouse pointing function also needs to follow a specific algorithm when clicking on text. The solution is to call a specific external function that calculates the internal position for a visual position. CheckPositionVisualToLogical(logical_str,vpos,lpos) char *logical_str; int vpos, *lpos; The visual position of the mouse (i.e order number of the character located at the mouse position) is given with vpos and the internal buffer is pointed by logical_str. The function return the corresponding logical position of the character in lpos. So during a selection action, the mouse pointing routine (LocateElement()) returns the logical character position of the mouse, and the RefreshText() function uses this value as selection area to calculate and draw the highlighted areas. Remark: If the system displays a mouse insertion position (like the I beam in editing areas, we need to use a counterpart function that makes the reverse operation: CheckPositionLogicalToVisual(logical_str,lpos,vpos). But this is not the case with a WEB browser, since all input areas are handled with OS libraries and not directly managed by the HTML main widget. However, if the localization kit also covers the Nestscape Gold editor product, this function should be implemented. 3.1.6 Text selection management This is one of the most complex features to be handled by a BI-DI Language. The problem is that the visual position of characters of a string could be totally different to their internal logical position into the input buffer. Also, a single block input selection range of text can appear as one, two or three separate highlighted areas on the screen. This is the case for example with mixed text selection: If you have the string "123abc456" where numbers represent RTL directional letters and letters abc are just a Latin "abc" string, you can have the following situations : ('H' attribute represents Highlight) - The internal string is : 123abc456 - The visual string is : 321abc654 - You select from '2' to '3' : HH....... (1 highlight area) - You select from '2' to 'b' : HH.HH.... (2 highlight areas) - You select from '2' to '5' : HH.HHH.HH (3 highlight areas) This process is done by the Arabic routine during the TextRefresh. The Selection_start and the Selection_end are stored into new HTML structure variable and should be updated permanently. In all cases, the TextRefresh function should have a pointer on the whole text to make the correct context analysis. General Latin text display optimization that allows to display only the last modified (or selected) character should be disabled. 3.2 : Arabic Language specific support -------------------------------------- In addition the BI-DI support, the Arabic language needs to handle the text context shaping. This process is composed of known rules and is easy to implement using an Arabic language toolkit. The condition is that the "context analysis process" must be done on the whole line before each display through XDrawString() or string length calculation with XTextExtent(). The whole line is needed in order to know the exact context around all characters. As for example, in the case 3.1.3 above, the text_width value must be calculated on the "visual" string and not directly on the ISO data buffer. This action can be easily done by having a call to a specific external function that transforms the ISO string to a CTX string. For english mode this function simply does no modification (i.e strcpy()) or the transform function can be called only if there is a flag IsTextShapping set to on. The transformation process should also take in account the Font used. Since there is no standard for font encoding, several encodings exist on the market. For AraMosaic, we used our own font encoding, but our XLANGBOX-ARA routines are able to handle different font sets. 4- The printing process ----------------------- The printing process must also follow the same kind of algorithm (except for the text selection) in order to build its page. Using postscript output is the simplest way to draw Arabic text. - We can add a font header to the file if the font is not resident, - The full line can be analyzed and a sub-part of text can be 'showed' in respective font using macros. Example: For a full standard line display generation: (blablabla...) S we must generate the postscript commands : (english_text1) S (arabic_text) SA (english_text2) S where the Macro SA select that Arabic fonts and display the text. Of course the whole line must be transformed before this operation. In RTL mode, the output should look like : (english_text1) AS (arabic_text) ASA (english_text2) AS At this point, the arabic_text string should be already transformed. 5- Netscape localization tool kit Entry point --------------------------------------------- We believe that in order to cover this issue, the Netscape localization tool kit should cater to the following issues: - Provide all menus/messages in Resources files. Companies that implement the localization will have to edit/translate this text and run Netscape under an already localized OS environment (i.e. Arabic Windows for Windows version or XLANGBOX-ARA for UNIX versions) But we presume that this is already done, since this is imperative even for ISO 8859-1 European languages localization. - Provide a full font configuration flexibility for the menus and the HTML documents. All HTML predefined styled fonts must be setable. - Netscape HTML page should manage the RTL alignment directly. It should be the best comprehensive solution to avoid most of continuing problems if this handling is done externally. Otherwise, an entrypoint function should allow the change to X coordinates dynamically to each element using the algorithm : new_x = view_width - right_margin - left_margin - old_x - element_width \ + X_offset; The entrypoint should look like : int /* New X coordinate in RTL, output */ TransformXCoordonate(cur_x,view_w,right_m,left_m,element_w,Xoffset) int cur_x; /* Current X coordinate in LRT, input*/ int view_w; /* HTML view page width, input */ int right_m; /* right margin width, input */ int left_m; /* left margin width, input */ int element_w; /* current element width, input */ int Xoffset; /* current HTML page X offset, input */ In this case, the Horizontal scroll bar handling and the main X offset should also be managed though an external entry point. This is more delicate, since we noticed that the X_offset is set within the Scroll_bar widget callback and its value is used in all functions. - Before each string specific handling (drawing, width measurement...), Netscape should call a string transformation entry point. TransformLogicalToVisual(logical_str,visual_str) unsigned char *logical_str; /* Logical data string, input */ unsigned char *visual_str; /* Visual data string, output */ For non BI-DI languages, this entrypoint should simply be replaced as follow: TransformLogicalToVisual(logical_str,visual_str) unsigned char *logical_str; unsigned char *visual_str; { strcpy(visual_str,logical_str); } - The mouse pointing routine, when pointing to text area, should calculate the position in the text (like it already does in English, by adding the width of each character on the string and comparing it with the current X pointing value) by calling an entrypoint as follow: CheckPositionVisualToLogical(logical_str,vpos,lpos) unsigned char *logical_str; /* Logical string, input */ int vpos; /* visual position, input */ int *lpos; /* logical position, output */ For non BI-DI languages, this function should just assign *lpos = vpos - If Input widget should be also localized (i.e. Gold editor), the following entrypoint should also be called before displaying the Insertion Point I beam: CheckPositionLogicalToVisual(logical_str,lpos,vpos) unsigned char *logical_str; /* Logical string, input */ int lpos; /* logical position, input */ int *vpos; /* visual position, output */ For non BI-DI languages, this function should just assign *vpos = lpos Warning: The Input widget support implies also to manage a dual virtual keyboard. The mapping and the toggle key should be definable through the localization kit. - For the printing support, the browser should concatenate the header with a user defined header which could include specific fonts and specific Postscript macros. When generating the PS output file, the system should call the transformation routine: PostscriptTransformationLtoV(logical_str,ps_line,sfl,sfa,dltr,drtl) unsigned char * logical_str; /* Logical text ISO , input*/ unsigned char * ps_line; /* Postscript command line, output */ char * sfl ; /* Set Font Latin macro, input */ char * sfa ; /* Set Font Arabic macro, input */ char * dltr ; /* Display in LTR macro, input */ char * drtl ; /* Display in RTL macro, input */ If the browser does not directly generate the PS and uses the OS printing system, a Transformation routine must be called anyway, except if the printing system is considered already arabized and does this transformation alone. Also, a font setting option should be available to set printing fonts. Conclusion ---------- We believe that currently there is a serious lack of Arabic language Web support. With the growing of Intranet solution, Web browser that can provide this support on a multi-platform environment can take the lead. Eventually, a Localization tool kit should be able to cover all languages including BI-DI complex ones. The above list summarizes what we need as a minimum for a tool kit and we noticed that in many cases, these points cannot be covered by a classical toolkit designed only for European languages. Also, if these points are taken into account at the time of conception, it will avoid costly re-design patches later. For more details or for clarification, please contact Franck Portaneri at: franck@langbox.com ******************************************************************************