RTL and BiDi text handling in general

Basics

The very basic concept of RTL and BiDi is that documents contain the characters in their logical order, the order in which the author typically writes them and a reader would read them out loud.

On this data the BiDi algorithm (UAX #9) is applied which rearranges the characters, resulting in the visual order. Readers of this document are recommended to read up a little bit about the basics to get a rough idea (probably a random reader-friendly article rather than the spec).

The BiDi algorithm can decide to mirror certain characters. E.g. an opening parenthesis U+0028 ( keeps its opening parenthesis semantics rather than the exact look (even despite Unicode incorrectly calling it left parenthesis), and hence becomes a visual ) when placed in RTL context. Rendering implementations usually prefer to take the counterpart glyph (this data is available in Unicode) instead of mirroring which might result in ugly look (e.g. in italics, or if mirrored after subpixel antialiasing treatment). For copy-pasting and similar purposes it should always remain the original codepoint.

There are situations where applying the BiDi algorithm on the letters isn’t sufficient to get the desired layout. That’s where special BiDi control characters come into play: marks (these are just like English or Arabic/Hebrew letters, except for being zero-width and hence invisible), as well as embeddings, overrides and isolates that have opening and closing characters, selecting a continuous logical section of the text for special treatment. I recommend readers of this document to become somewhat familiar with them (there’s a brief introduction here as well).

Each paragraph is rendered separately. Unbalanced BiDi control characters are implicitly terminated, cannot have an effect on subsequent paragraphs.

The BiDi algorithm always fills up the first visual line with as many characters as it can from the beginning of the logical string, then continues with the second line and so on. One’s eyes never have to move upwards to read the text. Characters are reordered within each line, but not across lines. (How they are reordered within a line depends on the contents of the entire paragraph, not just the given line.)

Paragraph direction (a.k.a. base direction)

With the transition from 8-bit character sets to UTF-8 (on Unix systems), one of the biggest advantages was that strings can just be carried on their own, without any external piece of meta-information (as the charset used to be, or at least should have been) and they will appear correctly whenever presented to the user. At least, as long as only LTR is in the game.

Is this still true with BiDi in mind? Unfortunately not.

Consider this very simple example. The text is just two words, an English followed by an Arabic one, the logical order is: english ARABIC.

Within LTR context, the BiDi algorithm lays this out starting with the first (English) word aligned to the left, and then the Arabic one being the second which requires reverse rendering. The visual result will be

english CIBARA

aligned to the left margin.

Within RTL context, the first (English) word is placed first, to the starting (that is: right) edge, requiring special treatment (reversing the order of letters from the overall RTL direction). This will be followed by (that is: to its left) the Arabic word. So the overall visual result will be

CIBARA english

aligned to the right.

The so-called paragraph direction or base direction (e.g. the dir attribute in HTML, the direction property in CSS, GtkTextDirection in GTK+ etc.) is an important piece of information that can affect the visual order (hence the correctness of the displayed text), and is something that needs to be carried externally.

Due to its inconvenience, the paragraph direction is often auto-detected based on the text.

Alignment usually follows the paragraph direction by default, but can be overridden (e.g. in HTML/CSS or word processors). In standard legacy terminal emulation there’s no way to override left alignment (a terminal-based application might of course right-align the text using explicit indentation in each line, but the terminal itself cannot right-align). So this is irrelevant for us, I don’t see the need to introduce the separate concept of alignment for terminal emulators. All paragraphs will still be aligned according to their paragraph direction.

BiDi control characters

This subsection is a quick overview of how the BiDi control characters are usually used, in order to embed potentially foreign direction text (typically when some placeholder is substituted by a value that’s not known at the time of writing the software). Feel free to skip this subsection.

The old standard approach is to use embeddings (LRE…PDF or RLE…PDF) or overrides (LRO…PDF or RLO…PDF). Their design is not as good as it could be, they have more effect outside of their character pairs than desired, which is often referred to as “spillover”:

Example with embeddings and overrides (spaces added for readability):

1. abc  [RLE] ARABIC [PDF]  [RLE] HEBREW [PDF]  jkl

shows up as

abc WERBEH CIBARA jkl

and

2. abc  [RLO] def [PDF]  [RLO] ghi [PDF]  jkl 

shows up as

abc ihg fed jkl

The two embedding or override blocks are joined together and laid out from right to left as one; probably not what you’d first expect. This is an unfortunate result of the old (pre Unicode 6.3) BiDi algorithm just assigning resolved embedding levels to each character and then eliminating the embedding and override controls. Characters of both these RTL segments receive the same resolved embedding level with nothing between them, so later the algorithm cannot tell that they don’t belong together.

A common practice (workaround) is to add a mark: LRM or RLM after each PDF to restore the paragraph direction, resulting in the “expected” rendering:

3. abc  [RLE] ARABIC [PDF] [LRM]  [RLE] HEBREW [PDF] [LRM]  jkl 

shows up as

abc CIBARA WERBEH jkl

and

4. abc  [RLO] def [PDF] [LRM]  [RLO] ghi [PDF] [LRM]  jkl 

shows up as

abc fed ihg jkl

Unicode 6.3 and UAX #9 (BiDi algorithm) revision 29, released in 2013, fixes it by introducing isolates. (Obviously the unfortunate behavior of existing control characters had to remain unchanged.) Isolates make the workaround with a following mark no longer necessary. Unlike embeddings and overrides, but like marks, isolate characters also remain there in the text throughout the BiDi algorithm, don’t get eliminated until the very end. I think they are pretty much a drop-in replacement for embeddings with the mark workaround (example 5), but they don’t override the direction, so if you really want to get reverse visual order (like the fed and ihg reverse English words in examples 2 and 4) then you should have an override inside (example 6):

5. abc  [RLI] ARABIC [PDI]  [RLI] HEBREW [PDI]  jkl 

showing up identically to 3.

6. abc  [RLI] [RLO] def [PDF] [PDI]  [RLI] [RLO] ghi [PDF] [PDI]  jkl 

showing up identically to 4.

Isolates also introduce a variant called FSI which autodetects the embedded string’s directionality, providing the most convenient way of embedding text: enclosing in FSI…PDI.

As of 2018, isolates are unfortunately still not widely supported. E.g. on Ubuntu, Firefox and LibreOffice Writer support them, the GTK+ stack began supporting them in Ubuntu 18.10 (with the update to FriBidi 1.0), the Qt/KDE stack doesn’t seem to support them and neither does Chromium in Ubuntu 18.10.

Shaping

BiDi comes hand in hand with shaping, the beautiful connected rendering of cursive Arabic letters. (Strictly speaking, shaping is unrelated to BiDi. It’s a mere coincidence that it happens to be a RTL script requiring shaping and not a LTR one.) Each Arabic letter has up to 4 different forms depending on whether it’s connected to a previous and/or a next letter, plus there are mandatory ligatures.

The CSS3 specification says:

“When shaping scripts such as Arabic wrap […] the characters must still be shaped (their joining forms chosen) as if the word were still whole.”

I assume this is applicable generally, not just for webpages. I also assume that the described behavior is also relevant for cropped text (e.g. when the text crosses the terminal emulator window’s border), that is, even at cropping boundaries the letter should be shaped as if the entire word was visible.

There are two substantially different approaches to perform shaping.

Arabic text files consist of the Arabic letters beginning at U+0600, covered by several Unicode ranges. Plus there exist “presentation form” characters beginning at U+FB50, in two ranges. These latter ones should not occur in text files, however, fonts can define glyphs there.

The simpler, older method for shaping is to use a library that replaces the codepoints with their “presentation form” variants (e.g. the FriBidi library has such methods), and then to render these glyphs. Allegedly this approach has some limitations, cannot perfectly render every possible text of every related language.

The preferred method is to avoid the presentation form characters, and use a library (e.g. HarfBuzz) that just renders the text as an opaque operation, handling a lot more special cases than the other approach could handle.

Inverse BiDi

The BiDi algorithm converts a logical text (potentially containing BiDi control characters) into a visual one (not containing BiDi control characters).

The “inverse BiDi” algorithm is the opposite: it takes the desired visual order of glyphs, and produces a logical order. The requirement is that applying the BiDi algorithm on the result should give back the exact same visual order, i.e. bidi(inverse_bidi(visual)) == visual.

Quoting from ICU User Guide:

“there is no standard algorithm for it. ICU’s BiDi API provides a setting for "inverse" operation that modifies the standard Unicode Bidi algorithm. However, it may not always produce the expected results.”

The FriBidi library doesn’t even attempt to address this functionality.

Does an “inverse BiDi” algorithm exist in theory which never produces any BiDi control characters? No, there’s no such algorithm. Here’s probably the simplest possible proof. In an LTR paragraph direction, the logical strings A1 and 1A (A denotes an RTL letter here) are both rendered as visual 1A. Hence, without using BiDi control characters, there cannot be a logical text that appears in an LTR paragraph as visual A1. Or if someone wonders whether an entirely different string could end up being displayed as A1: the number of possible logical vs. visual strings of a given length is the same, and the mapping isn’t bijective because we’ve seen clashes, therefore there must be visual strings that no logical string maps to.

If BiDi controls are allowed in the result then there’s of course a big hammer: just place the text inside an override (LRO…PDF or RLO…PDF). We’ll later see what the downsides of this big hammer are. There might be more sophisticated algorithms than this big hammer.

On a side note, it’s unclear to me if there’s an “inverse shaping” algorithm, i.e. whether the shaped text using “presentation form” characters can be “unshaped” to the proper logical text.

We’ll see later why it’s of any interest at all to us, and we’ll discuss how much we can or should avoid the need for an inverse BiDi or inverse shaping.