RTL and BiDi text handling in general
The very basic concept of RTL and BiDi is that documents contain the characters in their logical order, the order in which the author typically writes them and a reader would read them out loud.
On this data the BiDi algorithm (UAX #9) is applied which rearranges the characters, resulting in the visual order. Readers of this document are recommended to read up a little bit about the basics to get a rough idea (probably a random reader-friendly article rather than the spec).
The BiDi algorithm can decide to mirror certain characters. E.g. an
opening parenthesis U+0028
( keeps its opening parenthesis semantics
rather than the exact look (even despite Unicode incorrectly calling it
left parenthesis), and hence becomes a visual
) when placed in RTL
context. Rendering implementations usually prefer to take the
counterpart glyph (this data is available in Unicode) instead of
mirroring which might result in ugly look (e.g. in italics, or if
mirrored after subpixel antialiasing treatment). For copy-pasting and
similar purposes it should always remain the original codepoint.
There are situations where applying the BiDi algorithm on the letters isn’t sufficient to get the desired layout. That’s where special BiDi control characters come into play: marks (these are just like English or Arabic/Hebrew letters, except for being zero-width and hence invisible), as well as embeddings, overrides and isolates that have opening and closing characters, selecting a continuous logical section of the text for special treatment. I recommend readers of this document to become somewhat familiar with them (there’s a brief introduction here as well).
Each paragraph is rendered separately. Unbalanced BiDi control characters are implicitly terminated, cannot have an effect on subsequent paragraphs.
The BiDi algorithm always fills up the first visual line with as many characters as it can from the beginning of the logical string, then continues with the second line and so on. One’s eyes never have to move upwards to read the text. Characters are reordered within each line, but not across lines. (How they are reordered within a line depends on the contents of the entire paragraph, not just the given line.)
Paragraph direction (a.k.a. base direction)
With the transition from 8-bit character sets to UTF-8 (on Unix systems), one of the biggest advantages was that strings can just be carried on their own, without any external piece of meta-information (as the charset used to be, or at least should have been) and they will appear correctly whenever presented to the user. At least, as long as only LTR is in the game.
Is this still true with BiDi in mind? Unfortunately not.
Consider this very simple example. The text is just two words, an English followed by an Arabic one, the logical order is: english ARABIC.
Within LTR context, the BiDi algorithm lays this out starting with the first (English) word aligned to the left, and then the Arabic one being the second which requires reverse rendering. The visual result will be
aligned to the left margin.
Within RTL context, the first (English) word is placed first, to the starting (that is: right) edge, requiring special treatment (reversing the order of letters from the overall RTL direction). This will be followed by (that is: to its left) the Arabic word. So the overall visual result will be
aligned to the right.
The so-called paragraph direction or base direction (e.g. the
attribute in HTML, the
direction property in CSS, GtkTextDirection in
GTK+ etc.) is an important piece of information that can affect the
visual order (hence the correctness of the displayed text), and is
something that needs to be carried externally.
Due to its inconvenience, the paragraph direction is often auto-detected based on the text.
Alignment usually follows the paragraph direction by default, but can be overridden (e.g. in HTML/CSS or word processors). In standard legacy terminal emulation there’s no way to override left alignment (a terminal-based application might of course right-align the text using explicit indentation in each line, but the terminal itself cannot right-align). So this is irrelevant for us, I don’t see the need to introduce the separate concept of alignment for terminal emulators. All paragraphs will still be aligned according to their paragraph direction.
BiDi control characters
This subsection is a quick overview of how the BiDi control characters are usually used, in order to embed potentially foreign direction text (typically when some placeholder is substituted by a value that’s not known at the time of writing the software). Feel free to skip this subsection.
The old standard approach is to use embeddings (LRE…PDF or RLE…PDF) or overrides (LRO…PDF or RLO…PDF). Their design is not as good as it could be, they have more effect outside of their character pairs than desired, which is often referred to as “spillover”:
Example with embeddings and overrides (spaces added for readability):
1. abc [RLE] ARABIC [PDF] [RLE] HEBREW [PDF] jkl
shows up as
abc WERBEH CIBARA jkl
2. abc [RLO] def [PDF] [RLO] ghi [PDF] jkl
shows up as
abc ihg fed jkl
The two embedding or override blocks are joined together and laid out from right to left as one; probably not what you’d first expect. This is an unfortunate result of the old (pre Unicode 6.3) BiDi algorithm just assigning resolved embedding levels to each character and then eliminating the embedding and override controls. Characters of both these RTL segments receive the same resolved embedding level with nothing between them, so later the algorithm cannot tell that they don’t belong together.
A common practice (workaround) is to add a mark: LRM or RLM after each PDF to restore the paragraph direction, resulting in the “expected” rendering:
3. abc [RLE] ARABIC [PDF] [LRM] [RLE] HEBREW [PDF] [LRM] jkl
shows up as
abc CIBARA WERBEH jkl
4. abc [RLO] def [PDF] [LRM] [RLO] ghi [PDF] [LRM] jkl
shows up as
abc fed ihg jkl
Unicode 6.3 and UAX #9 (BiDi algorithm) revision 29, released in 2013,
fixes it by introducing isolates. (Obviously the unfortunate behavior
of existing control characters had to remain unchanged.) Isolates make
the workaround with a following mark no longer necessary. Unlike
embeddings and overrides, but like marks, isolate characters
also remain there in the text throughout the BiDi algorithm, don’t get
eliminated until the very end. I think they are pretty much a drop-in
replacement for embeddings with the mark workaround (example 5), but
they don’t override the direction, so if you really want to get
reverse visual order (like the
ihg reverse English words in
examples 2 and 4) then you should have an override inside (example 6):
5. abc [RLI] ARABIC [PDI] [RLI] HEBREW [PDI] jkl
showing up identically to 3.
6. abc [RLI] [RLO] def [PDF] [PDI] [RLI] [RLO] ghi [PDF] [PDI] jkl
showing up identically to 4.
Isolates also introduce a variant called FSI which autodetects the embedded string’s directionality, providing the most convenient way of embedding text: enclosing in FSI…PDI.
As of 2018, isolates are unfortunately still not widely supported. E.g. on Ubuntu, Firefox and LibreOffice Writer support them, the GTK+ stack began supporting them in Ubuntu 18.10 (with the update to FriBidi 1.0), the Qt/KDE stack doesn’t seem to support them and neither does Chromium in Ubuntu 18.10.
BiDi comes hand in hand with shaping, the beautiful connected rendering of cursive Arabic letters. (Strictly speaking, shaping is unrelated to BiDi. It’s a mere coincidence that it happens to be a RTL script requiring shaping and not a LTR one.) Each Arabic letter has up to 4 different forms depending on whether it’s connected to a previous and/or a next letter, plus there are mandatory ligatures.
The CSS3 specification says:
“When shaping scripts such as Arabic wrap […] the characters must still be shaped (their joining forms chosen) as if the word were still whole.”
I assume this is applicable generally, not just for webpages. I also assume that the described behavior is also relevant for cropped text (e.g. when the text crosses the terminal emulator window’s border), that is, even at cropping boundaries the letter should be shaped as if the entire word was visible.
There are two substantially different approaches to perform shaping.
Arabic text files consist of the Arabic letters beginning at U+0600, covered by several Unicode ranges. Plus there exist “presentation form” characters beginning at U+FB50, in two ranges. These latter ones should not occur in text files, however, fonts can define glyphs there.
The simpler, older method for shaping is to use a library that replaces the codepoints with their “presentation form” variants (e.g. the FriBidi library has such methods), and then to render these glyphs. Allegedly this approach has some limitations, cannot perfectly render every possible text of every related language.
The preferred method is to avoid the presentation form characters, and use a library (e.g. HarfBuzz) that just renders the text as an opaque operation, handling a lot more special cases than the other approach could handle.
The BiDi algorithm converts a logical text (potentially containing BiDi control characters) into a visual one (not containing BiDi control characters).
The “inverse BiDi” algorithm is the opposite: it takes the desired
visual order of glyphs, and produces a logical order. The requirement is
that applying the BiDi algorithm on the result should give back the
exact same visual order, i.e.
bidi(inverse_bidi(visual)) == visual.
Quoting from ICU User Guide:
“there is no standard algorithm for it. ICU’s BiDi API provides a setting for "inverse" operation that modifies the standard Unicode Bidi algorithm. However, it may not always produce the expected results.”
The FriBidi library doesn’t even attempt to address this functionality.
Does an “inverse BiDi” algorithm exist in theory which never produces
any BiDi control characters? No, there’s no such algorithm. Here’s
probably the simplest possible proof. In an LTR paragraph direction, the
A denotes an RTL letter here) are both
rendered as visual
1A. Hence, without using BiDi control characters,
there cannot be a logical text that appears in an LTR paragraph as
A1. Or if someone wonders whether an entirely different string
could end up being displayed as
A1: the number of possible logical vs.
visual strings of a given length is the same, and the mapping isn’t
bijective because we’ve seen clashes, therefore there must be visual
strings that no logical string maps to.
If BiDi controls are allowed in the result then there’s of course a big hammer: just place the text inside an override (LRO…PDF or RLO…PDF). We’ll later see what the downsides of this big hammer are. There might be more sophisticated algorithms than this big hammer.
On a side note, it’s unclear to me if there’s an “inverse shaping” algorithm, i.e. whether the shaped text using “presentation form” characters can be “unshaped” to the proper logical text.
We’ll see later why it’s of any interest at all to us, and we’ll discuss how much we can or should avoid the need for an inverse BiDi or inverse shaping.