Combining characters

This page is for explicit modes, in implicit modes the handling of combining characters is obvious.

Nonspacing and enclosing marks (Mn, Me)

This BiDi specification doesn’t change the essentials of terminal emulation behavior. As a direct consequence, in all of the modes, a nonspacing or enclosing mark is stored in the same character cell as the preceding base character of the input stream. The cells are typically expected to be rendered on their own, that is, no combining accent should move to another cell for display purposes, not even with BiDi in the game.

Example: The Hebrew word שָׁלוֹם (Shalom) has the following logical order, denoting the base characters and combining accents as they belong together:

  0      1      2      3      4      5      6
U+05e9 U+05b8 U+05c1 U+05dc U+05d5 U+05b9 U+05dd
‭└╴שָׁ╶───────────────┘ └╴ל╶─┘ └╴וֹ╶────────┘ └╴ם╶─┘‬

If an application wishes to print this word in explicit LTR mode, it has to reverse the order of the base letters but keep every combining accent after its base letter, that is, it needs to emit:

  6      4      5      3      0      1      2
U+05dd U+05d5 U+05b9 U+05dc U+05e9 U+05b8 U+05c1
‭└╴ם╶─┘ └╴וֹ╶────────┘ └╴ל╶─┘ └╴שָׁ╶───────────────┘‬

This is not what the BiDi algorithm produces by default, check e.g. at the online C or JAVA reference implementation.

The FriBidi library implements our desired behavior if FRIBIDI_FLAG_REORDER_NSM is passed to fribidi_reorder_line(), which is the default behavior.

In explicit RTL mode, an app would output the BiDi algorithm’s result in reverse memory order (assuming that the BiDi algorithm produces visual LTR order, that is, the beginning of the array corresponds to the left of the screen, the end of the array corresponds to the right). Care should be taken to restore the position of combining accents that belong to LTR glyphs. The FriBidi library doesn’t provide out of the box support for this. [TODO: file a feature request.]

Spacing marks (Mc)

Spacing marks are used in Devanagari and several other scripts, I believe all of them are LTR ones. Similarly to nonspacing marks, they add a combining symbol to the base one, but in addition they also increase the overall width by one character cell.

The added combining symbol does not necessarily show on the right of the base one, it can show on its left, too, see e.g. U+093F. Still, the logical order is the base character followed by the mark. This visual swapping is unrelated to the BiDi story.

Even without introducing BiDi, it’s unclear how such characters should be handled in terminal emulation, e.g. whether they should combine in the data layer or in the presentation layer. See e.g. VTE issue 584160.

With BiDi, it’s not entirely clear what the proper output order should be when outputting reverse string, that is, when printing such LTR text in an explicit RTL paragraph (e.g. an overall Arabic text has some embedded Devanagari word), or, theoretically, if in explicit LTR mode a RTL text with such spacing marks would be emitted.

Due to such marks not existing in RTL scripts, and visual RTL not being a common thing, I’m not sure if there’s any best practice here that we could follow. I find “Approach A” to be the cleaner overall design, but input from BiDi experts is desired to make a final decision.

Approach A

Applications are expected to print the base letter first, followed by the spacing mark, even when printing text in reverse order. That is, the order is always base letter followed by spacing mark in the data layer, before BiDi is applied.

This approach is the one that’s compatible with the possible approach of combining these marks in the data layer of terminals.

This approach is the one that’s along the lines of how nonspacing marks are handled. Similarly to them, the out of the box BiDi algorithm doesn’t do what apps would need, apps would need to reverse back the order of spacing marks. Currently there’s no support for this in FriBidi, but we could ask them to implement it.

This is how Pango, Firefox and Chromium interpret the data in right-to-left override context, e.g. inside RLO…PDF or <bdo dir="rtl">...</bdo>.

Approach B

Applications are expected to guarantee that the base letter is the one that’s first according to the letter’s intended direction, followed by the combining mark in that direction. That is, when printing text in reverse order, the spacing mark is printed first, followed by the base letter.

This is what happens to be the result of the BiDi algorithm.

This is the convenient one for apps that don’t wish to distinguish between regular letters and spacing marks at all. (This is only relevant as long as FriBidi doesn’t help and apps would need to manually reorder the marks. Once FriBiDi adds support, it’ll be a non-issue.)

A terminal emulator might have to apply some heuristics to determine whether a spacing mark between two letters actually belongs to the LTR letter on its left, or the RTL letter on its right. A reasonable heuristics is to always tie it to the base letter on its left, if I know it correctly that spacing marks aren’t used in any RTL scripts. Still IMHO it’s a bit of a hack rather than beautiful engineering.