Implicit mode, level 2

Implicit mode, level 2

Let’s extend the implicit mode to remember BiDi control characters (up to a reasonable extent).

One idea would be to handle them as combining accents are handled (tied to their preceding character). The problem is that we also need to remember BiDi controls at the very beginning of paragraphs, before their first character, and we’d need to define how exactly it would be handled. I don’t like the idea of the first column being handled differently from the others.

Another idea could be to generally introduce storage for every boundary between adjacent cells (plus before the first and after the last cell). Here one problematic part is defining exactly when and how these get overwritten. Another problematic part is defining the behavior at line (but not paragraph) breaks, how to merge a line’s right edge with the next line’s left edge.

My proposal tries to come up with something that is along the lines of how combining accents are handled, yet free from the problem with the beginning of the paragraph or inter-cell data, and also has relatively nice semantics and helps proper copy-pasting.

Let’s for each cell remember several preceding and several succeeding BiDi controls. Currently a cell contains: base letter + combining accents. From now on it would contain: preceding BiDi controls + base letter + combining accents + succeeding BiDi controls.

Let’s use M for BiDi marks (there are 3 of them), O for opening BiDi controls (7), C for closing ones (2), and lowercase letters for actual letters. The preceding BiDi controls can only be of type M and O, the succeeding ones can only be of type M and C.

As input arrives, after receiving a regular letter the BiDi controls are assigned to this letter as succeeding BiDi chars as long as their type is M or C. When an O is encountered, we move to the next letter, its preceding BiDi controls (or probably the entire character?) are wiped out and filled up with the arriving ones.

When there’s an O followed by a C without any actual visible letter in between, the C is ignored. (Or might remove that previous O, if found? Too complicated.) I don’t think any sane string would open and then close a BiDi context without any printable character in between [correct? How about when the empty string is substituted to a template?].

After cursor movement operations, or probably even attribute changes, arriving BiDi chars would be placed as preceding ones of the next letter, and wouldn’t fiddle with the previous character. Zero-width, non BiDi related characters (e.g. U+00A0 no-break space, U+200B zero width space) would also force moving on to the next character’s preceding BiDi controls. [Turn it to a precise specification if we end up going for this whole approach.]

E.g. if the input stream is M1 O1 M2 x M3 C1 M4 O2 M5 y M6 C2 M7, then the letter x would have M1 O1 M2 as its preceding and M3 C1 M4 as its succeeding BiDi controls; and the letter y would have O2 M5 as preceding and M6 C2 M7 as succeeding.

Assigning M4 to x rather than y breaks symmetry, but that’s not necessarily a problem; the BiDi algorithm isn’t symmetrical either regarding the beginning vs. end of the logical string. A typical scenario is when a piece of text is embedded in RLE…PDF and then immediately followed by an LRM to prevent spillover, but there’s no need for an LRM before the RLE. (See the section BiDi control characters.) So I think it’s actually the better (let alone much more easily implementable) approach if marks stick to the previous character whenever possible.

A preceding BiDi character can be stored on 4 bits (3+7+1 = 11 possible values including “none”), a succeeding one can be stored on 3 bits (3+2+1 = 6 values including “none”). Using simple bitpacking, a 16-bit integer has room for 2 preceding + 2 succeeding, a 32-bit integer per cell has room for 3 preceding + 6 succeeding, or (decided in the source code) 4 preceding + 5 succeeding BiDi chars. Mul/div/mod 6 and 11 arithmetics can store 2+3 on 16 bits, or 3+8, 4+7 or 5+5 on 32 bits. Taking into account that “none” can only occur at the end, 5+6 can also be squeezed into 32 bits. Having room for more succeeding than preceding controls is justified by the asymmetry in the design, as well as the real life RLE…PDF+LRM / RLE…PDF+LRM practice. I think this should be enough for most of the real life use cases; if more controls are received then they’re dropped. The emulator could first check for this bitpacked integer being 0 (by far the most typical case) or not before unpacking. Out of these, 4+7 feels to be the best choice for me, but maybe the 16-bit 2+3 is already good enough. It’s just one out of many possible implementations outlined here, of course.

I believe an advantage of this approach is that most emulation features (e.g. insertion, deletion) don’t need to be redesigned, they could just keep working exactly as they used to. Overwriting cells probably also does something quite reasonable to the BiDi controls, and so does copy-pasting (you probably automatically get or don’t get the BiDi controls as it makes sense). [How do graphical toolkits, web browsers etc. copy-paste fragments of BiDi text? When do and when don’t they include BiDi control characters; do they automatically balance out unbalanced ones? Is there a well established practice here?]

For any emulator claiming to support implicit mode level 2, it should be a hard requirement to support isolates. 5+ years after their debut there’s no excuse for picking a BiDi library that still doesn’t support them, it’s not reasonable to force applications to use the legacy embeddings+marks instead of a simpler and technically superior solution.

Alternate forms of BiDi controls

For implicit mode level 2 we might ponder about introducing alternate forms of BiDi control characters as well as other special characters like ZWJ and ZWNJ, as escape sequences rather than the Unicode codepoints.

Emulators would be expected to treat them differently when it comes to copy-pasting: bare Unicode characters would be preserved, but ones that arrived as escape sequences would be dropped. This way they’d not get lost when they’re contained in a file which is cat’ed and then copy-pasted, whereas an app could emit them for nicer formatting (see soon for examples) in a way that they don’t clutter the actual data when it’s copy-pasted. I believe this is similar to UAX #9’s concept of “higher level protocols”.

For future extendibility, we might pick an escape sequence framework where the value is the Unicode codepoint itself, rather than defining some arbitrary mapping.