Index: doc/proposals/unicode.html =================================================================== --- doc/proposals/unicode.html (revision 98d4df99d1a62214a4467c21f220c266138e9826) +++ doc/proposals/unicode.html (revision 98d4df99d1a62214a4467c21f220c266138e9826) @@ -0,0 +1,1034 @@ + + + + + + + + + + + + + + + + + + + +

Summary

+ +

+This document presents a string API for Cforall that accommodates +Unicode text while staying safe. The safety challenge is presented +and addressed.

+ +

+UTF-8 is the +format.

+ +

+Index-based operations in the public interface are reduced or +eliminated. Indexing does not play well with UTF-8’s variable-width +characters.

+ +

+Include-Exclude operations, which return a mutable reference into the +original (a capture), completely take their place. These compose find +(text -> index) with substring (index -> text), as text -> +text. The basic mechanism is straight out of Buhr 94; what’s new +here is a total devotion to it, and the decoupling of captures from +in-exclude calls.

+ +

+Composable patterns are the subjects being in/excluded. Leaf pattern +expressivity is exact-sequence and repeat character-class, not +regular expression (same as Buhr 94). A leaf pattern can, but does +not have to, offer a capture. An adjacency-composed pattern can +capture too.

+ +

+Operator syntax, generalizing current sin-sout pipes, is the pattern +glue. Basic-case string parsing looks the same as current sin +reading. The more nuanced cases allow overriding behaviours like +controlling whitespace treatment.

+ +

+A capture wraps and hides a byte-level reference into variable-width +text. It stays valid under length-changing mutations of the original. +In UTF-8, even single-character overwrite is a byte-length changing +operation (when done safely, in the general case).

+ +

+A representation of buffer-gap-buffer enables these stable captures. +The gap changes size under editing, moves under cursor advancing, and +is skipped under reading. The cursor move copies characters from the +gap-adjacent source-end of one buffer to gap-adjacent sink-end of the +other.

+ +

The +Unicode challenge

+ +

UTF-8 +is the current (spring 2019) standard for text encoding. Java and +.NET use two-byte encodings, which is an older approach. UTF-8 +supports the full universe of Unicode text. A new language like CFA +must support UTF-8, likely even treat it as the native format.

+ +

UTF-8 is a +variable-width encoding. Every string from the 128-character ASCII +aphabet is also a UTF-8 string. If the extra bit (which sends a byte +beyond the 128 characters) is set, then the byte is the start of a +character that uses at least two bytes. And so on. Up to 4 bytes, +which covers the 17 defined Unicode planes, including the overhead +the tagging. If 17 planes are not enough for you, but 256 planes +ought to be, then that takes 6 bytes, including the tagging; this +would cover the all the characters expressible with the (obscure) +UTF-32 (aka UCS-4) 4-byte fixed-width encoding.

+ +

The variable-width +encoding is a challenge for the string operations length and +index. An API’s semantics must decide bewteen character- and +byte-based counting. Suppose an API supports operations stuctured as, +“give me the range of length m, starting at position i, from this +string, which you have told me has length n, where I know i <= n – +m + 1.” (This presentation uses a 1-based location count; the +choice of base is independent of the semantics under discussion.) For +example, in all encodings, and in all semantics being discussed, when +doing this operation on “Hello, world!” with i=8, m=5 and n=13, +the result is “world” and has length 5. This sematic uniformity +happens because all letters in the example string have equal length +in all encodings.

+ +

Contrasting more +interesting examples:

+ +

B8 = Byte-counting +sematics on UTF-8 encoding

C8 = +Character-counting semantics on UTF-8 encoding

2B16 = +Double-byte-counting semantics on UTF-16 encoding. This is what Java +and .NET use.

+ +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

+ + + +	+ B8 +	+ C8 +	+ 2B16 +
+ Length of: u +	+ 1 +	+ 1 +	+ 1 +
+ Length of: ü +	+ 2 +	+ 1 +	+ 1 +
+ Length of: 福 +	+ 3 +	+ 1 +	+ 1 +
+ Length of: + + + (U+1D11E) +	+ 4 +	+ 1 +	+ 2 +
+ Are ill-aligned range specifications possible? +	+ Yes +	+ No +	+ Yes +
+ Complexity of length operation, assuming start- and + end-locations known +	+ O(1) +	+ O(n) +	+ O(1) +
+ Complexity of substring operation +	+ O(1) +	+ O(i+m) +	+ O(1) +

+ +

Thus, “B” +semantics put an onus on the programmer that “C” semantics handle +in the library, for a runtime cost.

+ +

Typical string APIs +impose position-based operations on programmers. But programmers +would rather be position-agnostic. Outside of string-APU example +programs, numbers passed to substring operations are usually the +result of find operations. The find-substring composition is text-in, +text-out. Tinkering with the numbers in between often happens, such +as to codify the intention, “except leave off the delimeter.” +This is where bugs happen, because cases like the delimiter being +dynamic, and showing up with a byte-width longer than one are hard to +conceive. [TODO: cite UTF-16 being problematic like this] A degree of +security or internationalization expertise is needed to consider +relevant test cases that leave the Basic Multilingual Plane.

+ +

Length-changing operations

+ +

Mutations that +change a string’s length are prominent in early (pre-2019) work on +CFA string support, and on Buhr-94 strings. For example:

+ +

    replace("123456789", "456", "") // result: "123789"

+The previous unicode discussion clarified that, under B8 semantics, +this too is a length-changing operation:

+ +

    replace("francais", "c", "ç")  // result: "français"

+
+ +

This proposal +supports doing such operations as mutations.

+ +

Solution Principles

+ +

Abstract all +byte-length awareness within the string API. No public B semantics +when a user does parsing.

+ +

Offer powerful +text-to-text operations on the public interface. There is a potential +to extend this expressivity to regular-expression matching, but +current detailed work focuses on character-class and exact-sequence +constructs.

+ +

Remove most +number-based operations from the public interface. No illusion of +general-purpose random access.

+ +

Offer +character-class matching in the public interface. “Match: Any * 3” +is the only way to chunk off characters by number of occurrences; +this has strictly C semantics. This supports off-by-a-couple +adjustements when writing a parsing routine, guarantees +wide-character correctness, and puts the operation in a place where +O(n) is reasonable, as n is expected to be small.

+ +

Refine an API-use +style in which character-widths are only tested once. Provide +examples in this style and explanations of why this happens. Re-work +low-level designs as needed to ensure naive usage avoids re-checks.

+ +

Offer a few coarse +chunking operations that work with a max-bytes parameter, documented +as helpers for batching cases. Return a well-aligned result (of +indeterminate C-semantic length), by leveraging UTF-8 tagging, in +which a start-of-character is easy to find. “Copy to c-string” is +one such operation.

+ +

Support +substrings/selections that straddle mutations points by following an +“open binder” design, introduced below.

+ +

The rest of the doc +presents detail suggesting the above is achievable. The reader may +thus entertain the hypothesis that all desirable string manipulation +can take place without an index-exposing API.

+ +

Text-to-text API, rev 1

+ +

This starter API is +an example of the text-to-text style, and C-style public semantics, +suggested earlier. It is presented mainly to clarify those concepts +before continuing. +

+ +

Note that matched +substrings write through into the original.

+ +

The recommended API +still woks this way, but also adds more control over which +(writeable) parts get captured into variables, enabling more useful +find-replace mutation cases.

+ +

    string s, qs, s1, s2;
+    charclass qcc;

+Split s once, with q on the left:

+ +

    [s1, s2] = include(s, qs);
+    assert s == s1 + s2;
+    assert s1 == "" || s1 == qs;
+
+    [s1, s2] = include(s, qcc);
+    assert s == s1 + s2;
+    for (c1: s1) assert ismatch(c1, qcc);
+    assert s2 == "" || ismatch(first(s2), qcc);

+Split s once, with q on the right:

    [s1, s2] = exclude(s, qs);
+    assert s == s1 + s2;
+    assert ["", _] == include(s1, qs)
+    assert [qs, _] == include(s2, qs)
+
+    [s1, s2] = exclude(s, qcc);
+    assert ["", s1] == include(s1, qcc);
+    assert s2 == "" || include(s2, qcc).0 != ""

+
+ +

All results are +writeable references:

+ +

    // demonstration on 1^st return of 1^st API function; others are similar
+
+    [s1, s2] = include(s, qs);
+    assert s == s1 + s2;
+    assert s1 == "" || s1 == qs;
+
+    s1 = "hi";
+    assert s == "hi" + s2;
+
+    int s_len_old  = len(s), 
+        s2_len_old = len(s2);
+    s1 = "福";                       // (2 ch, 2 b) overwritten with (1 ch, 3 b)
+    assert len(s) == s_len_old – 1; // length counts logical characters
+    assert len(s2) == s2_len_old;   // s2 is intact, logically unaffected, yet at different distance from front

+Splitting this way works in loops. In the examples following, think +of q as a delimiter. Note how two steps are combined into a single +call: get everything up to next q; move past the q.

+ +

Split s repeatedly, +with q ending each match:

+ +

    string ss = "";
+    for ([s1,s2] in split_ex(s, q)) {
+        assert [s1, s2] == exclude(s1 + s2, q);
+        ss += s1;
+        ss += s2;
+    }
+    assert ss == s;

+
+ +

Split s repeatedly, +with q starting each match:

+ +

    string ss = "";
+    for ([s1,s2] in split_in(s, q)) {
+        assert [s1, s2] == include(s1 + s2, q);
+        ss += s1;
+        ss += s2;
+    }
+    assert ss == s;

+
+ +

Recommended API

+ +

This discussion +strives for concreteness at the risk of painting the design into a +corner. Many details still need to be worked through. The most +significant point of feedback sought here is whether the +algebra-of-patterns is appropriate and sufficient.

+ +

The Rev-1 API gets +awkward on cases like:

+ +

    string s = "a=1, b=2, c=3, ";
+    string kv, del, k, eqv, eq, v;
+    for ([kv,del] in split_ex(s, ", ")) {
+        [k,eqv] = exclude(kv, "=");
+        [eq,v] = include(eqv, "=");
+        kv = k + v;
+        del = ":";
+    }
+    assert s == "a1:b2:c3";

+Points of awkwardness are:

+
Multiple API + calls on the "=" matching are still not averted
+
+
No substring + capturing all of "a=1, ". If we had one, say scur, the + following assignment would be more natural than the pair in the + example:
+
+
+
We declare + names for substrings that are only used to target the next API call.
+

+ +

The recommended API +treats the rev-1 points as composable primitives, and separates (in +general) pattern from capture.

+ +

    string s = "a=1, b=2, c=3, ";
+    while (string ss = s, string k, string v, string scur;
+           nextMatch(ss, scur & (k ^ "=" | v ^ ", "))
+          ) {
+        scur = k + v + ":";
+    }
+    assert s == "a1:b2:c3";

+Some work is still needed on the ss declaration, and the +iterator-loop syntactic interaction (nextMatch). Lower-level work has +explored the possibility of integrating it with “0 ~ 10”-like +stepping.

+ +

The top-level +construct is nextMatch(string, pattern). It matches the pattern +against the front of the string; this is described recursively next.

+ +

The combinators |, ^ +and & all do pattern op pattern -> pattern.

+ +

Base cases

+
A read-only + string is a pattern that matches exactly one occurrence of itself.
+
+
A writeable + string is a pattern that matches anything.
+
+
Overloading + to be built out to make, e.g. a writeable int qualify as a pattern + that matches the character class [0-9].
+
+
Note this + presentation considers a separate problem: how we + differentiate a writeable string from a read-only (or COW) string
+
+
Expressivity + = charclass, exact sequence
+

+ +

The operators, in +their natural precendence order from loosest to tightest, and being +naturally left-associative, give the recursive cases:

+ +

+ + + + + + + + + + + + + + + + + + + +

L|R

inclusive then

L consumes all characters that the L-pattern accepts, then R + begins from the first character that L rejects. +

L^R

exclusive then

L consumes all characters that the R-pattern rejects, then R + begins from the first character that R accepts.

L&R

same as

General meaning: Both sides have + to match the same run of characters.

Expected use: pairing a restrictive complex non-writeable + pattern with a permissive simple writeable pattern, the latter + declaring an alias for the result

+ +

The +semantics of failing to match need to be nailed down. Basically, the +suggestion is

+
don’t + throw exception
+
+
don’t + end up in an infinite loop (matching zero more chars each time)
+
+
just + exit the loop
+

+ +

Work +is in progress describing an iterator-processing model for these +patterns which includes modeling no-more-matches.

+ +

Unification with streams

+ +

It is desirable that +a | b | c have analogous meanings, when done on a string or on +standard-in. As it is also done to standard-out, this in turn +suggests the string append operator will also become |.

+ +

The stream sout, +seen as a container of characters, offers one operation: write to +back. Similarly, sin reads from front. Thus, they are ready to work +against a pattern a | b | c, needing no more case refinement. The +type sytem disambiguates sin|p +from sout|p.

+ +

+ + + + + + + + + + + + + + + + + + + +

+ +

read (aka parse)

write

in forward linguistic time order

split from front

(sin does this)

append

(sout does this)

in reverse linguistic time order

split from back

prepend

+ +

A +string user needs to specify which behaviour is desired (myStr|p +is not enough). This choice, happening at the top level, coincides +with the need to pick a syntax for looping (strawmanned above as +while-nextMatch). This work is in progress; elements in +consideration include:

+
++ + with <, vs -- with >, for move/has-next, in forward/reverse + order
+
- +
  plays + nice with for loop, but confuses s > i overloading
  +
+
+
unary + - and + for read-from/write-to
+
+
unary + ~ (or !) to switch to reverse linguistic order from default forward
+
+
top-level + only: >> for read-from; << for write-to
+

+ +

When reading from +standard-in, automatically trimming whitespace is generally desired. +When processing a string (and presuming to obviate an index-based +API), it must be possible to take control of whitespace. Auto-newline +behaviour is similar.

+ +

Idea is:

void ?{}(pat &, int &) // construct pattern from int reference

+
gets + overridden: multiple definitions of the symbol are out there
+
+
the default + one is scope-visible, at background level, upon “#include string”
+
+
user may + elevate a different one to current level by: with + (stringopts.picky) {...}
+

+ +

Illustration without +an SPI, and covering the cases we care about:

+
Default
+
- +
  input + spelling is base-10
  +
- +
  Be + aggressive about clearing whitespace, lax about its absence; + adjacent whitespaces are insignificant
  +
- +
  all + contiguous [0-9] into the int; panic on overflow
  +
- +
  require at + least one [0-9] +
  +
- +
  yadda yadda + leading zeros, atoi-consistent
  +
+
+
Picky
+
- +
  Don’t + touch whitespace
  +
- +
  require + exactly one [0-9], no coalescing
  +
+

+ +

Probably, a +character-level SPI is required, to enable extensibility.

+ +

Open Binder design

+ +

This design +addresses internals hidden by the string API.

+ +

All depicted +references into a string are by byte location.

+ +

The name suggests a +three-ring binder of papers, sitting opened to a middle page, with +the rings un-clasped. It is easy to insert or delete at the current +opened-to point, given the expense that data movement is required to +advance this current point, proportional to the distance moved.

+ +

We are optimizing +for a single iteration through a string, in which substantial +rewriting is happening at the iterator’s current point.

+ +

Suppose we have:

1   string s, s1, s2;
+2   s = "abcdefghijklmnopqrstuvwxyz";
+3   [s1, s2] = exclude(s, 'k');       // s1=="abcdefghij", s2=="klmnopqrstuvwxyz"
+4   s1 += "xxx";                      // s =="abcdefghijxxxklmnopqrstuvwxyz"

+These objects can be represented by the buffers following, where the +__ underscores represent don’t-care bytes.

+ +

After line 2:

           _________abcdefghijklmnopqrstuvwxyz

+After line 3:

           abcdefghij_________klmnopqrstuvwxyz

+After line 4:

           abcdefghijxxx______klmnopqrstuvwxyz

+
+ +

A sub-string is +implemented as

+
having a pair + of references into the buffer (start-byte, end-byte)
+
+
being in a + linked list that incudes all active sub-strings of the buffer, + ordered by start position
+
+
being in a + similar list by end positon
+
+
knowing that + a special element from the substrings’ lists marks the edit gap + (or a is-editable/has-edited state on an iterator, if supporting + multiple edit locations
+

+ +

The string buffer +implements automatically growing/shrinking the gap, by reallocating +at a different size and fixing the iterators.

+ +

Lit Review to Include

+ +

API’s approach to +UTF-8 and indexing in:

+
Go
+
+
Rust
+
+
std::string
+
+
alternate + proposals for Java at BMP-exit JSR
+
- +
  show these + are back-compat challenges unique to java that we can get around
  +
- +
  still, + position wrt c-compat
  +
+
+
(find some + super-high-level, text-processing DSL... what does it do?)
+
- +
  big data / + NLP ??
  +
+

+ +

Existing +char*-implemented algorithms

+
find some
+
+
ensure the + table-implementation for substring match features in my + exact-substring pattern
+
+
clarify + utf8-agnosticism
+
+
case study + for promoting / how-to
+
+
“client ABC + should convert their RDBMS to CFA”
+
+
getopt for + command-line arguments (thanks Thierry for the example)
+
- +
  case of it + reordering argv to do names then positionals; this is how gcc does + mixed
  +
+

+ +

Typical regex +implementation algorithms

+ +