Index: doc/proposals/unicode.html =================================================================== --- doc/proposals/unicode.html (revision 98d4df99d1a62214a4467c21f220c266138e9826) +++ doc/proposals/unicode.html (revision 98d4df99d1a62214a4467c21f220c266138e9826) @@ -0,0 +1,1034 @@ + + +
+ +
+
+
+This document presents a string API for Cforall that accommodates +Unicode text while staying safe. The safety challenge is presented +and addressed.
+
+
+
+UTF-8 is the +format.
+
+
+
+Index-based operations in the public interface are reduced or +eliminated. Indexing does not play well with UTF-8’s variable-width +characters.
+
+
+
+Include-Exclude operations, which return a mutable reference into the +original (a capture), completely take their place. These compose find +(text -> index) with substring (index -> text), as text -> +text. The basic mechanism is straight out of Buhr 94; what’s new +here is a total devotion to it, and the decoupling of captures from +in-exclude calls.
+
+
+
+Composable patterns are the subjects being in/excluded. Leaf pattern +expressivity is exact-sequence and repeat character-class, not +regular expression (same as Buhr 94). A leaf pattern can, but does +not have to, offer a capture. An adjacency-composed pattern can +capture too.
+
+
+
+Operator syntax, generalizing current sin-sout pipes, is the pattern +glue. Basic-case string parsing looks the same as current sin +reading. The more nuanced cases allow overriding behaviours like +controlling whitespace treatment.
+
+
+
+A capture wraps and hides a byte-level reference into variable-width +text. It stays valid under length-changing mutations of the original. +In UTF-8, even single-character overwrite is a byte-length changing +operation (when done safely, in the general case).
+
+
+
+A representation of buffer-gap-buffer enables these stable captures. +The gap changes size under editing, moves under cursor advancing, and +is skipped under reading. The cursor move copies characters from the +gap-adjacent source-end of one buffer to gap-adjacent sink-end of the +other.
+
+
+
+
+
+
+
UTF-8 +is the current (spring 2019) standard for text encoding. Java and +.NET use two-byte encodings, which is an older approach. UTF-8 +supports the full universe of Unicode text. A new language like CFA +must support UTF-8, likely even treat it as the native format.
+
+
+
UTF-8 is a +variable-width encoding. Every string from the 128-character ASCII +aphabet is also a UTF-8 string. If the extra bit (which sends a byte +beyond the 128 characters) is set, then the byte is the start of a +character that uses at least two bytes. And so on. Up to 4 bytes, +which covers the 17 defined Unicode planes, including the overhead +the tagging. If 17 planes are not enough for you, but 256 planes +ought to be, then that takes 6 bytes, including the tagging; this +would cover the all the characters expressible with the (obscure) +UTF-32 (aka UCS-4) 4-byte fixed-width encoding.
+
+
+
The variable-width +encoding is a challenge for the string operations length and +index. An API’s semantics must decide bewteen character- and +byte-based counting. Suppose an API supports operations stuctured as, +“give me the range of length m, starting at position i, from this +string, which you have told me has length n, where I know i <= n – +m + 1.” (This presentation uses a 1-based location count; the +choice of base is independent of the semantics under discussion.) For +example, in all encodings, and in all semantics being discussed, when +doing this operation on “Hello, world!” with i=8, m=5 and n=13, +the result is “world” and has length 5. This sematic uniformity +happens because all letters in the example string have equal length +in all encodings.
+
+
+
Contrasting more +interesting examples:
+
+
+
B8 = Byte-counting +sematics on UTF-8 encoding
+C8 = +Character-counting semantics on UTF-8 encoding
+2B16 = +Double-byte-counting semantics on UTF-16 encoding. This is what Java +and .NET use.
+
+
+
+
|
+
+ B8 + |
+
+ C8 + |
+
+ 2B16 + |
+
+ Length of: u + |
+
+ 1 + |
+
+ 1 + |
+
+ 1 + |
+
+ Length of: ü + |
+
+ 2 + |
+
+ 1 + |
+
+ 1 + |
+
+ Length of: 福 + |
+
+ 3 + |
+
+ 1 + |
+
+ 1 + |
+
+ Length of: + + + (U+1D11E) + |
+
+ 4 + |
+
+ 1 + |
+
+ 2 + |
+
+ Are ill-aligned range specifications possible? + |
+
+ Yes + |
+
+ No + |
+
+ Yes + |
+
+ Complexity of length operation, assuming start- and + end-locations known + |
+
+ O(1) + |
+
+ O(n) + |
+
+ O(1) + |
+
+ Complexity of substring operation + |
+
+ O(1) + |
+
+ O(i+m) + |
+
+ O(1) + |
+
+
+
Thus, “B” +semantics put an onus on the programmer that “C” semantics handle +in the library, for a runtime cost.
+
+
+
Typical string APIs +impose position-based operations on programmers. But programmers +would rather be position-agnostic. Outside of string-APU example +programs, numbers passed to substring operations are usually the +result of find operations. The find-substring composition is text-in, +text-out. Tinkering with the numbers in between often happens, such +as to codify the intention, “except leave off the delimeter.” +This is where bugs happen, because cases like the delimiter being +dynamic, and showing up with a byte-width longer than one are hard to +conceive. [TODO: cite UTF-16 being problematic like this] A degree of +security or internationalization expertise is needed to consider +relevant test cases that leave the Basic Multilingual Plane.
+
+
+
+
+
+
+
+
+
Mutations that +change a string’s length are prominent in early (pre-2019) work on +CFA string support, and on Buhr-94 strings. For example:
+
+
+
replace("123456789", "456", "") // result: "123789"
+The previous unicode discussion clarified that, under B8 semantics, +this too is a length-changing operation:
+
+
+
replace("francais", "c", "ç") // result: "français"
+
+
+
This proposal +supports doing such operations as mutations.
+
+
+
+
+
+
+
+
+
Abstract all +byte-length awareness within the string API. No public B semantics +when a user does parsing.
+
+
+
Offer powerful +text-to-text operations on the public interface. There is a potential +to extend this expressivity to regular-expression matching, but +current detailed work focuses on character-class and exact-sequence +constructs.
+
+
+
Remove most +number-based operations from the public interface. No illusion of +general-purpose random access.
+
+
+
Offer +character-class matching in the public interface. “Match: Any * 3” +is the only way to chunk off characters by number of occurrences; +this has strictly C semantics. This supports off-by-a-couple +adjustements when writing a parsing routine, guarantees +wide-character correctness, and puts the operation in a place where +O(n) is reasonable, as n is expected to be small.
+
+
+
Refine an API-use +style in which character-widths are only tested once. Provide +examples in this style and explanations of why this happens. Re-work +low-level designs as needed to ensure naive usage avoids re-checks.
+
+
+
Offer a few coarse +chunking operations that work with a max-bytes parameter, documented +as helpers for batching cases. Return a well-aligned result (of +indeterminate C-semantic length), by leveraging UTF-8 tagging, in +which a start-of-character is easy to find. “Copy to c-string” is +one such operation.
+
+
+
Support +substrings/selections that straddle mutations points by following an +“open binder” design, introduced below.
+
+
+
The rest of the doc +presents detail suggesting the above is achievable. The reader may +thus entertain the hypothesis that all desirable string manipulation +can take place without an index-exposing API.
+
+
+
+
+
+
+
+
+
This starter API is +an example of the text-to-text style, and C-style public semantics, +suggested earlier. It is presented mainly to clarify those concepts +before continuing. +
+
+
+
Note that matched +substrings write through into the original.
+
+
+
The recommended API +still woks this way, but also adds more control over which +(writeable) parts get captured into variables, enabling more useful +find-replace mutation cases.
+
+
+
string s, qs, s1, s2; + charclass qcc;
+Split s once, with q on the left:
+
+
+
[s1, s2] = include(s, qs); + assert s == s1 + s2; + assert s1 == "" || s1 == qs; + + [s1, s2] = include(s, qcc); + assert s == s1 + s2; + for (c1: s1) assert ismatch(c1, qcc); + assert s2 == "" || ismatch(first(s2), qcc);
+Split s once, with q on the right:
+[s1, s2] = exclude(s, qs); + assert s == s1 + s2; + assert ["", _] == include(s1, qs) + assert [qs, _] == include(s2, qs) + + [s1, s2] = exclude(s, qcc); + assert ["", s1] == include(s1, qcc); + assert s2 == "" || include(s2, qcc).0 != ""
+
+
+
All results are +writeable references:
+
+
+
// demonstration on 1st return of 1st API function; others are similar
+
+ [s1, s2] = include(s, qs);
+ assert s == s1 + s2;
+ assert s1 == "" || s1 == qs;
+
+ s1 = "hi";
+ assert s == "hi" + s2;
+
+ int s_len_old = len(s),
+ s2_len_old = len(s2);
+ s1 = "福"; // (2 ch, 2 b) overwritten with (1 ch, 3 b)
+ assert len(s) == s_len_old – 1; // length counts logical characters
+ assert len(s2) == s2_len_old; // s2 is intact, logically unaffected, yet at different distance from front
+Splitting this way works in loops. In the examples following, think +of q as a delimiter. Note how two steps are combined into a single +call: get everything up to next q; move past the q.
+
+
+
Split s repeatedly, +with q ending each match:
+
+
+
string ss = ""; + for ([s1,s2] in split_ex(s, q)) { + assert [s1, s2] == exclude(s1 + s2, q); + ss += s1; + ss += s2; + } + assert ss == s;
+
+
+
Split s repeatedly, +with q starting each match:
+
+
+
string ss = ""; + for ([s1,s2] in split_in(s, q)) { + assert [s1, s2] == include(s1 + s2, q); + ss += s1; + ss += s2; + } + assert ss == s;
+
+
+
+
+
This discussion +strives for concreteness at the risk of painting the design into a +corner. Many details still need to be worked through. The most +significant point of feedback sought here is whether the +algebra-of-patterns is appropriate and sufficient.
+
+
+
The Rev-1 API gets +awkward on cases like:
+
+
+
string s = "a=1, b=2, c=3, "; + string kv, del, k, eqv, eq, v; + for ([kv,del] in split_ex(s, ", ")) { + [k,eqv] = exclude(kv, "="); + [eq,v] = include(eqv, "="); + kv = k + v; + del = ":"; + } + assert s == "a1:b2:c3";
+Points of awkwardness are:
+Multiple API + calls on the "=" matching are still not averted
+ +No substring + capturing all of "a=1, ". If we had one, say scur, the + following assignment would be more natural than the pair in the + example:
+scur = k + v + ":"+ + +
We declare + names for substrings that are only used to target the next API call.
+ +
+
+
The recommended API +treats the rev-1 points as composable primitives, and separates (in +general) pattern from capture.
+
+
+
string s = "a=1, b=2, c=3, "; + while (string ss = s, string k, string v, string scur; + nextMatch(ss, scur & (k ^ "=" | v ^ ", ")) + ) { + scur = k + v + ":"; + } + assert s == "a1:b2:c3";
+Some work is still needed on the ss declaration, and the +iterator-loop syntactic interaction (nextMatch). Lower-level work has +explored the possibility of integrating it with “0 ~ 10”-like +stepping.
+
+
+
The top-level +construct is nextMatch(string, pattern). It matches the pattern +against the front of the string; this is described recursively next.
+
+
+
The combinators |, ^ +and & all do pattern op pattern -> pattern.
+
+
+
Base cases
+A read-only + string is a pattern that matches exactly one occurrence of itself.
+ +A writeable + string is a pattern that matches anything.
+ +Overloading + to be built out to make, e.g. a writeable int qualify as a pattern + that matches the character class [0-9].
+ +Note this + presentation considers a separate problem: how we + differentiate a writeable string from a read-only (or COW) string
+ +Expressivity + = charclass, exact sequence
+
+
+
The operators, in +their natural precendence order from loosest to tightest, and being +naturally left-associative, give the recursive cases:
+
+
+
+ L|R + |
+
+ inclusive then + |
+
+ L consumes all characters that the L-pattern accepts, then R + begins from the first character that L rejects. + + |
+
+ L^R + |
+
+ exclusive then + |
+
+ L consumes all characters that the R-pattern rejects, then R + begins from the first character that R accepts. + |
+
+ L&R + |
+
+ same as + |
+
+ General meaning: Both sides have + to match the same run of characters. +Expected use: pairing a restrictive complex non-writeable + pattern with a permissive simple writeable pattern, the latter + declaring an alias for the result + |
+
+
+
The +semantics of failing to match need to be nailed down. Basically, the +suggestion is
+don’t + throw exception
+ +don’t + end up in an infinite loop (matching zero more chars each time)
+ +just + exit the loop
+
+
+
Work +is in progress describing an iterator-processing model for these +patterns which includes modeling no-more-matches.
+
+
+
+
+
+
+
+
+
It is desirable that +a | b | c have analogous meanings, when done on a string or on +standard-in. As it is also done to standard-out, this in turn +suggests the string append operator will also become |.
+
+
+
The stream sout, +seen as a container of characters, offers one operation: write to +back. Similarly, sin reads from front. Thus, they are ready to work +against a pattern a | b | c, needing no more case refinement. The +type sytem disambiguates sin|p +from sout|p.
+
+
+
+
|
+
+ read (aka parse) + |
+
+ write + |
+
+ in forward linguistic time order + |
+
+ split from front +(sin does this) + |
+
+ append +(sout does this) + |
+
+ in reverse linguistic time order + |
+
+ split from back + |
+
+ prepend + |
+
+
+
+
+
A +string user needs to specify which behaviour is desired (myStr|p +is not enough). This choice, happening at the top level, coincides +with the need to pick a syntax for looping (strawmanned above as +while-nextMatch). This work is in progress; elements in +consideration include:
+++ + with <, vs -- with >, for move/has-next, in forward/reverse + order
+plays + nice with for loop, but confuses s > i overloading
+unary + - and + for read-from/write-to
+ +unary + ~ (or !) to switch to reverse linguistic order from default forward
+ +top-level + only: >> for read-from; << for write-to
+
+
+
When reading from +standard-in, automatically trimming whitespace is generally desired. +When processing a string (and presuming to obviate an index-based +API), it must be possible to take control of whitespace. Auto-newline +behaviour is similar.
+
+
+
Idea is:
+void ?{}(pat &, int &) // construct pattern from int reference+
gets + overridden: multiple definitions of the symbol are out there
+ +the default + one is scope-visible, at background level, upon “#include string”
+ +user may + elevate a different one to current level by: with + (stringopts.picky) {...}
+
+
+
Illustration without +an SPI, and covering the cases we care about:
+Default
+input + spelling is base-10
+ +Be + aggressive about clearing whitespace, lax about its absence; + adjacent whitespaces are insignificant
+ +all + contiguous [0-9] into the int; panic on overflow
+ +require at + least one [0-9] +
+ +yadda yadda + leading zeros, atoi-consistent
+Picky
+Don’t + touch whitespace
+ +require + exactly one [0-9], no coalescing
+
+
+
Probably, a +character-level SPI is required, to enable extensibility.
+
+
+
+
+
+
+
This design +addresses internals hidden by the string API.
+
+
+
All depicted +references into a string are by byte location.
+
+
+
The name suggests a +three-ring binder of papers, sitting opened to a middle page, with +the rings un-clasped. It is easy to insert or delete at the current +opened-to point, given the expense that data movement is required to +advance this current point, proportional to the distance moved.
+
+
+
We are optimizing +for a single iteration through a string, in which substantial +rewriting is happening at the iterator’s current point.
+
+
+
Suppose we have:
+1 string s, s1, s2; +2 s = "abcdefghijklmnopqrstuvwxyz"; +3 [s1, s2] = exclude(s, 'k'); // s1=="abcdefghij", s2=="klmnopqrstuvwxyz" +4 s1 += "xxx"; // s =="abcdefghijxxxklmnopqrstuvwxyz"
+These objects can be represented by the buffers following, where the +__ underscores represent don’t-care bytes.
+
+
+
After line 2:
+_________abcdefghijklmnopqrstuvwxyz
+After line 3:
+abcdefghij_________klmnopqrstuvwxyz
+After line 4:
+abcdefghijxxx______klmnopqrstuvwxyz
+
+
+
A sub-string is +implemented as
+having a pair + of references into the buffer (start-byte, end-byte)
+ +being in a + linked list that incudes all active sub-strings of the buffer, + ordered by start position
+ +being in a + similar list by end positon
+ +knowing that + a special element from the substrings’ lists marks the edit gap + (or a is-editable/has-edited state on an iterator, if supporting + multiple edit locations
+
+
+
The string buffer +implements automatically growing/shrinking the gap, by reallocating +at a different size and fixing the iterators.
+
+
+
+
+
+
+
+
+
API’s approach to +UTF-8 and indexing in:
+Go
+ +Rust
+ +std::string
+ +alternate + proposals for Java at BMP-exit JSR
+show these + are back-compat challenges unique to java that we can get around
+ +still, + position wrt c-compat
+(find some + super-high-level, text-processing DSL... what does it do?)
+big data / + NLP ??
+
+
+
Existing +char*-implemented algorithms
+find some
+ +ensure the + table-implementation for substring match features in my + exact-substring pattern
+ +clarify + utf8-agnosticism
+ +case study + for promoting / how-to
+ +“client ABC + should convert their RDBMS to CFA”
+ +getopt for + command-line arguments (thanks Thierry for the example)
+case of it + reordering argv to do names then positionals; this is how gcc does + mixed
+
+
+
Typical regex +implementation algorithms
+
+
+
+
+
+
+
+
+
+
+