Summary


This document presents a string API for Cforall that accommodates Unicode text while staying safe. The safety challenge is presented and addressed.


UTF-8 is the format.


Index-based operations in the public interface are reduced or eliminated. Indexing does not play well with UTF-8’s variable-width characters.


Include-Exclude operations, which return a mutable reference into the original (a capture), completely take their place. These compose find (text -> index) with substring (index -> text), as text -> text. The basic mechanism is straight out of Buhr 94; what’s new here is a total devotion to it, and the decoupling of captures from in-exclude calls.


Composable patterns are the subjects being in/excluded. Leaf pattern expressivity is exact-sequence and repeat character-class, not regular expression (same as Buhr 94). A leaf pattern can, but does not have to, offer a capture. An adjacency-composed pattern can capture too.


Operator syntax, generalizing current sin-sout pipes, is the pattern glue. Basic-case string parsing looks the same as current sin reading. The more nuanced cases allow overriding behaviours like controlling whitespace treatment.


A capture wraps and hides a byte-level reference into variable-width text. It stays valid under length-changing mutations of the original. In UTF-8, even single-character overwrite is a byte-length changing operation (when done safely, in the general case).


A representation of buffer-gap-buffer enables these stable captures. The gap changes size under editing, moves under cursor advancing, and is skipped under reading. The cursor move copies characters from the gap-adjacent source-end of one buffer to gap-adjacent sink-end of the other.



The Unicode challenge


UTF-8 is the current (spring 2019) standard for text encoding. Java and .NET use two-byte encodings, which is an older approach. UTF-8 supports the full universe of Unicode text. A new language like CFA must support UTF-8, likely even treat it as the native format.


UTF-8 is a variable-width encoding. Every string from the 128-character ASCII aphabet is also a UTF-8 string. If the extra bit (which sends a byte beyond the 128 characters) is set, then the byte is the start of a character that uses at least two bytes. And so on. Up to 4 bytes, which covers the 17 defined Unicode planes, including the overhead the tagging. If 17 planes are not enough for you, but 256 planes ought to be, then that takes 6 bytes, including the tagging; this would cover the all the characters expressible with the (obscure) UTF-32 (aka UCS-4) 4-byte fixed-width encoding.


The variable-width encoding is a challenge for the string operations length and index. An API’s semantics must decide bewteen character- and byte-based counting. Suppose an API supports operations stuctured as, “give me the range of length m, starting at position i, from this string, which you have told me has length n, where I know i <= n – m + 1.” (This presentation uses a 1-based location count; the choice of base is independent of the semantics under discussion.) For example, in all encodings, and in all semantics being discussed, when doing this operation on “Hello, world!” with i=8, m=5 and n=13, the result is “world” and has length 5. This sematic uniformity happens because all letters in the example string have equal length in all encodings.


Contrasting more interesting examples:


B8 = Byte-counting sematics on UTF-8 encoding

C8 = Character-counting semantics on UTF-8 encoding

2B16 = Double-byte-counting semantics on UTF-16 encoding. This is what Java and .NET use.



B8

C8

2B16

Length of: u

1

1

1

Length of: ü

2

1

1

Length of:

3

1

1

Length of: (U+1D11E)

4

1

2

Are ill-aligned range specifications possible?

Yes

No

Yes

Complexity of length operation, assuming start- and end-locations known

O(1)

O(n)

O(1)

Complexity of substring operation

O(1)

O(i+m)

O(1)


Thus, “B” semantics put an onus on the programmer that “C” semantics handle in the library, for a runtime cost.


Typical string APIs impose position-based operations on programmers. But programmers would rather be position-agnostic. Outside of string-APU example programs, numbers passed to substring operations are usually the result of find operations. The find-substring composition is text-in, text-out. Tinkering with the numbers in between often happens, such as to codify the intention, “except leave off the delimeter.” This is where bugs happen, because cases like the delimiter being dynamic, and showing up with a byte-width longer than one are hard to conceive. [TODO: cite UTF-16 being problematic like this] A degree of security or internationalization expertise is needed to consider relevant test cases that leave the Basic Multilingual Plane.




Length-changing operations


Mutations that change a string’s length are prominent in early (pre-2019) work on CFA string support, and on Buhr-94 strings. For example:


    replace("123456789", "456", "") // result: "123789"

The previous unicode discussion clarified that, under B8 semantics, this too is a length-changing operation:


    replace("francais", "c", "ç")  // result: "français"


This proposal supports doing such operations as mutations.




Solution Principles


Abstract all byte-length awareness within the string API. No public B semantics when a user does parsing.


Offer powerful text-to-text operations on the public interface. There is a potential to extend this expressivity to regular-expression matching, but current detailed work focuses on character-class and exact-sequence constructs.


Remove most number-based operations from the public interface. No illusion of general-purpose random access.


Offer character-class matching in the public interface. “Match: Any * 3” is the only way to chunk off characters by number of occurrences; this has strictly C semantics. This supports off-by-a-couple adjustements when writing a parsing routine, guarantees wide-character correctness, and puts the operation in a place where O(n) is reasonable, as n is expected to be small.


Refine an API-use style in which character-widths are only tested once. Provide examples in this style and explanations of why this happens. Re-work low-level designs as needed to ensure naive usage avoids re-checks.


Offer a few coarse chunking operations that work with a max-bytes parameter, documented as helpers for batching cases. Return a well-aligned result (of indeterminate C-semantic length), by leveraging UTF-8 tagging, in which a start-of-character is easy to find. “Copy to c-string” is one such operation.


Support substrings/selections that straddle mutations points by following an “open binder” design, introduced below.


The rest of the doc presents detail suggesting the above is achievable. The reader may thus entertain the hypothesis that all desirable string manipulation can take place without an index-exposing API.




Text-to-text API, rev 1


This starter API is an example of the text-to-text style, and C-style public semantics, suggested earlier. It is presented mainly to clarify those concepts before continuing.


Note that matched substrings write through into the original.


The recommended API still woks this way, but also adds more control over which (writeable) parts get captured into variables, enabling more useful find-replace mutation cases.


    string s, qs, s1, s2;
    charclass qcc;

Split s once, with q on the left:


    [s1, s2] = include(s, qs);
    assert s == s1 + s2;
    assert s1 == "" || s1 == qs;

    [s1, s2] = include(s, qcc);
    assert s == s1 + s2;
    for (c1: s1) assert ismatch(c1, qcc);
    assert s2 == "" || ismatch(first(s2), qcc);

Split s once, with q on the right:

    [s1, s2] = exclude(s, qs);
    assert s == s1 + s2;
    assert ["", _] == include(s1, qs)
    assert [qs, _] == include(s2, qs)

    [s1, s2] = exclude(s, qcc);
    assert ["", s1] == include(s1, qcc);
    assert s2 == "" || include(s2, qcc).0 != ""


All results are writeable references:


    // demonstration on 1st return of 1st API function; others are similar

    [s1, s2] = include(s, qs);
    assert s == s1 + s2;
    assert s1 == "" || s1 == qs;

    s1 = "hi";
    assert s == "hi" + s2;

    int s_len_old  = len(s), 
        s2_len_old = len(s2);
    s1 = "";                       // (2 ch, 2 b) overwritten with (1 ch, 3 b)
    assert len(s) == s_len_old – 1; // length counts logical characters
    assert len(s2) == s2_len_old;   // s2 is intact, logically unaffected, yet at different distance from front

Splitting this way works in loops. In the examples following, think of q as a delimiter. Note how two steps are combined into a single call: get everything up to next q; move past the q.


Split s repeatedly, with q ending each match:


    string ss = "";
    for ([s1,s2] in split_ex(s, q)) {
        assert [s1, s2] == exclude(s1 + s2, q);
        ss += s1;
        ss += s2;
    }
    assert ss == s;


Split s repeatedly, with q starting each match:


    string ss = "";
    for ([s1,s2] in split_in(s, q)) {
        assert [s1, s2] == include(s1 + s2, q);
        ss += s1;
        ss += s2;
    }
    assert ss == s;


Recommended API


This discussion strives for concreteness at the risk of painting the design into a corner. Many details still need to be worked through. The most significant point of feedback sought here is whether the algebra-of-patterns is appropriate and sufficient.


The Rev-1 API gets awkward on cases like:


    string s = "a=1, b=2, c=3, ";
    string kv, del, k, eqv, eq, v;
    for ([kv,del] in split_ex(s, ", ")) {
        [k,eqv] = exclude(kv, "=");
        [eq,v] = include(eqv, "=");
        kv = k + v;
        del = ":";
    }
    assert s == "a1:b2:c3";

Points of awkwardness are:


The recommended API treats the rev-1 points as composable primitives, and separates (in general) pattern from capture.


    string s = "a=1, b=2, c=3, ";
    while (string ss = s, string k, string v, string scur;
           nextMatch(ss, scur & (k ^ "=" | v ^ ", "))
          ) {
        scur = k + v + ":";
    }
    assert s == "a1:b2:c3";

Some work is still needed on the ss declaration, and the iterator-loop syntactic interaction (nextMatch). Lower-level work has explored the possibility of integrating it with “0 ~ 10”-like stepping.


The top-level construct is nextMatch(string, pattern). It matches the pattern against the front of the string; this is described recursively next.


The combinators |, ^ and & all do pattern op pattern -> pattern.


Base cases


The operators, in their natural precendence order from loosest to tightest, and being naturally left-associative, give the recursive cases:


L|R

inclusive then

L consumes all characters that the L-pattern accepts, then R begins from the first character that L rejects.

L^R

exclusive then

L consumes all characters that the R-pattern rejects, then R begins from the first character that R accepts.

L&R

same as

General meaning: Both sides have to match the same run of characters.

Expected use: pairing a restrictive complex non-writeable pattern with a permissive simple writeable pattern, the latter declaring an alias for the result


The semantics of failing to match need to be nailed down. Basically, the suggestion is


Work is in progress describing an iterator-processing model for these patterns which includes modeling no-more-matches.




Unification with streams


It is desirable that a | b | c have analogous meanings, when done on a string or on standard-in. As it is also done to standard-out, this in turn suggests the string append operator will also become |.


The stream sout, seen as a container of characters, offers one operation: write to back. Similarly, sin reads from front. Thus, they are ready to work against a pattern a | b | c, needing no more case refinement. The type sytem disambiguates sin|p from sout|p.



read (aka parse)

write

in forward linguistic time order

split from front

(sin does this)

append

(sout does this)

in reverse linguistic time order

split from back

prepend



A string user needs to specify which behaviour is desired (myStr|p is not enough). This choice, happening at the top level, coincides with the need to pick a syntax for looping (strawmanned above as while-nextMatch). This work is in progress; elements in consideration include:


When reading from standard-in, automatically trimming whitespace is generally desired. When processing a string (and presuming to obviate an index-based API), it must be possible to take control of whitespace. Auto-newline behaviour is similar.


Idea is:

void ?{}(pat &, int &) // construct pattern from int reference


Illustration without an SPI, and covering the cases we care about:


Probably, a character-level SPI is required, to enable extensibility.



Open Binder design


This design addresses internals hidden by the string API.


All depicted references into a string are by byte location.


The name suggests a three-ring binder of papers, sitting opened to a middle page, with the rings un-clasped. It is easy to insert or delete at the current opened-to point, given the expense that data movement is required to advance this current point, proportional to the distance moved.


We are optimizing for a single iteration through a string, in which substantial rewriting is happening at the iterator’s current point.


Suppose we have:

1   string s, s1, s2;
2   s = "abcdefghijklmnopqrstuvwxyz";
3   [s1, s2] = exclude(s, 'k');       // s1=="abcdefghij", s2=="klmnopqrstuvwxyz"
4   s1 += "xxx";                      // s =="abcdefghijxxxklmnopqrstuvwxyz"

These objects can be represented by the buffers following, where the __ underscores represent don’t-care bytes.


After line 2:

           _________abcdefghijklmnopqrstuvwxyz

After line 3:

           abcdefghij_________klmnopqrstuvwxyz

After line 4:

           abcdefghijxxx______klmnopqrstuvwxyz


A sub-string is implemented as


The string buffer implements automatically growing/shrinking the gap, by reallocating at a different size and fixing the iterators.



Lit Review to Include



API’s approach to UTF-8 and indexing in:


Existing char*-implemented algorithms


Typical regex implementation algorithms