This document presents a string API for Cforall that accommodates Unicode text while staying safe. The safety challenge is presented and addressed.
UTF-8 is the format.
Index-based operations in the public interface are reduced or eliminated. Indexing does not play well with UTF-8’s variable-width characters.
Include-Exclude operations, which return a mutable reference into the original (a capture), completely take their place. These compose find (text -> index) with substring (index -> text), as text -> text. The basic mechanism is straight out of Buhr 94; what’s new here is a total devotion to it, and the decoupling of captures from in-exclude calls.
Composable patterns are the subjects being in/excluded. Leaf pattern expressivity is exact-sequence and repeat character-class, not regular expression (same as Buhr 94). A leaf pattern can, but does not have to, offer a capture. An adjacency-composed pattern can capture too.
Operator syntax, generalizing current sin-sout pipes, is the pattern glue. Basic-case string parsing looks the same as current sin reading. The more nuanced cases allow overriding behaviours like controlling whitespace treatment.
A capture wraps and hides a byte-level reference into variable-width text. It stays valid under length-changing mutations of the original. In UTF-8, even single-character overwrite is a byte-length changing operation (when done safely, in the general case).
A representation of buffer-gap-buffer enables these stable captures. The gap changes size under editing, moves under cursor advancing, and is skipped under reading. The cursor move copies characters from the gap-adjacent source-end of one buffer to gap-adjacent sink-end of the other.
UTF-8 is the current (spring 2019) standard for text encoding. Java and .NET use two-byte encodings, which is an older approach. UTF-8 supports the full universe of Unicode text. A new language like CFA must support UTF-8, likely even treat it as the native format.
UTF-8 is a variable-width encoding. Every string from the 128-character ASCII aphabet is also a UTF-8 string. If the extra bit (which sends a byte beyond the 128 characters) is set, then the byte is the start of a character that uses at least two bytes. And so on. Up to 4 bytes, which covers the 17 defined Unicode planes, including the overhead the tagging. If 17 planes are not enough for you, but 256 planes ought to be, then that takes 6 bytes, including the tagging; this would cover the all the characters expressible with the (obscure) UTF-32 (aka UCS-4) 4-byte fixed-width encoding.
The variable-width encoding is a challenge for the string operations length and index. An API’s semantics must decide bewteen character- and byte-based counting. Suppose an API supports operations stuctured as, “give me the range of length m, starting at position i, from this string, which you have told me has length n, where I know i <= n – m + 1.” (This presentation uses a 1-based location count; the choice of base is independent of the semantics under discussion.) For example, in all encodings, and in all semantics being discussed, when doing this operation on “Hello, world!” with i=8, m=5 and n=13, the result is “world” and has length 5. This sematic uniformity happens because all letters in the example string have equal length in all encodings.
Contrasting more interesting examples:
B8 = Byte-counting sematics on UTF-8 encoding
C8 = Character-counting semantics on UTF-8 encoding
2B16 = Double-byte-counting semantics on UTF-16 encoding. This is what Java and .NET use.
|
B8 |
C8 |
2B16 |
Length of: u |
1 |
1 |
1 |
Length of: ü |
2 |
1 |
1 |
Length of: 福 |
3 |
1 |
1 |
Length of: (U+1D11E) |
4 |
1 |
2 |
Are ill-aligned range specifications possible? |
Yes |
No |
Yes |
Complexity of length operation, assuming start- and end-locations known |
O(1) |
O(n) |
O(1) |
Complexity of substring operation |
O(1) |
O(i+m) |
O(1) |
Thus, “B” semantics put an onus on the programmer that “C” semantics handle in the library, for a runtime cost.
Typical string APIs impose position-based operations on programmers. But programmers would rather be position-agnostic. Outside of string-APU example programs, numbers passed to substring operations are usually the result of find operations. The find-substring composition is text-in, text-out. Tinkering with the numbers in between often happens, such as to codify the intention, “except leave off the delimeter.” This is where bugs happen, because cases like the delimiter being dynamic, and showing up with a byte-width longer than one are hard to conceive. [TODO: cite UTF-16 being problematic like this] A degree of security or internationalization expertise is needed to consider relevant test cases that leave the Basic Multilingual Plane.
Mutations that change a string’s length are prominent in early (pre-2019) work on CFA string support, and on Buhr-94 strings. For example:
replace("123456789", "456", "") // result: "123789"
The previous unicode discussion clarified that, under B8 semantics, this too is a length-changing operation:
replace("francais", "c", "ç") // result: "français"
This proposal supports doing such operations as mutations.
Abstract all byte-length awareness within the string API. No public B semantics when a user does parsing.
Offer powerful text-to-text operations on the public interface. There is a potential to extend this expressivity to regular-expression matching, but current detailed work focuses on character-class and exact-sequence constructs.
Remove most number-based operations from the public interface. No illusion of general-purpose random access.
Offer character-class matching in the public interface. “Match: Any * 3” is the only way to chunk off characters by number of occurrences; this has strictly C semantics. This supports off-by-a-couple adjustements when writing a parsing routine, guarantees wide-character correctness, and puts the operation in a place where O(n) is reasonable, as n is expected to be small.
Refine an API-use style in which character-widths are only tested once. Provide examples in this style and explanations of why this happens. Re-work low-level designs as needed to ensure naive usage avoids re-checks.
Offer a few coarse chunking operations that work with a max-bytes parameter, documented as helpers for batching cases. Return a well-aligned result (of indeterminate C-semantic length), by leveraging UTF-8 tagging, in which a start-of-character is easy to find. “Copy to c-string” is one such operation.
Support substrings/selections that straddle mutations points by following an “open binder” design, introduced below.
The rest of the doc presents detail suggesting the above is achievable. The reader may thus entertain the hypothesis that all desirable string manipulation can take place without an index-exposing API.
This starter API is an example of the text-to-text style, and C-style public semantics, suggested earlier. It is presented mainly to clarify those concepts before continuing.
Note that matched substrings write through into the original.
The recommended API still woks this way, but also adds more control over which (writeable) parts get captured into variables, enabling more useful find-replace mutation cases.
string s, qs, s1, s2; charclass qcc;
Split s once, with q on the left:
[s1, s2] = include(s, qs); assert s == s1 + s2; assert s1 == "" || s1 == qs; [s1, s2] = include(s, qcc); assert s == s1 + s2; for (c1: s1) assert ismatch(c1, qcc); assert s2 == "" || ismatch(first(s2), qcc);
Split s once, with q on the right:
[s1, s2] = exclude(s, qs); assert s == s1 + s2; assert ["", _] == include(s1, qs) assert [qs, _] == include(s2, qs) [s1, s2] = exclude(s, qcc); assert ["", s1] == include(s1, qcc); assert s2 == "" || include(s2, qcc).0 != ""
All results are writeable references:
// demonstration on 1st return of 1st API function; others are similar
[s1, s2] = include(s, qs);
assert s == s1 + s2;
assert s1 == "" || s1 == qs;
s1 = "hi";
assert s == "hi" + s2;
int s_len_old = len(s),
s2_len_old = len(s2);
s1 = "福"; // (2 ch, 2 b) overwritten with (1 ch, 3 b)
assert len(s) == s_len_old – 1; // length counts logical characters
assert len(s2) == s2_len_old; // s2 is intact, logically unaffected, yet at different distance from front
Splitting this way works in loops. In the examples following, think of q as a delimiter. Note how two steps are combined into a single call: get everything up to next q; move past the q.
Split s repeatedly, with q ending each match:
string ss = ""; for ([s1,s2] in split_ex(s, q)) { assert [s1, s2] == exclude(s1 + s2, q); ss += s1; ss += s2; } assert ss == s;
Split s repeatedly, with q starting each match:
string ss = ""; for ([s1,s2] in split_in(s, q)) { assert [s1, s2] == include(s1 + s2, q); ss += s1; ss += s2; } assert ss == s;
This discussion strives for concreteness at the risk of painting the design into a corner. Many details still need to be worked through. The most significant point of feedback sought here is whether the algebra-of-patterns is appropriate and sufficient.
The Rev-1 API gets awkward on cases like:
string s = "a=1, b=2, c=3, "; string kv, del, k, eqv, eq, v; for ([kv,del] in split_ex(s, ", ")) { [k,eqv] = exclude(kv, "="); [eq,v] = include(eqv, "="); kv = k + v; del = ":"; } assert s == "a1:b2:c3";
Points of awkwardness are:
Multiple API calls on the "=" matching are still not averted
No substring capturing all of "a=1, ". If we had one, say scur, the following assignment would be more natural than the pair in the example:
scur = k + v + ":"
We declare names for substrings that are only used to target the next API call.
The recommended API treats the rev-1 points as composable primitives, and separates (in general) pattern from capture.
string s = "a=1, b=2, c=3, "; while (string ss = s, string k, string v, string scur; nextMatch(ss, scur & (k ^ "=" | v ^ ", ")) ) { scur = k + v + ":"; } assert s == "a1:b2:c3";
Some work is still needed on the ss declaration, and the iterator-loop syntactic interaction (nextMatch). Lower-level work has explored the possibility of integrating it with “0 ~ 10”-like stepping.
The top-level construct is nextMatch(string, pattern). It matches the pattern against the front of the string; this is described recursively next.
The combinators |, ^ and & all do pattern op pattern -> pattern.
Base cases
A read-only string is a pattern that matches exactly one occurrence of itself.
A writeable string is a pattern that matches anything.
Overloading to be built out to make, e.g. a writeable int qualify as a pattern that matches the character class [0-9].
Note this presentation considers a separate problem: how we differentiate a writeable string from a read-only (or COW) string
Expressivity = charclass, exact sequence
The operators, in their natural precendence order from loosest to tightest, and being naturally left-associative, give the recursive cases:
L|R |
inclusive then |
L consumes all characters that the L-pattern accepts, then R begins from the first character that L rejects. |
L^R |
exclusive then |
L consumes all characters that the R-pattern rejects, then R begins from the first character that R accepts. |
L&R |
same as |
General meaning: Both sides have to match the same run of characters. Expected use: pairing a restrictive complex non-writeable pattern with a permissive simple writeable pattern, the latter declaring an alias for the result |
The semantics of failing to match need to be nailed down. Basically, the suggestion is
don’t throw exception
don’t end up in an infinite loop (matching zero more chars each time)
just exit the loop
Work is in progress describing an iterator-processing model for these patterns which includes modeling no-more-matches.
It is desirable that a | b | c have analogous meanings, when done on a string or on standard-in. As it is also done to standard-out, this in turn suggests the string append operator will also become |.
The stream sout, seen as a container of characters, offers one operation: write to back. Similarly, sin reads from front. Thus, they are ready to work against a pattern a | b | c, needing no more case refinement. The type sytem disambiguates sin|p from sout|p.
|
read (aka parse) |
write |
in forward linguistic time order |
split from front (sin does this) |
append (sout does this) |
in reverse linguistic time order |
split from back |
prepend |
A string user needs to specify which behaviour is desired (myStr|p is not enough). This choice, happening at the top level, coincides with the need to pick a syntax for looping (strawmanned above as while-nextMatch). This work is in progress; elements in consideration include:
++ with <, vs -- with >, for move/has-next, in forward/reverse order
plays nice with for loop, but confuses s > i overloading
unary - and + for read-from/write-to
unary ~ (or !) to switch to reverse linguistic order from default forward
top-level only: >> for read-from; << for write-to
When reading from standard-in, automatically trimming whitespace is generally desired. When processing a string (and presuming to obviate an index-based API), it must be possible to take control of whitespace. Auto-newline behaviour is similar.
Idea is:
void ?{}(pat &, int &) // construct pattern from int reference
gets overridden: multiple definitions of the symbol are out there
the default one is scope-visible, at background level, upon “#include string”
user may elevate a different one to current level by: with (stringopts.picky) {...}
Illustration without an SPI, and covering the cases we care about:
Default
input spelling is base-10
Be aggressive about clearing whitespace, lax about its absence; adjacent whitespaces are insignificant
all contiguous [0-9] into the int; panic on overflow
require at least one [0-9]
yadda yadda leading zeros, atoi-consistent
Picky
Don’t touch whitespace
require exactly one [0-9], no coalescing
Probably, a character-level SPI is required, to enable extensibility.
This design addresses internals hidden by the string API.
All depicted references into a string are by byte location.
The name suggests a three-ring binder of papers, sitting opened to a middle page, with the rings un-clasped. It is easy to insert or delete at the current opened-to point, given the expense that data movement is required to advance this current point, proportional to the distance moved.
We are optimizing for a single iteration through a string, in which substantial rewriting is happening at the iterator’s current point.
Suppose we have:
1 string s, s1, s2; 2 s = "abcdefghijklmnopqrstuvwxyz"; 3 [s1, s2] = exclude(s, 'k'); // s1=="abcdefghij", s2=="klmnopqrstuvwxyz" 4 s1 += "xxx"; // s =="abcdefghijxxxklmnopqrstuvwxyz"
These objects can be represented by the buffers following, where the __ underscores represent don’t-care bytes.
After line 2:
_________abcdefghijklmnopqrstuvwxyz
After line 3:
abcdefghij_________klmnopqrstuvwxyz
After line 4:
abcdefghijxxx______klmnopqrstuvwxyz
A sub-string is implemented as
having a pair of references into the buffer (start-byte, end-byte)
being in a linked list that incudes all active sub-strings of the buffer, ordered by start position
being in a similar list by end positon
knowing that a special element from the substrings’ lists marks the edit gap (or a is-editable/has-edited state on an iterator, if supporting multiple edit locations
The string buffer implements automatically growing/shrinking the gap, by reallocating at a different size and fixing the iterators.
API’s approach to UTF-8 and indexing in:
Go
Rust
std::string
alternate proposals for Java at BMP-exit JSR
show these are back-compat challenges unique to java that we can get around
still, position wrt c-compat
(find some super-high-level, text-processing DSL... what does it do?)
big data / NLP ??
Existing char*-implemented algorithms
find some
ensure the table-implementation for substring match features in my exact-substring pattern
clarify utf8-agnosticism
case study for promoting / how-to
“client ABC should convert their RDBMS to CFA”
getopt for command-line arguments (thanks Thierry for the example)
case of it reordering argv to do names then positionals; this is how gcc does mixed
Typical regex implementation algorithms