source: doc/proposals/unicode.html @ 5a3d67c

ADTarm-ehast-experimentalenumforall-pointer-decayjacob/cs343-translationjenkins-sandboxnew-astnew-ast-unique-exprpthread-emulationqualifiedEnum
Last change on this file since 5a3d67c was 686f731, checked in by Michael Brooks <mlbrooks@…>, 6 years ago

unicode strings proposal - initial draft

  • Property mode set to 100644
File size: 146.0 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2<html>
3<head>
4        <meta http-equiv="content-type" content="text/html; charset=utf-8"/>
5        <title></title>
6        <meta name="generator" content="LibreOffice 5.1.6.2 (Linux)"/>
7        <meta name="created" content="00:00:00"/>
8        <meta name="changed" content="2019-04-30T12:17:26.683110610"/>
9        <meta name="created" content="00:00:00">
10        <meta name="changed" content="2019-04-30T11:56:44.860373711">
11        <meta name="created" content="00:00:00">
12        <meta name="changed" content="2019-04-30T11:40:36.209830313">
13        <meta name="created" content="00:00:00">
14        <meta name="changed" content="2019-04-30T11:36:41.526514563">
15        <meta name="created" content="00:00:00">
16        <meta name="changed" content="2019-04-30T11:25:44.054610077">
17        <style type="text/css">
18                @page { margin: 2cm }
19                p { margin-bottom: 0.25cm; line-height: 120% }
20                td p { margin-bottom: 0cm; line-height: 120% }
21                h1 { margin-bottom: 0.21cm }
22                h1.western { font-family: "Liberation Sans", sans-serif; font-size: 18pt }
23                h1.cjk { font-family: "Noto Sans CJK SC Regular"; font-size: 18pt }
24                h1.ctl { font-family: "FreeSans"; font-size: 18pt }
25                pre.cjk { font-family: "Courier New", monospace }
26                a:link { so-language: zxx }
27        </style>
28</head>
29<body lang="en-CA" dir="ltr">
30<h1 class="western">Summary</h1>
31<p style="margin-bottom: 0cm; line-height: 100%"><br/>
32
33</p>
34<p style="margin-bottom: 0cm; font-variant: normal; font-style: normal; line-height: 100%">
35This document presents a string API for Cforall that accommodates
36Unicode text while staying safe. The safety challenge is presented
37and addressed.</p>
38<p style="margin-bottom: 0cm; line-height: 100%"><br/>
39
40</p>
41<p style="margin-bottom: 0cm; font-variant: normal; font-style: normal; line-height: 100%">
42UTF-8 is <span style="text-decoration: none"><i>the</i></span>
43format.</p>
44<p style="margin-bottom: 0cm; line-height: 100%"><br/>
45
46</p>
47<p style="margin-bottom: 0cm; font-variant: normal; font-style: normal; line-height: 100%">
48Index-based operations in the public interface are reduced or
49eliminated. Indexing does not play well with UTF-8’s variable-width
50characters.</p>
51<p style="margin-bottom: 0cm; line-height: 100%"><br/>
52
53</p>
54<p style="margin-bottom: 0cm; font-variant: normal; font-style: normal; line-height: 100%">
55Include-Exclude operations, which return a mutable reference into the
56original (a capture), completely take their place. These compose find
57(text -&gt; index) with substring (index -&gt; text), as text -&gt;
58text. The basic mechanism is straight out of Buhr 94; what’s new
59here is a total devotion to it, and the decoupling of captures from
60in-exclude calls.</p>
61<p style="margin-bottom: 0cm; line-height: 100%"><br/>
62
63</p>
64<p style="margin-bottom: 0cm; font-variant: normal; font-style: normal; line-height: 100%">
65Composable patterns are the subjects being in/excluded. Leaf pattern
66expressivity is exact-sequence and repeat character-class, not
67regular expression (same as Buhr 94). A leaf pattern can, but does
68not have to, offer a capture. An adjacency-composed pattern can
69capture too.</p>
70<p style="margin-bottom: 0cm; line-height: 100%"><br/>
71
72</p>
73<p style="margin-bottom: 0cm; font-variant: normal; font-style: normal; line-height: 100%">
74Operator syntax, generalizing current sin-sout pipes, is the pattern
75glue. Basic-case string parsing looks the same as current sin
76reading. The more nuanced cases allow overriding behaviours like
77controlling whitespace treatment.</p>
78<p style="margin-bottom: 0cm; line-height: 100%"><br/>
79
80</p>
81<p style="margin-bottom: 0cm; font-variant: normal; font-style: normal; line-height: 100%">
82A capture wraps and hides a byte-level reference into variable-width
83text. It stays valid under length-changing mutations of the original.
84In UTF-8, even single-character overwrite is a byte-length changing
85operation (when done safely, in the general case).</p>
86<p style="margin-bottom: 0cm; line-height: 100%"><br/>
87
88</p>
89<p style="margin-bottom: 0cm; font-variant: normal; font-style: normal; line-height: 100%">
90A representation of buffer-gap-buffer enables these stable captures.
91The gap changes size under editing, moves under cursor advancing, and
92is skipped under reading. The cursor move copies characters from the
93gap-adjacent source-end of one buffer to gap-adjacent sink-end of the
94other.</p>
95<p style="margin-bottom: 0cm; line-height: 100%"><br/>
96
97</p>
98<p style="margin-bottom: 0cm; line-height: 100%"><br/>
99
100</p>
101<h1 class="western"><meta name="created" content="00:00:00"><meta name="changed" content="2019-04-30T09:48:29.629318918">The
102Unicode challenge</h1>
103<p style="margin-bottom: 0cm; line-height: 100%"><br/>
104
105</p>
106<p style="margin-bottom: 0cm; line-height: 100%"><meta name="changed" content="2019-04-24T17:30:13.350273625">UTF-8
107is the current (spring 2019) standard for text encoding. Java and
108.NET use two-byte encodings, which is an older approach. UTF-8
109supports the full universe of Unicode text. A new language like CFA
110must support UTF-8, likely even treat it as <i>the</i> native format.</p>
111<p style="margin-bottom: 0cm; line-height: 100%"><br/>
112
113</p>
114<p style="margin-bottom: 0cm; line-height: 100%">UTF-8 is a
115variable-width encoding. Every string from the 128-character ASCII
116aphabet is also a UTF-8 string. If the extra bit (which sends a byte
117beyond the 128 characters) is set, then the byte is the start of a
118character that uses at least two bytes. And so on. Up to 4 bytes,
119which covers the 17 defined Unicode planes, including the overhead
120the tagging. If 17 planes are not enough for you, but 256 planes
121ought to be, then that takes 6 bytes, including the tagging; this
122would cover the all the characters expressible with the (obscure)
123UTF-32 (aka UCS-4) 4-byte fixed-width encoding.</p>
124<p style="margin-bottom: 0cm; line-height: 100%"><br/>
125
126</p>
127<p style="margin-bottom: 0cm; line-height: 100%">The variable-width
128encoding is a challenge for the string operations <i>length</i> and
129<i>index.</i> An API’s semantics must decide bewteen character- and
130byte-based counting. Suppose an API supports operations stuctured as,
131“give me the range of length m, starting at position i, from this
132string, which you have told me has length n, where I know i &lt;= n –
133m + 1.” (This presentation uses a 1-based location count; the
134choice of base is independent of the semantics under discussion.) For
135example, in all encodings, and in all semantics being discussed, when
136doing this operation on “Hello, world!” with i=8, m=5 and n=13,
137the result is “world” and has length 5. This sematic uniformity
138happens because all letters in the example string have equal length
139in all encodings.</p>
140<p style="margin-bottom: 0cm; line-height: 100%"><br/>
141
142</p>
143<p style="margin-bottom: 0cm; line-height: 100%">Contrasting more
144interesting examples:</p>
145<p style="margin-bottom: 0cm; line-height: 100%"><br/>
146
147</p>
148<p style="margin-bottom: 0cm; line-height: 100%">B8 = Byte-counting
149sematics on UTF-8 encoding</p>
150<p style="margin-bottom: 0cm; line-height: 100%">C8 =
151Character-counting semantics on UTF-8 encoding</p>
152<p style="margin-bottom: 0cm; line-height: 100%">2B16 =
153Double-byte-counting semantics on UTF-16 encoding. This is what Java
154and .NET use.</p>
155<p style="margin-bottom: 0cm; line-height: 100%"><br/>
156
157</p>
158<table width="665" cellpadding="4" cellspacing="0">
159        <col width="456">
160        <col width="71">
161        <col width="56">
162        <col width="48">
163        <tr valign="top">
164                <td width="456" style="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0.1cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
165                        <p><br/>
166
167                        </p>
168                </td>
169                <td width="71" style="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0.1cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
170                        <p align="center">B8</p>
171                </td>
172                <td width="56" style="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0.1cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
173                        <p align="center">C8</p>
174                </td>
175                <td width="48" style="border: 1px solid #000000; padding: 0.1cm">
176                        <p align="center">2B16</p>
177                </td>
178        </tr>
179        <tr valign="top">
180                <td width="456" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
181                        <p>Length of: u</p>
182                </td>
183                <td width="71" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
184                        <p align="center">1</p>
185                </td>
186                <td width="56" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
187                        <p align="center">1</p>
188                </td>
189                <td width="48" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0.1cm">
190                        <p align="center">1</p>
191                </td>
192        </tr>
193        <tr valign="top">
194                <td width="456" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
195                        <p>Length of: ü</p>
196                </td>
197                <td width="71" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
198                        <p align="center">2</p>
199                </td>
200                <td width="56" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
201                        <p align="center">1</p>
202                </td>
203                <td width="48" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0.1cm">
204                        <p align="center">1</p>
205                </td>
206        </tr>
207        <tr valign="top">
208                <td width="456" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
209                        <p>Length of: <font face="Noto Sans CJK SC Regular"><span lang="zh-CN"><font face="Noto Sans CJK SC Regular"><font size="2" style="font-size: 10pt"><span lang="zh-CN"></span></font></font></span></font></p>
210                </td>
211                <td width="71" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
212                        <p align="center">3</p>
213                </td>
214                <td width="56" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
215                        <p align="center">1</p>
216                </td>
217                <td width="48" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0.1cm">
218                        <p align="center">1</p>
219                </td>
220        </tr>
221        <tr valign="top">
222                <td width="456" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
223                        <p>Length of:
224                        <img src="" name="Image1" align="middle" width="13" height="36" border="0"/>
225
226                        (U+1D11E)</p>
227                </td>
228                <td width="71" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
229                        <p align="center">4</p>
230                </td>
231                <td width="56" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
232                        <p align="center">1</p>
233                </td>
234                <td width="48" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0.1cm">
235                        <p align="center">2</p>
236                </td>
237        </tr>
238        <tr valign="top">
239                <td width="456" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
240                        <p>Are ill-aligned range specifications possible?</p>
241                </td>
242                <td width="71" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
243                        <p align="center">Yes</p>
244                </td>
245                <td width="56" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
246                        <p align="center">No</p>
247                </td>
248                <td width="48" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0.1cm">
249                        <p align="center">Yes</p>
250                </td>
251        </tr>
252        <tr valign="top">
253                <td width="456" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
254                        <p>Complexity of length operation, assuming start- and
255                        end-locations known</p>
256                </td>
257                <td width="71" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
258                        <p align="center">O(1)</p>
259                </td>
260                <td width="56" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
261                        <p align="center">O(n)</p>
262                </td>
263                <td width="48" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0.1cm">
264                        <p align="center">O(1)</p>
265                </td>
266        </tr>
267        <tr valign="top">
268                <td width="456" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
269                        <p>Complexity of substring operation</p>
270                </td>
271                <td width="71" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
272                        <p align="center">O(1)</p>
273                </td>
274                <td width="56" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
275                        <p align="center">O(i+m)</p>
276                </td>
277                <td width="48" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0.1cm">
278                        <p align="center">O(1)</p>
279                </td>
280        </tr>
281</table>
282<p style="margin-bottom: 0cm; line-height: 100%"><br/>
283
284</p>
285<p style="margin-bottom: 0cm; line-height: 100%">Thus, “B”
286semantics put an onus on the programmer that “C” semantics handle
287in the library, for a runtime cost.</p>
288<p style="margin-bottom: 0cm; line-height: 100%"><br/>
289
290</p>
291<p style="margin-bottom: 0cm; line-height: 100%">Typical string APIs
292impose position-based operations on programmers. But programmers
293would rather be position-agnostic. Outside of string-APU example
294programs, numbers passed to substring operations are usually the
295result of find operations. The find-substring composition is text-in,
296text-out. Tinkering with the numbers in between often happens, such
297as to codify the intention, “except leave off the delimeter.”
298This is where bugs happen, because cases like the delimiter being
299dynamic, and showing up with a byte-width longer than one are hard to
300conceive. [TODO: cite UTF-16 being problematic like this] A degree of
301security or internationalization expertise is needed to consider
302relevant test cases that leave the Basic Multilingual Plane.</p>
303<p style="margin-bottom: 0cm; line-height: 100%"><br/>
304
305</p>
306<p style="margin-bottom: 0cm; line-height: 100%"><br/>
307
308</p>
309<p style="margin-bottom: 0cm; line-height: 100%"><br/>
310
311</p>
312<h1 class="western">Length-changing operations</h1>
313<p style="margin-bottom: 0cm; line-height: 100%"><br/>
314
315</p>
316<p style="margin-bottom: 0cm; line-height: 100%">Mutations that
317change a string’s length are prominent in early (pre-2019) work on
318CFA string support, and on Buhr-94 strings. For example:</p>
319<p style="margin-bottom: 0cm; line-height: 100%"><br/>
320
321</p>
322<pre class="western" style="margin-bottom: 0.5cm">    replace(&quot;123456789&quot;, &quot;456&quot;, &quot;&quot;) // result: &quot;123789&quot;</pre><p style="margin-bottom: 0cm; line-height: 100%">
323The previous unicode discussion clarified that, under B8 semantics,
324this too is a length-changing operation:</p>
325<p style="margin-bottom: 0cm; line-height: 100%"><br/>
326
327</p>
328<pre class="western" style="margin-bottom: 0.5cm">    <font face="Liberation Mono, monospace">replace(&quot;francais&quot;, &quot;c&quot;, &quot;ç&quot;)  // result: &quot;français&quot;</font></pre><p style="margin-bottom: 0cm; line-height: 100%">
329<br/>
330
331</p>
332<p style="margin-bottom: 0cm; line-height: 100%">This proposal
333supports doing such operations as mutations.</p>
334<p style="margin-bottom: 0cm; line-height: 100%"><br/>
335
336</p>
337<p style="margin-bottom: 0cm; line-height: 100%"><br/>
338
339</p>
340<p style="margin-bottom: 0cm; line-height: 100%"><br/>
341
342</p>
343<h1 class="western">Solution Principles</h1>
344<p style="margin-bottom: 0cm; line-height: 100%"><br/>
345
346</p>
347<p style="margin-bottom: 0cm; line-height: 100%">Abstract all
348byte-length awareness within the string API. No public B semantics
349when a user does parsing.</p>
350<p style="margin-bottom: 0cm; line-height: 100%"><br/>
351
352</p>
353<p style="margin-bottom: 0cm; line-height: 100%">Offer powerful
354text-to-text operations on the public interface. There is a potential
355to extend this expressivity to regular-expression matching, but
356current detailed work focuses on character-class and exact-sequence
357constructs.</p>
358<p style="margin-bottom: 0cm; line-height: 100%"><br/>
359
360</p>
361<p style="margin-bottom: 0cm; line-height: 100%">Remove most
362number-based operations from the public interface. No illusion of
363general-purpose random access.</p>
364<p style="margin-bottom: 0cm; line-height: 100%"><br/>
365
366</p>
367<p style="margin-bottom: 0cm; line-height: 100%">Offer
368character-class matching in the public interface. “Match: Any * 3”
369is the only way to chunk off characters by number of occurrences;
370this has strictly C semantics. This supports off-by-a-couple
371adjustements when writing a parsing routine, guarantees
372wide-character correctness, and puts the operation in a place where
373O(n) is reasonable, as n is expected to be small.</p>
374<p style="margin-bottom: 0cm; line-height: 100%"><br/>
375
376</p>
377<p style="margin-bottom: 0cm; line-height: 100%">Refine an API-use
378style in which character-widths are only tested once. Provide
379examples in this style and explanations of why this happens. Re-work
380low-level designs as needed to ensure naive usage avoids re-checks.</p>
381<p style="margin-bottom: 0cm; line-height: 100%"><br/>
382
383</p>
384<p style="margin-bottom: 0cm; line-height: 100%">Offer a few coarse
385chunking operations that work with a max-bytes parameter, documented
386as helpers for batching cases. Return a well-aligned result (of
387indeterminate C-semantic length), by leveraging UTF-8 tagging, in
388which a start-of-character is easy to find. “Copy to c-string” is
389one such operation.</p>
390<p style="margin-bottom: 0cm; line-height: 100%"><br/>
391
392</p>
393<p style="margin-bottom: 0cm; line-height: 100%">Support
394substrings/selections that straddle mutations points by following an
395“open binder” design, introduced below.</p>
396<p style="margin-bottom: 0cm; line-height: 100%"><br/>
397
398</p>
399<p style="margin-bottom: 0cm; line-height: 100%">The rest of the doc
400presents detail suggesting the above is achievable. The reader may
401thus entertain the hypothesis that all desirable string manipulation
402can take place without an index-exposing API.</p>
403<p style="margin-bottom: 0cm; line-height: 100%"><br/>
404
405</p>
406<p style="margin-bottom: 0cm; line-height: 100%"><br/>
407
408</p>
409<p style="margin-bottom: 0cm; line-height: 100%"><br/>
410
411</p>
412<h1 class="western">Text-to-text API, rev 1</h1>
413<p style="margin-bottom: 0cm; line-height: 100%"><br/>
414
415</p>
416<p style="margin-bottom: 0cm; line-height: 100%">This starter API is
417an example of the text-to-text style, and C-style public semantics,
418suggested earlier. It is presented mainly to clarify those concepts
419before continuing.
420</p>
421<p style="margin-bottom: 0cm; line-height: 100%"><br/>
422
423</p>
424<p style="margin-bottom: 0cm; line-height: 100%">Note that matched
425substrings write through into the original.</p>
426<p style="margin-bottom: 0cm; line-height: 100%"><br/>
427
428</p>
429<p style="margin-bottom: 0cm; line-height: 100%">The recommended API
430still woks this way, but also adds more control over which
431(writeable) parts get captured into variables, enabling more useful
432find-replace mutation cases.</p>
433<p style="margin-bottom: 0cm; line-height: 100%"><br/>
434
435</p>
436<pre class="western">    string s, qs, s1, s2;
437    charclass qcc;</pre><p style="margin-bottom: 0cm; line-height: 100%">
438Split s once, with q on the left:</p>
439<p style="margin-bottom: 0cm; line-height: 100%"><br/>
440
441</p>
442<pre class="western">    [s1, s2] = include(s, qs);
443    assert s == s1 + s2;
444    assert s1 == &quot;&quot; || s1 == qs;
445
446    [s1, s2] = include(s, qcc);
447    assert s == s1 + s2;
448    for (c1: s1) assert ismatch(c1, qcc);
449    assert s2 == &quot;&quot; || ismatch(first(s2), qcc);</pre><p>
450Split s once, with q on the right:</p>
451<pre class="western">    [s1, s2] = exclude(s, qs);
452    assert s == s1 + s2;
453    assert [&quot;&quot;, _] == include(s1, qs)
454    assert [qs, _] == include(s2, qs)
455
456    [s1, s2] = exclude(s, qcc);
457    assert [&quot;&quot;, s1] == include(s1, qcc);
458    assert s2 == &quot;&quot; || include(s2, qcc).0 != &quot;&quot;</pre><p style="margin-bottom: 0cm; line-height: 100%">
459<br/>
460
461</p>
462<p style="margin-bottom: 0cm; line-height: 100%">All results are
463writeable references:</p>
464<p style="margin-bottom: 0cm; line-height: 100%"><br/>
465
466</p>
467<pre class="western">    // demonstration on 1<sup>st</sup> return of 1<sup>st</sup> API function; others are similar
468
469    [s1, s2] = include(s, qs);
470    assert s == s1 + s2;
471    assert s1 == &quot;&quot; || s1 == qs;
472
473    s1 = &quot;hi&quot;;
474    assert s == &quot;hi&quot; + s2;
475
476    int s_len_old  = len(s),
477        s2_len_old = len(s2);
478    s1 = &quot;<font face="Courier New, monospace"><span lang="zh-CN"><font face="Noto Sans CJK SC Regular"><font size="2" style="font-size: 10pt"><span lang="zh-CN"></span></font></font></span></font>&quot;;                       // (2 ch, 2 b) overwritten with (1 ch, 3 b)
479    assert len(s) == s_len_old – 1; // length counts logical characters
480    assert len(s2) == s2_len_old;   // s2 is intact, logically unaffected, yet at different distance from front</pre><p style="margin-bottom: 0cm; line-height: 100%">
481Splitting this way works in loops. In the examples following, think
482of q as a delimiter. Note how two steps are combined into a single
483call: get everything up to next q; move past the q.</p>
484<p style="margin-bottom: 0cm; line-height: 100%"><br/>
485
486</p>
487<p style="margin-bottom: 0cm; line-height: 100%">Split s repeatedly,
488with q ending each match:</p>
489<p style="margin-bottom: 0cm; line-height: 100%"><br/>
490
491</p>
492<pre class="western">    string ss = &quot;&quot;;
493    for ([s1,s2] in split_ex(s, q)) {
494        assert [s1, s2] == exclude(s1 + s2, q);
495        ss += s1;
496        ss += s2;
497    }
498    assert ss == s;</pre><p style="margin-bottom: 0cm; line-height: 100%">
499<br/>
500
501</p>
502<p style="margin-bottom: 0cm; line-height: 100%">Split s repeatedly,
503with q starting each match:</p>
504<p style="margin-bottom: 0cm; line-height: 100%"><br/>
505
506</p>
507<pre class="western">    string ss = &quot;&quot;;
508    for ([s1,s2] in split_in(s, q)) {
509        assert [s1, s2] == include(s1 + s2, q);
510        ss += s1;
511        ss += s2;
512    }
513    assert ss == s;</pre><p style="margin-bottom: 0cm; line-height: 100%">
514<br/>
515
516</p>
517<h1 class="western">Recommended API</h1>
518<p style="margin-bottom: 0cm; line-height: 100%"><br/>
519
520</p>
521<p style="margin-bottom: 0cm; line-height: 100%">This discussion
522strives for concreteness at the risk of painting the design into a
523corner. Many details still need to be worked through. The most
524significant point of feedback sought here is whether the
525algebra-of-patterns is appropriate and sufficient.</p>
526<p style="margin-bottom: 0cm; line-height: 100%"><br/>
527
528</p>
529<p style="margin-bottom: 0cm; line-height: 100%">The Rev-1 API gets
530awkward on cases like:</p>
531<p style="margin-bottom: 0cm; line-height: 100%"><br/>
532
533</p>
534<pre class="western">    string s = &quot;a=1, b=2, c=3, &quot;;
535    string kv, del, k, eqv, eq, v;
536    for ([kv,del] in split_ex(s, &quot;, &quot;)) {
537        [k,eqv] = exclude(kv, &quot;=&quot;);
538        [eq,v] = include(eqv, &quot;=&quot;);
539        kv = k + v;
540        del = &quot;:&quot;;
541    }
542    assert s == &quot;a1:b2:c3&quot;;</pre><p style="margin-bottom: 0cm; line-height: 100%">
543Points of awkwardness are:</p>
544<ul>
545        <li/>
546<p style="margin-bottom: 0cm; line-height: 100%">Multiple API
547        calls on the &quot;=&quot; matching are still not averted</p>
548        <li/>
549<p style="margin-bottom: 0cm; line-height: 100%">No substring
550        capturing all of &quot;a=1, &quot;. If we had one, say scur, the
551        following assignment would be more natural than the pair in the
552        example:</p>
553        <ul>
554                <pre class="western" style="margin-bottom: 0.5cm">scur = k + v + &quot;:&quot;</pre>
555        </ul>
556        <li/>
557<p style="margin-bottom: 0cm; line-height: 100%">We declare
558        names for substrings that are only used to target the next API call.</p>
559</ul>
560<p style="margin-bottom: 0cm; line-height: 100%"><br/>
561
562</p>
563<p style="margin-bottom: 0cm; line-height: 100%">The recommended API
564treats the rev-1 points as composable primitives, and separates (in
565general) pattern from capture.</p>
566<p style="margin-bottom: 0cm; line-height: 100%"><br/>
567
568</p>
569<pre class="western">    string s = &quot;a=1, b=2, c=3, &quot;;
570    while (string ss = s, string k, string v, string scur;
571           nextMatch(ss, scur &amp; (k ^ &quot;=&quot; | v ^ &quot;, &quot;))
572          ) {
573        scur = k + v + &quot;:&quot;;
574    }
575    assert s == &quot;a1:b2:c3&quot;;</pre><p style="margin-bottom: 0cm; line-height: 100%">
576Some work is still needed on the ss declaration, and the
577iterator-loop syntactic interaction (nextMatch). Lower-level work has
578explored the possibility of integrating it with “0 ~ 10”-like
579stepping.</p>
580<p style="margin-bottom: 0cm; line-height: 100%"><br/>
581
582</p>
583<p style="margin-bottom: 0cm; line-height: 100%">The top-level
584construct is nextMatch(string, pattern). It matches the pattern
585against the front of the string; this is described recursively next.</p>
586<p style="margin-bottom: 0cm; line-height: 100%"><br/>
587
588</p>
589<p style="margin-bottom: 0cm; line-height: 100%">The combinators |, ^
590and &amp; all do pattern op pattern -&gt; pattern.</p>
591<p style="margin-bottom: 0cm; line-height: 100%"><br/>
592
593</p>
594<p style="margin-bottom: 0cm; line-height: 100%">Base cases</p>
595<ul>
596        <li/>
597<p style="margin-bottom: 0cm; line-height: 100%">A read-only
598        string is a pattern that matches exactly one occurrence of itself.</p>
599        <li/>
600<p style="margin-bottom: 0cm; line-height: 100%">A writeable
601        string is a pattern that matches anything.</p>
602        <li/>
603<p style="margin-bottom: 0cm; line-height: 100%">Overloading
604        to be built out to make, e.g. a writeable int qualify as a pattern
605        that matches the character class [0-9].</p>
606        <li/>
607<p style="margin-bottom: 0cm; line-height: 100%">Note this
608        presentation considers a separate problem: <i>how</i> we
609        differentiate a writeable string from a read-only (or COW) string</p>
610        <li/>
611<p style="margin-bottom: 0cm; line-height: 100%">Expressivity
612        = charclass, exact sequence</p>
613</ul>
614<p style="margin-bottom: 0cm; line-height: 100%"><br/>
615
616</p>
617<p style="margin-bottom: 0cm; line-height: 100%">The operators, in
618their natural precendence order from loosest to tightest, and being
619naturally left-associative, give the recursive cases:</p>
620<p style="margin-bottom: 0cm; line-height: 100%"><br/>
621
622</p>
623<table width="665" cellpadding="4" cellspacing="0">
624        <col width="36">
625        <col width="95">
626        <col width="508">
627        <tr valign="top">
628                <td width="36" style="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0.1cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
629                        <p>L|R</p>
630                </td>
631                <td width="95" style="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0.1cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
632                        <p>inclusive then</p>
633                </td>
634                <td width="508" style="border: 1px solid #000000; padding: 0.1cm">
635                        <p>L consumes all characters that the L-pattern accepts, then R
636                        begins from the first character that L rejects.
637                        </p>
638                </td>
639        </tr>
640        <tr valign="top">
641                <td width="36" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
642                        <p>L^R</p>
643                </td>
644                <td width="95" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
645                        <p>exclusive then</p>
646                </td>
647                <td width="508" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0.1cm">
648                        <p>L consumes all characters that the R-pattern rejects, then R
649                        begins from the first character that R accepts.</p>
650                </td>
651        </tr>
652        <tr valign="top">
653                <td width="36" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
654                        <p>L&amp;R</p>
655                </td>
656                <td width="95" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
657                        <p>same as</p>
658                </td>
659                <td width="508" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0.1cm">
660                        <p style="margin-bottom: 0.5cm">General meaning: Both sides have
661                        to match the same run of characters.</p>
662                        <p>Expected use: pairing a restrictive complex non-writeable
663                        pattern with a permissive simple writeable pattern, the latter
664                        declaring an alias for the result</p>
665                </td>
666        </tr>
667</table>
668<p style="margin-bottom: 0cm; line-height: 100%"><br/>
669
670</p>
671<p style="margin-bottom: 0cm; line-height: 100%"><font color="#000000">The
672semantics of failing to match need to be nailed down. Basically, the
673suggestion is</font></p>
674<ul>
675        <li/>
676<p style="margin-bottom: 0cm; line-height: 100%"><font color="#000000">don’t
677        throw exception</font></p>
678        <li/>
679<p style="margin-bottom: 0cm; line-height: 100%"><font color="#000000">don’t
680        end up in an infinite loop (matching zero more chars each time)</font></p>
681        <li/>
682<p style="margin-bottom: 0cm; line-height: 100%"><font color="#000000">just
683        exit the loop</font></p>
684</ul>
685<p style="margin-bottom: 0cm; line-height: 100%"><br/>
686
687</p>
688<p style="margin-bottom: 0cm; line-height: 100%"><font color="#000000">Work
689is in progress describing an iterator-processing model for these
690patterns which includes modeling no-more-matches.</font></p>
691<p style="margin-bottom: 0cm; line-height: 100%"><br/>
692
693</p>
694<p style="margin-bottom: 0cm; line-height: 100%"><br/>
695
696</p>
697<p style="margin-bottom: 0cm; line-height: 100%"><br/>
698
699</p>
700<h1 class="western">Unification with streams</h1>
701<p style="margin-bottom: 0cm; line-height: 100%"><br/>
702
703</p>
704<p style="margin-bottom: 0cm; line-height: 100%">It is desirable that
705a | b | c have analogous meanings, when done on a string or on
706standard-in. As it is also done to standard-out, this in turn
707suggests the string append operator will also become |.</p>
708<p style="margin-bottom: 0cm; line-height: 100%"><br/>
709
710</p>
711<p style="margin-bottom: 0cm; line-height: 100%">The stream sout,
712seen as a container of characters, offers one operation: write to
713back. Similarly, sin reads from front. Thus, they are ready to work
714against a pattern a | b | c, needing no more case refinement. The
715type sytem disambiguates <font face="Liberation Mono, monospace">sin|p</font>
716from <font face="Liberation Mono, monospace">sout|p</font>.</p>
717<p style="margin-bottom: 0cm; line-height: 100%"><br/>
718
719</p>
720<table width="453" cellpadding="4" cellspacing="0">
721        <col width="208">
722        <col width="105">
723        <col width="116">
724        <tr valign="top">
725                <td width="208" style="border: none; padding: 0cm">
726                        <p><br/>
727
728                        </p>
729                </td>
730                <td width="105" style="border: none; padding: 0cm">
731                        <p align="center">read (aka parse)</p>
732                </td>
733                <td width="116" style="border: none; padding: 0cm">
734                        <p align="center">write</p>
735                </td>
736        </tr>
737        <tr>
738                <td width="208" style="border: none; padding: 0cm">
739                        <p align="right">in forward linguistic time order</p>
740                </td>
741                <td width="105" valign="top" style="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0.1cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
742                        <p align="center" style="margin-bottom: 0.5cm">split from front</p>
743                        <p align="center">(sin does this)</p>
744                </td>
745                <td width="116" valign="top" style="border: 1px solid #000000; padding: 0.1cm">
746                        <p align="center" style="margin-bottom: 0.5cm">append</p>
747                        <p align="center">(sout does this)</p>
748                </td>
749        </tr>
750        <tr>
751                <td width="208" style="border: none; padding: 0cm">
752                        <p align="right">in reverse linguistic time order</p>
753                </td>
754                <td width="105" valign="top" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0cm">
755                        <p align="center">split from back</p>
756                </td>
757                <td width="116" valign="top" style="border-top: none; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: 1px solid #000000; padding-top: 0cm; padding-bottom: 0.1cm; padding-left: 0.1cm; padding-right: 0.1cm">
758                        <p align="center">prepend</p>
759                </td>
760        </tr>
761</table>
762<p style="margin-bottom: 0cm; line-height: 100%"><br/>
763
764</p>
765<p style="margin-bottom: 0cm; line-height: 100%"><br/>
766
767</p>
768<p style="margin-bottom: 0cm; line-height: 100%"><font color="#000000">A
769string user needs to specify which behaviour is desired (<font face="Liberation Mono, monospace">myStr|p</font>
770is not enough). This choice, happening at the top level, coincides
771with the need to pick a syntax for looping (strawmanned above as
772while-nextMatch).  This work is in progress; elements in
773consideration include:</font></p>
774<ul>
775        <li/>
776<p style="margin-bottom: 0cm; line-height: 100%"><font color="#000000">++
777        with &lt;, vs -- with &gt;, for move/has-next, in forward/reverse
778        order</font></p>
779        <ul>
780                <li/>
781<p style="margin-bottom: 0cm; line-height: 100%"><font color="#000000">plays
782                nice with for loop, but confuses s &gt; i overloading</font></p>
783        </ul>
784        <li/>
785<p style="margin-bottom: 0cm; line-height: 100%"><font color="#000000">unary
786        - and + for read-from/write-to</font></p>
787        <li/>
788<p style="margin-bottom: 0cm; line-height: 100%"><font color="#000000">unary
789        ~ (or !) to switch to reverse linguistic order from default forward</font></p>
790        <li/>
791<p style="margin-bottom: 0cm; line-height: 100%"><font color="#000000">top-level
792        only: &gt;&gt; for read-from; &lt;&lt; for write-to</font></p>
793</ul>
794<p style="margin-bottom: 0cm; line-height: 100%"><br/>
795
796</p>
797<p style="margin-bottom: 0cm; line-height: 100%">When reading from
798standard-in, automatically trimming whitespace is generally desired.
799When processing a string (and presuming to obviate an index-based
800API), it must be possible to take control of whitespace. Auto-newline
801behaviour is similar.</p>
802<p style="margin-bottom: 0cm; line-height: 100%"><br/>
803
804</p>
805<p style="margin-bottom: 0cm; line-height: 100%">Idea is:</p>
806<pre class="western" style="margin-bottom: 0.5cm">void ?{}(pat &amp;, int &amp;) // construct pattern from int reference</pre>
807<ul>
808        <li/>
809<p style="margin-bottom: 0cm; line-height: 100%">gets
810        overridden: multiple definitions of the symbol are out there</p>
811        <li/>
812<p style="margin-bottom: 0cm; line-height: 100%">the default
813        one is scope-visible, at background level, upon “#include string”</p>
814        <li/>
815<p style="margin-bottom: 0cm; line-height: 100%">user may
816        elevate a different one to current level by: <font face="Liberation Mono, monospace">with
817        (stringopts.picky) {...}</font></p>
818</ul>
819<p style="margin-bottom: 0cm; line-height: 100%"><br/>
820
821</p>
822<p style="margin-bottom: 0cm; line-height: 100%">Illustration without
823an SPI, and covering the cases we care about:</p>
824<ul>
825        <li/>
826<p style="margin-bottom: 0cm; line-height: 100%">Default</p>
827        <ul>
828                <li/>
829<p style="margin-bottom: 0cm; line-height: 100%">input
830                spelling is base-10</p>
831                <li/>
832<p style="margin-bottom: 0cm; line-height: 100%">Be
833                aggressive about clearing whitespace, lax about its absence;
834                adjacent whitespaces are insignificant</p>
835                <li/>
836<p style="margin-bottom: 0cm; line-height: 100%">all
837                contiguous [0-9] into the int; panic on overflow</p>
838                <li/>
839<p style="margin-bottom: 0cm; line-height: 100%">require at
840                least one [0-9]
841                </p>
842                <li/>
843<p style="margin-bottom: 0cm; line-height: 100%">yadda yadda
844                leading zeros, atoi-consistent</p>
845        </ul>
846        <li/>
847<p style="margin-bottom: 0cm; line-height: 100%">Picky</p>
848        <ul>
849                <li/>
850<p style="margin-bottom: 0cm; line-height: 100%">Don’t
851                touch whitespace</p>
852                <li/>
853<p style="margin-bottom: 0cm; line-height: 100%">require
854                exactly one [0-9], no coalescing</p>
855        </ul>
856</ul>
857<p style="margin-bottom: 0cm; line-height: 100%"><br/>
858
859</p>
860<p style="margin-bottom: 0cm; line-height: 100%">Probably, a
861character-level SPI is required, to enable extensibility.</p>
862<p style="margin-bottom: 0cm; line-height: 100%"><br/>
863
864</p>
865<p style="margin-bottom: 0cm; line-height: 100%"><br/>
866
867</p>
868<h1 class="western">Open Binder design</h1>
869<p style="margin-bottom: 0cm; line-height: 100%"><br/>
870
871</p>
872<p style="margin-bottom: 0cm; line-height: 100%">This design
873addresses internals hidden by the string API.</p>
874<p style="margin-bottom: 0cm; line-height: 100%"><br/>
875
876</p>
877<p style="margin-bottom: 0cm; line-height: 100%">All depicted
878references into a string are by byte location.</p>
879<p style="margin-bottom: 0cm; line-height: 100%"><br/>
880
881</p>
882<p style="margin-bottom: 0cm; line-height: 100%">The name suggests a
883three-ring binder of papers, sitting opened to a middle page, with
884the rings un-clasped. It is easy to insert or delete at the current
885opened-to point, given the expense that data movement is required to
886advance this current point, proportional to the distance moved.</p>
887<p style="margin-bottom: 0cm; line-height: 100%"><br/>
888
889</p>
890<p style="margin-bottom: 0cm; line-height: 100%">We are optimizing
891for a single iteration through a string, in which substantial
892rewriting is happening at the iterator’s current point.</p>
893<p style="margin-bottom: 0cm; line-height: 100%"><br/>
894
895</p>
896<p style="margin-bottom: 0cm; line-height: 100%">Suppose we have:</p>
897<pre class="western">1   string s, s1, s2;
8982   s = &quot;abcdefghijklmnopqrstuvwxyz&quot;;
8993   [s1, s2] = exclude(s, 'k');       // s1==&quot;abcdefghij&quot;, s2==&quot;klmnopqrstuvwxyz&quot;
9004   s1 += &quot;xxx&quot;;                      // s ==&quot;abcdefghijxxxklmnopqrstuvwxyz&quot;</pre><p style="margin-bottom: 0cm; line-height: 100%">
901These objects can be represented by the buffers following, where the
902__ underscores represent don’t-care bytes.</p>
903<p style="margin-bottom: 0cm; line-height: 100%"><br/>
904
905</p>
906<p style="margin-bottom: 0cm; line-height: 100%">After line 2:</p>
907<pre class="western" style="margin-bottom: 0.5cm">           _________abcdefghijklmnopqrstuvwxyz</pre><p style="margin-bottom: 0cm; line-height: 100%">
908After line 3:</p>
909<pre class="western" style="margin-bottom: 0.5cm">           abcdefghij_________klmnopqrstuvwxyz</pre><p style="margin-bottom: 0cm; line-height: 100%">
910After line 4:</p>
911<pre class="western" style="margin-bottom: 0.5cm">           abcdefghijxxx______klmnopqrstuvwxyz</pre><p style="margin-bottom: 0cm; line-height: 100%">
912<br/>
913
914</p>
915<p style="margin-bottom: 0cm; line-height: 100%">A sub-string is
916implemented as</p>
917<ul>
918        <li/>
919<p style="margin-bottom: 0cm; line-height: 100%">having a pair
920        of references into the buffer (start-byte, end-byte)</p>
921        <li/>
922<p style="margin-bottom: 0cm; line-height: 100%">being in a
923        linked list that incudes all active sub-strings of the buffer,
924        ordered by start position</p>
925        <li/>
926<p style="margin-bottom: 0cm; line-height: 100%">being in a
927        similar list by end positon</p>
928        <li/>
929<p style="margin-bottom: 0cm; line-height: 100%">knowing that
930        a special element from the substrings’ lists marks the edit gap
931        (or a is-editable/has-edited state on an iterator, if supporting
932        multiple edit locations</p>
933</ul>
934<p style="margin-bottom: 0cm; line-height: 100%"><br/>
935
936</p>
937<p style="margin-bottom: 0cm; line-height: 100%">The string buffer
938implements automatically growing/shrinking the gap, by reallocating
939at a different size and fixing the iterators.</p>
940<p style="margin-bottom: 0cm; line-height: 100%"><br/>
941
942</p>
943<p style="margin-bottom: 0cm; line-height: 100%"><br/>
944
945</p>
946<h1 class="western">Lit Review to Include</h1>
947<p style="margin-bottom: 0cm; line-height: 100%"><br/>
948
949</p>
950<p style="margin-bottom: 0cm; line-height: 100%"><br/>
951
952</p>
953<p style="margin-bottom: 0cm; line-height: 100%">API’s approach to
954UTF-8 and indexing in:</p>
955<ul>
956        <li/>
957<p style="margin-bottom: 0cm; line-height: 100%">Go</p>
958        <li/>
959<p style="margin-bottom: 0cm; line-height: 100%">Rust</p>
960        <li/>
961<p style="margin-bottom: 0cm; line-height: 100%">std::string</p>
962        <li/>
963<p style="margin-bottom: 0cm; line-height: 100%">alternate
964        proposals for Java at BMP-exit JSR</p>
965        <ul>
966                <li/>
967<p style="margin-bottom: 0cm; line-height: 100%">show these
968                are back-compat challenges unique to java that we can get around</p>
969                <li/>
970<p style="margin-bottom: 0cm; line-height: 100%">still,
971                position wrt c-compat</p>
972        </ul>
973        <li/>
974<p style="margin-bottom: 0cm; line-height: 100%">(find some
975        super-high-level, text-processing DSL... what does it do?)</p>
976        <ul>
977                <li/>
978<p style="margin-bottom: 0cm; line-height: 100%">big data /
979                NLP ??</p>
980        </ul>
981</ul>
982<p style="margin-bottom: 0cm; line-height: 100%"><br/>
983
984</p>
985<p style="margin-bottom: 0cm; line-height: 100%">Existing
986char*-implemented algorithms</p>
987<ul>
988        <li/>
989<p style="margin-bottom: 0cm; line-height: 100%">find some</p>
990        <li/>
991<p style="margin-bottom: 0cm; line-height: 100%">ensure the
992        table-implementation for substring match features in my
993        exact-substring pattern</p>
994        <li/>
995<p style="margin-bottom: 0cm; line-height: 100%">clarify
996        utf8-agnosticism</p>
997        <li/>
998<p style="margin-bottom: 0cm; line-height: 100%">case study
999        for promoting / how-to</p>
1000        <li/>
1001<p style="margin-bottom: 0cm; line-height: 100%">“client ABC
1002        should convert their RDBMS to CFA”</p>
1003        <li/>
1004<p style="margin-bottom: 0cm; line-height: 100%">getopt for
1005        command-line arguments (thanks Thierry for the example)</p>
1006        <ul>
1007                <li/>
1008<p style="margin-bottom: 0cm; line-height: 100%">case of it
1009                reordering argv to do names then positionals; this is how gcc does
1010                mixed</p>
1011        </ul>
1012</ul>
1013<p style="margin-bottom: 0cm; line-height: 100%"><br/>
1014
1015</p>
1016<p style="margin-bottom: 0cm; line-height: 100%">Typical regex
1017implementation algorithms</p>
1018<p style="margin-bottom: 0cm; line-height: 100%"><br/>
1019
1020</p>
1021<p style="margin-bottom: 0cm; line-height: 100%"><br/>
1022
1023</p>
1024<p style="margin-bottom: 0cm; line-height: 100%"><br/>
1025
1026</p>
1027<p style="margin-bottom: 0cm; line-height: 100%"><br/>
1028
1029</p>
1030<p style="margin-bottom: 0cm; line-height: 100%"><br/>
1031
1032</p>
1033</body>
1034</html>
Note: See TracBrowser for help on using the repository browser.