Ignore:
Timestamp:
Apr 13, 2025, 9:16:33 PM (5 months ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Children:
d9aee90
Parents:
5ad6f0d
Message:

more proofreading of string chapter

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mike_brooks_MMath/string.tex

    r5ad6f0d rb0296dba  
    3737\end{figure}
    3838
    39 As mentioned in \VRef{s:String}, a C string differs from other string types as it uses null termination rather than a length, which leads to explicit storage management;
    40 hence, most of its group operations are error prone and expensive.
     39As mentioned in \VRef{s:String}, a C string uses null termination rather than a length, which leads to explicit storage management;
     40hence, most of its group operations are error prone and expensive due to copying.
    4141Most high-level string libraries use a separate length field and specialized storage management to implement group operations.
    4242Interestingly, \CC strings retain null termination in case it is needed to interface with C library functions.
     
    6161A \CFA string manages its length separately from the string, so there is no null (@'\0'@) terminating value at the end of a string value.
    6262Hence, a \CFA string cannot be passed to a C string manipulation function, such as @strcat@.
    63 Like C strings, characters in a @string@ are numbered from the left starting at 0, and in \CFA numbered from the right starting at -1.
    64 \begin{cquote}
    65 \sf
     63Like C strings, characters in a @string@ are numbered from the left starting at 0 (because subscripting is zero-origin), and in \CFA numbered from the right starting at -1.
     64\begin{cquote}
     65\rm
    6666\begin{tabular}{@{}rrrrll@{}}
    6767\small\tt "a & \small\tt b & \small\tt c & \small\tt d & \small\tt e" \\
     
    7070\end{tabular}
    7171\end{cquote}
    72 The following operations have been defined to manipulate an instance of type @string@.
    73 The discussion assumes the following declarations and assignment statements are executed.
     72The following operations manipulate an instance of type @string@, where the discussion assumes the following declarations.
    7473\begin{cfa}
    7574#include @<string.hfa>@
    7675@string@ s = "abcde", name = "MIKE", digit = "0123456789";
    77 const char cs[] = "abc";
     76const char cs[$\,$] = "abc";
    7877int i;
    7978\end{cfa}
     
    8988\begin{tabular}{@{}l|ll|l@{}}
    9089\begin{cfa}
    91 //      string s = 5;
    92         s = 'x';
    93         s = "abc";
    94         s = cs;
    95         s = 45hh;
    96         s = 45h;
     90string s;
     91s = 'x';
     92s = "abc";
     93s = cs;
     94s = 45hh;
     95s = 45h;
    9796\end{cfa}
    9897&
     
    160159
    161160The @len@ operation (short for @strlen@) returns the length of a C or \CFA string.
    162 For consistency, @strlen@ also works with \CFA strings.
     161For compatibility, @strlen@ also works with \CFA strings.
    163162\begin{cquote}
    164163\setlength{\tabcolsep}{15pt}
     
    189188The binary relational, @<@, @<=@, @>@, @>=@, and equality, @==@, @!=@, operators compare \CFA string values using lexicographical ordering, where longer strings are greater than shorter strings.
    190189In C, these operators compare the C string pointer not its value, which does not match programmer expectation.
    191 C strings use function @strcmp@, as the relational/equality operator for string values.
     190C strings use function @strcmp@ to lexicographically compare the string value.
    192191
    193192
    194193\subsection{Concatenation}
    195194
    196 The binary operators @+@ and @+=@ concatenate characters, C strings and \CFA strings, creating the sum of the characters.
    197 \begin{cquote}
     195The binary operators @+@ and @+=@ concatenate C @char@, @char *@ and \CFA strings, creating the sum of the characters.
     196\par\noindent
    198197\begin{tabular}{@{}l|l@{\hspace{15pt}}l|l@{\hspace{15pt}}l|l@{}}
    199198\begin{cfa}
     
    245244\end{cfa}
    246245\end{tabular}
    247 \end{cquote}
    248 For these operations to meet programmer expectations, \CFA introduces two C non-backward compatibilities.
    249 Note, subtracting pointers or characters has a low-level use-case.
     246\par\noindent
     247However, including @<string.hfa>@ can result in ambiguous uses of the overloaded @+@ operator.\footnote{Combining multiple packages in any programming language can result in name clashes or ambiguities.}
     248While subtracting characters or pointers has a low-level use-case
    250249\begin{cfa}
    251250ch - '0'    $\C[2in]{// find character offset}$
    252 cp1 - cp2;  $\C{// find pointer offset}\CRT$
    253 \end{cfa}
    254 However, there is no obvious use case for addition.
     251cs - cs2;  $\C{// find pointer offset}\CRT$
     252\end{cfa}
     253addition is less obvious
    255254\begin{cfa}
    256255ch + 'b'    $\C[2in]{// add character values}$
    257 cp1 + 'a';  $\C{// move pointer cp1['a']}\CRT$
    258 \end{cfa}
    259 Adding character values or advancing a pointer with a character are unusual operations, and hence, unlikely to existing in C programs.
    260 There is a legitimate use case for arithmetic on @signed@/@unsigned@ characters (bytes), but these type are treated differently from @char@ in \CC and \CFA.
    261 However, for backwards compatibility reasons it is impossible to restrict or remove arithmetic on type @char@.
    262 Stealing these two cases for use with strings, allows all combinations of concatenation among @char@, @char *@, and @string@.
    263 Note, stealing only occurs if a program includes @<string.hfa>@, resulting is ambiguities in existing C code where there is no way to disambiguate.
    264 \begin{cfa}
    265 ch = 'a' + 'b'; $\C[2in]{// LHS disambiguate, add character values}$
    266 s = 'a' + 'b'; $\C{// LHS disambiguate, concatenation characters}$
    267 sout | 'a' + 'b'; $\C{// ambiguous with <string.hfa>, add or concatenate?}$
    268 sout | (char)'a' + 'b'; $\C{// disambiguate}$
    269 sout | "a" + "b"; $\C{// disambiguate}\CRT$
    270 \end{cfa}
    271 Again, introducing disambiguates for this scenario are rare, as adding characters is uncommon.
    272 
    273 \CC cannot support this generality because it does not use the left-hand side of assignment in expression resolution.
     256cs + 'a';  $\C{// move pointer cs['a']}\CRT$
     257\end{cfa}
     258There is a legitimate use case for arithmetic with @signed@/@unsigned@ characters (bytes), but these types are treated differently from @char@ in \CC and \CFA.
     259However, backwards compatibility makes is impossible to restrict or remove addition on type @char@.
     260Similarly, it is impossible to restrict or remove addition on type @char *@ because (unfortunately) it is subscripting: @cs + 'a'@ implies @cs['a']@ or @'a'[cs]@.
     261
     262Fortunately, the prior concatenation examples show complex mixed-mode interactions among @char@, @char *@, and @string@ (variables are the same as constants) work correctly.
     263The reason is that the \CFA type-system handles this kind of overloading well using the left-hand assignment-type and complex conversion costs.
     264Hence, the type system correctly handles all uses of addition (explicit or implicit) for @char *@.
     265\begin{cfa}
     266printf( "%s %s %s %c %c\n", "abc", cs, cs + 3, cs['a'], 'a'[cs] );
     267\end{cfa}
     268Only @char@ addition can result in ambiguities, and only when there is no left-hand information.
     269\begin{cfa}
     270ch = ch + 'b'; $\C[2in]{// LHS disambiguate, add character values}$
     271s = 'a' + 'b'; $\C{// LHS disambiguate, concatenate characters}$
     272printf( "%c\n", @'a' + 'b'@ ); $\C[2in]{// no LHS information, ambiguous}$
     273printf( "%c\n", @(return char)@('a' + 'b') ); $\C{// disambiguate with ascription cast}$
     274\end{cfa}
     275The ascription cast, @(return T)@, disambiguates by stating a (LHS) type to use during expression resolution (not a conversion).
     276Fortunately, character addition without LHS information is rare in C/\CFA programs, so repurposing the operator @+@ for @string@ types is not a problem.
     277Note, other programming languages that repurpose @+@ for concatenation, could have a similar ambiguity issue.
     278
     279Interestingly, \CC cannot support this generality because it does not use the left-hand side of assignment in expression resolution.
    274280While it can special case some combinations:
    275281\begin{c++}
     
    288294The binary operators @*@ and @*=@ repeat a string $N$ times.
    289295If $N = 0$, a zero length string, @""@, is returned.
    290 Like concatenation, multiplication is stolen for @char@;
    291 multiplication for pointers does not exist in C.
    292 \begin{cquote}
    293 \setlength{\tabcolsep}{15pt}
    294 \begin{tabular}{@{}l|l@{}}
    295 \begin{cfa}
     296\begin{cquote}
     297\setlength{\tabcolsep}{15pt}
     298\begin{tabular}{@{}l|l@{}}
     299\begin{cfa}
     300s = 'x' * 0;
    296301s = 'x' * 3;
    297302s = "abc" * 3;
     
    300305&
    301306\begin{cfa}
     307"
    302308"xxx"
    303309"abcabcabc"
     
    306312\end{tabular}
    307313\end{cquote}
     314Like concatenation, there is a potential ambiguity with multiplication of characters;
     315multiplication for pointers does not exist in C.
     316\begin{cfa}
     317ch = ch * 3; $\C[2in]{// LHS disambiguate, multiply character values}$
     318s = 'a' * 3; $\C{// LHS disambiguate, concatenate characters}$
     319printf( "%c\n", @'a' * 3@ ); $\C[2in]{// no LHS information, ambiguous}$
     320printf( "%c\n", @(return char)@('a' * 3) ); $\C{// disambiguate with ascription cast}$
     321\end{cfa}
     322Fortunately, character multiplication without LHS information is even rarer than addition, so repurposing the operator @*@ for @string@ types is not a problem.
    308323
    309324
     
    355370If the substring request extends beyond the beginning or end of the string, it is clipped (shortened) to the bounds of the string.
    356371If the substring request is completely outside of the original string, a null string is returned.
    357 The pattern form either returns the pattern string is the pattern matches or a null string if the pattern does not match.
     372The pattern-form either returns the pattern string is the pattern matches or a null string if the pattern does not match.
    358373The usefulness of this mechanism is discussed next.
    359374
    360 The substring operation can also appear on the left side of an assignment and replaced by the string value on the right side.
    361 The length of the right string may be shorter, the same length, or longer than the length of left string.
     375The substring operation can appear on the left side of assignment, where it defines a replacement substring.
     376The length of the right string may be shorter, the same, or longer than the length of left string.
    362377Hence, the left string may decrease, stay the same, or increase in length.
    363378\begin{cquote}
     
    398413Extending the pattern to a regular expression is a possible extension.
    399414
    400 The replace operation returns a string in which all occurrences of a substring are replaced by another string.
     415The replace operation extensions substring to substitute all occurrences.
    401416\begin{cquote}
    402417\setlength{\tabcolsep}{15pt}
     
    420435\subsection{Searching}
    421436
    422 The find operation returns the position of the first occurrence of a key string in a string.
    423 If the key does not appear in the string, the length of the string plus one is returned.
     437The find operation returns the position of the first occurrence of a key in a string.
     438If the key does not appear in the string, the length of the string is returned.
    424439\begin{cquote}
    425440\setlength{\tabcolsep}{15pt}
     
    428443i = find( digit, '3' );
    429444i = find( digit, "45" );
    430 string x = "567";
    431 i = find( digit, x );
     445i = find( digit, "abc" );
    432446\end{cfa}
    433447&
     
    4354493
    4364504
    437 
    438 5
    439 \end{cfa}
    440 \end{tabular}
    441 \end{cquote}
    442 The character-class operations indicate if a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc.
     45110
     452\end{cfa}
     453\end{tabular}
     454\end{cquote}
     455
     456A character-class operation indicate if a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc.
    443457\begin{cquote}
    444458\setlength{\tabcolsep}{15pt}
     
    457471\end{tabular}
    458472\end{cquote}
    459 @vowels@ defines a character class and function @include@ checks if all characters in the string are included in the class (compliance).
    460 The position of the last character plus 1 is return if the string is compliant or the position of the first non-compliant character.
     473@vowels@ defines a character class and function @include@ checks if all characters in the string appear in the class (compliance).
     474The position of the last character is returned if the string is compliant or the position of the first non-compliant character.
    461475There is no relationship between the order of characters in the two strings.
    462476Function @exclude@ is the reverse of @include@, checking if all characters in the string are excluded from the class (compliance).
     
    495509\end{cquote}
    496510
    497 The test operation checks if each character in a string is in one of the C character classes.\footnote{It is part of the hereditary madness of C that these function take and return an \lstinline{int} rather than a \lstinline{char}.}
    498 \begin{cquote}
    499 \setlength{\tabcolsep}{15pt}
    500 \begin{tabular}{@{}l|l@{}}
    501 \begin{cfa}
    502 i = test( "1FeC34aB", @isxdigit@ );
    503 i = test( ".,;'!\"", @ispunct@ );
    504 i = test( "XXXx", @isupper@ );
     511There are versions of @include@ and @exclude@, returning a position or string, taking a validation function, like one of the C character-class routines.\footnote{It is part of the hereditary of C that these function take and return an \lstinline{int} rather than a \lstinline{char}, which affects the function type.}
     512\begin{cquote}
     513\setlength{\tabcolsep}{15pt}
     514\begin{tabular}{@{}l|l@{}}
     515\begin{cfa}
     516i = include( "1FeC34aB", @isxdigit@ );
     517i = include( ".,;'!\"", @ispunct@ );
     518i = include( "XXXx", @isupper@ );
    505519\end{cfa}
    506520&
     
    512526\end{tabular}
    513527\end{cquote}
    514 The position of the last character plus 1 is return if the string is compliant or the position of the first non-compliant character.
    515 
    516 Combining substring and search allows actions like trimming whitespace from the start of a line.
    517 \begin{cquote}
    518 \setlength{\tabcolsep}{15pt}
    519 \begin{tabular}{@{}l|l@{}}
    520 \begin{cfa}
    521 string line = "  \t  xxx yyy zzz";
    522 string trim = line( test( line, isspace ) );
    523 \end{cfa}
    524 &
    525 \begin{cfa}
    526 
    527 "xxx yyy zzz"
    528 \end{cfa}
    529 \end{tabular}
    530 \end{cquote}
     528These operations perform an apply of the validation function to each character, and it returns a boolean indicating a stopping condition.
     529The position of the last character is returned if the string is compliant or the position of the first non-compliant character.
    531530
    532531The translate operation returns a string with each character transformed by one of the C character transformation functions.
     
    551550
    552551
    553 \subsection{Returning N+1 on Search Failure}
    554 
    555 String search functions can fail to find the key in the target string.
    556 The failure must be returned as an alternate outcome, possibly an exception.
    557 Many string types use a return code to indicate the failure, such as @0@ or @-1@ (PL/I~\cite{PLI} returns @0@).
    558 \CFA adopts the approach used by the index-of function in APL~\cite{apl}, which returns length of the target string plus 1 ($N+1$).
    559 
    560 When a search is performed to find the starting location for a substring operation, returning $N+1$ is arguably the best choice.
    561 For example, in extracting words from a string, it is necessary to scan from left to right over whitespace until the first alphabetic character is found.
    562 \begin{cfa}
    563 line = line( exclude( line, alpha ) );  // find start of word
    564 \end{cfa}
    565 If the line contains all whitespace and @exclude@ returns 0 or -1, the result of the substring is unclear.
    566 Most string types generate an error, or clip the starting value to 1, resulting in the entire whitespace string being selected.
    567 This behaviour leads to the awkward pattern:
    568 \begin{cfa}
    569 i = exclude( line, alpha );
    570 if ( i != -1 ) line = line( i );
    571 else line = "";
    572 \end{cfa}
    573 If @exclude@ returns $N+1$, the starting position for the substring operation is beyond the end of the string leaving a null string.
    574 This scenario is repeated when scanning off the word.
    575 \begin{cfa}
    576 word = line( 0, include( line, alpha ) - 1 );  // scan off word
    577 \end{cfa}
    578 If the entire line is composed of a word, the @include@ fails to find a non-alphabetic character, resulting in the same awkward pattern.
     552\subsection{Returning N on Search Failure}
     553
     554Some of the prior string operations are composite, \eg string operations returning the longest substring of compliant characters (@include@) are built using a search and then substring the appropriate text.
     555However, string search can fail, which is reported as an alternate search outcome, possibly an exception.
     556Many string libraries use a return code to indicate search failure, with a failure value of @0@ or @-1@ (PL/I~\cite{PLI} returns @0@).
     557This semantics leads to the awkward pattern, which can appear many times in a string library or user code.
     558\begin{cfa}
     559i = exclude( s, alpha );
     560if ( i != -1 ) return s( 0, i );
     561else return "";
     562\end{cfa}
     563
     564\CFA also adopts a return code but the failure value is taken from the index-of function in APL~\cite{apl}, which returns the length of the target string $N$ (or $N+1$ for 1 origin).
     565This semantics allows many search and substring functions to be written without conditions, \eg:
     566\begin{cfa}
     567string include( const string & s, int (*f)( int ) ) { return @s( 0, include( s, f ) )@; }
     568string exclude( const string & s, int (*f)( int ) ) { return @s( 0, exclude( s, f ) )@; }
     569\end{cfa}
    579570In string systems with an $O(1)$ length operator, checking for failure is low cost.
    580571\begin{cfa}
    581572if ( include( line, alpha ) == len( line ) ) ... // not found, 0 origin
    582573\end{cfa}
     574\VRef[Figure]{f:ExtractingWordsText} compares \CC and \CFA string code for extracting words from a line of text, repeatedly removing non-word text and then a word until the line is empty.
     575The \CFA code is simpler solely because of the choice for indicating search failure.
     576(It is possible to simplify the \CC version by concatenating a sentinel character at the end of the line so the call to @find_first_not_of@ does not fail.)
     577
     578\begin{figure}
     579\begin{cquote}
     580\setlength{\tabcolsep}{15pt}
     581\begin{tabular}{@{}l|l@{}}
     582\multicolumn{1}{c}{\textbf{\CC}} & \multicolumn{1}{c}{\textbf{\CFA}} \\
     583\begin{cfa}
     584for ( ;; ) {
     585        string::size_type posn = line.find_first_of( alpha );
     586  if ( posn == string::npos ) break;
     587        line = line.substr( posn );
     588        posn = line.find_first_not_of( alpha );
     589        if ( posn != string::npos ) {
     590                cout << line.substr( 0, posn ) << endl;
     591                line = line.substr( posn );
     592        } else {
     593                cout << line << endl;
     594                line = "";
     595        }
     596}
     597\end{cfa}
     598&
     599\begin{cfa}
     600for ( ;; ) {
     601        size_t posn = exclude( line, alpha );
     602  if ( posn == len( line ) ) break;
     603        line = line( posn );
     604        posn = include( line, alpha );
     605
     606        sout | line( 0, posn );
     607        line = line( posn );
     608
     609
     610
     611
     612}
     613\end{cfa}
     614\end{tabular}
     615\end{cquote}
     616\caption{Extracting Words from Line of Text}
     617\label{f:ExtractingWordsText}
     618\end{figure}
    583619
    584620
     
    589625\begin{cfa}
    590626char s[32];   // string s;
     627strlen( s );
     628strnlen( s, 3 );
     629strcmp( s, "abc" );
     630strncmp( s, "abc", 3 );
    591631strcpy( s, "abc" );
    592632strncpy( s, "abcdef", 3 );
     
    598638
    599639
    600 \subsection{Parameter Passing}
    601 
    602 A substring is treated as a pointer into the base (substringed) string rather than creating a copy of the subtext.
    603 Hence, if the referenced item is changed, then the pointer sees the change.
    604 Pointers to the result value of a substring operation are defined to always start at the same location in their base string as long as that starting location exists, independent of changes to themselves or the base string.
    605 However, if the base string value changes, this may affect the values of one or more of the substrings to that base string.
    606 If the base string value shortens so that its end is before the starting location of a substring, resulting in the substring starting location disappearing, the substring becomes a null string located at the end of the base string.
    607 
    608 \VRef[Figure]{f:ParameterPassing} shows passing the results of substring operations by reference and by value to a subprogram.
    609 Notice the side-effects to other reference parameters as one is modified.
    610 
    611 \begin{figure}
    612 \begin{cfa}
    613 // x, a, b, c, & d are substring results passed by reference
    614 // e is a substring result passed by value
    615 void test(string &x, string &a, string &b, string &c, string &d, string e) {
    616                                                                         $\C{//   x                                a               b               c               d               e}$
    617         a( 1, 2 ) = "aaa";                              $\C{// aaaxxxxxxxxxxx   aaax    axx             xxxxx   xxxxx   xxxxx}$
    618         b( 2, 12 ) = "bbb";                             $\C{// aaabbbxxxxxxxxx  aaab    abbb    bbxxx   xxxxx   xxxxx}$
    619         c( 4, 5 ) = "ccc";                              $\C{// aaabbbxcccxxxxxx aaab    abbb    bbxccc  ccxxx   xxxxx}$
    620         c = "yyy";                                              $\C{// aaabyyyxxxxxx    aaab    abyy    yyy             xxxxx   xxxxx}$
    621         d( 1, 3 ) = "ddd";                              $\C{// aaabyyyxdddxx    aaab    abyy    yyy             dddxx   xxxxx}$
    622         e( 1, 3 ) = "eee";                              $\C{// aaabyyyxdddxx    aaab    abyy    yyy             dddxx   eeexx}$
    623         x = e;                                                  $\C{// eeexx                    eeex    exx             x                               eeexx}$
    624 }
    625 int main() {
    626         string x = "xxxxxxxxxxxxx";
    627         test( x, x(1,3), x(3,3), x(5,5), x(9,5), x(9,5) );
    628 }
    629 \end{cfa}
    630 \caption{Parameter Passing}
    631 \label{f:ParameterPassing}
    632 \end{figure}
    633 
    634 
    635640\subsection{I/O Operators}
    636641
    637 The ability to read and print strings is as essential as for any other type.
    638 The goal for character I/O is to work with groups rather than individual characters.
     642The ability to input and output strings is as essential as for any other type.
     643The goal for character I/O is to also work with groups rather than individual characters.
    639644A comparison with \CC string I/O is presented as a counterpoint to \CFA string I/O.
    640645
     
    658663
    659664The \CFA input/output operator @|@ is defined on type @string@.
    660 \CFA output for @char@, @char *@, and @string@ are the similar.
     665\CFA output for @char@, @char *@, and @string@ are similar.
    661666The \CFA manipulators are @bin@, @oct@, @hex@, @wd@, and its associated width control and @left@.
    662667\begin{cquote}
     
    685690\end{cquote}
    686691
    687 \CC input matching for @char@, @char *@, and @string@ are the similar, where \emph{all} input characters are read from the current point in the input stream to the end of the type size, format width, whitespace, end of line (@'\n'@), or end of file.
     692\CC input matching for @char@, @char *@, and @string@ are similar, where \emph{all} input characters are read from the current point in the input stream to the end of the type size, format width, whitespace, end of line (@'\n'@), or end of file.
    688693The \CC manipulator is @setw@ to restrict the size.
    689694Reading into a @char@ is safe as the size is 1, @char *@ is unsafe without using @setw@ to constraint the length (which includes @'\0'@), @string@ is safe as its grows dynamically as characters are read.
     
    695700string s;
    696701cin >> ch >> setw( 5 ) >> c  >> s;
    697 abcde   fg
     702@abcde   fg@
    698703\end{c++}
    699704&
     
    706711\end{tabular}
    707712\end{cquote}
    708 Input text can be gulped from the current point to an arbitrary delimiter character using @getline@, which reads whitespace.
    709 
    710 The \CFA philosophy for input is that for every constant type in C, these constants should be usable as input.
     713Input text can be gulped, including whitespace, from the current point to an arbitrary delimiter character using @getline@.
     714
     715The \CFA philosophy for input is that, for every constant type in C, these constants should be usable as input.
    711716For example, the complex constant @3.5+4.1i@ can appear as input to a complex variable.
    712717\CFA input matching for @char@, @char *@, and @string@ are similar.
    713718C-strings may only be read with a width field, which should match the string size.
    714719Certain input manipulators support a scanset, which is a simple regular expression from @printf@.
    715 The \CFA manipulators for these types are @wdi@\footnote{Due to an overloading issue in the type-resolver, the input width name must be temporarily different from the output, \lstinline{wdi} versus \lstinline{wd}.},
    716 and its associated width control and @left@, @quote@, @incl@, @excl@, and @getline@.
     720The \CFA manipulators for these types are @wdi@,\footnote{Due to an overloading issue in the type-resolver, the input width name must be temporarily different from the output, \lstinline{wdi} versus \lstinline{wd}.} and its associated width control and @left@, @quote@, @incl@, @excl@, and @getline@.
    717721\begin{cquote}
    718722\setlength{\tabcolsep}{10pt}
     
    722726string s;
    723727sin | ch | wdi( 5, c ) | s;
    724 abcde fg
     728@abcde fg@
    725729sin | quote( ch ) | quote( wdi( sizeof(c), c ) ) | quote( s, '[', ']' ) | nl;
    726 $'a' "bcde" [fg]$
     730@$'a' "bcde" [fg]$@
    727731sin | incl( "a-zA-Z0-9 ?!&\n", s ) | nl;
    728 x?&000xyz TOM !.
     732@x?&000xyz TOM !.@
    729733sin | excl( "a-zA-Z0-9 ?!&\n", s );
    730 <>{}{}STOP
     734@<>{}{}STOP@
    731735\end{c++}
    732736&
     
    734738
    735739
    736 'a' "bcde" [fg]
    737 
    738 'a' "bcde" [fg]
     740'a' "bcde" "fg"
     741
     742'a' "bcde" "fg"
    739743
    740744"x?&000xyz TOM !"
     
    745749\end{tabular}
    746750\end{cquote}
    747 
     751Note, the ability to read in quoted strings to match with program strings.
     752The @nl@ at the end of an input ignores the rest of the line.
    748753
    749754
     
    751756
    752757While \VRef[Figure]{f:StrApiCompare} emphasizes cross-language similarities, it elides many specific operational differences.
    753 For example, the @replace@ function selects a substring in the target and substitutes it with the source string, which can be smaller or larger than the substring.
    754 \CC performs the modification on the mutable receiver object
    755 \begin{cfa}
     758For example, the \CC @replace@ function selects a substring in the target and substitutes it with the source string, which can be smaller or larger than the substring.
     759\CC modifies the mutable receiver object, replacing by position (zero origin) and length.
     760\begin{cquote}
     761\setlength{\tabcolsep}{15pt}
     762\begin{tabular}{@{}l|l@{}}
     763\begin{c++}
    756764string s1 = "abcde";
    757 s1.replace( 2, 3, "xy" );  $\C[2.25in]{// replace by position (zero origin) and length, mutable}\CRT$
    758 cout << s1 << endl;
    759 $\texttt{\small abxy}$
    760 \end{cfa}
    761 while Java allocates and returns a new string with the result, leaving the receiver unmodified.
     765s1.replace( 2, 3, "xy" );
     766\end{c++}
     767&
     768\begin{c++}
     769
     770"abxy"
     771\end{c++}
     772\end{tabular}
     773\end{cquote}
     774Java cannot modify the receiver (immutable strings) so it returns a new string, replacing by text.
    762775\label{p:JavaReplace}
     776\begin{cquote}
     777\setlength{\tabcolsep}{15pt}
     778\begin{tabular}{@{}l|l@{}}
    763779\begin{java}
    764780String s = "abcde";
    765 String r = s.replace( "cde", "xy" );  $\C[2.25in]{// replace by text, immutable}$
    766 System.out.println( s + ' ' + r );
    767 $\texttt{\small abcde abxy}$
     781String r = s.replace( "cde", "xy" );
    768782\end{java}
    769 % Generally, Java's @String@ type is immutable.
    770 Java provides a @StringBuffer@ near-analog that is mutable.
     783&
     784\begin{java}
     785
     786"abxy"
     787\end{java}
     788\end{tabular}
     789\end{cquote}
     790Java also provides a mutable @StringBuffer@, replacing by position (zero origin) and length.
     791\begin{cquote}
     792\setlength{\tabcolsep}{15pt}
     793\begin{tabular}{@{}l|l@{}}
    771794\begin{java}
    772795StringBuffer sb = new StringBuffer( "abcde" );
    773 sb.replace( 2, 5, "xy" );  $\C[2.25in]{// replace by position, mutable}\CRT$
    774 System.out.println( sb );
    775 $\texttt{\small abxy}$
     796sb.replace( 2, 5, "xy" );
    776797\end{java}
    777 However, there are significant differences;
    778 \eg, @StringBuffer@'s @substring@ function returns a @String@ copy that is immutable.
    779 Finally, the operations between these type are asymmetric, \eg @String@ has @replace@ by text but not replace by position and vice versa for @StringBuffer@.
    780 
    781 More significant operational differences relate to storage management, often appearing through assignment (@target = source@), and are summarized in \VRef[Figure]{f:StrSemanticCompare}.
    782 % It calls out the consequences of each language taking a different approach on ``internal'' storage management.
     798&
     799\begin{java}
     800
     801"abxy"
     802\end{java}
     803\end{tabular}
     804\end{cquote}
     805However, there are anomalies.
     806@StringBuffer@'s @substring@ returns a @String@ copy that is immutable rather than modifying the receiver.
     807As well, the operations are asymmetric, \eg @String@ has @replace@ by text but not replace by position and vice versa for @StringBuffer@.
     808
     809More significant operational differences relate to storage management, often appearing through assignment (@target = source@), and are summarized in \VRef[Figure]{f:StrSemanticCompare}, defining properties: type abstraction, state, symmetry, and referent.
    783810The following discussion justifies the figure's yes/no entries per language.
    784811
     
    807834                                        & Laxed: The target's type is anything string-like; it may have a different status concerning ownership.
    808835                                                                & Strict: The target's type is the same as the source; both strings are equivalent peers concerning ownership.
    809                                                                                         & --            & no    & yes   & yes \\
     836                                                                                        & N/A           & no    & yes   & yes \\
    810837\hline
    811838Referent
     
    835862char s[$\,$] = "abcde";
    836863\end{cfa}
    837 creates a second-class fixed-sized string-variable, as it can only be used in its lexical context;
    838 it cannot be passed by value to string operations or user functions as C array's cannot be copied because there is no string-length information passed to the function.
     864creates a second-class fixed-sized string-variable, as it can only be used in its lexical context, \ie it cannot be passed by value to string operations or user functions.
     865The reason is that there is no implicit mechanism to pass the string-length information to the function.
    839866Therefore, only pointers to strings are first-class, and discussed further.
    840867\begin{cfa}
     
    865892The lax symmetry reflects how the validity of @s1@ depends on the content and lifetime of @s@.
    866893It is common practice in \CC to use the @s1@-style pass by reference, with the understanding that the callee only uses the referenced string for the duration of the call, \ie no side-effect using the parameter.
    867 So, when the called function is a constructor, it is typical to use an @s2@-style copy-initialization to string-object-typed member.
     894So, when the called function is a constructor, it is typical to use an @s2@-style copy-initialization.
    868895Exceptions to this pattern are possible, but require the programmer to assure safety where the type system does not.
    869896The @s3@ initialization is constrained to copy the substring because @c_str@ always provides a null-terminated character, which may be different from the source string.
     
    946973\input{sharing1.tex}
    947974Here, the aliasing (@`share@) causes partial changes (subscripting) to flow in both directions.
     975(In the following examples, watch how @s1@ and @s1a@ change together, and @s2@ is independent.)
    948976\input{sharing2.tex}
    949977Similarly for complete changes.
    950978\input{sharing3.tex}
    951 
    952979Because string assignment copies the value, RHS aliasing is irrelevant.
    953980Hence, aliasing of the LHS is unaffected.
     
    966993The following rules explain aliasing substrings that flow in the opposite direction, large to small.
    967994
    968 Growth and shrinkage are natural extensions, as for the text-editor example mentioned earlier, where an empty substring is as real real as an empty string.
     995Growth and shrinkage are natural extensions, as for the text-editor example mentioned earlier, where an empty substring is as real as an empty string.
    969996\input{sharing8.tex}
    970997
     
    9831010
    9841011%\input{sharing-demo.tex}
     1012
     1013\VRef[Figure]{f:ParameterPassing} shows similar relationships when passing the results of substring operations by reference and by value to a subprogram.
     1014Again, notice the side-effects to other reference parameters as one is modified.
     1015
     1016\begin{figure}
     1017\begin{cfa}
     1018// x, a, b, c, & d are substring results passed by reference
     1019// e is a substring result passed by value
     1020void test( string & x, string & a, string & b, string & c, string & d, string e ) {
     1021\end{cfa}
     1022\begin{cquote}
     1023\setlength{\tabcolsep}{2pt}
     1024\begin{tabular}{@{}ll@{}}
     1025\begin{cfa}
     1026
     1027        a( 0, 2 ) = "aaa";
     1028        b( 1, 12 ) = "bbb";
     1029        c( 4, 5 ) = "ccc";
     1030        c = "yyy";
     1031        d( 0, 3 ) = "ddd";
     1032        e( 0, 3 ) = "eee";
     1033        x = e;
     1034}
     1035\end{cfa}
     1036&
     1037\sf
     1038\setlength{\extrarowheight}{-0.5pt}
     1039\begin{tabular}{@{}llllll@{}}
     1040x                                       & a             & b             & c             & d             & e             \\
     1041@"aaaxxxxxxxxx"@        & @"aaax"@      & @"xxx"@       & @"xxxxx"@     & @"xxx"@       & @"xxx"@       \\
     1042@"aaaxbbbxxxxxx"@       & @"aaax"@      & @"xbbb"@      & @"xxxx"@      & @"xxx"@       & @"xxx"@       \\
     1043@"aaaxbbbxxxcccxx"@     & @"aaax"@      & @"xbbb"@      & @"xxxccc"@& @"cccxx"@ & @"xxx"@       \\
     1044@"aaaxbbbyyyxx"@        & @"aaax"@      & @"aaab"@      & @"yyy"@       & @"xx"@        & @"xxx"@       \\
     1045@"aaaxbbbyyyddd"@       & @"aaax"@      & @"xbbb"@      & @"yyy"@       & @"ddd"@       & @"xxx"@       \\
     1046@"aaaxbbbyyyddd"@       & @"aaax"@      & @"xbbb"@      & @"yyy"@       & @"ddd"@       & @"eee"@       \\
     1047@"eee"@                         & @""@  & @""@  & @""@          & @"eee"@ \\
     1048 & \\
     1049\end{tabular}
     1050\end{tabular}
     1051\end{cquote}
     1052\begin{cfa}
     1053int main() {
     1054        string x = "xxxxxxxxxxx";
     1055        test( x, x(0, 3), x(2, 3), x(4, 5), x(8, 5), x(8, 5) );
     1056}
     1057\end{cfa}
     1058\caption{Parameter Passing}
     1059\label{f:ParameterPassing}
     1060\end{figure}
    9851061
    9861062
     
    10771153The heap header and text buffer define a sharing context.
    10781154Normally, one global sharing context is appropriate for an entire program;
    1079 concurrent exceptions are discussed in \VRef{s:AvoidingImplicitSharing}.
     1155concurrent exceptions are discussed in \VRef{s:ControllingImplicitSharing}.
    10801156A string is a handle into the buffer and linked into a list.
    10811157The list is doubly linked for $O(1)$ insertion and removal at any location.
     
    11561232
    11571233
    1158 \subsection{Avoiding implicit sharing}
    1159 \label{s:AvoidingImplicitSharing}
    1160 
    1161 There are tradeoffs associated with the copy-on-write mechanism.
    1162 Several qualitative matters are detailed in \VRef{s:PerformanceAssessment} and the qualitative issue of multi-threaded support is introduced here.
    1163 The \CFA string library provides a switch to disable threads allocating from the string buffer, when string sharing is unsafe.
    1164 When toggled, string management is moved to the storage allocator, specifically @malloc@/@free@, where the storage allocator is assumed to be thread-safe.
    1165 
    1166 In detail, string sharing has inter-linked string handles, so any participant managing one string is also managing, directly, the neighbouring strings, and from there, a data structure of the ``set of all strings.''
    1167 This string structure is intended for sequential access.
    1168 Hence, multiple threads using shared strings need to avoid modifying (concurrently) an instance of this structure (like Java immutable strings).
    1169 A positive consequence of this approach is that independent threads can use the sharing buffer without locking overhead.
    1170 
    1171 When the string library is running with sharing disabled, it runs without implicit thread-safety challenges, which is the same as the \CC STL, and with performance goals similar to the STL.
    1172 Running with sharing disabled can be thought of as a STL-emulation mode.
    1173 Hence, concurrent users of string objects must still bring their own mutual exclusion, but the string library does not add any cross thread uses that are not apparent in a user's code.
    1174 
    1175 The \CFA string library provides the type @string_sharectx@ to control an ambient sharing context for a current thread.
    1176 It allows two adjustments: to opt out of sharing entirely or to begin sharing within a private context.
    1177 Either way, the chosen mode applies only to the current thread, for the duration of the lifetime of the created  @string_sharectx@ object, up to being suspended by child lifetimes of different contexts.
    1178 \VRef[Figure]{fig:string-sharectx} illustrates its behaviour.
    1179 Executing the example does not produce an interesting outcome.
    1180 But the comments indicate when the logical copy operation runs with
    1181 \begin{description}
    1182     \item[share:] the copy being deferred, as described through the rest of this section (fast), or
    1183     \item[copy:] the copy performed eagerly (slow).
    1184 \end{description}
    1185 Only eager copies can cross @string_sharectx@ boundaries.
    1186 The intended use is with stack-managed lifetimes, in which the established context lasts until the current function returns, and affects all functions called that do not create their own contexts.
    1187 In this example, the single-letter functions are called in alphabetic order.
    1188 The functions @a@, @b@ and @g@ share string character ranges with each other, because they occupy a common sharing-enabled context.
    1189 The function @e@ shares within itself (because its is in a sharing-enabled context), but not with the rest of the program (because its context is not occupied by any of the rest of the program).
    1190 The functions @c@, @d@ and @f@ never share anything, because they are in a sharing-disabled context.
    1191 
     1234\subsection{Controlling implicit sharing}
     1235\label{s:ControllingImplicitSharing}
     1236
     1237There are tradeoffs associated with sharing and its implicit copy-on-write mechanism.
     1238Several qualitative matters are detailed in \VRef{s:PerformanceAssessment}.
     1239In detail, string sharing has inter-linked string handles, so managing one string is also managing the neighbouring strings, and from there, a data structure of the ``set of all strings.''
     1240Therefore, it is useful to toggle this capability on or off when it is not providing any application benefit.
    11921241
    11931242\begin{figure}
     
    12011250\end{figure}
    12021251
     1252The \CFA string library provides the type @string_sharectx@ to control an ambient sharing context.
     1253It allows two adjustments: to opt out of sharing entirely or to begin sharing within a private context.
     1254Running with sharing disabled can be thought of as a \CC STL-emulation mode, where each string is dynamically allocated.
     1255The chosen mode applies for the duration of the lifetime of the created  @string_sharectx@ object, up to being suspended by child lifetimes of different contexts.
     1256\VRef[Figure]{fig:string-sharectx} illustrates this behaviour by showing the stack frames of a program in execution.
     1257In this example, the single-letter functions are called in alphabetic order.
     1258The functions @a@, @b@ and @g@ share string character ranges with each other, because they occupy a common sharing-enabled context.
     1259The function @e@ shares within itself (because its is in a sharing-enabled context), but not with the rest of the program (because its context is not occupied by any of the rest of the program).
     1260The functions @c@, @d@ and @f@ never share anything, because they are in a sharing-disabled context.
     1261Executing the example does not produce an interesting outcome, but the comments in the picture indicate when the logical copy operation runs with
     1262\begin{description}
     1263    \item[share:] the copy being deferred, as described through the rest of this section (fast), or
     1264    \item[copy:] the copy performed eagerly (slow).
     1265\end{description}
     1266Only eager copies can cross @string_sharectx@ boundaries.
     1267The intended use is with stack-managed lifetimes, in which the established context lasts until the current function returns, and affects all functions called that do not create their own contexts.
    12031268
    12041269[ TODO: true up with ``is thread local'' (implement that and expand this discussion to give a concurrent example, or adjust this wording) ]
    12051270
    12061271
     1272\subsection{Sharing and threading}
     1273
     1274The \CFA string library provides no thread safety, the same as \CC string, providing similar performance goals.
     1275Threads can create their own string buffers and avoid passing these strings to other threads, or require that shared strings be immutable, as concurrent reading is safe.
     1276A positive consequence of this approach is that independent threads can use the sharing buffer without locking overhead.
     1277When string sharing amongst threads is required, program-wide string-management can toggled to non-sharing using @malloc@/@free@, where the storage allocator is assumed to be thread-safe.
     1278Finally, concurrent users of string objects can provide their own mutual exclusion.
     1279
     1280
    12071281\subsection{Future work}
    12081282
    1209 To discuss: Unicode
    1210 
    1211 To discuss: Small-string optimization
     1283Implementing the small-string optimization is straightforward, as a string header contains a pointer to the string text in the buffer.
     1284This pointer could be marked with a flag and contain a small string.
     1285However, there is now a conditional check required on the fast-path to switch between small and large string operations.
     1286
     1287It might be possible to pack 16- or 32-bit Unicode characters within the same string buffer as 8-bit characters.
     1288Again, locations for identification flags must be found and checked along the fast path to select the correct actions.
     1289Handling utf8 (variable length), is more problematic because simple pointer arithmetic cannot be used to stride through the variable-length characters.
     1290Trying to use a secondary array of fixed-sized pointers/offsets to the characters is possible, but raises the question of storage management for the utf8 characters themselves.
    12121291
    12131292
Note: See TracChangeset for help on using the changeset viewer.