Context Navigation

-              r5ad6f0d
+              rb0296dba
 \end{figure}
 As mentioned in \VRef{s:String}, a C string differs from other string types as it uses null termination rather than a length, which leads to explicit storage management;
 hence, most of its group operations are error prone and expensive.
+As mentioned in \VRef{s:String}, a C string uses null termination rather than a length, which leads to explicit storage management;
+hence, most of its group operations are error prone and expensive due to copying.
 Most high-level string libraries use a separate length field and specialized storage management to implement group operations.
 Interestingly, \CC strings retain null termination in case it is needed to interface with C library functions.
 …
 A \CFA string manages its length separately from the string, so there is no null (@'\0'@) terminating value at the end of a string value.
 Hence, a \CFA string cannot be passed to a C string manipulation function, such as @strcat@.
 Like C strings, characters in a @string@ are numbered from the left starting at 0, and in \CFA numbered from the right starting at -1.
 \begin{cquote}
 \sf
+Like C strings, characters in a @string@ are numbered from the left starting at 0 (because subscripting is zero-origin), and in \CFA numbered from the right starting at -1.
+\begin{cquote}
+\rm
 \begin{tabular}{@{}rrrrll@{}}
 \small\tt "a & \small\tt b & \small\tt c & \small\tt d & \small\tt e" \\
 …
 \end{tabular}
 \end{cquote}
+The following operations have been defined to manipulate an instance of type @string@.
+The discussion assumes the following declarations and assignment statements are executed.
+The following operations manipulate an instance of type @string@, where the discussion assumes the following declarations.
 \begin{cfa}
 #include @<string.hfa>@
 @string@ s = "abcde", name = "MIKE", digit = "0123456789";
 const char cs[] = "abc";
+const char cs[$\,$] = "abc";
 int i;
 \end{cfa}
 …
 \begin{tabular}{@{}l|ll|l@{}}
 \begin{cfa}
 //      string s = 5;
         s = 'x';
         s = "abc";
         s = cs;
         s = 45hh;
         s = 45h;
+string s;
+s = 'x';
+s = "abc";
+s = cs;
+s = 45hh;
+s = 45h;
 \end{cfa}
+&
 …
 The @len@ operation (short for @strlen@) returns the length of a C or \CFA string.
 For consistency, @strlen@ also works with \CFA strings.
+For compatibility, @strlen@ also works with \CFA strings.
 \begin{cquote}
 \setlength{\tabcolsep}{15pt}
 …
 The binary relational, @<@, @<=@, @>@, @>=@, and equality, @==@, @!=@, operators compare \CFA string values using lexicographical ordering, where longer strings are greater than shorter strings.
 In C, these operators compare the C string pointer not its value, which does not match programmer expectation.
 C strings use function @strcmp@, as the relational/equality operator for string values.
+C strings use function @strcmp@ to lexicographically compare the string value.
 \subsection{Concatenation}
 The binary operators @+@ and @+=@ concatenate characters, C strings and \CFA strings, creating the sum of the characters.
 \begin{cquote}
+The binary operators @+@ and @+=@ concatenate C @char@, @char *@ and \CFA strings, creating the sum of the characters.
+\par\noindent
 \begin{tabular}{@{}l|l@{\hspace{15pt}}l|l@{\hspace{15pt}}l|l@{}}
 \begin{cfa}
 …
 \end{cfa}
 \end{tabular}
 \end{cquote}
+For these operations to meet programmer expectations, \CFA introduces two C non-backward compatibilities.
+Note, subtracting pointers or characters has a low-level use-case.
+\par\noindent
+However, including @<string.hfa>@ can result in ambiguous uses of the overloaded @+@ operator.\footnote{Combining multiple packages in any programming language can result in name clashes or ambiguities.}
+While subtracting characters or pointers has a low-level use-case
 \begin{cfa}
 ch - '0'    $\C[2in]{// find character offset}$
 cp1 - cp2;  $\C{// find pointer offset}\CRT$
 \end{cfa}
+However, there is no obvious use case for addition.
+cs - cs2;  $\C{// find pointer offset}\CRT$
+\end{cfa}
+addition is less obvious
 \begin{cfa}
 ch + 'b'    $\C[2in]{// add character values}$
+cp1 + 'a';  $\C{// move pointer cp1['a']}\CRT$
+\end{cfa}
+Adding character values or advancing a pointer with a character are unusual operations, and hence, unlikely to existing in C programs.
+There is a legitimate use case for arithmetic on @signed@/@unsigned@ characters (bytes), but these type are treated differently from @char@ in \CC and \CFA.
+However, for backwards compatibility reasons it is impossible to restrict or remove arithmetic on type @char@.
+Stealing these two cases for use with strings, allows all combinations of concatenation among @char@, @char *@, and @string@.
+Note, stealing only occurs if a program includes @<string.hfa>@, resulting is ambiguities in existing C code where there is no way to disambiguate.
+\begin{cfa}
+ch = 'a' + 'b'; $\C[2in]{// LHS disambiguate, add character values}$
+s = 'a' + 'b'; $\C{// LHS disambiguate, concatenation characters}$
+sout | 'a' + 'b'; $\C{// ambiguous with <string.hfa>, add or concatenate?}$
+sout | (char)'a' + 'b'; $\C{// disambiguate}$
+sout | "a" + "b"; $\C{// disambiguate}\CRT$
+\end{cfa}
+Again, introducing disambiguates for this scenario are rare, as adding characters is uncommon.
+\CC cannot support this generality because it does not use the left-hand side of assignment in expression resolution.
+cs + 'a';  $\C{// move pointer cs['a']}\CRT$
+\end{cfa}
+There is a legitimate use case for arithmetic with @signed@/@unsigned@ characters (bytes), but these types are treated differently from @char@ in \CC and \CFA.
+However, backwards compatibility makes is impossible to restrict or remove addition on type @char@.
+Similarly, it is impossible to restrict or remove addition on type @char *@ because (unfortunately) it is subscripting: @cs + 'a'@ implies @cs['a']@ or @'a'[cs]@.
+Fortunately, the prior concatenation examples show complex mixed-mode interactions among @char@, @char *@, and @string@ (variables are the same as constants) work correctly.
+The reason is that the \CFA type-system handles this kind of overloading well using the left-hand assignment-type and complex conversion costs.
+Hence, the type system correctly handles all uses of addition (explicit or implicit) for @char *@.
+\begin{cfa}
+printf( "%s %s %s %c %c\n", "abc", cs, cs + 3, cs['a'], 'a'[cs] );
+\end{cfa}
+Only @char@ addition can result in ambiguities, and only when there is no left-hand information.
+\begin{cfa}
+ch = ch + 'b'; $\C[2in]{// LHS disambiguate, add character values}$
+s = 'a' + 'b'; $\C{// LHS disambiguate, concatenate characters}$
+printf( "%c\n", @'a' + 'b'@ ); $\C[2in]{// no LHS information, ambiguous}$
+printf( "%c\n", @(return char)@('a' + 'b') ); $\C{// disambiguate with ascription cast}$
+\end{cfa}
+The ascription cast, @(return T)@, disambiguates by stating a (LHS) type to use during expression resolution (not a conversion).
+Fortunately, character addition without LHS information is rare in C/\CFA programs, so repurposing the operator @+@ for @string@ types is not a problem.
+Note, other programming languages that repurpose @+@ for concatenation, could have a similar ambiguity issue.
+Interestingly, \CC cannot support this generality because it does not use the left-hand side of assignment in expression resolution.
 While it can special case some combinations:
 \begin{c++}
 …
 The binary operators @*@ and @*=@ repeat a string $N$ times.
 If $N = 0$, a zero length string, @""@, is returned.
+Like concatenation, multiplication is stolen for @char@;
+multiplication for pointers does not exist in C.
+\begin{cquote}
+\setlength{\tabcolsep}{15pt}
+\begin{tabular}{@{}l|l@{}}
+\begin{cfa}
+\begin{cquote}
+\setlength{\tabcolsep}{15pt}
+\begin{tabular}{@{}l|l@{}}
+\begin{cfa}
+s = 'x' * 0;
 s = 'x' * 3;
 s = "abc" * 3;
 …
+&
 \begin{cfa}
+"
 "xxx"
 "abcabcabc"
 …
 \end{tabular}
 \end{cquote}
+Like concatenation, there is a potential ambiguity with multiplication of characters;
+multiplication for pointers does not exist in C.
+\begin{cfa}
+ch = ch * 3; $\C[2in]{// LHS disambiguate, multiply character values}$
+s = 'a' * 3; $\C{// LHS disambiguate, concatenate characters}$
+printf( "%c\n", @'a' * 3@ ); $\C[2in]{// no LHS information, ambiguous}$
+printf( "%c\n", @(return char)@('a' * 3) ); $\C{// disambiguate with ascription cast}$
+\end{cfa}
+Fortunately, character multiplication without LHS information is even rarer than addition, so repurposing the operator @*@ for @string@ types is not a problem.
 …
 If the substring request extends beyond the beginning or end of the string, it is clipped (shortened) to the bounds of the string.
 If the substring request is completely outside of the original string, a null string is returned.
 The pattern form either returns the pattern string is the pattern matches or a null string if the pattern does not match.
+The pattern-form either returns the pattern string is the pattern matches or a null string if the pattern does not match.
 The usefulness of this mechanism is discussed next.
 The substring operation can also appear on the left side of an assignment and replaced by the string value on the right side.
 The length of the right string may be shorter, the same length, or longer than the length of left string.
+The substring operation can appear on the left side of assignment, where it defines a replacement substring.
+The length of the right string may be shorter, the same, or longer than the length of left string.
 Hence, the left string may decrease, stay the same, or increase in length.
 \begin{cquote}
 …
 Extending the pattern to a regular expression is a possible extension.
 The replace operation returns a string in which all occurrences of a substring are replaced by another string.
+The replace operation extensions substring to substitute all occurrences.
 \begin{cquote}
 \setlength{\tabcolsep}{15pt}
 …
 \subsection{Searching}
 The find operation returns the position of the first occurrence of a key string in a string.
 If the key does not appear in the string, the length of the string plus one is returned.
+The find operation returns the position of the first occurrence of a key in a string.
+If the key does not appear in the string, the length of the string is returned.
 \begin{cquote}
 \setlength{\tabcolsep}{15pt}
 …
 i = find( digit, '3' );
 i = find( digit, "45" );
+string x = "567";
+i = find( digit, x );
+i = find( digit, "abc" );
 \end{cfa}
+&
 …
 \end{cfa}
 \end{tabular}
+\end{cquote}
 The character-class operations indicate if a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc.
+\end{cfa}
+\end{tabular}
+\end{cquote}
+A character-class operation indicate if a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc.
 \begin{cquote}
 \setlength{\tabcolsep}{15pt}
 …
 \end{tabular}
 \end{cquote}
 @vowels@ defines a character class and function @include@ checks if all characters in the string are included in the class (compliance).
 The position of the last character plus 1 is return if the string is compliant or the position of the first non-compliant character.
+@vowels@ defines a character class and function @include@ checks if all characters in the string appear in the class (compliance).
+The position of the last character is returned if the string is compliant or the position of the first non-compliant character.
 There is no relationship between the order of characters in the two strings.
 Function @exclude@ is the reverse of @include@, checking if all characters in the string are excluded from the class (compliance).
 …
 \end{cquote}
 The test operation checks if each character in a string is in one of the C character classes.\footnote{It is part of the hereditary madness of C that these function take and return an \lstinline{int} rather than a \lstinline{char}.}
 \begin{cquote}
 \setlength{\tabcolsep}{15pt}
 \begin{tabular}{@{}l|l@{}}
 \begin{cfa}
 i = test( "1FeC34aB", @isxdigit@ );
 i = test( ".,;'!\"", @ispunct@ );
 i = test( "XXXx", @isupper@ );
+There are versions of @include@ and @exclude@, returning a position or string, taking a validation function, like one of the C character-class routines.\footnote{It is part of the hereditary of C that these function take and return an \lstinline{int} rather than a \lstinline{char}, which affects the function type.}
+\begin{cquote}
+\setlength{\tabcolsep}{15pt}
+\begin{tabular}{@{}l|l@{}}
+\begin{cfa}
+i = include( "1FeC34aB", @isxdigit@ );
+i = include( ".,;'!\"", @ispunct@ );
+i = include( "XXXx", @isupper@ );
 \end{cfa}
+&
 …
 \end{tabular}
 \end{cquote}
+The position of the last character plus 1 is return if the string is compliant or the position of the first non-compliant character.
+Combining substring and search allows actions like trimming whitespace from the start of a line.
+\begin{cquote}
+\setlength{\tabcolsep}{15pt}
+\begin{tabular}{@{}l|l@{}}
+\begin{cfa}
+string line = "  \t  xxx yyy zzz";
+string trim = line( test( line, isspace ) );
+\end{cfa}
+&
+\begin{cfa}
+"xxx yyy zzz"
+\end{cfa}
+\end{tabular}
+\end{cquote}
+These operations perform an apply of the validation function to each character, and it returns a boolean indicating a stopping condition.
+The position of the last character is returned if the string is compliant or the position of the first non-compliant character.
 The translate operation returns a string with each character transformed by one of the C character transformation functions.
 …
+\subsection{Returning N+1 on Search Failure}
+String search functions can fail to find the key in the target string.
+The failure must be returned as an alternate outcome, possibly an exception.
+Many string types use a return code to indicate the failure, such as @0@ or @-1@ (PL/I~\cite{PLI} returns @0@).
+\CFA adopts the approach used by the index-of function in APL~\cite{apl}, which returns length of the target string plus 1 ($N+1$).
+When a search is performed to find the starting location for a substring operation, returning $N+1$ is arguably the best choice.
+For example, in extracting words from a string, it is necessary to scan from left to right over whitespace until the first alphabetic character is found.
+\begin{cfa}
+line = line( exclude( line, alpha ) );  // find start of word
+\end{cfa}
+If the line contains all whitespace and @exclude@ returns 0 or -1, the result of the substring is unclear.
+Most string types generate an error, or clip the starting value to 1, resulting in the entire whitespace string being selected.
+This behaviour leads to the awkward pattern:
+\begin{cfa}
+i = exclude( line, alpha );
+if ( i != -1 ) line = line( i );
+else line = "";
+\end{cfa}
+If @exclude@ returns $N+1$, the starting position for the substring operation is beyond the end of the string leaving a null string.
+This scenario is repeated when scanning off the word.
+\begin{cfa}
+word = line( 0, include( line, alpha ) - 1 );  // scan off word
+\end{cfa}
+If the entire line is composed of a word, the @include@ fails to find a non-alphabetic character, resulting in the same awkward pattern.
+\subsection{Returning N on Search Failure}
+Some of the prior string operations are composite, \eg string operations returning the longest substring of compliant characters (@include@) are built using a search and then substring the appropriate text.
+However, string search can fail, which is reported as an alternate search outcome, possibly an exception.
+Many string libraries use a return code to indicate search failure, with a failure value of @0@ or @-1@ (PL/I~\cite{PLI} returns @0@).
+This semantics leads to the awkward pattern, which can appear many times in a string library or user code.
+\begin{cfa}
+i = exclude( s, alpha );
+if ( i != -1 ) return s( 0, i );
+else return "";
+\end{cfa}
+\CFA also adopts a return code but the failure value is taken from the index-of function in APL~\cite{apl}, which returns the length of the target string $N$ (or $N+1$ for 1 origin).
+This semantics allows many search and substring functions to be written without conditions, \eg:
+\begin{cfa}
+string include( const string & s, int (*f)( int ) ) { return @s( 0, include( s, f ) )@; }
+string exclude( const string & s, int (*f)( int ) ) { return @s( 0, exclude( s, f ) )@; }
+\end{cfa}
 In string systems with an $O(1)$ length operator, checking for failure is low cost.
 \begin{cfa}
 if ( include( line, alpha ) == len( line ) ) ... // not found, 0 origin
 \end{cfa}
+\VRef[Figure]{f:ExtractingWordsText} compares \CC and \CFA string code for extracting words from a line of text, repeatedly removing non-word text and then a word until the line is empty.
+The \CFA code is simpler solely because of the choice for indicating search failure.
+(It is possible to simplify the \CC version by concatenating a sentinel character at the end of the line so the call to @find_first_not_of@ does not fail.)
+\begin{figure}
+\begin{cquote}
+\setlength{\tabcolsep}{15pt}
+\begin{tabular}{@{}l|l@{}}
+\multicolumn{1}{c}{\textbf{\CC}} & \multicolumn{1}{c}{\textbf{\CFA}} \\
+\begin{cfa}
+for ( ;; ) {
+        string::size_type posn = line.find_first_of( alpha );
+  if ( posn == string::npos ) break;
+        line = line.substr( posn );
+        posn = line.find_first_not_of( alpha );
+        if ( posn != string::npos ) {
+                cout << line.substr( 0, posn ) << endl;
+                line = line.substr( posn );
+        } else {
+                cout << line << endl;
+                line = "";
+        }
+}
+\end{cfa}
+&
+\begin{cfa}
+for ( ;; ) {
+        size_t posn = exclude( line, alpha );
+  if ( posn == len( line ) ) break;
+        line = line( posn );
+        posn = include( line, alpha );
+        sout | line( 0, posn );
+        line = line( posn );
+}
+\end{cfa}
+\end{tabular}
+\end{cquote}
+\caption{Extracting Words from Line of Text}
+\label{f:ExtractingWordsText}
+\end{figure}
 …
 \begin{cfa}
 char s[32];   // string s;
+strlen( s );
+strnlen( s, 3 );
+strcmp( s, "abc" );
+strncmp( s, "abc", 3 );
 strcpy( s, "abc" );
 strncpy( s, "abcdef", 3 );
 …
-\subsection{Parameter Passing}
-A substring is treated as a pointer into the base (substringed) string rather than creating a copy of the subtext.
-Hence, if the referenced item is changed, then the pointer sees the change.
-Pointers to the result value of a substring operation are defined to always start at the same location in their base string as long as that starting location exists, independent of changes to themselves or the base string.
-However, if the base string value changes, this may affect the values of one or more of the substrings to that base string.
-If the base string value shortens so that its end is before the starting location of a substring, resulting in the substring starting location disappearing, the substring becomes a null string located at the end of the base string.
-\VRef[Figure]{f:ParameterPassing} shows passing the results of substring operations by reference and by value to a subprogram.
-Notice the side-effects to other reference parameters as one is modified.
-\begin{figure}
-\begin{cfa}
-// x, a, b, c, & d are substring results passed by reference
-// e is a substring result passed by value
-void test(string &x, string &a, string &b, string &c, string &d, string e) {
-                                                                        $\C{//   x                                a               b               c               d               e}$
-        a( 1, 2 ) = "aaa";                              $\C{// aaaxxxxxxxxxxx   aaax    axx             xxxxx   xxxxx   xxxxx}$
-        b( 2, 12 ) = "bbb";                             $\C{// aaabbbxxxxxxxxx  aaab    abbb    bbxxx   xxxxx   xxxxx}$
-        c( 4, 5 ) = "ccc";                              $\C{// aaabbbxcccxxxxxx aaab    abbb    bbxccc  ccxxx   xxxxx}$
-        c = "yyy";                                              $\C{// aaabyyyxxxxxx    aaab    abyy    yyy             xxxxx   xxxxx}$
-        d( 1, 3 ) = "ddd";                              $\C{// aaabyyyxdddxx    aaab    abyy    yyy             dddxx   xxxxx}$
-        e( 1, 3 ) = "eee";                              $\C{// aaabyyyxdddxx    aaab    abyy    yyy             dddxx   eeexx}$
-        x = e;                                                  $\C{// eeexx                    eeex    exx             x                               eeexx}$
+}
-int main() {
-        string x = "xxxxxxxxxxxxx";
-        test( x, x(1,3), x(3,3), x(5,5), x(9,5), x(9,5) );
+}
-\end{cfa}
-\caption{Parameter Passing}
-\label{f:ParameterPassing}
-\end{figure}
 \subsection{I/O Operators}
 The ability to read and print strings is as essential as for any other type.
 The goal for character I/O is to work with groups rather than individual characters.
+The ability to input and output strings is as essential as for any other type.
+The goal for character I/O is to also work with groups rather than individual characters.
 A comparison with \CC string I/O is presented as a counterpoint to \CFA string I/O.
 …
 The \CFA input/output operator @|@ is defined on type @string@.
 \CFA output for @char@, @char *@, and @string@ are the similar.
+\CFA output for @char@, @char *@, and @string@ are similar.
 The \CFA manipulators are @bin@, @oct@, @hex@, @wd@, and its associated width control and @left@.
 \begin{cquote}
 …
 \end{cquote}
 \CC input matching for @char@, @char *@, and @string@ are the similar, where \emph{all} input characters are read from the current point in the input stream to the end of the type size, format width, whitespace, end of line (@'\n'@), or end of file.
+\CC input matching for @char@, @char *@, and @string@ are similar, where \emph{all} input characters are read from the current point in the input stream to the end of the type size, format width, whitespace, end of line (@'\n'@), or end of file.
 The \CC manipulator is @setw@ to restrict the size.
 Reading into a @char@ is safe as the size is 1, @char *@ is unsafe without using @setw@ to constraint the length (which includes @'\0'@), @string@ is safe as its grows dynamically as characters are read.
 …
 string s;
 cin >> ch >> setw( 5 ) >> c  >> s;
+abcde   fg
+@abcde   fg@
 \end{c++}
+&
 …
 \end{tabular}
 \end{cquote}
 Input text can be gulped from the current point to an arbitrary delimiter character using @getline@, which reads whitespace.
 The \CFA philosophy for input is that for every constant type in C, these constants should be usable as input.
+Input text can be gulped, including whitespace, from the current point to an arbitrary delimiter character using @getline@.
+The \CFA philosophy for input is that, for every constant type in C, these constants should be usable as input.
 For example, the complex constant @3.5+4.1i@ can appear as input to a complex variable.
 \CFA input matching for @char@, @char *@, and @string@ are similar.
 C-strings may only be read with a width field, which should match the string size.
 Certain input manipulators support a scanset, which is a simple regular expression from @printf@.
+The \CFA manipulators for these types are @wdi@\footnote{Due to an overloading issue in the type-resolver, the input width name must be temporarily different from the output, \lstinline{wdi} versus \lstinline{wd}.},
+and its associated width control and @left@, @quote@, @incl@, @excl@, and @getline@.
+The \CFA manipulators for these types are @wdi@,\footnote{Due to an overloading issue in the type-resolver, the input width name must be temporarily different from the output, \lstinline{wdi} versus \lstinline{wd}.} and its associated width control and @left@, @quote@, @incl@, @excl@, and @getline@.
 \begin{cquote}
 \setlength{\tabcolsep}{10pt}
 …
 string s;
 sin | ch | wdi( 5, c ) | s;
+abcde fg
+@abcde fg@
 sin | quote( ch ) | quote( wdi( sizeof(c), c ) ) | quote( s, '[', ']' ) | nl;
+$'a' "bcde" [fg]$
+@$'a' "bcde" [fg]$@
 sin | incl( "a-zA-Z0-9 ?!&\n", s ) | nl;
+x?&000xyz TOM !.
+@x?&000xyz TOM !.@
 sin | excl( "a-zA-Z0-9 ?!&\n", s );
+<>{}{}STOP
+@<>{}{}STOP@
 \end{c++}
+&
 …
 'a' "bcde" [fg]
 'a' "bcde" [fg]
+'a' "bcde" "fg"
+'a' "bcde" "fg"
 "x?&000xyz TOM !"
 …
 \end{tabular}
 \end{cquote}
+Note, the ability to read in quoted strings to match with program strings.
+The @nl@ at the end of an input ignores the rest of the line.
 …
 While \VRef[Figure]{f:StrApiCompare} emphasizes cross-language similarities, it elides many specific operational differences.
+For example, the @replace@ function selects a substring in the target and substitutes it with the source string, which can be smaller or larger than the substring.
+\CC performs the modification on the mutable receiver object
+\begin{cfa}
+For example, the \CC @replace@ function selects a substring in the target and substitutes it with the source string, which can be smaller or larger than the substring.
+\CC modifies the mutable receiver object, replacing by position (zero origin) and length.
+\begin{cquote}
+\setlength{\tabcolsep}{15pt}
+\begin{tabular}{@{}l|l@{}}
+\begin{c++}
 string s1 = "abcde";
+s1.replace( 2, 3, "xy" );  $\C[2.25in]{// replace by position (zero origin) and length, mutable}\CRT$
+cout << s1 << endl;
+$\texttt{\small abxy}$
+\end{cfa}
+while Java allocates and returns a new string with the result, leaving the receiver unmodified.
+s1.replace( 2, 3, "xy" );
+\end{c++}
+&
+\begin{c++}
+"abxy"
+\end{c++}
+\end{tabular}
+\end{cquote}
+Java cannot modify the receiver (immutable strings) so it returns a new string, replacing by text.
 \label{p:JavaReplace}
+\begin{cquote}
+\setlength{\tabcolsep}{15pt}
+\begin{tabular}{@{}l|l@{}}
 \begin{java}
 String s = "abcde";
+String r = s.replace( "cde", "xy" );  $\C[2.25in]{// replace by text, immutable}$
+System.out.println( s + ' ' + r );
+$\texttt{\small abcde abxy}$
+String r = s.replace( "cde", "xy" );
 \end{java}
+% Generally, Java's @String@ type is immutable.
+Java provides a @StringBuffer@ near-analog that is mutable.
+&
+\begin{java}
+"abxy"
+\end{java}
+\end{tabular}
+\end{cquote}
+Java also provides a mutable @StringBuffer@, replacing by position (zero origin) and length.
+\begin{cquote}
+\setlength{\tabcolsep}{15pt}
+\begin{tabular}{@{}l|l@{}}
 \begin{java}
 StringBuffer sb = new StringBuffer( "abcde" );
+sb.replace( 2, 5, "xy" );  $\C[2.25in]{// replace by position, mutable}\CRT$
+System.out.println( sb );
+$\texttt{\small abxy}$
+sb.replace( 2, 5, "xy" );
 \end{java}
+However, there are significant differences;
+\eg, @StringBuffer@'s @substring@ function returns a @String@ copy that is immutable.
+Finally, the operations between these type are asymmetric, \eg @String@ has @replace@ by text but not replace by position and vice versa for @StringBuffer@.
+More significant operational differences relate to storage management, often appearing through assignment (@target = source@), and are summarized in \VRef[Figure]{f:StrSemanticCompare}.
+% It calls out the consequences of each language taking a different approach on ``internal'' storage management.
+&
+\begin{java}
+"abxy"
+\end{java}
+\end{tabular}
+\end{cquote}
+However, there are anomalies.
+@StringBuffer@'s @substring@ returns a @String@ copy that is immutable rather than modifying the receiver.
+As well, the operations are asymmetric, \eg @String@ has @replace@ by text but not replace by position and vice versa for @StringBuffer@.
+More significant operational differences relate to storage management, often appearing through assignment (@target = source@), and are summarized in \VRef[Figure]{f:StrSemanticCompare}, defining properties: type abstraction, state, symmetry, and referent.
 The following discussion justifies the figure's yes/no entries per language.
 …
                                         & Laxed: The target's type is anything string-like; it may have a different status concerning ownership.
                                                                 & Strict: The target's type is the same as the source; both strings are equivalent peers concerning ownership.
                                                                                         & --            & no    & yes   & yes \\
+                                                                                        & N/A           & no    & yes   & yes \\
 \hline
 Referent
 …
 char s[$\,$] = "abcde";
 \end{cfa}
 creates a second-class fixed-sized string-variable, as it can only be used in its lexical context;
 it cannot be passed by value to string operations or user functions as C array's cannot be copied because there is no string-length information passed to the function.
+creates a second-class fixed-sized string-variable, as it can only be used in its lexical context, \ie it cannot be passed by value to string operations or user functions.
+The reason is that there is no implicit mechanism to pass the string-length information to the function.
 Therefore, only pointers to strings are first-class, and discussed further.
 \begin{cfa}
 …
 The lax symmetry reflects how the validity of @s1@ depends on the content and lifetime of @s@.
 It is common practice in \CC to use the @s1@-style pass by reference, with the understanding that the callee only uses the referenced string for the duration of the call, \ie no side-effect using the parameter.
 So, when the called function is a constructor, it is typical to use an @s2@-style copy-initialization to string-object-typed member.
+So, when the called function is a constructor, it is typical to use an @s2@-style copy-initialization.
 Exceptions to this pattern are possible, but require the programmer to assure safety where the type system does not.
 The @s3@ initialization is constrained to copy the substring because @c_str@ always provides a null-terminated character, which may be different from the source string.
 …
 \input{sharing1.tex}
 Here, the aliasing (@`share@) causes partial changes (subscripting) to flow in both directions.
+(In the following examples, watch how @s1@ and @s1a@ change together, and @s2@ is independent.)
 \input{sharing2.tex}
 Similarly for complete changes.
 \input{sharing3.tex}
 Because string assignment copies the value, RHS aliasing is irrelevant.
 Hence, aliasing of the LHS is unaffected.
 …
 The following rules explain aliasing substrings that flow in the opposite direction, large to small.
 Growth and shrinkage are natural extensions, as for the text-editor example mentioned earlier, where an empty substring is as real real as an empty string.
+Growth and shrinkage are natural extensions, as for the text-editor example mentioned earlier, where an empty substring is as real as an empty string.
 \input{sharing8.tex}
 …
 %\input{sharing-demo.tex}
+\VRef[Figure]{f:ParameterPassing} shows similar relationships when passing the results of substring operations by reference and by value to a subprogram.
+Again, notice the side-effects to other reference parameters as one is modified.
+\begin{figure}
+\begin{cfa}
+// x, a, b, c, & d are substring results passed by reference
+// e is a substring result passed by value
+void test( string & x, string & a, string & b, string & c, string & d, string e ) {
+\end{cfa}
+\begin{cquote}
+\setlength{\tabcolsep}{2pt}
+\begin{tabular}{@{}ll@{}}
+\begin{cfa}
+        a( 0, 2 ) = "aaa";
+        b( 1, 12 ) = "bbb";
+        c( 4, 5 ) = "ccc";
+        c = "yyy";
+        d( 0, 3 ) = "ddd";
+        e( 0, 3 ) = "eee";
+        x = e;
+}
+\end{cfa}
+&
+\sf
+\setlength{\extrarowheight}{-0.5pt}
+\begin{tabular}{@{}llllll@{}}
+x                                       & a             & b             & c             & d             & e             \\
+@"aaaxxxxxxxxx"@        & @"aaax"@      & @"xxx"@       & @"xxxxx"@     & @"xxx"@       & @"xxx"@       \\
+@"aaaxbbbxxxxxx"@       & @"aaax"@      & @"xbbb"@      & @"xxxx"@      & @"xxx"@       & @"xxx"@       \\
+@"aaaxbbbxxxcccxx"@     & @"aaax"@      & @"xbbb"@      & @"xxxccc"@& @"cccxx"@ & @"xxx"@       \\
+@"aaaxbbbyyyxx"@        & @"aaax"@      & @"aaab"@      & @"yyy"@       & @"xx"@        & @"xxx"@       \\
+@"aaaxbbbyyyddd"@       & @"aaax"@      & @"xbbb"@      & @"yyy"@       & @"ddd"@       & @"xxx"@       \\
+@"aaaxbbbyyyddd"@       & @"aaax"@      & @"xbbb"@      & @"yyy"@       & @"ddd"@       & @"eee"@       \\
+@"eee"@                         & @""@  & @""@  & @""@          & @"eee"@ \\
+ & \\
+\end{tabular}
+\end{tabular}
+\end{cquote}
+\begin{cfa}
+int main() {
+        string x = "xxxxxxxxxxx";
+        test( x, x(0, 3), x(2, 3), x(4, 5), x(8, 5), x(8, 5) );
+}
+\end{cfa}
+\caption{Parameter Passing}
+\label{f:ParameterPassing}
+\end{figure}
 …
 The heap header and text buffer define a sharing context.
 Normally, one global sharing context is appropriate for an entire program;
 concurrent exceptions are discussed in \VRef{s:AvoidingImplicitSharing}.
+concurrent exceptions are discussed in \VRef{s:ControllingImplicitSharing}.
 A string is a handle into the buffer and linked into a list.
 The list is doubly linked for $O(1)$ insertion and removal at any location.
 …
+\subsection{Avoiding implicit sharing}
+\label{s:AvoidingImplicitSharing}
+There are tradeoffs associated with the copy-on-write mechanism.
+Several qualitative matters are detailed in \VRef{s:PerformanceAssessment} and the qualitative issue of multi-threaded support is introduced here.
+The \CFA string library provides a switch to disable threads allocating from the string buffer, when string sharing is unsafe.
+When toggled, string management is moved to the storage allocator, specifically @malloc@/@free@, where the storage allocator is assumed to be thread-safe.
+In detail, string sharing has inter-linked string handles, so any participant managing one string is also managing, directly, the neighbouring strings, and from there, a data structure of the ``set of all strings.''
+This string structure is intended for sequential access.
+Hence, multiple threads using shared strings need to avoid modifying (concurrently) an instance of this structure (like Java immutable strings).
+A positive consequence of this approach is that independent threads can use the sharing buffer without locking overhead.
+When the string library is running with sharing disabled, it runs without implicit thread-safety challenges, which is the same as the \CC STL, and with performance goals similar to the STL.
+Running with sharing disabled can be thought of as a STL-emulation mode.
+Hence, concurrent users of string objects must still bring their own mutual exclusion, but the string library does not add any cross thread uses that are not apparent in a user's code.
+The \CFA string library provides the type @string_sharectx@ to control an ambient sharing context for a current thread.
+It allows two adjustments: to opt out of sharing entirely or to begin sharing within a private context.
+Either way, the chosen mode applies only to the current thread, for the duration of the lifetime of the created  @string_sharectx@ object, up to being suspended by child lifetimes of different contexts.
+\VRef[Figure]{fig:string-sharectx} illustrates its behaviour.
+Executing the example does not produce an interesting outcome.
+But the comments indicate when the logical copy operation runs with
+\begin{description}
+    \item[share:] the copy being deferred, as described through the rest of this section (fast), or
+    \item[copy:] the copy performed eagerly (slow).
+\end{description}
+Only eager copies can cross @string_sharectx@ boundaries.
+The intended use is with stack-managed lifetimes, in which the established context lasts until the current function returns, and affects all functions called that do not create their own contexts.
+In this example, the single-letter functions are called in alphabetic order.
+The functions @a@, @b@ and @g@ share string character ranges with each other, because they occupy a common sharing-enabled context.
+The function @e@ shares within itself (because its is in a sharing-enabled context), but not with the rest of the program (because its context is not occupied by any of the rest of the program).
+The functions @c@, @d@ and @f@ never share anything, because they are in a sharing-disabled context.
+\subsection{Controlling implicit sharing}
+\label{s:ControllingImplicitSharing}
+There are tradeoffs associated with sharing and its implicit copy-on-write mechanism.
+Several qualitative matters are detailed in \VRef{s:PerformanceAssessment}.
+In detail, string sharing has inter-linked string handles, so managing one string is also managing the neighbouring strings, and from there, a data structure of the ``set of all strings.''
+Therefore, it is useful to toggle this capability on or off when it is not providing any application benefit.
 \begin{figure}
 …
 \end{figure}
+The \CFA string library provides the type @string_sharectx@ to control an ambient sharing context.
+It allows two adjustments: to opt out of sharing entirely or to begin sharing within a private context.
+Running with sharing disabled can be thought of as a \CC STL-emulation mode, where each string is dynamically allocated.
+The chosen mode applies for the duration of the lifetime of the created  @string_sharectx@ object, up to being suspended by child lifetimes of different contexts.
+\VRef[Figure]{fig:string-sharectx} illustrates this behaviour by showing the stack frames of a program in execution.
+In this example, the single-letter functions are called in alphabetic order.
+The functions @a@, @b@ and @g@ share string character ranges with each other, because they occupy a common sharing-enabled context.
+The function @e@ shares within itself (because its is in a sharing-enabled context), but not with the rest of the program (because its context is not occupied by any of the rest of the program).
+The functions @c@, @d@ and @f@ never share anything, because they are in a sharing-disabled context.
+Executing the example does not produce an interesting outcome, but the comments in the picture indicate when the logical copy operation runs with
+\begin{description}
+    \item[share:] the copy being deferred, as described through the rest of this section (fast), or
+    \item[copy:] the copy performed eagerly (slow).
+\end{description}
+Only eager copies can cross @string_sharectx@ boundaries.
+The intended use is with stack-managed lifetimes, in which the established context lasts until the current function returns, and affects all functions called that do not create their own contexts.
 [ TODO: true up with ``is thread local'' (implement that and expand this discussion to give a concurrent example, or adjust this wording) ]
+\subsection{Sharing and threading}
+The \CFA string library provides no thread safety, the same as \CC string, providing similar performance goals.
+Threads can create their own string buffers and avoid passing these strings to other threads, or require that shared strings be immutable, as concurrent reading is safe.
+A positive consequence of this approach is that independent threads can use the sharing buffer without locking overhead.
+When string sharing amongst threads is required, program-wide string-management can toggled to non-sharing using @malloc@/@free@, where the storage allocator is assumed to be thread-safe.
+Finally, concurrent users of string objects can provide their own mutual exclusion.
 \subsection{Future work}
+To discuss: Unicode
+To discuss: Small-string optimization
+Implementing the small-string optimization is straightforward, as a string header contains a pointer to the string text in the buffer.
+This pointer could be marked with a flag and contain a small string.
+However, there is now a conditional check required on the fast-path to switch between small and large string operations.
+It might be possible to pack 16- or 32-bit Unicode characters within the same string buffer as 8-bit characters.
+Again, locations for identification flags must be found and checked along the fast path to select the correct actions.
+Handling utf8 (variable length), is more problematic because simple pointer arithmetic cannot be used to stride through the variable-length characters.
+Trying to use a secondary array of fixed-sized pointers/offsets to the characters is possible, but raises the question of storage management for the utf8 characters themselves.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset b0296dba for doc/theses/mike_brooks_MMath/string.tex

Legend:

doc/theses/mike_brooks_MMath/string.tex

Download in other formats: