Context Navigation

-              rc4f8c4bf
+              r602ac05
 The maximum storage for a \CFA @string@ value is @size_t@ characters, which is $2^{32}$ or $2^{64}$ respectively.
 A \CFA string manages its length separately from the string, so there is no null (@'\0'@) terminating value at the end of a string value.
 Hence, a \CFA string cannot be passed to a C string manipulation routine, such as @strcat@.
+Hence, a \CFA string cannot be passed to a C string manipulation function, such as @strcat@.
 Like C strings, characters in a @string@ are numbered from the left starting at 0, and in \CFA numbered from the right starting at -1.
 \begin{cquote}
 …
 s = (string){ 5.5 };                            $\C{// converts double to string}$
 \end{cfa}
+Conversions from @string@ to @char *@, attempt to be safe:
+Conversions from @string@ to @char *@ attempt to be safe:
 either by requiring the maximum length of the @char *@ storage (@strncpy@) or allocating the @char *@ storage for the string characters (ownership), meaning the programmer must free the storage.
 Note, a C string is always null terminated, implying a minimum size of 1 character.
 …
 \subsection{Comparison Operators}
+The binary relational, @<@, @<=@, @>@, @>=@, and equality, @==@, @!=@, operators compare strings using lexicographical ordering, where longer strings are greater than shorter strings.
+C strings use function @strcmp@, as the relational/equality operators compare C string pointers not their values, which does not match programmer expectation.
+The binary relational, @<@, @<=@, @>@, @>=@, and equality, @==@, @!=@, operators compare \CFA string values using lexicographical ordering, where longer strings are greater than shorter strings.
+In C, these operators compare the C string pointer not its value, which does not match programmer expectation.
+C strings use function @strcmp@, as the relational/equality operator for string values.
 …
 The binary operators @+@ and @+=@ concatenate characters, C strings and \CFA strings, creating the sum of the characters.
 \begin{cquote}
 \begin{tabular}{@{}l|l@{\hspace{25pt}}l|l@{\hspace{25pt}}l|l@{}}
+\begin{tabular}{@{}l|l@{\hspace{15pt}}l|l@{\hspace{15pt}}l|l@{}}
 \begin{cfa}
 s = "";
 …
 \end{cquote}
 For these operations to meet programmer expectations, \CFA introduces two C non-backward compatibilities.
 Note, subtracting pointers or characters has a low-level use case.
+Note, subtracting pointers or characters has a low-level use-case.
 \begin{cfa}
 ch - '0'    $\C[2in]{// find character offset}$
 …
 \end{cfa}
 Adding character values or advancing a pointer with a character are unusual operations, and hence, unlikely to existing in C programs.
+There is a legitimate use case for arithmetic on @signed@/@unsigned@ characters (bytes), but these type are treated differently from @char@ in \CC and \CFA.
+However, for backwards compatibility reasons it is impossible to restrict or remove arithmetic on type @char@.
 Stealing these two cases for use with strings, allows all combinations of concatenation among @char@, @char *@, and @string@.
 Note, stealing only occurs if a program includes @string.hfa@, resulting is ambiguities in existing C code where there is no way to disambiguate.
+Note, stealing only occurs if a program includes @<string.hfa>@, resulting is ambiguities in existing C code where there is no way to disambiguate.
 \begin{cfa}
 ch = 'a' + 'b'; $\C[2in]{// LHS disambiguate, add character values}$
 s = 'a' + 'b'; $\C{// LHS disambiguate, concatenation characters}$
 sout | 'a' + 'b'; $\C{// ambiguous with string.hfa, add or concatenate?}$
+sout | 'a' + 'b'; $\C{// ambiguous with <string.hfa>, add or concatenate?}$
 sout | (char)'a' + 'b'; $\C{// disambiguate}$
 sout | "a" + "b"; $\C{// disambiguate}\CRT$
 \end{cfa}
 Again, the possibility of this scenario is extremely rare, as adding characters is meaningless.
+Again, introducing disambiguates for this scenario are rare, as adding characters is uncommon.
 \CC cannot support this generality because it does not use the left-hand side of assignment in expression resolution.
 …
 \setlength{\tabcolsep}{10pt}
 \begin{tabular}{@{}l|ll|l@{}}
+\multicolumn{2}{c}{\textbf{length}} & \multicolumn{2}{c}{\textbf{pattern}} \\
 \begin{cfa}
 s = name( 2, 2 );
 …
 "KE"
 "IK"
 "KE", clipped length to 2
 "", beyond string clipped to null
+"KE", clip length to 2
+"", beyond string clip to null
 "K"
 "IKE", to end of string
 …
 If the substring request is completely outside of the original string, a null string is returned.
 The pattern form either returns the pattern string is the pattern matches or a null string if the pattern does not match.
 This mechanism is discussed next.
+The usefulness of this mechanism is discussed next.
 The substring operation can also appear on the left side of an assignment and replaced by the string value on the right side.
 …
 \end{tabular}
 \end{cquote}
 Pattern matching is useful on the left-hand side of the assignment.
+Now pattern matching is useful on the left-hand side of assignment.
 \begin{cquote}
 \setlength{\tabcolsep}{15pt}
 …
 The find operation returns the position of the first occurrence of a key string in a string.
 If the key does not appear in the current string, the length of the current string plus one is returned.
+If the key does not appear in the string, the length of the string plus one is returned.
 \begin{cquote}
 \setlength{\tabcolsep}{15pt}
 …
 \begin{cfa}
 i = find( digit, '3' );
 i = "45" ^ digit; // python style "45" in digit
+i = find( digit, "45" );
 string x = "567";
 i = find( digit, x );
 …
 \end{tabular}
 \end{cquote}
 The character-class operations indicates if a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc.
+The character-class operations indicate if a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc.
 \begin{cquote}
 \setlength{\tabcolsep}{15pt}
 …
 \end{cquote}
 The test operation checks if each character in a string is in one of the C character classes.
+The test operation checks if each character in a string is in one of the C character classes.\footnote{It is part of the hereditary madness of C that these function take and return an \lstinline{int} rather than a \lstinline{char}.}
 \begin{cquote}
 \setlength{\tabcolsep}{15pt}
 …
+\subsection{Returning N+1 on Failure}
+Any of the string search routines can fail at some point during the search.
+When this happens it is necessary to return indicating the failure.
+Many string types in other languages use some special value to indicate the failure.
+This value is often 0 or -1 (PL/I returns 0).
+This section argues that a value of N+1, where N is the length of the base string in the search, is a more useful value to return.
+The index-of function in APL returns N+1.
+These are the boundary situations and are often overlooked when designing a string type.
+The situation that can be optimized by returning N+1 is when a search is performed to find the starting location for a substring operation.
+For example, in a program that is extracting words from a text file, it is necessary to scan from left to right over whitespace until the first alphabetic character is found.
+\begin{cfa}
+line = line( line.exclude( alpha ) );
+\end{cfa}
+If a text line contains all whitespaces, the exclude operation fails to find an alphabetic character.
+If @exclude@ returns 0 or -1, the result of the substring operation is unclear.
+\subsection{Returning N+1 on Search Failure}
+String search functions can fail to find the key in the target string.
+The failure must be returned as an alternate outcome, possibly an exception.
+Many string types use a return code to indicate the failure, such as @0@ or @-1@ (PL/I~\cite{PLI} returns @0@).
+\CFA adopts the approach used by the index-of function in APL~\cite{apl}, which returns length of the target string plus 1 ($N+1$).
+When a search is performed to find the starting location for a substring operation, returning $N+1$ is arguably the best choice.
+For example, in extracting words from a string, it is necessary to scan from left to right over whitespace until the first alphabetic character is found.
+\begin{cfa}
+line = line( exclude( line, alpha ) );  // find start of word
+\end{cfa}
+If the line contains all whitespace and @exclude@ returns 0 or -1, the result of the substring is unclear.
 Most string types generate an error, or clip the starting value to 1, resulting in the entire whitespace string being selected.
+If @exclude@ returns N+1, the starting position for the substring operation is beyond the end of the string leaving a null string.
+The same situation occurs when scanning off a word.
+\begin{cfa}
+start = line.include(alpha);
+word = line(1, start - 1);
+\end{cfa}
+If the entire line is composed of a word, the include operation will  fail to find a non-alphabetic character.
+In general, returning 0 or -1 is not an appropriate starting position for the substring, which must substring off the word leaving a null string.
+However, returning N+1 will substring off the word leaving a null string.
+This behaviour leads to the awkward pattern:
+\begin{cfa}
+i = exclude( line, alpha );
+if ( i != -1 ) line = line( i );
+else line = "";
+\end{cfa}
+If @exclude@ returns $N+1$, the starting position for the substring operation is beyond the end of the string leaving a null string.
+This scenario is repeated when scanning off the word.
+\begin{cfa}
+word = line( 0, include( line, alpha ) - 1 );  // scan off word
+\end{cfa}
+If the entire line is composed of a word, the @include@ fails to find a non-alphabetic character, resulting in the same awkward pattern.
+In string systems with an $O(1)$ length operator, checking for failure is low cost.
+\begin{cfa}
+if ( include( line, alpha ) == len( line ) ) ... // not found, 0 origin
+\end{cfa}
 \subsection{C Compatibility}
+To ease conversion from C to \CFA, there are companion @string@ routines for C strings.
+\VRef[Table]{t:CompanionStringRoutines} shows the C routines on the left that also work with @string@ and the rough equivalent @string@ operation of the right.
+Hence, it is possible to directly convert a block of C string operations into @string@ just by changing the
+\begin{table}
+\begin{cquote}
+\begin{tabular}{@{}l|l@{}}
+\multicolumn{1}{c|}{\lstinline{char []}}        & \multicolumn{1}{c}{\lstinline{string}}        \\
+\hline
+@strcpy@, @strncpy@             & @=@                                                                   \\
+@strcat@, @strncat@             & @+@                                                                   \\
+@strcmp@, @strncmp@             & @==@, @!=@, @<@, @<=@, @>@, @>=@              \\
+@strlen@                                & @size@                                                                \\
+@[]@                                    & @[]@                                                                  \\
+@strstr@                                & @find@                                                                \\
+@strcspn@                               & @find_first_of@, @find_last_of@               \\
+@strspc@                                & @find_fist_not_of@, @find_last_not_of@
+\end{tabular}
+\end{cquote}
+\caption{Companion Routines for \CFA \lstinline{string} to C Strings}
+\label{t:CompanionStringRoutines}
+\end{table}
+For example, this block of C code can be converted to \CFA by simply changing the type of variable @s@ from @char []@ to @string@.
+\begin{cfa}
+        char s[32];
+        //string s;
+        strcpy( s, "abc" );                             PRINT( %s, s );
+        strncpy( s, "abcdef", 3 );              PRINT( %s, s );
+        strcat( s, "xyz" );                             PRINT( %s, s );
+        strncat( s, "uvwxyz", 3 );              PRINT( %s, s );
+        PRINT( %zd, strlen( s ) );
+        PRINT( %c, s[3] );
+        PRINT( %s, strstr( s, "yzu" ) ) ;
+        PRINT( %s, strstr( s, 'y' ) ) ;
+To ease conversion from C to \CFA, \CFA provides companion C @string@ functions.
+Hence, it is possible to convert a block of C string operations to \CFA strings just by changing the type @char *@ to @string@.
+\begin{cfa}
+char s[32];   // string s;
+strcpy( s, "abc" );
+strncpy( s, "abcdef", 3 );
+strcat( s, "xyz" );
+strncat( s, "uvwxyz", 3 );
 \end{cfa}
 However, the conversion fails with I/O because @printf@ cannot print a @string@ using format code @%s@ because \CFA strings are not null terminated.
+Nevertheless, this capability does provide a useful starting point for conversion to safer \CFA strings.
 …
 If the base string value shortens so that its end is before the starting location of a substring, resulting in the substring starting location disappearing, the substring becomes a null string located at the end of the base string.
 The following example illustrates passing the results of substring operations by reference and by value to a subprogram.
+\VRef[Figure]{f:ParameterPassing} shows passing the results of substring operations by reference and by value to a subprogram.
 Notice the side-effects to other reference parameters as one is modified.
+\begin{cfa}
+main() {
+        string x = "xxxxxxxxxxxxx";
+        test( x, x(1,3), x(3,3), x(5,5), x(9,5), x(9,5) );
+}
+\begin{figure}
+\begin{cfa}
 // x, a, b, c, & d are substring results passed by reference
 // e is a substring result passed by value
 …
         x = e;                                                  $\C{// eeexx                    eeex    exx             x                               eeexx}$
+}
+\end{cfa}
+\subsection{Input/Output Operators}
+Both the \CC operators @<<@ and @>>@ are defined on type @string@.
+However, input of a string value is different from input of a @char *@ value.
+When a string value is read, \emph{all} input characters from the current point in the input stream to either the end of line (@'\n'@) or the end of file are read.
+\section{Implementation}
+int main() {
+        string x = "xxxxxxxxxxxxx";
+        test( x, x(1,3), x(3,3), x(5,5), x(9,5), x(9,5) );
+}
+\end{cfa}
+\caption{Parameter Passing}
+\label{f:ParameterPassing}
+\end{figure}
+\subsection{I/O Operators}
+The ability to read and print strings is as essential as for any other type.
+The goal for character I/O is to work with groups rather than individual characters.
+A comparison with \CC string I/O is presented as a counterpoint to \CFA string I/O.
+The \CC output @<<@ and input @>>@ operators are defined on type @string@.
+\CC output for @char@, @char *@, and @string@ are similar.
+The \CC manipulators are @setw@, and its associated width controls @left@, @right@ and @setfill@.
+\begin{cquote}
+\setlength{\tabcolsep}{15pt}
+\begin{tabular}{@{}l|l@{}}
+\begin{c++}
+string s = "abc";
+cout << setw(10) << left << setfill( 'x' ) << s << endl;
+\end{c++}
+&
+\begin{c++}
+"abcxxxxxxx"
+\end{c++}
+\end{tabular}
+\end{cquote}
+The \CFA input/output operator @|@ is defined on type @string@.
+\CFA output for @char@, @char *@, and @string@ are the similar.
+The \CFA manipulators are @bin@, @oct@, @hex@, @wd@, and its associated width control and @left@.
+\begin{cquote}
+\setlength{\tabcolsep}{15pt}
+\begin{tabular}{@{}l|l@{}}
+\begin{cfa}
+string s = "abc";
+sout | bin( s ) | nl
+           | oct( s ) | nl
+           | hex( s ) | nl
+           | wd( 10, s ) | nl
+           | wd( 10, 2, s ) | nl
+           | left( wd( 10, s ) );
+\end{cfa}
+&
+\begin{cfa}
+"0b1100001 0b1100010 0b1100011"
+"0141 0142 0143"
+"0x61 0x62 0x63"
+"       abc"
+"        ab"
+"abc       "
+\end{cfa}
+\end{tabular}
+\end{cquote}
+\CC input matching for @char@, @char *@, and @string@ are the similar, where \emph{all} input characters are read from the current point in the input stream to the end of the type size, format width, whitespace, end of line (@'\n'@), or end of file.
+The \CC manipulator is @setw@ to restrict the size.
+Reading into a @char@ is safe as the size is 1, @char *@ is unsafe without using @setw@ to constraint the length (which includes @'\0'@), @string@ is safe as its grows dynamically as characters are read.
+\begin{cquote}
+\setlength{\tabcolsep}{15pt}
+\begin{tabular}{@{}l|l@{}}
+\begin{c++}
+char ch, c[10];
+string s;
+cin >> ch >> setw( 5 ) >> c  >> s;
+abcde   fg
+\end{c++}
+&
+\begin{c++}
+'a' "bcde" "fg"
+\end{c++}
+\end{tabular}
+\end{cquote}
+Input text can be gulped from the current point to an arbitrary delimiter character using @getline@, which reads whitespace.
+The \CFA philosophy for input is that for every constant type in C, these constants should be usable as input.
+For example, the complex constant @3.5+4.1i@ can appear as input to a complex variable.
+\CFA input matching for @char@, @char *@, and @string@ are similar.
+C-strings may only be read with a width field, which should match the string size.
+Certain input manipulators support a scanset, which is a simple regular expression from @printf@.
+The \CFA manipulators for these types are @wdi@\footnote{Due to an overloading issue in the type-resolver, the input width name must be temporarily different from the output, \lstinline{wdi} versus \lstinline{wd}.},
+and its associated width control and @left@, @quote@, @incl@, @excl@, and @getline@.
+\begin{cquote}
+\setlength{\tabcolsep}{10pt}
+\begin{tabular}{@{}l|l@{}}
+\begin{c++}
+char ch, c[10];
+string s;
+sin | ch | wdi( 5, c ) | s;
+abcde fg
+sin | quote( ch ) | quote( wdi( sizeof(c), c ) ) | quote( s, '[', ']' ) | nl;
+$'a' "bcde" [fg]$
+sin | incl( "a-zA-Z0-9 ?!&\n", s ) | nl;
+x?&000xyz TOM !.
+sin | excl( "a-zA-Z0-9 ?!&\n", s );
+<>{}{}STOP
+\end{c++}
+&
+\begin{c++}
+'a' "bcde" [fg]
+'a' "bcde" [fg]
+"x?&000xyz TOM !"
+"<>{}{}"
+\end{c++}
+\end{tabular}
+\end{cquote}
+\subsection{Assignment}
 While \VRef[Figure]{f:StrApiCompare} emphasizes cross-language similarities, it elides many specific operational differences.
 …
 Object lifecycle events are the \emph{subscription-management} triggers in such a service.
 There are two fundamental string-creation routines: importing external text like a C-string or reading a string, and initialization from an existing \CFA string.
+There are two fundamental string-creation functions: importing external text like a C-string or reading a string, and initialization from an existing \CFA string.
 When importing, storage comes from the end of the buffer, into which the text is copied.
 The new string handle is inserted at the end of the handle list because the new text is at the end of the buffer.
 …
 Here, \emph{reusing a logical allocation}, means that the program variable, into which the user is concatenating, previously held a long string.
 In general, a user should not have to care about this difference, yet the STL performs differently in these cases.
 Furthermore, if a routine takes a string by reference, if cannot use the fresh approach.
+Furthermore, if a function takes a string by reference, if cannot use the fresh approach.
 Concretely, both cases incur the cost of copying characters into the target string, but only the allocation-fresh case incurs a further reallocation cost, which is generally paid at points of doubling the length.
 For the STL, this cost includes obtaining a fresh buffer from the memory allocator and copying older characters into the new buffer, while \CFA-sharing hides such a cost entirely.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 602ac05 for doc/theses

Legend:

doc/theses/mike_brooks_MMath/string.tex

Download in other formats: