Changeset 602ac05 for doc/theses


Ignore:
Timestamp:
Apr 11, 2025, 4:31:29 PM (5 months ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Children:
a800a19
Parents:
c4f8c4bf
Message:

more proofreading of string chapter

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mike_brooks_MMath/string.tex

    rc4f8c4bf r602ac05  
    6060The maximum storage for a \CFA @string@ value is @size_t@ characters, which is $2^{32}$ or $2^{64}$ respectively.
    6161A \CFA string manages its length separately from the string, so there is no null (@'\0'@) terminating value at the end of a string value.
    62 Hence, a \CFA string cannot be passed to a C string manipulation routine, such as @strcat@.
     62Hence, a \CFA string cannot be passed to a C string manipulation function, such as @strcat@.
    6363Like C strings, characters in a @string@ are numbered from the left starting at 0, and in \CFA numbered from the right starting at -1.
    6464\begin{cquote}
     
    131131s = (string){ 5.5 };                            $\C{// converts double to string}$
    132132\end{cfa}
    133 Conversions from @string@ to @char *@, attempt to be safe:
     133
     134Conversions from @string@ to @char *@ attempt to be safe:
    134135either by requiring the maximum length of the @char *@ storage (@strncpy@) or allocating the @char *@ storage for the string characters (ownership), meaning the programmer must free the storage.
    135136Note, a C string is always null terminated, implying a minimum size of 1 character.
     
    186187\subsection{Comparison Operators}
    187188
    188 The binary relational, @<@, @<=@, @>@, @>=@, and equality, @==@, @!=@, operators compare strings using lexicographical ordering, where longer strings are greater than shorter strings.
    189 C strings use function @strcmp@, as the relational/equality operators compare C string pointers not their values, which does not match programmer expectation.
     189The binary relational, @<@, @<=@, @>@, @>=@, and equality, @==@, @!=@, operators compare \CFA string values using lexicographical ordering, where longer strings are greater than shorter strings.
     190In C, these operators compare the C string pointer not its value, which does not match programmer expectation.
     191C strings use function @strcmp@, as the relational/equality operator for string values.
    190192
    191193
     
    194196The binary operators @+@ and @+=@ concatenate characters, C strings and \CFA strings, creating the sum of the characters.
    195197\begin{cquote}
    196 \begin{tabular}{@{}l|l@{\hspace{25pt}}l|l@{\hspace{25pt}}l|l@{}}
     198\begin{tabular}{@{}l|l@{\hspace{15pt}}l|l@{\hspace{15pt}}l|l@{}}
    197199\begin{cfa}
    198200s = "";
     
    245247\end{cquote}
    246248For these operations to meet programmer expectations, \CFA introduces two C non-backward compatibilities.
    247 Note, subtracting pointers or characters has a low-level use case.
     249Note, subtracting pointers or characters has a low-level use-case.
    248250\begin{cfa}
    249251ch - '0'    $\C[2in]{// find character offset}$
     
    256258\end{cfa}
    257259Adding character values or advancing a pointer with a character are unusual operations, and hence, unlikely to existing in C programs.
     260There is a legitimate use case for arithmetic on @signed@/@unsigned@ characters (bytes), but these type are treated differently from @char@ in \CC and \CFA.
     261However, for backwards compatibility reasons it is impossible to restrict or remove arithmetic on type @char@.
    258262Stealing these two cases for use with strings, allows all combinations of concatenation among @char@, @char *@, and @string@.
    259 Note, stealing only occurs if a program includes @string.hfa@, resulting is ambiguities in existing C code where there is no way to disambiguate.
     263Note, stealing only occurs if a program includes @<string.hfa>@, resulting is ambiguities in existing C code where there is no way to disambiguate.
    260264\begin{cfa}
    261265ch = 'a' + 'b'; $\C[2in]{// LHS disambiguate, add character values}$
    262266s = 'a' + 'b'; $\C{// LHS disambiguate, concatenation characters}$
    263 sout | 'a' + 'b'; $\C{// ambiguous with string.hfa, add or concatenate?}$
     267sout | 'a' + 'b'; $\C{// ambiguous with <string.hfa>, add or concatenate?}$
    264268sout | (char)'a' + 'b'; $\C{// disambiguate}$
    265269sout | "a" + "b"; $\C{// disambiguate}\CRT$
    266270\end{cfa}
    267 Again, the possibility of this scenario is extremely rare, as adding characters is meaningless.
     271Again, introducing disambiguates for this scenario are rare, as adding characters is uncommon.
    268272
    269273\CC cannot support this generality because it does not use the left-hand side of assignment in expression resolution.
     
    309313\setlength{\tabcolsep}{10pt}
    310314\begin{tabular}{@{}l|ll|l@{}}
     315\multicolumn{2}{c}{\textbf{length}} & \multicolumn{2}{c}{\textbf{pattern}} \\
    311316\begin{cfa}
    312317s = name( 2, 2 );
     
    321326"KE"
    322327"IK"
    323 "KE", clipped length to 2
    324 "", beyond string clipped to null
     328"KE", clip length to 2
     329"", beyond string clip to null
    325330"K"
    326331"IKE", to end of string
     
    351356If the substring request is completely outside of the original string, a null string is returned.
    352357The pattern form either returns the pattern string is the pattern matches or a null string if the pattern does not match.
    353 This mechanism is discussed next.
     358The usefulness of this mechanism is discussed next.
    354359
    355360The substring operation can also appear on the left side of an assignment and replaced by the string value on the right side.
     
    376381\end{tabular}
    377382\end{cquote}
    378 Pattern matching is useful on the left-hand side of the assignment.
     383Now pattern matching is useful on the left-hand side of assignment.
    379384\begin{cquote}
    380385\setlength{\tabcolsep}{15pt}
     
    416421
    417422The find operation returns the position of the first occurrence of a key string in a string.
    418 If the key does not appear in the current string, the length of the current string plus one is returned.
     423If the key does not appear in the string, the length of the string plus one is returned.
    419424\begin{cquote}
    420425\setlength{\tabcolsep}{15pt}
     
    422427\begin{cfa}
    423428i = find( digit, '3' );
    424 i = "45" ^ digit; // python style "45" in digit
     429i = find( digit, "45" );
    425430string x = "567";
    426431i = find( digit, x );
     
    435440\end{tabular}
    436441\end{cquote}
    437 The character-class operations indicates if a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc.
     442The character-class operations indicate if a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc.
    438443\begin{cquote}
    439444\setlength{\tabcolsep}{15pt}
     
    490495\end{cquote}
    491496
    492 The test operation checks if each character in a string is in one of the C character classes.
     497The test operation checks if each character in a string is in one of the C character classes.\footnote{It is part of the hereditary madness of C that these function take and return an \lstinline{int} rather than a \lstinline{char}.}
    493498\begin{cquote}
    494499\setlength{\tabcolsep}{15pt}
     
    546551
    547552
    548 \subsection{Returning N+1 on Failure}
    549 
    550 Any of the string search routines can fail at some point during the search.
    551 When this happens it is necessary to return indicating the failure.
    552 Many string types in other languages use some special value to indicate the failure.
    553 This value is often 0 or -1 (PL/I returns 0).
    554 This section argues that a value of N+1, where N is the length of the base string in the search, is a more useful value to return.
    555 The index-of function in APL returns N+1.
    556 These are the boundary situations and are often overlooked when designing a string type.
    557 
    558 The situation that can be optimized by returning N+1 is when a search is performed to find the starting location for a substring operation.
    559 For example, in a program that is extracting words from a text file, it is necessary to scan from left to right over whitespace until the first alphabetic character is found.
    560 \begin{cfa}
    561 line = line( line.exclude( alpha ) );
    562 \end{cfa}
    563 If a text line contains all whitespaces, the exclude operation fails to find an alphabetic character.
    564 If @exclude@ returns 0 or -1, the result of the substring operation is unclear.
     553\subsection{Returning N+1 on Search Failure}
     554
     555String search functions can fail to find the key in the target string.
     556The failure must be returned as an alternate outcome, possibly an exception.
     557Many string types use a return code to indicate the failure, such as @0@ or @-1@ (PL/I~\cite{PLI} returns @0@).
     558\CFA adopts the approach used by the index-of function in APL~\cite{apl}, which returns length of the target string plus 1 ($N+1$).
     559
     560When a search is performed to find the starting location for a substring operation, returning $N+1$ is arguably the best choice.
     561For example, in extracting words from a string, it is necessary to scan from left to right over whitespace until the first alphabetic character is found.
     562\begin{cfa}
     563line = line( exclude( line, alpha ) );  // find start of word
     564\end{cfa}
     565If the line contains all whitespace and @exclude@ returns 0 or -1, the result of the substring is unclear.
    565566Most string types generate an error, or clip the starting value to 1, resulting in the entire whitespace string being selected.
    566 If @exclude@ returns N+1, the starting position for the substring operation is beyond the end of the string leaving a null string.
    567 
    568 The same situation occurs when scanning off a word.
    569 \begin{cfa}
    570 start = line.include(alpha);
    571 word = line(1, start - 1);
    572 \end{cfa}
    573 If the entire line is composed of a word, the include operation will  fail to find a non-alphabetic character.
    574 In general, returning 0 or -1 is not an appropriate starting position for the substring, which must substring off the word leaving a null string.
    575 However, returning N+1 will substring off the word leaving a null string.
     567This behaviour leads to the awkward pattern:
     568\begin{cfa}
     569i = exclude( line, alpha );
     570if ( i != -1 ) line = line( i );
     571else line = "";
     572\end{cfa}
     573If @exclude@ returns $N+1$, the starting position for the substring operation is beyond the end of the string leaving a null string.
     574This scenario is repeated when scanning off the word.
     575\begin{cfa}
     576word = line( 0, include( line, alpha ) - 1 );  // scan off word
     577\end{cfa}
     578If the entire line is composed of a word, the @include@ fails to find a non-alphabetic character, resulting in the same awkward pattern.
     579In string systems with an $O(1)$ length operator, checking for failure is low cost.
     580\begin{cfa}
     581if ( include( line, alpha ) == len( line ) ) ... // not found, 0 origin
     582\end{cfa}
    576583
    577584
    578585\subsection{C Compatibility}
    579586
    580 To ease conversion from C to \CFA, there are companion @string@ routines for C strings.
    581 \VRef[Table]{t:CompanionStringRoutines} shows the C routines on the left that also work with @string@ and the rough equivalent @string@ operation of the right.
    582 Hence, it is possible to directly convert a block of C string operations into @string@ just by changing the
    583 
    584 \begin{table}
    585 \begin{cquote}
    586 \begin{tabular}{@{}l|l@{}}
    587 \multicolumn{1}{c|}{\lstinline{char []}}        & \multicolumn{1}{c}{\lstinline{string}}        \\
    588 \hline
    589 @strcpy@, @strncpy@             & @=@                                                                   \\
    590 @strcat@, @strncat@             & @+@                                                                   \\
    591 @strcmp@, @strncmp@             & @==@, @!=@, @<@, @<=@, @>@, @>=@              \\
    592 @strlen@                                & @size@                                                                \\
    593 @[]@                                    & @[]@                                                                  \\
    594 @strstr@                                & @find@                                                                \\
    595 @strcspn@                               & @find_first_of@, @find_last_of@               \\
    596 @strspc@                                & @find_fist_not_of@, @find_last_not_of@
    597 \end{tabular}
    598 \end{cquote}
    599 \caption{Companion Routines for \CFA \lstinline{string} to C Strings}
    600 \label{t:CompanionStringRoutines}
    601 \end{table}
    602 
    603 For example, this block of C code can be converted to \CFA by simply changing the type of variable @s@ from @char []@ to @string@.
    604 \begin{cfa}
    605         char s[32];
    606         //string s;
    607         strcpy( s, "abc" );                             PRINT( %s, s );
    608         strncpy( s, "abcdef", 3 );              PRINT( %s, s );
    609         strcat( s, "xyz" );                             PRINT( %s, s );
    610         strncat( s, "uvwxyz", 3 );              PRINT( %s, s );
    611         PRINT( %zd, strlen( s ) );
    612         PRINT( %c, s[3] );
    613         PRINT( %s, strstr( s, "yzu" ) ) ;
    614         PRINT( %s, strstr( s, 'y' ) ) ;
     587To ease conversion from C to \CFA, \CFA provides companion C @string@ functions.
     588Hence, it is possible to convert a block of C string operations to \CFA strings just by changing the type @char *@ to @string@.
     589\begin{cfa}
     590char s[32];   // string s;
     591strcpy( s, "abc" );
     592strncpy( s, "abcdef", 3 );
     593strcat( s, "xyz" );
     594strncat( s, "uvwxyz", 3 );
    615595\end{cfa}
    616596However, the conversion fails with I/O because @printf@ cannot print a @string@ using format code @%s@ because \CFA strings are not null terminated.
     597Nevertheless, this capability does provide a useful starting point for conversion to safer \CFA strings.
    617598
    618599
     
    625606If the base string value shortens so that its end is before the starting location of a substring, resulting in the substring starting location disappearing, the substring becomes a null string located at the end of the base string.
    626607
    627 The following example illustrates passing the results of substring operations by reference and by value to a subprogram.
     608\VRef[Figure]{f:ParameterPassing} shows passing the results of substring operations by reference and by value to a subprogram.
    628609Notice the side-effects to other reference parameters as one is modified.
    629 \begin{cfa}
    630 main() {
    631         string x = "xxxxxxxxxxxxx";
    632         test( x, x(1,3), x(3,3), x(5,5), x(9,5), x(9,5) );
    633 }
    634 
     610
     611\begin{figure}
     612\begin{cfa}
    635613// x, a, b, c, & d are substring results passed by reference
    636614// e is a substring result passed by value
     
    645623        x = e;                                                  $\C{// eeexx                    eeex    exx             x                               eeexx}$
    646624}
    647 \end{cfa}
    648 
    649 
    650 \subsection{Input/Output Operators}
    651 
    652 Both the \CC operators @<<@ and @>>@ are defined on type @string@.
    653 However, input of a string value is different from input of a @char *@ value.
    654 When a string value is read, \emph{all} input characters from the current point in the input stream to either the end of line (@'\n'@) or the end of file are read.
    655 
    656 
    657 \section{Implementation}
     625int main() {
     626        string x = "xxxxxxxxxxxxx";
     627        test( x, x(1,3), x(3,3), x(5,5), x(9,5), x(9,5) );
     628}
     629\end{cfa}
     630\caption{Parameter Passing}
     631\label{f:ParameterPassing}
     632\end{figure}
     633
     634
     635\subsection{I/O Operators}
     636
     637The ability to read and print strings is as essential as for any other type.
     638The goal for character I/O is to work with groups rather than individual characters.
     639A comparison with \CC string I/O is presented as a counterpoint to \CFA string I/O.
     640
     641The \CC output @<<@ and input @>>@ operators are defined on type @string@.
     642\CC output for @char@, @char *@, and @string@ are similar.
     643The \CC manipulators are @setw@, and its associated width controls @left@, @right@ and @setfill@.
     644\begin{cquote}
     645\setlength{\tabcolsep}{15pt}
     646\begin{tabular}{@{}l|l@{}}
     647\begin{c++}
     648string s = "abc";
     649cout << setw(10) << left << setfill( 'x' ) << s << endl;
     650\end{c++}
     651&
     652\begin{c++}
     653
     654"abcxxxxxxx"
     655\end{c++}
     656\end{tabular}
     657\end{cquote}
     658
     659The \CFA input/output operator @|@ is defined on type @string@.
     660\CFA output for @char@, @char *@, and @string@ are the similar.
     661The \CFA manipulators are @bin@, @oct@, @hex@, @wd@, and its associated width control and @left@.
     662\begin{cquote}
     663\setlength{\tabcolsep}{15pt}
     664\begin{tabular}{@{}l|l@{}}
     665\begin{cfa}
     666string s = "abc";
     667sout | bin( s ) | nl
     668           | oct( s ) | nl
     669           | hex( s ) | nl
     670           | wd( 10, s ) | nl
     671           | wd( 10, 2, s ) | nl
     672           | left( wd( 10, s ) );
     673\end{cfa}
     674&
     675\begin{cfa}
     676
     677"0b1100001 0b1100010 0b1100011"
     678"0141 0142 0143"
     679"0x61 0x62 0x63"
     680"       abc"
     681"        ab"
     682"abc       "
     683\end{cfa}
     684\end{tabular}
     685\end{cquote}
     686
     687\CC input matching for @char@, @char *@, and @string@ are the similar, where \emph{all} input characters are read from the current point in the input stream to the end of the type size, format width, whitespace, end of line (@'\n'@), or end of file.
     688The \CC manipulator is @setw@ to restrict the size.
     689Reading into a @char@ is safe as the size is 1, @char *@ is unsafe without using @setw@ to constraint the length (which includes @'\0'@), @string@ is safe as its grows dynamically as characters are read.
     690\begin{cquote}
     691\setlength{\tabcolsep}{15pt}
     692\begin{tabular}{@{}l|l@{}}
     693\begin{c++}
     694char ch, c[10];
     695string s;
     696cin >> ch >> setw( 5 ) >> c  >> s;
     697abcde   fg
     698\end{c++}
     699&
     700\begin{c++}
     701
     702
     703'a' "bcde" "fg"
     704
     705\end{c++}
     706\end{tabular}
     707\end{cquote}
     708Input text can be gulped from the current point to an arbitrary delimiter character using @getline@, which reads whitespace.
     709
     710The \CFA philosophy for input is that for every constant type in C, these constants should be usable as input.
     711For example, the complex constant @3.5+4.1i@ can appear as input to a complex variable.
     712\CFA input matching for @char@, @char *@, and @string@ are similar.
     713C-strings may only be read with a width field, which should match the string size.
     714Certain input manipulators support a scanset, which is a simple regular expression from @printf@.
     715The \CFA manipulators for these types are @wdi@\footnote{Due to an overloading issue in the type-resolver, the input width name must be temporarily different from the output, \lstinline{wdi} versus \lstinline{wd}.},
     716and its associated width control and @left@, @quote@, @incl@, @excl@, and @getline@.
     717\begin{cquote}
     718\setlength{\tabcolsep}{10pt}
     719\begin{tabular}{@{}l|l@{}}
     720\begin{c++}
     721char ch, c[10];
     722string s;
     723sin | ch | wdi( 5, c ) | s;
     724abcde fg
     725sin | quote( ch ) | quote( wdi( sizeof(c), c ) ) | quote( s, '[', ']' ) | nl;
     726$'a' "bcde" [fg]$
     727sin | incl( "a-zA-Z0-9 ?!&\n", s ) | nl;
     728x?&000xyz TOM !.
     729sin | excl( "a-zA-Z0-9 ?!&\n", s );
     730<>{}{}STOP
     731\end{c++}
     732&
     733\begin{c++}
     734
     735
     736'a' "bcde" [fg]
     737
     738'a' "bcde" [fg]
     739
     740"x?&000xyz TOM !"
     741
     742"<>{}{}"
     743
     744\end{c++}
     745\end{tabular}
     746\end{cquote}
     747
     748
     749
     750\subsection{Assignment}
    658751
    659752While \VRef[Figure]{f:StrApiCompare} emphasizes cross-language similarities, it elides many specific operational differences.
     
    10051098
    10061099Object lifecycle events are the \emph{subscription-management} triggers in such a service.
    1007 There are two fundamental string-creation routines: importing external text like a C-string or reading a string, and initialization from an existing \CFA string.
     1100There are two fundamental string-creation functions: importing external text like a C-string or reading a string, and initialization from an existing \CFA string.
    10081101When importing, storage comes from the end of the buffer, into which the text is copied.
    10091102The new string handle is inserted at the end of the handle list because the new text is at the end of the buffer.
     
    12181311Here, \emph{reusing a logical allocation}, means that the program variable, into which the user is concatenating, previously held a long string.
    12191312In general, a user should not have to care about this difference, yet the STL performs differently in these cases.
    1220 Furthermore, if a routine takes a string by reference, if cannot use the fresh approach.
     1313Furthermore, if a function takes a string by reference, if cannot use the fresh approach.
    12211314Concretely, both cases incur the cost of copying characters into the target string, but only the allocation-fresh case incurs a further reallocation cost, which is generally paid at points of doubling the length.
    12221315For the STL, this cost includes obtaining a fresh buffer from the memory allocator and copying older characters into the new buffer, while \CFA-sharing hides such a cost entirely.
Note: See TracChangeset for help on using the changeset viewer.