Changeset 602ac05
- Timestamp:
- Apr 11, 2025, 4:31:29 PM (5 months ago)
- Branches:
- master
- Children:
- a800a19
- Parents:
- c4f8c4bf
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mike_brooks_MMath/string.tex
rc4f8c4bf r602ac05 60 60 The maximum storage for a \CFA @string@ value is @size_t@ characters, which is $2^{32}$ or $2^{64}$ respectively. 61 61 A \CFA string manages its length separately from the string, so there is no null (@'\0'@) terminating value at the end of a string value. 62 Hence, a \CFA string cannot be passed to a C string manipulation routine, such as @strcat@.62 Hence, a \CFA string cannot be passed to a C string manipulation function, such as @strcat@. 63 63 Like C strings, characters in a @string@ are numbered from the left starting at 0, and in \CFA numbered from the right starting at -1. 64 64 \begin{cquote} … … 131 131 s = (string){ 5.5 }; $\C{// converts double to string}$ 132 132 \end{cfa} 133 Conversions from @string@ to @char *@, attempt to be safe: 133 134 Conversions from @string@ to @char *@ attempt to be safe: 134 135 either by requiring the maximum length of the @char *@ storage (@strncpy@) or allocating the @char *@ storage for the string characters (ownership), meaning the programmer must free the storage. 135 136 Note, a C string is always null terminated, implying a minimum size of 1 character. … … 186 187 \subsection{Comparison Operators} 187 188 188 The binary relational, @<@, @<=@, @>@, @>=@, and equality, @==@, @!=@, operators compare strings using lexicographical ordering, where longer strings are greater than shorter strings. 189 C strings use function @strcmp@, as the relational/equality operators compare C string pointers not their values, which does not match programmer expectation. 189 The binary relational, @<@, @<=@, @>@, @>=@, and equality, @==@, @!=@, operators compare \CFA string values using lexicographical ordering, where longer strings are greater than shorter strings. 190 In C, these operators compare the C string pointer not its value, which does not match programmer expectation. 191 C strings use function @strcmp@, as the relational/equality operator for string values. 190 192 191 193 … … 194 196 The binary operators @+@ and @+=@ concatenate characters, C strings and \CFA strings, creating the sum of the characters. 195 197 \begin{cquote} 196 \begin{tabular}{@{}l|l@{\hspace{ 25pt}}l|l@{\hspace{25pt}}l|l@{}}198 \begin{tabular}{@{}l|l@{\hspace{15pt}}l|l@{\hspace{15pt}}l|l@{}} 197 199 \begin{cfa} 198 200 s = ""; … … 245 247 \end{cquote} 246 248 For these operations to meet programmer expectations, \CFA introduces two C non-backward compatibilities. 247 Note, subtracting pointers or characters has a low-level use 249 Note, subtracting pointers or characters has a low-level use-case. 248 250 \begin{cfa} 249 251 ch - '0' $\C[2in]{// find character offset}$ … … 256 258 \end{cfa} 257 259 Adding character values or advancing a pointer with a character are unusual operations, and hence, unlikely to existing in C programs. 260 There is a legitimate use case for arithmetic on @signed@/@unsigned@ characters (bytes), but these type are treated differently from @char@ in \CC and \CFA. 261 However, for backwards compatibility reasons it is impossible to restrict or remove arithmetic on type @char@. 258 262 Stealing these two cases for use with strings, allows all combinations of concatenation among @char@, @char *@, and @string@. 259 Note, stealing only occurs if a program includes @ string.hfa@, resulting is ambiguities in existing C code where there is no way to disambiguate.263 Note, stealing only occurs if a program includes @<string.hfa>@, resulting is ambiguities in existing C code where there is no way to disambiguate. 260 264 \begin{cfa} 261 265 ch = 'a' + 'b'; $\C[2in]{// LHS disambiguate, add character values}$ 262 266 s = 'a' + 'b'; $\C{// LHS disambiguate, concatenation characters}$ 263 sout | 'a' + 'b'; $\C{// ambiguous with string.hfa, add or concatenate?}$267 sout | 'a' + 'b'; $\C{// ambiguous with <string.hfa>, add or concatenate?}$ 264 268 sout | (char)'a' + 'b'; $\C{// disambiguate}$ 265 269 sout | "a" + "b"; $\C{// disambiguate}\CRT$ 266 270 \end{cfa} 267 Again, the possibility of this scenario is extremely rare, as adding characters is meaningless.271 Again, introducing disambiguates for this scenario are rare, as adding characters is uncommon. 268 272 269 273 \CC cannot support this generality because it does not use the left-hand side of assignment in expression resolution. … … 309 313 \setlength{\tabcolsep}{10pt} 310 314 \begin{tabular}{@{}l|ll|l@{}} 315 \multicolumn{2}{c}{\textbf{length}} & \multicolumn{2}{c}{\textbf{pattern}} \\ 311 316 \begin{cfa} 312 317 s = name( 2, 2 ); … … 321 326 "KE" 322 327 "IK" 323 "KE", clip pedlength to 2324 "", beyond string clip pedto null328 "KE", clip length to 2 329 "", beyond string clip to null 325 330 "K" 326 331 "IKE", to end of string … … 351 356 If the substring request is completely outside of the original string, a null string is returned. 352 357 The pattern form either returns the pattern string is the pattern matches or a null string if the pattern does not match. 353 Th is mechanism is discussed next.358 The usefulness of this mechanism is discussed next. 354 359 355 360 The substring operation can also appear on the left side of an assignment and replaced by the string value on the right side. … … 376 381 \end{tabular} 377 382 \end{cquote} 378 Pattern matching is useful on the left-hand side of theassignment.383 Now pattern matching is useful on the left-hand side of assignment. 379 384 \begin{cquote} 380 385 \setlength{\tabcolsep}{15pt} … … 416 421 417 422 The find operation returns the position of the first occurrence of a key string in a string. 418 If the key does not appear in the current string, the length of the currentstring plus one is returned.423 If the key does not appear in the string, the length of the string plus one is returned. 419 424 \begin{cquote} 420 425 \setlength{\tabcolsep}{15pt} … … 422 427 \begin{cfa} 423 428 i = find( digit, '3' ); 424 i = "45" ^ digit; // python style "45" in digit429 i = find( digit, "45" ); 425 430 string x = "567"; 426 431 i = find( digit, x ); … … 435 440 \end{tabular} 436 441 \end{cquote} 437 The character-class operations indicate sif a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc.442 The character-class operations indicate if a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc. 438 443 \begin{cquote} 439 444 \setlength{\tabcolsep}{15pt} … … 490 495 \end{cquote} 491 496 492 The test operation checks if each character in a string is in one of the C character classes. 497 The test operation checks if each character in a string is in one of the C character classes.\footnote{It is part of the hereditary madness of C that these function take and return an \lstinline{int} rather than a \lstinline{char}.} 493 498 \begin{cquote} 494 499 \setlength{\tabcolsep}{15pt} … … 546 551 547 552 548 \subsection{Returning N+1 on Failure} 549 550 Any of the string search routines can fail at some point during the search. 551 When this happens it is necessary to return indicating the failure. 552 Many string types in other languages use some special value to indicate the failure. 553 This value is often 0 or -1 (PL/I returns 0). 554 This section argues that a value of N+1, where N is the length of the base string in the search, is a more useful value to return. 555 The index-of function in APL returns N+1. 556 These are the boundary situations and are often overlooked when designing a string type. 557 558 The situation that can be optimized by returning N+1 is when a search is performed to find the starting location for a substring operation. 559 For example, in a program that is extracting words from a text file, it is necessary to scan from left to right over whitespace until the first alphabetic character is found. 560 \begin{cfa} 561 line = line( line.exclude( alpha ) ); 562 \end{cfa} 563 If a text line contains all whitespaces, the exclude operation fails to find an alphabetic character. 564 If @exclude@ returns 0 or -1, the result of the substring operation is unclear. 553 \subsection{Returning N+1 on Search Failure} 554 555 String search functions can fail to find the key in the target string. 556 The failure must be returned as an alternate outcome, possibly an exception. 557 Many string types use a return code to indicate the failure, such as @0@ or @-1@ (PL/I~\cite{PLI} returns @0@). 558 \CFA adopts the approach used by the index-of function in APL~\cite{apl}, which returns length of the target string plus 1 ($N+1$). 559 560 When a search is performed to find the starting location for a substring operation, returning $N+1$ is arguably the best choice. 561 For example, in extracting words from a string, it is necessary to scan from left to right over whitespace until the first alphabetic character is found. 562 \begin{cfa} 563 line = line( exclude( line, alpha ) ); // find start of word 564 \end{cfa} 565 If the line contains all whitespace and @exclude@ returns 0 or -1, the result of the substring is unclear. 565 566 Most string types generate an error, or clip the starting value to 1, resulting in the entire whitespace string being selected. 566 If @exclude@ returns N+1, the starting position for the substring operation is beyond the end of the string leaving a null string. 567 568 The same situation occurs when scanning off a word. 569 \begin{cfa} 570 start = line.include(alpha); 571 word = line(1, start - 1); 572 \end{cfa} 573 If the entire line is composed of a word, the include operation will fail to find a non-alphabetic character. 574 In general, returning 0 or -1 is not an appropriate starting position for the substring, which must substring off the word leaving a null string. 575 However, returning N+1 will substring off the word leaving a null string. 567 This behaviour leads to the awkward pattern: 568 \begin{cfa} 569 i = exclude( line, alpha ); 570 if ( i != -1 ) line = line( i ); 571 else line = ""; 572 \end{cfa} 573 If @exclude@ returns $N+1$, the starting position for the substring operation is beyond the end of the string leaving a null string. 574 This scenario is repeated when scanning off the word. 575 \begin{cfa} 576 word = line( 0, include( line, alpha ) - 1 ); // scan off word 577 \end{cfa} 578 If the entire line is composed of a word, the @include@ fails to find a non-alphabetic character, resulting in the same awkward pattern. 579 In string systems with an $O(1)$ length operator, checking for failure is low cost. 580 \begin{cfa} 581 if ( include( line, alpha ) == len( line ) ) ... // not found, 0 origin 582 \end{cfa} 576 583 577 584 578 585 \subsection{C Compatibility} 579 586 580 To ease conversion from C to \CFA, there are companion @string@ routines for C strings. 581 \VRef[Table]{t:CompanionStringRoutines} shows the C routines on the left that also work with @string@ and the rough equivalent @string@ operation of the right. 582 Hence, it is possible to directly convert a block of C string operations into @string@ just by changing the 583 584 \begin{table} 585 \begin{cquote} 586 \begin{tabular}{@{}l|l@{}} 587 \multicolumn{1}{c|}{\lstinline{char []}} & \multicolumn{1}{c}{\lstinline{string}} \\ 588 \hline 589 @strcpy@, @strncpy@ & @=@ \\ 590 @strcat@, @strncat@ & @+@ \\ 591 @strcmp@, @strncmp@ & @==@, @!=@, @<@, @<=@, @>@, @>=@ \\ 592 @strlen@ & @size@ \\ 593 @[]@ & @[]@ \\ 594 @strstr@ & @find@ \\ 595 @strcspn@ & @find_first_of@, @find_last_of@ \\ 596 @strspc@ & @find_fist_not_of@, @find_last_not_of@ 597 \end{tabular} 598 \end{cquote} 599 \caption{Companion Routines for \CFA \lstinline{string} to C Strings} 600 \label{t:CompanionStringRoutines} 601 \end{table} 602 603 For example, this block of C code can be converted to \CFA by simply changing the type of variable @s@ from @char []@ to @string@. 604 \begin{cfa} 605 char s[32]; 606 //string s; 607 strcpy( s, "abc" ); PRINT( %s, s ); 608 strncpy( s, "abcdef", 3 ); PRINT( %s, s ); 609 strcat( s, "xyz" ); PRINT( %s, s ); 610 strncat( s, "uvwxyz", 3 ); PRINT( %s, s ); 611 PRINT( %zd, strlen( s ) ); 612 PRINT( %c, s[3] ); 613 PRINT( %s, strstr( s, "yzu" ) ) ; 614 PRINT( %s, strstr( s, 'y' ) ) ; 587 To ease conversion from C to \CFA, \CFA provides companion C @string@ functions. 588 Hence, it is possible to convert a block of C string operations to \CFA strings just by changing the type @char *@ to @string@. 589 \begin{cfa} 590 char s[32]; // string s; 591 strcpy( s, "abc" ); 592 strncpy( s, "abcdef", 3 ); 593 strcat( s, "xyz" ); 594 strncat( s, "uvwxyz", 3 ); 615 595 \end{cfa} 616 596 However, the conversion fails with I/O because @printf@ cannot print a @string@ using format code @%s@ because \CFA strings are not null terminated. 597 Nevertheless, this capability does provide a useful starting point for conversion to safer \CFA strings. 617 598 618 599 … … 625 606 If the base string value shortens so that its end is before the starting location of a substring, resulting in the substring starting location disappearing, the substring becomes a null string located at the end of the base string. 626 607 627 The following example illustrates passing the results of substring operations by reference and by value to a subprogram.608 \VRef[Figure]{f:ParameterPassing} shows passing the results of substring operations by reference and by value to a subprogram. 628 609 Notice the side-effects to other reference parameters as one is modified. 629 \begin{cfa} 630 main() { 631 string x = "xxxxxxxxxxxxx"; 632 test( x, x(1,3), x(3,3), x(5,5), x(9,5), x(9,5) ); 633 } 634 610 611 \begin{figure} 612 \begin{cfa} 635 613 // x, a, b, c, & d are substring results passed by reference 636 614 // e is a substring result passed by value … … 645 623 x = e; $\C{// eeexx eeex exx x eeexx}$ 646 624 } 647 \end{cfa} 648 649 650 \subsection{Input/Output Operators} 651 652 Both the \CC operators @<<@ and @>>@ are defined on type @string@. 653 However, input of a string value is different from input of a @char *@ value. 654 When a string value is read, \emph{all} input characters from the current point in the input stream to either the end of line (@'\n'@) or the end of file are read. 655 656 657 \section{Implementation} 625 int main() { 626 string x = "xxxxxxxxxxxxx"; 627 test( x, x(1,3), x(3,3), x(5,5), x(9,5), x(9,5) ); 628 } 629 \end{cfa} 630 \caption{Parameter Passing} 631 \label{f:ParameterPassing} 632 \end{figure} 633 634 635 \subsection{I/O Operators} 636 637 The ability to read and print strings is as essential as for any other type. 638 The goal for character I/O is to work with groups rather than individual characters. 639 A comparison with \CC string I/O is presented as a counterpoint to \CFA string I/O. 640 641 The \CC output @<<@ and input @>>@ operators are defined on type @string@. 642 \CC output for @char@, @char *@, and @string@ are similar. 643 The \CC manipulators are @setw@, and its associated width controls @left@, @right@ and @setfill@. 644 \begin{cquote} 645 \setlength{\tabcolsep}{15pt} 646 \begin{tabular}{@{}l|l@{}} 647 \begin{c++} 648 string s = "abc"; 649 cout << setw(10) << left << setfill( 'x' ) << s << endl; 650 \end{c++} 651 & 652 \begin{c++} 653 654 "abcxxxxxxx" 655 \end{c++} 656 \end{tabular} 657 \end{cquote} 658 659 The \CFA input/output operator @|@ is defined on type @string@. 660 \CFA output for @char@, @char *@, and @string@ are the similar. 661 The \CFA manipulators are @bin@, @oct@, @hex@, @wd@, and its associated width control and @left@. 662 \begin{cquote} 663 \setlength{\tabcolsep}{15pt} 664 \begin{tabular}{@{}l|l@{}} 665 \begin{cfa} 666 string s = "abc"; 667 sout | bin( s ) | nl 668 | oct( s ) | nl 669 | hex( s ) | nl 670 | wd( 10, s ) | nl 671 | wd( 10, 2, s ) | nl 672 | left( wd( 10, s ) ); 673 \end{cfa} 674 & 675 \begin{cfa} 676 677 "0b1100001 0b1100010 0b1100011" 678 "0141 0142 0143" 679 "0x61 0x62 0x63" 680 " abc" 681 " ab" 682 "abc " 683 \end{cfa} 684 \end{tabular} 685 \end{cquote} 686 687 \CC input matching for @char@, @char *@, and @string@ are the similar, where \emph{all} input characters are read from the current point in the input stream to the end of the type size, format width, whitespace, end of line (@'\n'@), or end of file. 688 The \CC manipulator is @setw@ to restrict the size. 689 Reading into a @char@ is safe as the size is 1, @char *@ is unsafe without using @setw@ to constraint the length (which includes @'\0'@), @string@ is safe as its grows dynamically as characters are read. 690 \begin{cquote} 691 \setlength{\tabcolsep}{15pt} 692 \begin{tabular}{@{}l|l@{}} 693 \begin{c++} 694 char ch, c[10]; 695 string s; 696 cin >> ch >> setw( 5 ) >> c >> s; 697 abcde fg 698 \end{c++} 699 & 700 \begin{c++} 701 702 703 'a' "bcde" "fg" 704 705 \end{c++} 706 \end{tabular} 707 \end{cquote} 708 Input text can be gulped from the current point to an arbitrary delimiter character using @getline@, which reads whitespace. 709 710 The \CFA philosophy for input is that for every constant type in C, these constants should be usable as input. 711 For example, the complex constant @3.5+4.1i@ can appear as input to a complex variable. 712 \CFA input matching for @char@, @char *@, and @string@ are similar. 713 C-strings may only be read with a width field, which should match the string size. 714 Certain input manipulators support a scanset, which is a simple regular expression from @printf@. 715 The \CFA manipulators for these types are @wdi@\footnote{Due to an overloading issue in the type-resolver, the input width name must be temporarily different from the output, \lstinline{wdi} versus \lstinline{wd}.}, 716 and its associated width control and @left@, @quote@, @incl@, @excl@, and @getline@. 717 \begin{cquote} 718 \setlength{\tabcolsep}{10pt} 719 \begin{tabular}{@{}l|l@{}} 720 \begin{c++} 721 char ch, c[10]; 722 string s; 723 sin | ch | wdi( 5, c ) | s; 724 abcde fg 725 sin | quote( ch ) | quote( wdi( sizeof(c), c ) ) | quote( s, '[', ']' ) | nl; 726 $'a' "bcde" [fg]$ 727 sin | incl( "a-zA-Z0-9 ?!&\n", s ) | nl; 728 x?&000xyz TOM !. 729 sin | excl( "a-zA-Z0-9 ?!&\n", s ); 730 <>{}{}STOP 731 \end{c++} 732 & 733 \begin{c++} 734 735 736 'a' "bcde" [fg] 737 738 'a' "bcde" [fg] 739 740 "x?&000xyz TOM !" 741 742 "<>{}{}" 743 744 \end{c++} 745 \end{tabular} 746 \end{cquote} 747 748 749 750 \subsection{Assignment} 658 751 659 752 While \VRef[Figure]{f:StrApiCompare} emphasizes cross-language similarities, it elides many specific operational differences. … … 1005 1098 1006 1099 Object lifecycle events are the \emph{subscription-management} triggers in such a service. 1007 There are two fundamental string-creation routines: importing external text like a C-string or reading a string, and initialization from an existing \CFA string.1100 There are two fundamental string-creation functions: importing external text like a C-string or reading a string, and initialization from an existing \CFA string. 1008 1101 When importing, storage comes from the end of the buffer, into which the text is copied. 1009 1102 The new string handle is inserted at the end of the handle list because the new text is at the end of the buffer. … … 1218 1311 Here, \emph{reusing a logical allocation}, means that the program variable, into which the user is concatenating, previously held a long string. 1219 1312 In general, a user should not have to care about this difference, yet the STL performs differently in these cases. 1220 Furthermore, if a routinetakes a string by reference, if cannot use the fresh approach.1313 Furthermore, if a function takes a string by reference, if cannot use the fresh approach. 1221 1314 Concretely, both cases incur the cost of copying characters into the target string, but only the allocation-fresh case incurs a further reallocation cost, which is generally paid at points of doubling the length. 1222 1315 For the STL, this cost includes obtaining a fresh buffer from the memory allocator and copying older characters into the new buffer, while \CFA-sharing hides such a cost entirely.
Note:
See TracChangeset
for help on using the changeset viewer.