Changeset b0296dba for doc/theses/mike_brooks_MMath/string.tex
- Timestamp:
- Apr 13, 2025, 9:16:33 PM (5 months ago)
- Branches:
- master
- Children:
- d9aee90
- Parents:
- 5ad6f0d
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mike_brooks_MMath/string.tex
r5ad6f0d rb0296dba 37 37 \end{figure} 38 38 39 As mentioned in \VRef{s:String}, a C string differs from other string types as ituses null termination rather than a length, which leads to explicit storage management;40 hence, most of its group operations are error prone and expensive .39 As mentioned in \VRef{s:String}, a C string uses null termination rather than a length, which leads to explicit storage management; 40 hence, most of its group operations are error prone and expensive due to copying. 41 41 Most high-level string libraries use a separate length field and specialized storage management to implement group operations. 42 42 Interestingly, \CC strings retain null termination in case it is needed to interface with C library functions. … … 61 61 A \CFA string manages its length separately from the string, so there is no null (@'\0'@) terminating value at the end of a string value. 62 62 Hence, a \CFA string cannot be passed to a C string manipulation function, such as @strcat@. 63 Like C strings, characters in a @string@ are numbered from the left starting at 0 , and in \CFA numbered from the right starting at -1.64 \begin{cquote} 65 \ sf63 Like C strings, characters in a @string@ are numbered from the left starting at 0 (because subscripting is zero-origin), and in \CFA numbered from the right starting at -1. 64 \begin{cquote} 65 \rm 66 66 \begin{tabular}{@{}rrrrll@{}} 67 67 \small\tt "a & \small\tt b & \small\tt c & \small\tt d & \small\tt e" \\ … … 70 70 \end{tabular} 71 71 \end{cquote} 72 The following operations have been defined to manipulate an instance of type @string@. 73 The discussion assumes the following declarations and assignment statements are executed. 72 The following operations manipulate an instance of type @string@, where the discussion assumes the following declarations. 74 73 \begin{cfa} 75 74 #include @<string.hfa>@ 76 75 @string@ s = "abcde", name = "MIKE", digit = "0123456789"; 77 const char cs[ ] = "abc";76 const char cs[$\,$] = "abc"; 78 77 int i; 79 78 \end{cfa} … … 89 88 \begin{tabular}{@{}l|ll|l@{}} 90 89 \begin{cfa} 91 // string s = 5;92 93 94 95 96 90 string s; 91 s = 'x'; 92 s = "abc"; 93 s = cs; 94 s = 45hh; 95 s = 45h; 97 96 \end{cfa} 98 97 & … … 160 159 161 160 The @len@ operation (short for @strlen@) returns the length of a C or \CFA string. 162 For co nsistency, @strlen@ also works with \CFA strings.161 For compatibility, @strlen@ also works with \CFA strings. 163 162 \begin{cquote} 164 163 \setlength{\tabcolsep}{15pt} … … 189 188 The binary relational, @<@, @<=@, @>@, @>=@, and equality, @==@, @!=@, operators compare \CFA string values using lexicographical ordering, where longer strings are greater than shorter strings. 190 189 In C, these operators compare the C string pointer not its value, which does not match programmer expectation. 191 C strings use function @strcmp@ , as the relational/equality operator for string values.190 C strings use function @strcmp@ to lexicographically compare the string value. 192 191 193 192 194 193 \subsection{Concatenation} 195 194 196 The binary operators @+@ and @+=@ concatenate characters, C stringsand \CFA strings, creating the sum of the characters.197 \ begin{cquote}195 The binary operators @+@ and @+=@ concatenate C @char@, @char *@ and \CFA strings, creating the sum of the characters. 196 \par\noindent 198 197 \begin{tabular}{@{}l|l@{\hspace{15pt}}l|l@{\hspace{15pt}}l|l@{}} 199 198 \begin{cfa} … … 245 244 \end{cfa} 246 245 \end{tabular} 247 \ end{cquote}248 For these operations to meet programmer expectations, \CFA introduces two C non-backward compatibilities. 249 Note, subtracting pointers or characters has a low-level use-case. 246 \par\noindent 247 However, including @<string.hfa>@ can result in ambiguous uses of the overloaded @+@ operator.\footnote{Combining multiple packages in any programming language can result in name clashes or ambiguities.} 248 While subtracting characters or pointers has a low-level use-case 250 249 \begin{cfa} 251 250 ch - '0' $\C[2in]{// find character offset}$ 252 c p1 - cp2; $\C{// find pointer offset}\CRT$253 \end{cfa} 254 However, there is no obvious use case for addition. 251 cs - cs2; $\C{// find pointer offset}\CRT$ 252 \end{cfa} 253 addition is less obvious 255 254 \begin{cfa} 256 255 ch + 'b' $\C[2in]{// add character values}$ 257 cp1 + 'a'; $\C{// move pointer cp1['a']}\CRT$ 258 \end{cfa} 259 Adding character values or advancing a pointer with a character are unusual operations, and hence, unlikely to existing in C programs. 260 There is a legitimate use case for arithmetic on @signed@/@unsigned@ characters (bytes), but these type are treated differently from @char@ in \CC and \CFA. 261 However, for backwards compatibility reasons it is impossible to restrict or remove arithmetic on type @char@. 262 Stealing these two cases for use with strings, allows all combinations of concatenation among @char@, @char *@, and @string@. 263 Note, stealing only occurs if a program includes @<string.hfa>@, resulting is ambiguities in existing C code where there is no way to disambiguate. 264 \begin{cfa} 265 ch = 'a' + 'b'; $\C[2in]{// LHS disambiguate, add character values}$ 266 s = 'a' + 'b'; $\C{// LHS disambiguate, concatenation characters}$ 267 sout | 'a' + 'b'; $\C{// ambiguous with <string.hfa>, add or concatenate?}$ 268 sout | (char)'a' + 'b'; $\C{// disambiguate}$ 269 sout | "a" + "b"; $\C{// disambiguate}\CRT$ 270 \end{cfa} 271 Again, introducing disambiguates for this scenario are rare, as adding characters is uncommon. 272 273 \CC cannot support this generality because it does not use the left-hand side of assignment in expression resolution. 256 cs + 'a'; $\C{// move pointer cs['a']}\CRT$ 257 \end{cfa} 258 There is a legitimate use case for arithmetic with @signed@/@unsigned@ characters (bytes), but these types are treated differently from @char@ in \CC and \CFA. 259 However, backwards compatibility makes is impossible to restrict or remove addition on type @char@. 260 Similarly, it is impossible to restrict or remove addition on type @char *@ because (unfortunately) it is subscripting: @cs + 'a'@ implies @cs['a']@ or @'a'[cs]@. 261 262 Fortunately, the prior concatenation examples show complex mixed-mode interactions among @char@, @char *@, and @string@ (variables are the same as constants) work correctly. 263 The reason is that the \CFA type-system handles this kind of overloading well using the left-hand assignment-type and complex conversion costs. 264 Hence, the type system correctly handles all uses of addition (explicit or implicit) for @char *@. 265 \begin{cfa} 266 printf( "%s %s %s %c %c\n", "abc", cs, cs + 3, cs['a'], 'a'[cs] ); 267 \end{cfa} 268 Only @char@ addition can result in ambiguities, and only when there is no left-hand information. 269 \begin{cfa} 270 ch = ch + 'b'; $\C[2in]{// LHS disambiguate, add character values}$ 271 s = 'a' + 'b'; $\C{// LHS disambiguate, concatenate characters}$ 272 printf( "%c\n", @'a' + 'b'@ ); $\C[2in]{// no LHS information, ambiguous}$ 273 printf( "%c\n", @(return char)@('a' + 'b') ); $\C{// disambiguate with ascription cast}$ 274 \end{cfa} 275 The ascription cast, @(return T)@, disambiguates by stating a (LHS) type to use during expression resolution (not a conversion). 276 Fortunately, character addition without LHS information is rare in C/\CFA programs, so repurposing the operator @+@ for @string@ types is not a problem. 277 Note, other programming languages that repurpose @+@ for concatenation, could have a similar ambiguity issue. 278 279 Interestingly, \CC cannot support this generality because it does not use the left-hand side of assignment in expression resolution. 274 280 While it can special case some combinations: 275 281 \begin{c++} … … 288 294 The binary operators @*@ and @*=@ repeat a string $N$ times. 289 295 If $N = 0$, a zero length string, @""@, is returned. 290 Like concatenation, multiplication is stolen for @char@; 291 multiplication for pointers does not exist in C. 292 \begin{cquote} 293 \setlength{\tabcolsep}{15pt} 294 \begin{tabular}{@{}l|l@{}} 295 \begin{cfa} 296 \begin{cquote} 297 \setlength{\tabcolsep}{15pt} 298 \begin{tabular}{@{}l|l@{}} 299 \begin{cfa} 300 s = 'x' * 0; 296 301 s = 'x' * 3; 297 302 s = "abc" * 3; … … 300 305 & 301 306 \begin{cfa} 307 " 302 308 "xxx" 303 309 "abcabcabc" … … 306 312 \end{tabular} 307 313 \end{cquote} 314 Like concatenation, there is a potential ambiguity with multiplication of characters; 315 multiplication for pointers does not exist in C. 316 \begin{cfa} 317 ch = ch * 3; $\C[2in]{// LHS disambiguate, multiply character values}$ 318 s = 'a' * 3; $\C{// LHS disambiguate, concatenate characters}$ 319 printf( "%c\n", @'a' * 3@ ); $\C[2in]{// no LHS information, ambiguous}$ 320 printf( "%c\n", @(return char)@('a' * 3) ); $\C{// disambiguate with ascription cast}$ 321 \end{cfa} 322 Fortunately, character multiplication without LHS information is even rarer than addition, so repurposing the operator @*@ for @string@ types is not a problem. 308 323 309 324 … … 355 370 If the substring request extends beyond the beginning or end of the string, it is clipped (shortened) to the bounds of the string. 356 371 If the substring request is completely outside of the original string, a null string is returned. 357 The pattern 372 The pattern-form either returns the pattern string is the pattern matches or a null string if the pattern does not match. 358 373 The usefulness of this mechanism is discussed next. 359 374 360 The substring operation can a lso appear on the left side of an assignment and replaced by the string value on the right side.361 The length of the right string may be shorter, the same length, or longer than the length of left string.375 The substring operation can appear on the left side of assignment, where it defines a replacement substring. 376 The length of the right string may be shorter, the same, or longer than the length of left string. 362 377 Hence, the left string may decrease, stay the same, or increase in length. 363 378 \begin{cquote} … … 398 413 Extending the pattern to a regular expression is a possible extension. 399 414 400 The replace operation returns a string in which all occurrences of a substring are replaced by another string.415 The replace operation extensions substring to substitute all occurrences. 401 416 \begin{cquote} 402 417 \setlength{\tabcolsep}{15pt} … … 420 435 \subsection{Searching} 421 436 422 The find operation returns the position of the first occurrence of a key stringin a string.423 If the key does not appear in the string, the length of the string plus oneis returned.437 The find operation returns the position of the first occurrence of a key in a string. 438 If the key does not appear in the string, the length of the string is returned. 424 439 \begin{cquote} 425 440 \setlength{\tabcolsep}{15pt} … … 428 443 i = find( digit, '3' ); 429 444 i = find( digit, "45" ); 430 string x = "567"; 431 i = find( digit, x ); 445 i = find( digit, "abc" ); 432 446 \end{cfa} 433 447 & … … 435 449 3 436 450 4 437 438 5 439 \end{ cfa}440 \end{ tabular}441 \end{cquote} 442 The character-class operationsindicate if a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc.451 10 452 \end{cfa} 453 \end{tabular} 454 \end{cquote} 455 456 A character-class operation indicate if a string is composed completely of a particular class of characters, \eg, alphabetic, numeric, vowels, \etc. 443 457 \begin{cquote} 444 458 \setlength{\tabcolsep}{15pt} … … 457 471 \end{tabular} 458 472 \end{cquote} 459 @vowels@ defines a character class and function @include@ checks if all characters in the string a re includedin the class (compliance).460 The position of the last character plus 1 is returnif the string is compliant or the position of the first non-compliant character.473 @vowels@ defines a character class and function @include@ checks if all characters in the string appear in the class (compliance). 474 The position of the last character is returned if the string is compliant or the position of the first non-compliant character. 461 475 There is no relationship between the order of characters in the two strings. 462 476 Function @exclude@ is the reverse of @include@, checking if all characters in the string are excluded from the class (compliance). … … 495 509 \end{cquote} 496 510 497 The test operation checks if each character in a string is in one of the C character classes.\footnote{It is part of the hereditary madness of C that these function take and return an \lstinline{int} rather than a \lstinline{char}.}498 \begin{cquote} 499 \setlength{\tabcolsep}{15pt} 500 \begin{tabular}{@{}l|l@{}} 501 \begin{cfa} 502 i = test( "1FeC34aB", @isxdigit@ );503 i = test( ".,;'!\"", @ispunct@ );504 i = test( "XXXx", @isupper@ );511 There are versions of @include@ and @exclude@, returning a position or string, taking a validation function, like one of the C character-class routines.\footnote{It is part of the hereditary of C that these function take and return an \lstinline{int} rather than a \lstinline{char}, which affects the function type.} 512 \begin{cquote} 513 \setlength{\tabcolsep}{15pt} 514 \begin{tabular}{@{}l|l@{}} 515 \begin{cfa} 516 i = include( "1FeC34aB", @isxdigit@ ); 517 i = include( ".,;'!\"", @ispunct@ ); 518 i = include( "XXXx", @isupper@ ); 505 519 \end{cfa} 506 520 & … … 512 526 \end{tabular} 513 527 \end{cquote} 514 The position of the last character plus 1 is return if the string is compliant or the position of the first non-compliant character. 515 516 Combining substring and search allows actions like trimming whitespace from the start of a line. 517 \begin{cquote} 518 \setlength{\tabcolsep}{15pt} 519 \begin{tabular}{@{}l|l@{}} 520 \begin{cfa} 521 string line = " \t xxx yyy zzz"; 522 string trim = line( test( line, isspace ) ); 523 \end{cfa} 524 & 525 \begin{cfa} 526 527 "xxx yyy zzz" 528 \end{cfa} 529 \end{tabular} 530 \end{cquote} 528 These operations perform an apply of the validation function to each character, and it returns a boolean indicating a stopping condition. 529 The position of the last character is returned if the string is compliant or the position of the first non-compliant character. 531 530 532 531 The translate operation returns a string with each character transformed by one of the C character transformation functions. … … 551 550 552 551 553 \subsection{Returning N+1 on Search Failure} 554 555 String search functions can fail to find the key in the target string. 556 The failure must be returned as an alternate outcome, possibly an exception. 557 Many string types use a return code to indicate the failure, such as @0@ or @-1@ (PL/I~\cite{PLI} returns @0@). 558 \CFA adopts the approach used by the index-of function in APL~\cite{apl}, which returns length of the target string plus 1 ($N+1$). 559 560 When a search is performed to find the starting location for a substring operation, returning $N+1$ is arguably the best choice. 561 For example, in extracting words from a string, it is necessary to scan from left to right over whitespace until the first alphabetic character is found. 562 \begin{cfa} 563 line = line( exclude( line, alpha ) ); // find start of word 564 \end{cfa} 565 If the line contains all whitespace and @exclude@ returns 0 or -1, the result of the substring is unclear. 566 Most string types generate an error, or clip the starting value to 1, resulting in the entire whitespace string being selected. 567 This behaviour leads to the awkward pattern: 568 \begin{cfa} 569 i = exclude( line, alpha ); 570 if ( i != -1 ) line = line( i ); 571 else line = ""; 572 \end{cfa} 573 If @exclude@ returns $N+1$, the starting position for the substring operation is beyond the end of the string leaving a null string. 574 This scenario is repeated when scanning off the word. 575 \begin{cfa} 576 word = line( 0, include( line, alpha ) - 1 ); // scan off word 577 \end{cfa} 578 If the entire line is composed of a word, the @include@ fails to find a non-alphabetic character, resulting in the same awkward pattern. 552 \subsection{Returning N on Search Failure} 553 554 Some of the prior string operations are composite, \eg string operations returning the longest substring of compliant characters (@include@) are built using a search and then substring the appropriate text. 555 However, string search can fail, which is reported as an alternate search outcome, possibly an exception. 556 Many string libraries use a return code to indicate search failure, with a failure value of @0@ or @-1@ (PL/I~\cite{PLI} returns @0@). 557 This semantics leads to the awkward pattern, which can appear many times in a string library or user code. 558 \begin{cfa} 559 i = exclude( s, alpha ); 560 if ( i != -1 ) return s( 0, i ); 561 else return ""; 562 \end{cfa} 563 564 \CFA also adopts a return code but the failure value is taken from the index-of function in APL~\cite{apl}, which returns the length of the target string $N$ (or $N+1$ for 1 origin). 565 This semantics allows many search and substring functions to be written without conditions, \eg: 566 \begin{cfa} 567 string include( const string & s, int (*f)( int ) ) { return @s( 0, include( s, f ) )@; } 568 string exclude( const string & s, int (*f)( int ) ) { return @s( 0, exclude( s, f ) )@; } 569 \end{cfa} 579 570 In string systems with an $O(1)$ length operator, checking for failure is low cost. 580 571 \begin{cfa} 581 572 if ( include( line, alpha ) == len( line ) ) ... // not found, 0 origin 582 573 \end{cfa} 574 \VRef[Figure]{f:ExtractingWordsText} compares \CC and \CFA string code for extracting words from a line of text, repeatedly removing non-word text and then a word until the line is empty. 575 The \CFA code is simpler solely because of the choice for indicating search failure. 576 (It is possible to simplify the \CC version by concatenating a sentinel character at the end of the line so the call to @find_first_not_of@ does not fail.) 577 578 \begin{figure} 579 \begin{cquote} 580 \setlength{\tabcolsep}{15pt} 581 \begin{tabular}{@{}l|l@{}} 582 \multicolumn{1}{c}{\textbf{\CC}} & \multicolumn{1}{c}{\textbf{\CFA}} \\ 583 \begin{cfa} 584 for ( ;; ) { 585 string::size_type posn = line.find_first_of( alpha ); 586 if ( posn == string::npos ) break; 587 line = line.substr( posn ); 588 posn = line.find_first_not_of( alpha ); 589 if ( posn != string::npos ) { 590 cout << line.substr( 0, posn ) << endl; 591 line = line.substr( posn ); 592 } else { 593 cout << line << endl; 594 line = ""; 595 } 596 } 597 \end{cfa} 598 & 599 \begin{cfa} 600 for ( ;; ) { 601 size_t posn = exclude( line, alpha ); 602 if ( posn == len( line ) ) break; 603 line = line( posn ); 604 posn = include( line, alpha ); 605 606 sout | line( 0, posn ); 607 line = line( posn ); 608 609 610 611 612 } 613 \end{cfa} 614 \end{tabular} 615 \end{cquote} 616 \caption{Extracting Words from Line of Text} 617 \label{f:ExtractingWordsText} 618 \end{figure} 583 619 584 620 … … 589 625 \begin{cfa} 590 626 char s[32]; // string s; 627 strlen( s ); 628 strnlen( s, 3 ); 629 strcmp( s, "abc" ); 630 strncmp( s, "abc", 3 ); 591 631 strcpy( s, "abc" ); 592 632 strncpy( s, "abcdef", 3 ); … … 598 638 599 639 600 \subsection{Parameter Passing}601 602 A substring is treated as a pointer into the base (substringed) string rather than creating a copy of the subtext.603 Hence, if the referenced item is changed, then the pointer sees the change.604 Pointers to the result value of a substring operation are defined to always start at the same location in their base string as long as that starting location exists, independent of changes to themselves or the base string.605 However, if the base string value changes, this may affect the values of one or more of the substrings to that base string.606 If the base string value shortens so that its end is before the starting location of a substring, resulting in the substring starting location disappearing, the substring becomes a null string located at the end of the base string.607 608 \VRef[Figure]{f:ParameterPassing} shows passing the results of substring operations by reference and by value to a subprogram.609 Notice the side-effects to other reference parameters as one is modified.610 611 \begin{figure}612 \begin{cfa}613 // x, a, b, c, & d are substring results passed by reference614 // e is a substring result passed by value615 void test(string &x, string &a, string &b, string &c, string &d, string e) {616 $\C{// x a b c d e}$617 a( 1, 2 ) = "aaa"; $\C{// aaaxxxxxxxxxxx aaax axx xxxxx xxxxx xxxxx}$618 b( 2, 12 ) = "bbb"; $\C{// aaabbbxxxxxxxxx aaab abbb bbxxx xxxxx xxxxx}$619 c( 4, 5 ) = "ccc"; $\C{// aaabbbxcccxxxxxx aaab abbb bbxccc ccxxx xxxxx}$620 c = "yyy"; $\C{// aaabyyyxxxxxx aaab abyy yyy xxxxx xxxxx}$621 d( 1, 3 ) = "ddd"; $\C{// aaabyyyxdddxx aaab abyy yyy dddxx xxxxx}$622 e( 1, 3 ) = "eee"; $\C{// aaabyyyxdddxx aaab abyy yyy dddxx eeexx}$623 x = e; $\C{// eeexx eeex exx x eeexx}$624 }625 int main() {626 string x = "xxxxxxxxxxxxx";627 test( x, x(1,3), x(3,3), x(5,5), x(9,5), x(9,5) );628 }629 \end{cfa}630 \caption{Parameter Passing}631 \label{f:ParameterPassing}632 \end{figure}633 634 635 640 \subsection{I/O Operators} 636 641 637 The ability to read and print strings is as essential as for any other type.638 The goal for character I/O is to work with groups rather than individual characters.642 The ability to input and output strings is as essential as for any other type. 643 The goal for character I/O is to also work with groups rather than individual characters. 639 644 A comparison with \CC string I/O is presented as a counterpoint to \CFA string I/O. 640 645 … … 658 663 659 664 The \CFA input/output operator @|@ is defined on type @string@. 660 \CFA output for @char@, @char *@, and @string@ are thesimilar.665 \CFA output for @char@, @char *@, and @string@ are similar. 661 666 The \CFA manipulators are @bin@, @oct@, @hex@, @wd@, and its associated width control and @left@. 662 667 \begin{cquote} … … 685 690 \end{cquote} 686 691 687 \CC input matching for @char@, @char *@, and @string@ are thesimilar, where \emph{all} input characters are read from the current point in the input stream to the end of the type size, format width, whitespace, end of line (@'\n'@), or end of file.692 \CC input matching for @char@, @char *@, and @string@ are similar, where \emph{all} input characters are read from the current point in the input stream to the end of the type size, format width, whitespace, end of line (@'\n'@), or end of file. 688 693 The \CC manipulator is @setw@ to restrict the size. 689 694 Reading into a @char@ is safe as the size is 1, @char *@ is unsafe without using @setw@ to constraint the length (which includes @'\0'@), @string@ is safe as its grows dynamically as characters are read. … … 695 700 string s; 696 701 cin >> ch >> setw( 5 ) >> c >> s; 697 abcde fg 702 @abcde fg@ 698 703 \end{c++} 699 704 & … … 706 711 \end{tabular} 707 712 \end{cquote} 708 Input text can be gulped from the current point to an arbitrary delimiter character using @getline@, which reads whitespace.709 710 The \CFA philosophy for input is that for every constant type in C, these constants should be usable as input.713 Input text can be gulped, including whitespace, from the current point to an arbitrary delimiter character using @getline@. 714 715 The \CFA philosophy for input is that, for every constant type in C, these constants should be usable as input. 711 716 For example, the complex constant @3.5+4.1i@ can appear as input to a complex variable. 712 717 \CFA input matching for @char@, @char *@, and @string@ are similar. 713 718 C-strings may only be read with a width field, which should match the string size. 714 719 Certain input manipulators support a scanset, which is a simple regular expression from @printf@. 715 The \CFA manipulators for these types are @wdi@\footnote{Due to an overloading issue in the type-resolver, the input width name must be temporarily different from the output, \lstinline{wdi} versus \lstinline{wd}.}, 716 and its associated width control and @left@, @quote@, @incl@, @excl@, and @getline@. 720 The \CFA manipulators for these types are @wdi@,\footnote{Due to an overloading issue in the type-resolver, the input width name must be temporarily different from the output, \lstinline{wdi} versus \lstinline{wd}.} and its associated width control and @left@, @quote@, @incl@, @excl@, and @getline@. 717 721 \begin{cquote} 718 722 \setlength{\tabcolsep}{10pt} … … 722 726 string s; 723 727 sin | ch | wdi( 5, c ) | s; 724 abcde fg 728 @abcde fg@ 725 729 sin | quote( ch ) | quote( wdi( sizeof(c), c ) ) | quote( s, '[', ']' ) | nl; 726 $'a' "bcde" [fg]$ 730 @$'a' "bcde" [fg]$@ 727 731 sin | incl( "a-zA-Z0-9 ?!&\n", s ) | nl; 728 x?&000xyz TOM !. 732 @x?&000xyz TOM !.@ 729 733 sin | excl( "a-zA-Z0-9 ?!&\n", s ); 730 <>{}{}STOP 734 @<>{}{}STOP@ 731 735 \end{c++} 732 736 & … … 734 738 735 739 736 'a' "bcde" [fg]737 738 'a' "bcde" [fg]740 'a' "bcde" "fg" 741 742 'a' "bcde" "fg" 739 743 740 744 "x?&000xyz TOM !" … … 745 749 \end{tabular} 746 750 \end{cquote} 747 751 Note, the ability to read in quoted strings to match with program strings. 752 The @nl@ at the end of an input ignores the rest of the line. 748 753 749 754 … … 751 756 752 757 While \VRef[Figure]{f:StrApiCompare} emphasizes cross-language similarities, it elides many specific operational differences. 753 For example, the @replace@ function selects a substring in the target and substitutes it with the source string, which can be smaller or larger than the substring. 754 \CC performs the modification on the mutable receiver object 755 \begin{cfa} 758 For example, the \CC @replace@ function selects a substring in the target and substitutes it with the source string, which can be smaller or larger than the substring. 759 \CC modifies the mutable receiver object, replacing by position (zero origin) and length. 760 \begin{cquote} 761 \setlength{\tabcolsep}{15pt} 762 \begin{tabular}{@{}l|l@{}} 763 \begin{c++} 756 764 string s1 = "abcde"; 757 s1.replace( 2, 3, "xy" ); $\C[2.25in]{// replace by position (zero origin) and length, mutable}\CRT$ 758 cout << s1 << endl; 759 $\texttt{\small abxy}$ 760 \end{cfa} 761 while Java allocates and returns a new string with the result, leaving the receiver unmodified. 765 s1.replace( 2, 3, "xy" ); 766 \end{c++} 767 & 768 \begin{c++} 769 770 "abxy" 771 \end{c++} 772 \end{tabular} 773 \end{cquote} 774 Java cannot modify the receiver (immutable strings) so it returns a new string, replacing by text. 762 775 \label{p:JavaReplace} 776 \begin{cquote} 777 \setlength{\tabcolsep}{15pt} 778 \begin{tabular}{@{}l|l@{}} 763 779 \begin{java} 764 780 String s = "abcde"; 765 String r = s.replace( "cde", "xy" ); $\C[2.25in]{// replace by text, immutable}$ 766 System.out.println( s + ' ' + r ); 767 $\texttt{\small abcde abxy}$ 781 String r = s.replace( "cde", "xy" ); 768 782 \end{java} 769 % Generally, Java's @String@ type is immutable. 770 Java provides a @StringBuffer@ near-analog that is mutable. 783 & 784 \begin{java} 785 786 "abxy" 787 \end{java} 788 \end{tabular} 789 \end{cquote} 790 Java also provides a mutable @StringBuffer@, replacing by position (zero origin) and length. 791 \begin{cquote} 792 \setlength{\tabcolsep}{15pt} 793 \begin{tabular}{@{}l|l@{}} 771 794 \begin{java} 772 795 StringBuffer sb = new StringBuffer( "abcde" ); 773 sb.replace( 2, 5, "xy" ); $\C[2.25in]{// replace by position, mutable}\CRT$ 774 System.out.println( sb ); 775 $\texttt{\small abxy}$ 796 sb.replace( 2, 5, "xy" ); 776 797 \end{java} 777 However, there are significant differences; 778 \eg, @StringBuffer@'s @substring@ function returns a @String@ copy that is immutable. 779 Finally, the operations between these type are asymmetric, \eg @String@ has @replace@ by text but not replace by position and vice versa for @StringBuffer@. 780 781 More significant operational differences relate to storage management, often appearing through assignment (@target = source@), and are summarized in \VRef[Figure]{f:StrSemanticCompare}. 782 % It calls out the consequences of each language taking a different approach on ``internal'' storage management. 798 & 799 \begin{java} 800 801 "abxy" 802 \end{java} 803 \end{tabular} 804 \end{cquote} 805 However, there are anomalies. 806 @StringBuffer@'s @substring@ returns a @String@ copy that is immutable rather than modifying the receiver. 807 As well, the operations are asymmetric, \eg @String@ has @replace@ by text but not replace by position and vice versa for @StringBuffer@. 808 809 More significant operational differences relate to storage management, often appearing through assignment (@target = source@), and are summarized in \VRef[Figure]{f:StrSemanticCompare}, defining properties: type abstraction, state, symmetry, and referent. 783 810 The following discussion justifies the figure's yes/no entries per language. 784 811 … … 807 834 & Laxed: The target's type is anything string-like; it may have a different status concerning ownership. 808 835 & Strict: The target's type is the same as the source; both strings are equivalent peers concerning ownership. 809 & --& no & yes & yes \\836 & N/A & no & yes & yes \\ 810 837 \hline 811 838 Referent … … 835 862 char s[$\,$] = "abcde"; 836 863 \end{cfa} 837 creates a second-class fixed-sized string-variable, as it can only be used in its lexical context ;838 it cannot be passed by value to string operations or user functions as C array's cannot be copied because there is no string-length information passedto the function.864 creates a second-class fixed-sized string-variable, as it can only be used in its lexical context, \ie it cannot be passed by value to string operations or user functions. 865 The reason is that there is no implicit mechanism to pass the string-length information to the function. 839 866 Therefore, only pointers to strings are first-class, and discussed further. 840 867 \begin{cfa} … … 865 892 The lax symmetry reflects how the validity of @s1@ depends on the content and lifetime of @s@. 866 893 It is common practice in \CC to use the @s1@-style pass by reference, with the understanding that the callee only uses the referenced string for the duration of the call, \ie no side-effect using the parameter. 867 So, when the called function is a constructor, it is typical to use an @s2@-style copy-initialization to string-object-typed member.894 So, when the called function is a constructor, it is typical to use an @s2@-style copy-initialization. 868 895 Exceptions to this pattern are possible, but require the programmer to assure safety where the type system does not. 869 896 The @s3@ initialization is constrained to copy the substring because @c_str@ always provides a null-terminated character, which may be different from the source string. … … 946 973 \input{sharing1.tex} 947 974 Here, the aliasing (@`share@) causes partial changes (subscripting) to flow in both directions. 975 (In the following examples, watch how @s1@ and @s1a@ change together, and @s2@ is independent.) 948 976 \input{sharing2.tex} 949 977 Similarly for complete changes. 950 978 \input{sharing3.tex} 951 952 979 Because string assignment copies the value, RHS aliasing is irrelevant. 953 980 Hence, aliasing of the LHS is unaffected. … … 966 993 The following rules explain aliasing substrings that flow in the opposite direction, large to small. 967 994 968 Growth and shrinkage are natural extensions, as for the text-editor example mentioned earlier, where an empty substring is as real realas an empty string.995 Growth and shrinkage are natural extensions, as for the text-editor example mentioned earlier, where an empty substring is as real as an empty string. 969 996 \input{sharing8.tex} 970 997 … … 983 1010 984 1011 %\input{sharing-demo.tex} 1012 1013 \VRef[Figure]{f:ParameterPassing} shows similar relationships when passing the results of substring operations by reference and by value to a subprogram. 1014 Again, notice the side-effects to other reference parameters as one is modified. 1015 1016 \begin{figure} 1017 \begin{cfa} 1018 // x, a, b, c, & d are substring results passed by reference 1019 // e is a substring result passed by value 1020 void test( string & x, string & a, string & b, string & c, string & d, string e ) { 1021 \end{cfa} 1022 \begin{cquote} 1023 \setlength{\tabcolsep}{2pt} 1024 \begin{tabular}{@{}ll@{}} 1025 \begin{cfa} 1026 1027 a( 0, 2 ) = "aaa"; 1028 b( 1, 12 ) = "bbb"; 1029 c( 4, 5 ) = "ccc"; 1030 c = "yyy"; 1031 d( 0, 3 ) = "ddd"; 1032 e( 0, 3 ) = "eee"; 1033 x = e; 1034 } 1035 \end{cfa} 1036 & 1037 \sf 1038 \setlength{\extrarowheight}{-0.5pt} 1039 \begin{tabular}{@{}llllll@{}} 1040 x & a & b & c & d & e \\ 1041 @"aaaxxxxxxxxx"@ & @"aaax"@ & @"xxx"@ & @"xxxxx"@ & @"xxx"@ & @"xxx"@ \\ 1042 @"aaaxbbbxxxxxx"@ & @"aaax"@ & @"xbbb"@ & @"xxxx"@ & @"xxx"@ & @"xxx"@ \\ 1043 @"aaaxbbbxxxcccxx"@ & @"aaax"@ & @"xbbb"@ & @"xxxccc"@& @"cccxx"@ & @"xxx"@ \\ 1044 @"aaaxbbbyyyxx"@ & @"aaax"@ & @"aaab"@ & @"yyy"@ & @"xx"@ & @"xxx"@ \\ 1045 @"aaaxbbbyyyddd"@ & @"aaax"@ & @"xbbb"@ & @"yyy"@ & @"ddd"@ & @"xxx"@ \\ 1046 @"aaaxbbbyyyddd"@ & @"aaax"@ & @"xbbb"@ & @"yyy"@ & @"ddd"@ & @"eee"@ \\ 1047 @"eee"@ & @""@ & @""@ & @""@ & @"eee"@ \\ 1048 & \\ 1049 \end{tabular} 1050 \end{tabular} 1051 \end{cquote} 1052 \begin{cfa} 1053 int main() { 1054 string x = "xxxxxxxxxxx"; 1055 test( x, x(0, 3), x(2, 3), x(4, 5), x(8, 5), x(8, 5) ); 1056 } 1057 \end{cfa} 1058 \caption{Parameter Passing} 1059 \label{f:ParameterPassing} 1060 \end{figure} 985 1061 986 1062 … … 1077 1153 The heap header and text buffer define a sharing context. 1078 1154 Normally, one global sharing context is appropriate for an entire program; 1079 concurrent exceptions are discussed in \VRef{s: AvoidingImplicitSharing}.1155 concurrent exceptions are discussed in \VRef{s:ControllingImplicitSharing}. 1080 1156 A string is a handle into the buffer and linked into a list. 1081 1157 The list is doubly linked for $O(1)$ insertion and removal at any location. … … 1156 1232 1157 1233 1158 \subsection{Avoiding implicit sharing} 1159 \label{s:AvoidingImplicitSharing} 1160 1161 There are tradeoffs associated with the copy-on-write mechanism. 1162 Several qualitative matters are detailed in \VRef{s:PerformanceAssessment} and the qualitative issue of multi-threaded support is introduced here. 1163 The \CFA string library provides a switch to disable threads allocating from the string buffer, when string sharing is unsafe. 1164 When toggled, string management is moved to the storage allocator, specifically @malloc@/@free@, where the storage allocator is assumed to be thread-safe. 1165 1166 In detail, string sharing has inter-linked string handles, so any participant managing one string is also managing, directly, the neighbouring strings, and from there, a data structure of the ``set of all strings.'' 1167 This string structure is intended for sequential access. 1168 Hence, multiple threads using shared strings need to avoid modifying (concurrently) an instance of this structure (like Java immutable strings). 1169 A positive consequence of this approach is that independent threads can use the sharing buffer without locking overhead. 1170 1171 When the string library is running with sharing disabled, it runs without implicit thread-safety challenges, which is the same as the \CC STL, and with performance goals similar to the STL. 1172 Running with sharing disabled can be thought of as a STL-emulation mode. 1173 Hence, concurrent users of string objects must still bring their own mutual exclusion, but the string library does not add any cross thread uses that are not apparent in a user's code. 1174 1175 The \CFA string library provides the type @string_sharectx@ to control an ambient sharing context for a current thread. 1176 It allows two adjustments: to opt out of sharing entirely or to begin sharing within a private context. 1177 Either way, the chosen mode applies only to the current thread, for the duration of the lifetime of the created @string_sharectx@ object, up to being suspended by child lifetimes of different contexts. 1178 \VRef[Figure]{fig:string-sharectx} illustrates its behaviour. 1179 Executing the example does not produce an interesting outcome. 1180 But the comments indicate when the logical copy operation runs with 1181 \begin{description} 1182 \item[share:] the copy being deferred, as described through the rest of this section (fast), or 1183 \item[copy:] the copy performed eagerly (slow). 1184 \end{description} 1185 Only eager copies can cross @string_sharectx@ boundaries. 1186 The intended use is with stack-managed lifetimes, in which the established context lasts until the current function returns, and affects all functions called that do not create their own contexts. 1187 In this example, the single-letter functions are called in alphabetic order. 1188 The functions @a@, @b@ and @g@ share string character ranges with each other, because they occupy a common sharing-enabled context. 1189 The function @e@ shares within itself (because its is in a sharing-enabled context), but not with the rest of the program (because its context is not occupied by any of the rest of the program). 1190 The functions @c@, @d@ and @f@ never share anything, because they are in a sharing-disabled context. 1191 1234 \subsection{Controlling implicit sharing} 1235 \label{s:ControllingImplicitSharing} 1236 1237 There are tradeoffs associated with sharing and its implicit copy-on-write mechanism. 1238 Several qualitative matters are detailed in \VRef{s:PerformanceAssessment}. 1239 In detail, string sharing has inter-linked string handles, so managing one string is also managing the neighbouring strings, and from there, a data structure of the ``set of all strings.'' 1240 Therefore, it is useful to toggle this capability on or off when it is not providing any application benefit. 1192 1241 1193 1242 \begin{figure} … … 1201 1250 \end{figure} 1202 1251 1252 The \CFA string library provides the type @string_sharectx@ to control an ambient sharing context. 1253 It allows two adjustments: to opt out of sharing entirely or to begin sharing within a private context. 1254 Running with sharing disabled can be thought of as a \CC STL-emulation mode, where each string is dynamically allocated. 1255 The chosen mode applies for the duration of the lifetime of the created @string_sharectx@ object, up to being suspended by child lifetimes of different contexts. 1256 \VRef[Figure]{fig:string-sharectx} illustrates this behaviour by showing the stack frames of a program in execution. 1257 In this example, the single-letter functions are called in alphabetic order. 1258 The functions @a@, @b@ and @g@ share string character ranges with each other, because they occupy a common sharing-enabled context. 1259 The function @e@ shares within itself (because its is in a sharing-enabled context), but not with the rest of the program (because its context is not occupied by any of the rest of the program). 1260 The functions @c@, @d@ and @f@ never share anything, because they are in a sharing-disabled context. 1261 Executing the example does not produce an interesting outcome, but the comments in the picture indicate when the logical copy operation runs with 1262 \begin{description} 1263 \item[share:] the copy being deferred, as described through the rest of this section (fast), or 1264 \item[copy:] the copy performed eagerly (slow). 1265 \end{description} 1266 Only eager copies can cross @string_sharectx@ boundaries. 1267 The intended use is with stack-managed lifetimes, in which the established context lasts until the current function returns, and affects all functions called that do not create their own contexts. 1203 1268 1204 1269 [ TODO: true up with ``is thread local'' (implement that and expand this discussion to give a concurrent example, or adjust this wording) ] 1205 1270 1206 1271 1272 \subsection{Sharing and threading} 1273 1274 The \CFA string library provides no thread safety, the same as \CC string, providing similar performance goals. 1275 Threads can create their own string buffers and avoid passing these strings to other threads, or require that shared strings be immutable, as concurrent reading is safe. 1276 A positive consequence of this approach is that independent threads can use the sharing buffer without locking overhead. 1277 When string sharing amongst threads is required, program-wide string-management can toggled to non-sharing using @malloc@/@free@, where the storage allocator is assumed to be thread-safe. 1278 Finally, concurrent users of string objects can provide their own mutual exclusion. 1279 1280 1207 1281 \subsection{Future work} 1208 1282 1209 To discuss: Unicode 1210 1211 To discuss: Small-string optimization 1283 Implementing the small-string optimization is straightforward, as a string header contains a pointer to the string text in the buffer. 1284 This pointer could be marked with a flag and contain a small string. 1285 However, there is now a conditional check required on the fast-path to switch between small and large string operations. 1286 1287 It might be possible to pack 16- or 32-bit Unicode characters within the same string buffer as 8-bit characters. 1288 Again, locations for identification flags must be found and checked along the fast path to select the correct actions. 1289 Handling utf8 (variable length), is more problematic because simple pointer arithmetic cannot be used to stride through the variable-length characters. 1290 Trying to use a secondary array of fixed-sized pointers/offsets to the characters is possible, but raises the question of storage management for the utf8 characters themselves. 1212 1291 1213 1292
Note:
See TracChangeset
for help on using the changeset viewer.