Changeset fc8ec54 for doc


Ignore:
Timestamp:
Mar 31, 2025, 9:33:29 PM (7 months ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Children:
7e8db9e
Parents:
41c3e46
Message:

first attempt at describing string API

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mike_brooks_MMath/string.tex

    r41c3e46 rfc8ec54  
    4848% Instead, each \CC string is null terminated just in case it might be needed for this purpose.
    4949Providing this backwards compatibility with C has a ubiquitous performance and storage cost.
     50
     51
     52\section{\CFA \lstinline{string} type}
     53\label{s:stringType}
     54
     55The \CFA string type is for manipulation of dynamically-sized character-strings versus C @char *@ type for manipulation of statically-sized null-terminated character-strings.
     56Hence, the amount of storage for a \CFA string changes dynamically at runtime to fit the string size, whereas the amount of storage for a C string is fixed at compile time.
     57As a result, a @string@ declaration does not specify a maximum length, where a C string must.
     58The maximum storage for a \CFA @string@ value is @size_t@ characters, which is $2^{32}$ or $2^{64}$ respectively.
     59A \CFA string manages its length separately from the string, so there is no null (@'\0'@) terminating value at the end of a string value.
     60Hence, a \CFA string cannot be passed to a C string manipulation routine, such as @strcat@.
     61Like C strings, the characters in a @string@ are numbered starting from 0.
     62
     63The following operations have been defined to manipulate an instance of type @string@.
     64The discussion assumes the following declarations and assignment statements are executed.
     65\begin{cfa}
     66#include @<string.hfa>@
     67@string@ s, name, digit, alpha, punctuation, ifstmt;
     68int i;
     69name  = "MIKE";
     70digit  = "0123456789";
     71punctuation = "().,";
     72ifstmt = "IF (A > B) {";
     73\end{cfa}
     74Note, the include file @string.hfa@ to access type @string@.
     75
     76
     77\subsection{Implicit String Conversions}
     78
     79The ability to convert from internal (machine) to external (human) format is useful in situations other than I/O.
     80Hence, the basic types @char@, @char *@, @int@, @double@, @_Complex@, including any signness and size variations, implicitly convert to type @string@.
     81\VRef[Figure]{f:ImplicitConversionsString} shows examples of implicit conversions.
     82Conversions can be explicitly specified using a compound literal:
     83\begin{cfa}
     84s = (string){ "abc" };                          $\C{// converts char * to string}$
     85s = (string){ 5 };                                      $\C{// converts int to string}$
     86s = (string){ 5.5 };                            $\C{// converts double to string}$
     87\end{cfa}
     88Conversions from @string@ to @char *@, attempt to be safe:
     89either by requiring the maximum length of the @char *@ storage (@strncpy@) or allocating the @char *@ storage for the string characters (ownership), meaning the programmer must free the storage.
     90As well, a C string is always null terminates, implying a minimum size of 1 character.
     91\begin{cquote}
     92\setlength{\tabcolsep}{15pt}
     93\begin{tabular}{@{}l|l@{}}
     94\begin{cfa}
     95string s = "abcde";
     96char cs[3];
     97strncpy( cs, s, sizeof(cs) );
     98char * cp = s;
     99delete( cp );
     100cp = s + ' ' + s;
     101delete( cp );
     102\end{cfa}
     103&
     104\begin{cfa}
     105"abcde"
     106
     107"ab\0", in place
     108"abcde\0", malloc
     109
     110"abcde abcde\0", malloc
     111
     112\end{cfa}
     113\end{tabular}
     114\end{cquote}
     115
     116\begin{figure}
     117\begin{tabular}{@{}l|l@{}}
     118\setlength{\tabcolsep}{15pt}
     119\begin{cfa}
     120//      string s = 5;
     121        string s;
     122        // conversion of char and char * to string
     123        s = 'x';
     124        s = "abc";
     125        char cs[5] = "abc";
     126        s = cs;
     127        // conversion of integral, floating-point, and complex to string
     128        s = 45hh;
     129        s = 45h;
     130        s = -(ssize_t)MAX - 1;
     131        s = (size_t)MAX;
     132        s = 5.5;
     133        s = 5.5L;
     134        s = 5.5+3.4i;
     135        s = 5.5L+3.4Li;
     136\end{cfa}
     137&
     138\begin{cfa}
     139
     140
     141
     142"x"
     143"abc"
     144
     145"abc"
     146
     147"45"
     148"45"
     149"-9223372036854775808"
     150"18446744073709551615"
     151"5.5"
     152"5.5"
     153"5.5+3.4i"
     154"5.5+3.4i"
     155\end{cfa}
     156\end{tabular}
     157\caption{Implicit Conversions to String}
     158\label{f:ImplicitConversionsString}
     159\end{figure}
     160
     161
     162\subsection{Length}
     163
     164The @len@ operation returns the length of a string using prefix call.
     165\begin{cquote}
     166\setlength{\tabcolsep}{15pt}
     167\begin{tabular}{@{}l|l@{}}
     168\begin{cfa}
     169const char * cs = "abc";
     170i = ""`len;
     171i = "abc"`len;
     172i = cs`len;
     173i = name`len;
     174\end{cfa}
     175&
     176\begin{cfa}
     177
     1780
     1793
     1803
     1814
     182\end{cfa}
     183\end{tabular}
     184\end{cquote}
     185
     186
     187\subsection{Comparison Operators}
     188
     189The binary relational, @<@, @<=@, @>@, @>=@, and equality, @==@, @!=@, operators compare strings using lexicographical ordering, where longer strings are greater than shorter strings.
     190C strings use function @strcmp@, as the relational/equality operators compare C string pointers not their values, which does not match normal programmer expectation.
     191
     192
     193\subsection{Concatenation}
     194
     195The binary operators @+@ and @+=@ concatenate two strings, creating the sum of the strings.
     196\begin{cquote}
     197\setlength{\tabcolsep}{15pt}
     198\begin{tabular}{@{}l|l@{}}
     199\begin{cfa}
     200s = name + ' ' + digit;
     201s += name;
     202s = s + 'a' + 'b';
     203s = s + "a" + "abc";
     204s = 'a' + 'b' + s;
     205s = "a" + "abc" + s;
     206\end{cfa}
     207&
     208\begin{cfa}
     209"MIKE 0123456789"
     210"MIKE 0123456789MIKE"
     211
     212
     213$\CC$ unsupported
     214$\CC$ unsupported
     215\end{cfa}
     216\end{tabular}
     217\end{cquote}
     218The \CFA type-system allows full  commutativity with character and C strings;
     219\CC does not.
     220
     221
     222\subsection{Repetition}
     223
     224The binary operators @*@ and @*=@ repeat a string $N$ times.
     225If $N = 0$, a zero length string, @""@ is returned.
     226\begin{cquote}
     227\setlength{\tabcolsep}{15pt}
     228\begin{tabular}{@{}l|l@{}}
     229\begin{cfa}
     230s = 'x' * 3;
     231s = "abc" * 3;
     232s = (name + ' ') * 3;
     233\end{cfa}
     234&
     235\begin{cfa}
     236xxx
     237abcabcabc
     238MIKE MIKE MIKE
     239\end{cfa}
     240\end{tabular}
     241\end{cquote}
     242
     243
     244\subsection{Substring}
     245The substring operation returns a subset of the string starting at a position in the string and traversing a length.
     246\begin{cquote}
     247\setlength{\tabcolsep}{15pt}
     248\begin{tabular}{@{}l|l@{}}
     249\begin{cfa}
     250s = name( 2, 2 );
     251s = name( 3, -2 );
     252s = name( 2, 8 );
     253s = name( 0, -1 );
     254s = name( -1, -1 );
     255s = name( -3 );
     256\end{cfa}
     257&
     258\begin{cfa}
     259"KE"
     260"IK", length is opposite direction
     261"KE", length is clipped to 2
     262"", beyond string so clipped to null
     263"K", start $and$ length are negative
     264"IKE", to end of string
     265\end{cfa}
     266\end{tabular}
     267\end{cquote}
     268A negative starting position is a specification from the right end of the string.
     269A negative length means that characters are selected in the opposite (right to left) direction from the starting position.
     270If the substring request extends beyond the beginning or end of the string, it is clipped (shortened) to the bounds of the string.
     271If the substring request is completely outside of the original string, a null string located at the end of the original string is returned.
     272
     273The substring operation can also appear on the left hand side of the assignment operator.
     274The substring is replaced by the value on the right hand side of the assignment.
     275The length of the right-hand-side value may be shorter, the same length, or longer than the length of the substring that is selected on the left hand side of the assignment.
     276\begin{cfa}
     277digit( 3, 3 ) = "";                             $\C{// digit is assigned "0156789"}$
     278digit( 4, 3 ) = "xyz";                          $\C{// digit is assigned "015xyz9"}$
     279digit( 7, 0 ) = "***";                          $\C{// digit is assigned "015xyz***9"}$
     280digit(-4, 3 ) = "$\tt\$\$\$$";          $\C{// digit is assigned "015xyz\$\$\$9"}$
     281\end{cfa}
     282A substring is treated as a pointer into the base (substringed) string rather than creating a copy of the subtext.
     283As with all pointers, if the item they are pointing at is changed, then the pointer is referring to the changed item.
     284Pointers to the result value of a substring operation are defined to always start at the same location in their base string as long as that starting location exists, independent of changes to themselves or the base string.
     285However, if the base string value changes, this may affect the values of one or more of the substrings to that base string.
     286If the base string value shortens so that its end is before the starting location of a substring, resulting in the substring starting location disappearing, the substring becomes a null string located at the end of the base string.
     287
     288The following example illustrates passing the results of substring operations by reference and by value to a subprogram.
     289Notice the side-effects to other reference parameters as one is modified.
     290\begin{cfa}
     291main() {
     292        string x = "xxxxxxxxxxxxx";
     293        test( x, x(1,3), x(3,3), x(5,5), x(9,5), x(9,5) );
     294}
     295
     296// x, a, b, c, & d are substring results passed by reference
     297// e is a substring result passed by value
     298void test(string &x, string &a, string &b, string &c, string &d, string e) {
     299                                                                        $\C{//   x                                a               b               c               d               e}$
     300        a( 1, 2 ) = "aaa";                              $\C{// aaaxxxxxxxxxxx   aaax    axx             xxxxx   xxxxx   xxxxx}$
     301        b( 2, 12 ) = "bbb";                             $\C{// aaabbbxxxxxxxxx  aaab    abbb    bbxxx   xxxxx   xxxxx}$
     302        c( 4, 5 ) = "ccc";                              $\C{// aaabbbxcccxxxxxx aaab    abbb    bbxccc  ccxxx   xxxxx}$
     303        c = "yyy";                                              $\C{// aaabyyyxxxxxx    aaab    abyy    yyy             xxxxx   xxxxx}$
     304        d( 1, 3 ) = "ddd";                              $\C{// aaabyyyxdddxx    aaab    abyy    yyy             dddxx   xxxxx}$
     305        e( 1, 3 ) = "eee";                              $\C{// aaabyyyxdddxx    aaab    abyy    yyy             dddxx   eeexx}$
     306        x = e;                                                  $\C{// eeexx                    eeex    exx             x                               eeexx}$
     307}
     308\end{cfa}
     309
     310There is an assignment form of substring in which only the starting position is specified and the length is assumed to be the remainder of the string.
     311\begin{cfa}
     312string operator () (int start);
     313\end{cfa}
     314For example:
     315\begin{cfa}
     316s = name( 2 );                                          $\C{// s is assigned "ETER"}$
     317name( 2 ) = "IPER";                             $\C{// name is assigned "PIPER"}$
     318\end{cfa}
     319It is also possible to substring using a string as the index for selecting the substring portion of the string.
     320\begin{cfa}
     321string operator () (const string &index);
     322\end{cfa}
     323For example:
     324\begin{cfa}[mathescape=false]
     325digit( "xyz$\$\$\$$" ) = "678";         $\C{// digit is assigned "0156789"}$
     326digit( "234") = "***";                          $\C{// digit is assigned "0156789***"}$
     327\end{cfa}
     328
     329
     330\subsection{Searching}
     331
     332The @index@ operation
     333\begin{cfa}
     334int index( const string &key, int start = 1, occurrence occ = first );
     335\end{cfa}
     336returns the position of the first or last occurrence of the @key@ (depending on the occurrence indicator @occ@ that is either @first@ or @last@) in the current string starting the search at position @start@.
     337If the @key@ does not appear in the current string, the length of the current string plus one is returned.
     338%If the @key@ has zero length, the value 1 is returned regardless of what the current string contains.
     339A negative starting position is a specification from the right end of the string.
     340\begin{cfa}
     341i = digit.index( "567" );                       $\C{// i is assigned 3}$
     342i = digit.index( "567", 7 );            $\C{// i is assigned 11}$
     343i = digit.index( "567", -1, last );     $\C{// i is assigned 3}$
     344i = name.index( "E", 5, last ); $\C{// i is assigned 4}$
     345\end{cfa}
     346
     347The next two string operations test a string to see if it is or is not composed completely of a particular class of characters.
     348For example, are the characters of a string all alphabetic or all numeric?
     349Use of these operations involves a two step operation.
     350First, it is necessary to create an instance of type @strmask@ and initialize it to a string containing the characters of the particular character class, as in:
     351\begin{cfa}
     352strmask digitmask = digit;
     353strmask alphamask = string( "abcdefghijklmnopqrstuvwxyz" );
     354\end{cfa}
     355Second, the character mask is used in the functions @include@ and @exclude@ to check a string for compliance of its characters with the characters indicated by the mask.
     356
     357The @include@ operation
     358\begin{cfa}
     359int include( const strmask &, int = 1, occurrence occ = first );
     360\end{cfa}
     361returns the position of the first or last character (depending on the occurrence indicator, which is either @first@ or @last@) in the current string that does not appear in the @mask@ starting the search at position @start@;
     362hence it skips over characters in the current string that are included (in) the @mask@.
     363The characters in the current string do not have to be in the same order as the @mask@.
     364If all the characters in the current string appear in the @mask@, the length of the current string plus one is returned, regardless of which occurrence is being searched for.
     365A negative starting position is a specification from the right end of the string.
     366\begin{cfa}
     367i = name.include( digitmask );          $\C{// i is assigned 1}$
     368i = name.include( alphamask );          $\C{// i is assigned 6}$
     369\end{cfa}
     370
     371The @exclude@ operation
     372\begin{cfa}
     373int exclude( string &mask, int start = 1, occurrence occ = first )
     374\end{cfa}
     375returns the position of the first or last character (depending on the occurrence indicator, which is either @first@ or @last@) in the current string that does appear in the @mask@ string starting the search at position @start@;
     376hence it skips over characters in the current string that are excluded from (not in) in the @mask@ string.
     377The characters in the current string do not have to be in the same order as the @mask@ string.
     378If all the characters in the current string do NOT appear in the @mask@ string, the length of the current string plus one is returned, regardless of which occurrence is being searched for.
     379A negative starting position is a specification from the right end of the string.
     380\begin{cfa}
     381i = name.exclude( digitmask );          $\C{// i is assigned 6}$
     382i = ifstmt.exclude( strmask( punctuation ) ); $\C{// i is assigned 4}$
     383\end{cfa}
     384
     385The @includeStr@ operation:
     386\begin{cfa}
     387string includeStr( strmask &mask, int start = 1, occurrence occ = first )
     388\end{cfa}
     389returns the longest substring of leading or trailing characters (depending on the occurrence indicator, which is either @first@ or @last@) of the current string that ARE included in the @mask@ string starting the search at position @start@.
     390A negative starting position is a specification from the right end of the string.
     391\begin{cfa}
     392s = name.includeStr( alphamask );       $\C{// s is assigned "MIKE"}$
     393s = ifstmt.includeStr( alphamask );     $\C{// s is assigned "IF"}$
     394s = name.includeStr( digitmask );       $\C{// s is assigned ""}$
     395\end{cfa}
     396
     397The @excludeStr@ operation:
     398\begin{cfa}
     399string excludeStr( strmask &mask, int start = 1, occurrence = first )
     400\end{cfa}
     401returns the longest substring of leading or trailing characters (depending on the occurrence indicator, which is either @first@ or @last@) of the current string that are excluded (NOT) in the @mask@ string starting the search at position @start@.
     402A negative starting position is a specification from the right end of the string.
     403\begin{cfa}
     404s = name.excludeStr( digitmask);        $\C{// s is assigned "MIKE"}$
     405s = ifstmt.excludeStr( strmask( punctuation ) ); $\C{// s is assigned "IF "}$
     406s = name.excludeStr( alphamask);        $\C{// s is assigned ""}$
     407\end{cfa}
     408
     409
     410\subsection{Miscellaneous}
     411
     412The @trim@ operation
     413\begin{cfa}
     414string trim( string &mask, occurrence occ = first )
     415\end{cfa}
     416returns a string in that is the longest substring of leading or trailing characters (depending on the occurrence indicator, which is either @first@ or @last@) which ARE included in the @mask@ are removed.
     417\begin{cfa}
     418// remove leading blanks
     419s = string( "   ABC" ).trim( " " );     $\C{// s is assigned "ABC",}$
     420// remove trailing blanks
     421s = string( "ABC   " ).trim( " ", last ); $\C{// s is assigned "ABC",}$
     422\end{cfa}
     423
     424The @translate@ operation
     425\begin{cfa}
     426string translate( string &from, string &to )
     427\end{cfa}
     428returns a string that is the same length as the original string in which all occurrences of the characters that appear in the @from@ string have been translated into their corresponding character in the @to@ string.
     429Translation is done on a character by character basis between the @from@ and @to@ strings; hence these two strings must be the same length.
     430If a character in the original string does not appear in the @from@ string, then it simply appears as is in the resulting string.
     431\begin{cfa}
     432// upper to lower case
     433name = name.translate( "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz" );
     434                        // name is assigned "name"
     435s = ifstmt.translate( "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz" );
     436                        // ifstmt is assigned "if (a > b) {"
     437// lower to upper case
     438name = name.translate( "abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ" );
     439                        // name is assigned "MIKE"
     440\end{cfa}
     441
     442The @replace@ operation
     443\begin{cfa}
     444string replace( string &from, string &to )
     445\end{cfa}
     446returns a string in which all occurrences of the @from@ string in the current string have been replaced by the @to@ string.
     447\begin{cfa}
     448s = name.replace( "E", "XX" );          $\C{// s is assigned "PXXTXXR"}$
     449\end{cfa}
     450The replacement is done left-to-right.
     451When an instance of the @from@ string is found and changed to the @to@ string, it is NOT examined again for further replacement.
     452
     453\subsection{Returning N+1 on Failure}
     454
     455Any of the string search routines can fail at some point during the search.
     456When this happens it is necessary to return indicating the failure.
     457Many string types in other languages use some special value to indicate the failure.
     458This value is often 0 or -1 (PL/I returns 0).
     459This section argues that a value of N+1, where N is the length of the base string in the search, is a more useful value to return.
     460The index-of function in APL returns N+1.
     461These are the boundary situations and are often overlooked when designing a string type.
     462
     463The situation that can be optimized by returning N+1 is when a search is performed to find the starting location for a substring operation.
     464For example, in a program that is extracting words from a text file, it is necessary to scan from left to right over whitespace until the first alphabetic character is found.
     465\begin{cfa}
     466line = line( line.exclude( alpha ) );
     467\end{cfa}
     468If a text line contains all whitespaces, the exclude operation fails to find an alphabetic character.
     469If @exclude@ returns 0 or -1, the result of the substring operation is unclear.
     470Most string types generate an error, or clip the starting value to 1, resulting in the entire whitespace string being selected.
     471If @exclude@ returns N+1, the starting position for the substring operation is beyond the end of the string leaving a null string.
     472
     473The same situation occurs when scanning off a word.
     474\begin{cfa}
     475start = line.include(alpha);
     476word = line(1, start - 1);
     477\end{cfa}
     478If the entire line is composed of a word, the include operation will  fail to find a non-alphabetic character.
     479In general, returning 0 or -1 is not an appropriate starting position for the substring, which must substring off the word leaving a null string.
     480However, returning N+1 will substring off the word leaving a null string.
     481
     482
     483\subsection{C Compatibility}
     484
     485To ease conversion from C to \CFA, there are companion @string@ routines for C strings.
     486\VRef[Table]{t:CompanionStringRoutines} shows the C routines on the left that also work with @string@ and the rough equivalent @string@ opeation of the right.
     487Hence, it is possible to directly convert a block of C string operations into @string@ just by changing the
     488
     489\begin{table}
     490\begin{cquote}
     491\begin{tabular}{@{}l|l@{}}
     492\multicolumn{1}{c|}{\lstinline{char []}}        & \multicolumn{1}{c}{\lstinline{string}}        \\
     493\hline
     494@strcpy@, @strncpy@             & @=@                                                                   \\
     495@strcat@, @strncat@             & @+@                                                                   \\
     496@strcmp@, @strncmp@             & @==@, @!=@, @<@, @<=@, @>@, @>=@              \\
     497@strlen@                                & @size@                                                                \\
     498@[]@                                    & @[]@                                                                  \\
     499@strstr@                                & @find@                                                                \\
     500@strcspn@                               & @find_first_of@, @find_last_of@               \\
     501@strspc@                                & @find_fist_not_of@, @find_last_not_of@
     502\end{tabular}
     503\end{cquote}
     504\caption{Companion Routines for \CFA \lstinline{string} to C Strings}
     505\label{t:CompanionStringRoutines}
     506\end{table}
     507
     508For example, this block of C code can be converted to \CFA by simply changing the type of variable @s@ from @char []@ to @string@.
     509\begin{cfa}
     510        char s[32];
     511        //string s;
     512        strcpy( s, "abc" );                             PRINT( %s, s );
     513        strncpy( s, "abcdef", 3 );              PRINT( %s, s );
     514        strcat( s, "xyz" );                             PRINT( %s, s );
     515        strncat( s, "uvwxyz", 3 );              PRINT( %s, s );
     516        PRINT( %zd, strlen( s ) );
     517        PRINT( %c, s[3] );
     518        PRINT( %s, strstr( s, "yzu" ) ) ;
     519        PRINT( %s, strstr( s, 'y' ) ) ;
     520\end{cfa}
     521However, the conversion fails with I/O because @printf@ cannot print a @string@ using format code @%s@ because \CFA strings are not null terminated.
     522
     523
     524\subsection{Input/Output Operators}
     525
     526Both the \CC operators @<<@ and @>>@ are defined on type @string@.
     527However, input of a string value is different from input of a @char *@ value.
     528When a string value is read, \emph{all} input characters from the current point in the input stream to either the end of line (@'\n'@) or the end of file are read.
     529
     530
     531\section{Implementation Details}
    50532
    51533While \VRef[Figure]{f:StrApiCompare} emphasizes cross-language similarities, it elides many specific operational differences.
     
    5321014\subsection{Methodology}
    5331015
    534 These tests use randomly generated varying-length strings (string content is immaterial).
    535 A collection of such fragments is a \emph{corpus}.
    536 The mean length of a fragment from a corpus is a typical explanatory variable.
    537 Such a length is used in one of three modes:
     1016These tests use a \emph{corpus} of strings (string content is immaterial).
     1017For varying-length strings, the mean length comes from a geometric distribution, which implies that lengths much longer than the mean occur frequently.
     1018The string sizes are:
    5381019\begin{description}
    539         \item [Fixed-size] means all string fragments are of the stated size.
    540         \item [Varying from 1] means string lengths are drawn from a geometric distribution with the stated mean, and all lengths occur.
    541         \item [Varying from 16] means string lengths are drawn from a geometric distribution with the stated mean, but only lengths 16 and above occur; thus, the stated mean is above 16.
     1020        \item [Fixed-size] all string lengths are of the stated size.
     1021        \item [Varying from 1 to N] means the string lengths are drawn from the geometric distribution with a stated mean and all lengths occur.
     1022        \item [Varying from 16 to N] means string lengths are drawn from the geometric distribution with the stated mean, but only lengths 16 and above occur; thus, the stated mean is above 16.
    5421023\end{description}
    543 The geometric distribution implies that lengths much longer than the mean occur frequently.
    544 The special treatment of length 16 deals with comparison to STL, given that STL has a short-string optimization (see [TODO: write and cross-ref future-work SSO]), currently not implemented in \CFA.
     1024The means for the geometric distribution are the X-axis values in experiments.
     1025The special treatment of length 16 deals with the short-string optimization (SSO) in STL @string@, currently not implemented in \CFA.
     1026When an STL string can fit into a heap pointer, the optimization uses the pointer storage to eliminate using the heap.
     1027\begin{c++}
     1028class string {
     1029        union {
     1030                struct { $\C{// long string, string storage in heap}$
     1031                        size_t size;
     1032                        char * strptr;
     1033                } lstr;
     1034                char sstr[sizeof(lstr)]; $\C{// short string 8-16 characters, in situ}$
     1035        };
     1036        bool tag; $\C{// string kind, short or long}$
     1037        ... $\C{// other storage}$
     1038};
     1039\end{c++}
     1040
    5451041When success is illustrated, notwithstanding SSO, a fixed-size or from-16 distribution ensures that extra-optimized cases are not part of the mix on the STL side.
    546 In all experiments that use a corpus, its text is generated and loaded into the SUT before the timed phase begins.
     1042In all experiments that use a corpus, its text is generated and loaded into the system under test before the timed phase begins.
    5471043
    5481044To discuss: vocabulary for reused case variables
     
    5521048To discuss: hardware and such
    5531049
    554 To discuss: memory allocator
     1050To ensure comparable results, a common memory allocator is used for \CFA and \CC.
     1051The llheap allocator~\cite{Zulfiqar22} is embedded into \CFA and is used standalone with \CC.
    5551052
    5561053
    5571054\subsection{Test: Append}
    5581055
    559 This test measures the speed of appending fragments of text onto a growing string.
    560 Its subcases include both \CFA being similar to STL and their designs offering a tradeoff.
    561 
    562 One experimental variable is logically equivalent operations such as @a = a + b@ \vs @a += b@.
     1056This test measures the speed of appending randomly-size text onto a growing string.
     1057\begin{cquote}
     1058\setlength{\tabcolsep}{20pt}
     1059\begin{tabular}{@{}ll@{}}
     1060% \multicolumn{1}{c}{\textbf{fresh}} & \multicolumn{1}{c}{\textbf{reuse}} \\
     1061\begin{cfa}
     1062
     1063for ( ... ) {
     1064        @string x;@       // fresh
     1065        for ( ... )
     1066                x @+=@ ...
     1067}
     1068\end{cfa}
     1069&
     1070\begin{cfa}
     1071string x;
     1072for ( ... ) {
     1073        @x = "";@  $\C[1in]{// reuse}$
     1074        for ( ... )
     1075                x @+=@ ...  $\C{// append, alternative x = x + ...}\CRT$
     1076}
     1077\end{cfa}
     1078\end{tabular}
     1079\end{cquote}
     1080The benchmark's outer loop executes ``until a sample-worthy amount of execution has happened'' and an inner loop for ``building up the desired-length string.''
     1081Its subcases include,
     1082\begin{enumerate}[leftmargin=*]
     1083\item
     1084\CFA nosharing/sharing \vs \CC nosharing.
     1085\item
     1086Difference between the logically equivalent operations @x += ...@ \vs @x = x + ...@.
    5631087For numeric types, the generated code is equivalence, giving identical performance.
    5641088However, for string types there can be a significant difference in performance, especially if this code appears in a loop iterating a large number of times.
    5651089This difference might not be intuitive to beginners.
    566 
    567 Another experimental variable is whether the user's logical allocation is fresh \vs reused.
    568 Here, \emph{reusing a logical allocation}, means that the program variable, into which the user is concatenating, previously held a long string:
    569 \begin{cquote}
    570 \setlength{\tabcolsep}{20pt}
    571 \begin{tabular}{@{}ll@{}}
    572 \multicolumn{1}{c}{\textbf{fresh}} & \multicolumn{1}{c}{\textbf{reuse}} \\
    573 \begin{cfa}
    574 
    575 for ( ... ) {
    576         @string x;@
    577         for ( ... )
    578                 x += ...
    579 }
    580 \end{cfa}
    581 &
    582 \begin{cfa}
    583 string x;
    584 for ( ... ) {
    585         @x = "";@
    586         for ( ... )
    587                 x += ...
    588 }
    589 \end{cfa}
    590 \end{tabular}
    591 \end{cquote}
    592 
    593 All benchmark drivers have an outer loop for ``until a sample-worthy amount of execution has happened'' and an inner loop for ``building up the desired-length string.''
     1090\item
     1091Coding practice where the user's logical allocation is fresh \vs reused.
     1092Here, \emph{reusing a logical allocation}, means that the program variable, into which the user is concatenating, previously held a long string.
    5941093In general, a user should not have to care about this difference, yet the STL performs differently in these cases.
    5951094Furthermore, if a routine takes a string by reference, if cannot use the fresh approach.
    5961095Concretely, both cases incur the cost of copying characters into the target string, but only the allocation-fresh case incurs a further reallocation cost, which is generally paid at points of doubling the length.
    5971096For the STL, this cost includes obtaining a fresh buffer from the memory allocator and copying older characters into the new buffer, while \CFA-sharing hides such a cost entirely.
    598 The fresh \vs reuse distinction is only relevant in the \emph{append} tests.
     1097%The fresh \vs reuse distinction is only relevant in the \emph{append} tests.
     1098\end{enumerate}
    5991099
    6001100\begin{figure}
     
    6061106\end{figure}
    6071107
    608 The \emph{append} tests use the varying-from-1 corpus construction, \ie they assume the STL's advantage of small-string optimization.
     1108This tests use the varying-from-1 corpus construction, \ie it assumes the STL's advantage of small-string optimization.
    6091109\PAB{To discuss: any other case variables introduced in the performance intro}
    6101110\VRef[Figure]{fig:string-graph-peq-cppemu} shows this behaviour, by the STL and by \CFA in STL emulation mode.
    6111111\CFA reproduces STL's performance, up to a 15\% penalty averaged over the cases shown, diminishing with larger strings, and 50\% in the worst case.
    6121112This penalty characterizes the amount of implementation fine tuning done with STL and not done with \CFA in present state.
    613 \PAB{The larger inherent penalty, for a user mismanaging reuse, is 40\% averaged over the cases shown, is minimally 24\%, shows up consistently between the STL and \CFA implementations, and increases with larger strings.}
     1113There is a larger penalty for redeclaring the string each loop iteration (fresh) versus hosting it out of the loop and reseting it to the null string (reuse).
     1114The cost is 40\% averaged over the cases shown and minimally 24\%, and shows up consistently between the \CFA and STL implementations, and increases with larger strings.
    6141115
    6151116\begin{figure}
     
    6401141\subsubsection{Test: Pass argument}
    6411142
    642 To have introduced:  STL string library forces users to think about memory management when communicating values across a function call
     1143STL has a penalty for passing a string by value, which indirectly forces users to think about memory management when communicating values to a function.
    6431144\begin{cfa}
    6441145void foo( string s );
     
    6461147foo( s );
    6471148\end{cfa}
    648 STL charges a prohibitive penalty for passing a string by value.
    6491149With implicit sharing active, \CFA treats this operation as normal and supported.
    6501150This test illustrates a main advantage of the \CFA sharing algorithm.
     
    7421242}
    7431243\end{lstlisting}
    744 
    745 \section{String I/O}
Note: See TracChangeset for help on using the changeset viewer.