Context Navigation

-                      r3483185
+                      r195f43d
 \chapter{String}
+\vspace*{-20pt}
 This chapter presents my work on designing and building a modern string type in \CFA.
 The discussion starts with examples of interesting string problems, followed by examples of how these issues are solved in my design.
+The discussion starts with examples of interesting string problems, followed by examples of how these issues are resolved in my design.
 …
 To prepare for the following discussion, comparisons among C, \CC, Java and \CFA strings are presented, beginning in \VRef[Figure]{f:StrApiCompare}.
+It provides a classic ``cheat sheet'' presentation, summarizing the names of the typical operations.
+\begin{figure}
+It provides a classic ``cheat sheet'' presentation, summarizing the names of the most common closely-equivalent operations.
+The over-arching commonality is that operations work on groups of characters for assigning, copying, scanning, and updating.
+\begin{figure}[h]
 \begin{cquote}
 \begin{tabular}{@{}l|l|l|l@{}}
 …
 \end{figure}
+The key commonality is that operations work on groups of characters for assigning, copying, scanning, and updating.
 Because a C string is null terminated and requires explicit storage management \see{\VRef{s:String}}, most of its group operations are error prone and expensive.
 Most high-level string libraries use a separate length field and specialized storage management to support group operations.
 \CC strings retain null termination to interface with library functions requiring C strings.
 \begin{cfa}
 int open( const char * pathname, int flags );
+As mentioned in \VRef{s:String}, a C string differs from other string types as it uses null termination rather than a length, which leads to explicit storage management;
+hence, most of its group operations are error prone and expensive.
+Most high-level string libraries use a separate length field and specialized storage management to implement group operations.
+Interestingly, \CC strings retain null termination just in case it is needed to interface with C library functions.
+\begin{cfa}
+int open( @const char * pathname@, int flags );
 string fname{ "test.cc" );
 open( fname.@c_str()@, O_RDONLY );
 \end{cfa}
 The function @c_str@ does not create a new null-terminated C string from the \CC string, as that requires passing ownership of the C string to the caller for eventual deletion.\footnote{
+Here, the \CC @c_str@ function does not create a new null-terminated C string from the \CC string, as that requires passing ownership of the C string to the caller for eventual deletion.\footnote{
 C functions like \lstinline{strdup} do return allocated storage that must be freed by the caller.}
 Instead, each \CC string is null terminated just in case it might be needed for this purpose.
+% Instead, each \CC string is null terminated just in case it might be needed for this purpose.
 Providing this backwards compatibility with C has a ubiquitous performance and storage cost.
+While \VRef[Figure]{f:StrApiCompare} emphasizes cross-language similarities, it elides differences in how a certain function is used.
+For example, the @replace@ function in \CC performs a modification on its @this@ parameter, while the Java one allocates and returns a new string with the result, leaving @this@ unmodified.
+While \VRef[Figure]{f:StrApiCompare} emphasizes cross-language similarities, it elides many specific operational differences.
+For example, the @replace@ function selects a substring in the target and substitutes it with the source string, which can be smaller or larger than the substring.
+\CC performs the modification on the mutable receiver object.
+\begin{cfa}
+string s1 = "abcde";
+s1.replace( 2, 3, "xy" );  $\C[2.25in]{// replace by position (zero origin) and length, mutable}\CRT$
+cout << s1 << endl;
+$\texttt{abxy}$
+\end{cfa}
+while Java allocates and returns a new string with the result, leaving the receiver unmodified.
+\begin{java}
+String s = "abcde";
+String r = s.replace( "cde", "xy" );  $\C[2.25in]{// replace by text, immutable}$
+StringBuffer sb = new StringBuffer( "abcde" );
+sb.replace( 2, 5, "xy" );  $\C{// replace by position, mutable}\CRT$
+System.out.println( s + ' ' + r + ' ' + sb );
+$\texttt{abcde abxy abxy}$
+\end{java}
 Generally, Java's @String@ type is immutable.
+Java provides a @StringBuffer@ near-analog that is mutable, but the differences are significant; for example, this class's @substring@ functions still return copies rather than mutable selections.
+These more significant differences are summarized in \VRef[Figure]{f:StrSemanticCompare}.  It calls out the consequences of each language taking a different approach on the ``internal'' issues, like storage management and null-terminator interoperability.  The discussion following justifies the figure's yes/no entries.
+\begin{figure}
+\begin{tabular}{@{}p{0.5in}p{2in}p{2in}>{\centering\arraybackslash}p{0.2in}>{\centering\arraybackslash}>{\centering\arraybackslash}p{0.2in}>{\centering\arraybackslash}p{0.2in}>{\centering\arraybackslash}p{0.2in}@{}}
+                                        &                       &                       & \multicolumn{4}{c}{\underline{Supports Helpful?}} \\
+Java provides a @StringBuffer@ near-analog that is mutable, but the differences are significant; for example, @StringBuffer@'s @substring@ functions still return copies rather than mutable selections.
+Finally, the operations between these type are asymmetric, \eg @string@ has @replace@ by text but not replace by position and vice versa for @StringBuffer@.
+More significant operational differences relate to storage management, often appearing through assignment (@target = source@), and are summarized in \VRef[Figure]{f:StrSemanticCompare}.
+% It calls out the consequences of each language taking a different approach on ``internal'' storage management.
+The following discussion justifies the figure's yes/no entries per language.
+\begin{figure}
+\setlength{\extrarowheight}{2pt}
+\begin{tabularx}{\textwidth}{@{}p{0.6in}XXcccc@{}}
+                                        &                       &                       & \multicolumn{4}{@{}c@{}}{\underline{Supports Helpful?}} \\
                                         & Required      & Helpful       & C                     & \CC           & Java          & \CFA \\
 \hline
 Type abst'n
                                         & Low-level: A ``string'' type represents a varying amount of text that is communicated with a function as a parameter/return.
                                                                 & High-level: Using a string-typed variable relieves the user of managing a distinct allocation for the text.
                                                                                         & \xmark        & \cmark        & \cmark        & \cmark \\
+                                        & Low-level: The string type is a varying amount of text communicated via a parameter or return.
+                                                                & High-level: The string-typed relieves the user of managing memory for the text.
+                                                                                        & no    & yes   & yes   & yes \\
 \hline
+\multirow{2}{0.5in}
+{State}
+State
                                         & \multirow{2}{2in}
                                         {Fast Initialize: The target receives the characters of the original, but no time is spent copying characters.  The result is one of Alias or Snapshot.}
                                                                 & Alias: The target is a further name for the text in the original; changes made in either variable are visible in both.
                                                                                         & \cmark        & \cmark        & \xmark        & \cmark \\
+                                        {Fast Initialize: The target receives the characters of the source without copying the characters, resulting in an Alias or Snapshot.}
+                                                                & Alias: The target name is within the source text; changes made in either variable are visible in both.
+                                                                                        & yes   & yes   & no    & yes \\
 \cline{3-7}
+                                        &
                                                                 & Snapshot: The target (resp.\ source) contains the value of the source at the time of the initialization until the target (resp.\ source) is explicitly changed.
                                                                                         & \xmark        & \xmark        & \cmark        & \cmark \\
+                                                                & Snapshot: The target is an alias within the source until the target changes (copy on write).
+                                                                                        & no    & no    & yes   & yes \\
 \hline
 Symmetry
                                         & Laxed: The target’s type is anything string-like; it may have a different status concerning ownership.
                                                                 & Strict: The target’s type is the same as the original; both strings are equivalent peers concerning ownership.
                                                                                         & --            & \xmark        & \cmark        & \cmark \\
+                                        & Laxed: The target's type is anything string-like; it may have a different status concerning ownership.
+                                                                & Strict: The target's type is the same as the source; both strings are equivalent peers concerning ownership.
+                                                                                        & --            & no    & yes   & yes \\
 \hline
 Referent
                                         & Variable-Constrained: The target can accept the entire text of the original.
                                                                 & Fragment: The target can accept an arbitrary substring of the original.
                                                                                         & \xmark        & \xmark        & \cmark        & \cmark
 \end{tabular}
+                                        & Variable-Constrained: The target can accept the entire text of the source.
+                                                                & Fragment: The target can accept an arbitrary substring of the source.
+                                                                                        & no    & no    & yes   & yes
+\end{tabularx}
 \noindent
 Notes
 \begin{itemize}
+\begin{itemize}[parsep=0pt]
 \item
         All languages support Required in all criteria.
 …
         A language gets ``Supports Helpful'' in one criterion if it can do so without sacrificing the Required achievement on all other criteria.
 \item
         The C ``string'' is @char *@, under the conventions that @<string.h>@ requires.  Symmetry is not applicable to C.
+        The C ``string'' is actually @char []@, under the conventions that @<string.h>@ requires. Hence, there is no actual string type in C, so symmetry does not apply.
 \item
         The Java @String@ class is analyzed; its @StringBuffer@ class behaves similarly to @C++@.
 \end{itemize}
 \caption{Comparison of languages' strings, assignment-semantics perspective.}
+\caption{Comparison of languages' strings, storage management perspective.}
 \label{f:StrSemanticCompare}
 \end{figure}
+In C:
+\begin{cfa}
+char * s1 = ...; // assumed
+char * s2 = s1;  // alias state, variable-constrained referent
+char * s3 = s1 + 2;  // alias state, variable-constrained referent
+\end{cfa}
+The issue of symmetry is trivial for a low-level type, and so, scored as not applicable to C.
+With the type not managing the text buffer, there is no ownership question, \ie nothing done with the @s1@ or @s2@ variables leads to the memory that their text currently occupies becoming reusable.
+While @s3@ is a valid C-string that contains a proper substring of @s1@, the @s3@ technique does not constitue having a fragment referent because null termination implies the substring cannot be chosen arbitrarily; the technique works only for suffixes.
+In \CC:
+\begin{cfa}
+string s1 = ...; // assumed
+string & s2 = s1;  // alias state, lax symmetry, variable-constrained referent
+string s3 = s1;  // NOT fast-initialize (strict symmetry, variable-constrained referent)
+string s4 = s1.substr(2,4);  // NOT fast-initialize (strict symmetry, fragment referent)
+string & s5 = s1.substr(2,4);  // memory-use error
+\end{cfa}
+The lax symmetry of the @s2@ technique reflects how the validity of @s2@ depends on the lifetime of @s1@.
+It is common practice in \CC to use the @s2@ technique for parameter passing, but the safest-bet advice has to be that the callee can only use the referenced string for the duration of the call.
+So, when the called function is a constructor, it is typical that the implementation is doing an @s3@-style initialization of a string-object-typed member.
+Exceptions of this pattern are possible, of course, but they represent the programmer taking responsiblity to assure safety where the type system does not.
+The @s4@ initialization is constrained by specification to copy the substring because of @c_str@ being specified to be a null-terminated character run that is not its own allocation.
+TODO: address caveat that @s3@ could be done fast by reference counting in the text area.
+In Java:
+\begin{cfa}
+String s1 = ...;  // assumed
+String s2 = s1;  // snapshot state, strict symmetry, variable-constrained referent
+String s3 = s1.substring(2,4);  // snapshot state (possible), strict symmetry, fragment referent
+\end{cfa}
+Here, facts about Java's implicit pointers and pointer equality can overcomplicate the picture.
+The further fact of Java's string immutability means that string variables behave as simple values.
+The result in @s2@ is the value of @s1@, and their pointer equality certainly assures that no time is spent copying characters.
+With @s3@, the case for fast-copy is more subtle.
+Certainly, its value is not pointer-equal to @s1@, implying at least a further allocation.
+In C, the declaration
+\begin{cfa}
+char s[$\,$] = "abcde";
+\end{cfa}
+creates a second-class fixed-sized string-variable, as it can only be used in its lexical context;
+it cannot be passed by value to string operations or user functions as C array's cannot be copied because there is no string-length information passed to the function.
+Therefore, only pointers to strings are first-class, and discussed further.
+\begin{cfa}
+(const) char * s = "abcde";  $\C[2.25in]{// alias state, n/a symmetry, variable-constrained referent}$
+char * s1 = s;  $\C{// alias state, n/a symmetry, variable-constrained referent}$
+char * s2 = s;  $\C{// alias state, n/a symmetry, variable-constrained referent}$
+char * s3 = &s[1];  $\C{// alias state, n/a symmetry, variable-constrained referent}$
+char * s4 = &s3[1];  $\C{// alias state, n/a symmetry, variable-constrained referent}\CRT$
+printf( "%s %s %s %s %s\n", s, s1, s2, s3, s4 );
+$\texttt{\small abcde abcde abcde bcde cde}$
+\end{cfa}
+Note, all of these strings rely on the single null termination character at the end of @s@.
+The issue of symmetry does not apply to C strings because the value and pointer strings are essentially different types, and so this feature is scored as not applicable for C.
+With the type not managing the text buffer, there is no ownership question, \ie operations on @s1@ or @s2@ never leads to their memory becoming reusable.
+While @s3@ is a valid C-string that contains a proper substring of @s1@, the @s3@ technique does not constitute having a fragment referent because null termination implies the substring cannot be chosen arbitrarily; the technique works only for suffixes.
+In \CC, @string@ offers a high-level abstraction.
+\begin{cfa}
+string s = "abcde";
+string & s1 = s;  $\C[2.25in]{// alias state, lax symmetry, variable-constrained referent}$
+string s2 = s;  $\C{// copy (strict symmetry, variable-constrained referent)}$
+string s3 = s.substr( 1, 2 );  $\C{// copy (strict symmetry, fragment referent)}$
+string s4 = s3.substr( 1, 1 );  $\C{// copy (strict symmetry, fragment referent)}$
+cout << s << ' ' << s1 << ' ' << s2 << ' ' << s3 << ' ' << s4 << endl;
+$\texttt{\small abcde abcde abcde bc c}$
+string & s5 = s.substr(2,4);  $\C{// error: cannot point to temporary}\CRT$
+\end{cfa}
+The lax symmetry reflects how the validity of @s1@ depends on the content and lifetime of @s@.
+It is common practice in \CC to use the @s1@-style pass by reference, with the understanding that the callee only use the referenced string for the duration of the call, \ie no side-effect using the parameter.
+So, when the called function is a constructor, it is typical to use an @s2@-style copy initialization to string-object-typed member.
+Exceptions to this pattern are possible, but require the programmer to assure safety where the type system does not.
+The @s3@ initialization is constrained to copy the substring because @c_str@ always provides a null-terminated character, which is different from source string.
+@s3@ assignment could be fast by reference counting the text area and using copy-on-write, but would require an implementation upgrade.
+In Java, @String@ also offers a high-level abstraction:
+\begin{cfa}
+String s = "abcde";
+String s1 = s;  $\C[2.25in]{// snapshot state, strict symmetry, variable-constrained referent}$
+String s2 = s.substring( 1, 3 );  $\C{// snapshot state (possible), strict symmetry, fragment referent}$
+String s3 = s2.substring( 1, 2 );  $\C{// snapshot state (possible), strict symmetry, fragment referent}\CRT$
+System.out.println( s + ' ' + s1 + ' ' + s2 + ' ' + s3 );
+System.out.println( (s == s1) + " " + (s == s2) + " " + (s2 == s3) );
+$\texttt{\small abcde abcde bc c}$
+$\texttt{\small true false false}$
+\end{cfa}
+Note, @substring@ takes a start and end position, rather than a start position and length.
+Here, facts about Java's implicit pointers and pointer equality can over complicate the picture, and so are ignored.
+Furthermore, Java's string immutability means string variables behave as simple values.
+The result in @s1@ is the pointer in @s@, and their pointer equality confirm no time is spent copying characters.
+With @s2@, the case for fast-copy is more subtle.
+Certainly, its value is not pointer-equal to @s@, implying at least a further allocation.
 TODO: finish the fast-copy case.
 Java strings lacking mutation means that aliasing is not possible with the @String@ type.
+Java's immutable strings mean aliasing is impossible with the @String@ type.
 Java's @StringBuffer@ provides aliasing, though without supporting symmetric treatment of a fragment referent; as a result, @StringBuffer@ scores as \CC.
 The easy symmetry that the Java string enjoys is aided by Java's garbage collection; Java's @s2@ is doing effectively the operation of \CC's @s2@, though without the consequence to of complicating memory management.
+Finally, in \CFA,
+\begin{cfa}
+string s1 = ...; // assumed
+string s2 = s1; // snapshot state, strict symmetry, variable-constrained referent
+string s3 = s1`shareEdits; // alias state, strict symmetry, variable-constrained referent
+string s4 = s1(2,4); // snapshot state, strict symmetry, fragment referent
+string s5 = s1(2,4)`shareEdits; // alias state, strict symmetry, fragment referent
+\end{cfa}
+all helpful criteria of \VRef[Figure]{f:StrSemanticCompare} are satisfied.
+The \CFA string manages storage, handles all assignments, including those of fragment referents, with fast initialization, provides the choice between snapshot and alias semantics, does so symmetrically with one type (which assures text validity according to the lifecycles of the string variables).
+Finally, In \CFA, @string@ also offers a high-level abstraction:
+\begin{cfa}
+string s = "abcde";
+string & s1 = s; $\C[2.25in]{// alias state, strict symmetry, variable-constrained referent}$
+string s2 = s; $\C{// snapshot state, strict symmetry, variable-constrained referent}$
+string s3 = s`shareEdits; $\C{// alias state, strict symmetry, variable-constrained referent}\CRT$
+string s4 = s( 1, 2 );
+string s5 = s4( 1, 1 );
+sout | s | s1 | s2 | s3 | s4 | s5;
+$\texttt{\small abcde abcde abcde abcde bc c}$
+\end{cfa}
+% all helpful criteria of \VRef[Figure]{f:StrSemanticCompare} are satisfied.
+The \CFA string manages storage, handles all assignments, including those of fragment referents with fast initialization, provides the choice between snapshot and alias semantics, and does so symmetrically with one type (which assures text validity according to the lifecycles of the string variables).
 With aliasing, the intuition is that each string is an editor on an open shared document.
 With fragment aliasing, the intuition is that these editor views have been scolled or zoomed to overlapping, but potentially different, ranges.
+With fragment aliasing, the intuition is that these editor views have been scrolled or zoomed to overlapping, but potentially different, ranges.
 The remainder of this chapter explains how the \CFA string achieves this usage style.
 …
 Earlier work on \CFA~\cite[ch.~2]{Schluntz17} implemented object constructors and destructors for all types (basic and user defined).
 A constructor is a user-defined function run implicitly \emph{after} an object's declaration-storage is created, and a destructor is a user-defined function run \emph{before} an object's declaration-storage is deleted.
 This feature, called RAII~\cite[p.~389]{Stroustrup94}, guarantees pre invariants for users before accessing an object and post invariants for the programming environment after an object terminates.
+This feature, called RAII~\cite[p.~389]{Stroustrup94}, guarantees pre-invariants for users before accessing an object and post invariants for the programming environment after an object terminates.
 The purposes of these invariants goes beyond ensuring authentic values inside an object.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 195f43d for doc/theses/mike_brooks_MMath/string.tex

Legend:

TabularUnified doc/theses/mike_brooks_MMath/string.tex ¶

Download in other formats: