Index: doc/theses/mike_brooks_MMath/string.tex
===================================================================
--- doc/theses/mike_brooks_MMath/string.tex	(revision eeefc0ce7b34d6158108a35960cd5f58cbed09fa)
+++ doc/theses/mike_brooks_MMath/string.tex	(revision ac9c0ee8184815223b208b1dc7955395e38f770f)
@@ -2181,5 +2181,5 @@
 This plot breaks down the time spent, comparing STL--\CFA tradeoffs, at successful len-50 with unsuccessful len-200.
 Data are sourced from running the experiment under \emph{perf}, recording samples of the call stack, and imposing a mutually-exclusive classification on these call stacks.
-Reading a stacked bar from the top down, \emph{text import} captures samples where a routine like @memcpy@ is running, so that the string library is copying characters from the corpus source into its allocated working space.
+Reading a stacked bar from the top down, \emph{text import} captures samples where a routine like @memcpy@ is running, so that the string library is copying characters from the corpus source into its allocated text buffer.
 \emph{Malloc-free} means literally one of those routines (taking nontrivial time only for STL), while \emph{gc} is the direct \CFA(-only) equivalent.
 All of the attributions so far occur while a string is live; further time spent in a string's lifecycle management functions is attributed \emph{ctor-dtor}.
@@ -2211,21 +2211,12 @@
 If the STL monolithic compilation advantage is removed from consideration, the \emph{text-import} difference is the only reason that \CFA is not beating STL on speed, by about 10\%, across the board.
 
-An investigation\footnote{
-	\MLB{Peter, you need to be okay with this.}
-	The description of this investigation that appears in the current draft is my best recollection concerning work done previously.
-	But, so far, I have been unable to find this actual work.
-	90\% case: I find it or reproduce it and save the details properly; this footnote disappears.
-	10\% case: I can't do so; I retract the explanation above. 
-} into the \emph{text-import} difference revealed an interesting optimization opportunity.
-Both implementations use a @memcpy@ operation, sourcing from the program's @argv@ representation, targeting the string library's working space.
-The @memcpy@ action is inlined into its call site successfully, in both implementations.
-But STL's, which runs faster, does the data movement with vector instructions, while \CFA's does not.
-This STL-only instruction sequence appears to be correct only when the source and destination have their starting byte at the same offset within a vector chunk.
-The \CFA implementation has made no provision for this quality, so it is good for correctness that \CFA does not receive the vector version.
-Presumably, the optimizer (or check affecting the instruction stream) has noticed STL arranging for the destination to line up with the source.
-It could do so either by matching a known alignment (statically) or choosing to match the source's unaligned chunk offset (dynamically).
-Either possibility would be a choice to incur further fragmentation, when allocating working space (the copy's destination), in exchange for a faster copy.
-The \CFA implementation may benefit from attempting such a scheme.
-At present, incorporating the necessary fragementation into the working heap management is too disruptive.
+An investigation into the \emph{text-import} difference revealed an interesting optimization opportunity.
+Both implementations use a @memcpy@ operation, sourcing from the program's @argv@ strings (corpus strings specified on the command line), targeting the string library's text buffer.
+In both implementations, the @memcpy@ code is inlined.
+However, at runtime, the STL's version runs faster, by doing the data movement with vector instructions, while \CFA's does not.
+This STL optimization only occurs when the source and destination are aligned on a memory boundary matching with vector-data alignment (64-byte alignment).
+However, strings in the \CFA text buffer are only byte aligned, whereas the \CC SSO and @malloc@ed strings are 16-byte aligned, increasing the possibly of vector alignment or an optimization that ultimately results in vector operations.
+The \CFA implementation may benefit from such a scheme by wasting a small amount of space to position strings at a larger alignment boundary.
+At present, incorporating this optimization into the heap management is too disruptive.
 So, this discovery is left as a potential improvement.