Context Navigation

-              reeefc0c
+              rac9c0ee8
 This plot breaks down the time spent, comparing STL--\CFA tradeoffs, at successful len-50 with unsuccessful len-200.
 Data are sourced from running the experiment under \emph{perf}, recording samples of the call stack, and imposing a mutually-exclusive classification on these call stacks.
 Reading a stacked bar from the top down, \emph{text import} captures samples where a routine like @memcpy@ is running, so that the string library is copying characters from the corpus source into its allocated working space.
+Reading a stacked bar from the top down, \emph{text import} captures samples where a routine like @memcpy@ is running, so that the string library is copying characters from the corpus source into its allocated text buffer.
 \emph{Malloc-free} means literally one of those routines (taking nontrivial time only for STL), while \emph{gc} is the direct \CFA(-only) equivalent.
 All of the attributions so far occur while a string is live; further time spent in a string's lifecycle management functions is attributed \emph{ctor-dtor}.
 …
 If the STL monolithic compilation advantage is removed from consideration, the \emph{text-import} difference is the only reason that \CFA is not beating STL on speed, by about 10\%, across the board.
+An investigation\footnote{
+        \MLB{Peter, you need to be okay with this.}
+        The description of this investigation that appears in the current draft is my best recollection concerning work done previously.
+        But, so far, I have been unable to find this actual work.
+\% case: I find it or reproduce it and save the details properly; this footnote disappears.
+\% case: I can't do so; I retract the explanation above.
+} into the \emph{text-import} difference revealed an interesting optimization opportunity.
+Both implementations use a @memcpy@ operation, sourcing from the program's @argv@ representation, targeting the string library's working space.
+The @memcpy@ action is inlined into its call site successfully, in both implementations.
+But STL's, which runs faster, does the data movement with vector instructions, while \CFA's does not.
+This STL-only instruction sequence appears to be correct only when the source and destination have their starting byte at the same offset within a vector chunk.
+The \CFA implementation has made no provision for this quality, so it is good for correctness that \CFA does not receive the vector version.
+Presumably, the optimizer (or check affecting the instruction stream) has noticed STL arranging for the destination to line up with the source.
+It could do so either by matching a known alignment (statically) or choosing to match the source's unaligned chunk offset (dynamically).
+Either possibility would be a choice to incur further fragmentation, when allocating working space (the copy's destination), in exchange for a faster copy.
+The \CFA implementation may benefit from attempting such a scheme.
+At present, incorporating the necessary fragementation into the working heap management is too disruptive.
+An investigation into the \emph{text-import} difference revealed an interesting optimization opportunity.
+Both implementations use a @memcpy@ operation, sourcing from the program's @argv@ strings (corpus strings specified on the command line), targeting the string library's text buffer.
+In both implementations, the @memcpy@ code is inlined.
+However, at runtime, the STL's version runs faster, by doing the data movement with vector instructions, while \CFA's does not.
+This STL optimization only occurs when the source and destination are aligned on a memory boundary matching with vector-data alignment (64-byte alignment).
+However, strings in the \CFA text buffer are only byte aligned, whereas the \CC SSO and @malloc@ed strings are 16-byte aligned, increasing the possibly of vector alignment or an optimization that ultimately results in vector operations.
+The \CFA implementation may benefit from such a scheme by wasting a small amount of space to position strings at a larger alignment boundary.
+At present, incorporating this optimization into the heap management is too disruptive.
 So, this discovery is left as a potential improvement.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset ac9c0ee8

Legend:

doc/theses/mike_brooks_MMath/string.tex

Download in other formats: