Context Navigation

-              r9c1880b
+              ra17f496
 %This alias model is assigning to an aliased substring for two strings overlapping by total containment: one is the selected string, the other is the document.
 Extend the metaphor to two selected areas, where one area can be drag-and-dropped into another, changing the text in the drop area and correspondingly changing the document.
 When the selected areas are indenpendent, the semantics of the drag-and-drop are straightforward.
+When the selected areas are independent, the semantics of the drag-and-drop are straightforward.
 However, for overlapping selections, either partial or full, there are multiple useful semantics.
 For example, two areas overlap at the top, or bottom, or a block at a corner, where one areas is dropped into the other.
 …
 The test shows that \CFA enables faster speed in exchange for greater memory usage.
 A garbage collector, afforded the freedom of managed memory (where it knows about all the pointers and is allowed to modify them), often runs faster than malloc-free in an amortized analysis, even though it must occasionally stop to collect.
 The speedup happens because GC is able to use its collection time to move objects.
+A precise garbage-collector, which knows all pointers and may modify them, often runs faster than malloc-free in an amortized analysis, even though it must occasionally stop to collect (latency issues).
+The speedup happens indirectly because GC is able to move objects.
 (In the case of the mini-allocator powering the \CFA string library, objects are runs of text.)
 Moving objects lets fresh allocations consume from a large contiguous store of available memory; the ``bump pointer'' bookkeeping for such a scheme is very light.
 A malloc-free implementation without the freedom to move objects must, in the general case, allocate in the spaces between existing objects; doing so entails the heavier bookkeeping of maintaining a linked structure of freed allocations and/or coalescing freed allocations.
 A garbage collector keeps allocations around for longer than the using program can reach them.
+By contrast, a program (correctly) using malloc-free releases allocations exactly when they are no longer reachable.
 Therefore, the same harness will use more memory while running under garbage collection.
 A garbage collector can minimize the memory overhead by searching for these dead allocations aggressively, that is, by collecting more often.
 Tuned in this way, it spends a lot of time collecting, easily so much as to overwhelm its speed advantage from bump-pointer allocation.
 If it is tuned to collect rarely, then it leaves a lot of garbage allocated (waiting to be collected) but gains the advantage of little time spent doing collection.
+[TODO: find citations for the above knowledge]
 The speed for memory tradeoff is, therefore, standard for comparisons like \CFA--STL string allocations.
 The test verifies that it is so and quantifies the returns available.
+Moving objects creates a large contiguous store of available memory;
+fresh allocations consume from this area using light-weight \emph{bump-allocation}.
+A malloc-free implementation cannot move objects because the location of allocated objects is unknown;
+hence, free-space management is more complex managing linked structures of freed allocations and possibly coalescing freed blocks.
+GC occurs lazily, so the lifetime of unreachable allocations is unknown, \eg GC may never be triggered.
+A garbage collector can minimize the memory footprint for unreachable allocations by collecting eagerly.
+However, this approach can negate GC's speed advantage.
+Alternatively, collecting infrequently can consume large amounts of resident and virtual memory.
+In contrast, a program using malloc-free tries to release an allocation as soon as it is unreachable.
+This approach reduces the memory footprint (depending on the allocator), but has a per allocation cost for each free.
+% [TODO: find citations for the above knowledge]
+Therefore, a speed \vs memory tradeoff is a good comparator for \CFA--STL string allocations.
+%The test verifies that it is so and quantifies the returns available.
 These tests manipulate a tuning knob that controls how much extra space to use.
 Specific values of this knob are not user-visible and are not presented in the results here.
+Specific values of this knob are not user-visible and are not presented in the results.
 Instead, its two effects (amount of space used and time per operation) are shown.
 The independent variable is the liveness target, which is the fraction of the text buffer that is in use at the end of a collection.
 The allocator will expand its text buffer during a collection if the actual fraction live exceeds this target.
+The independent variable is the liveness target, which is the fraction of the text buffer in use at the end of a collection.
+The allocator expands its text buffer during a collection if the actual live fraction exceeds this target.
 \begin{figure}
 …
 \end{cfa}
 The runs presented use a call depth of 1000 and a fan-out of 1.006, which means that approximately one call in 167 makes two recursive calls, while the rest make one.
+This sizing keeps an average of 836 strings live.
+This sizing was chosen to keep the amounts of consumed memory within the machine's last-level cache.
+The time measurement is of nanoseconds per such @f@-call, be it internal or leaf.
+This sizing keeps an average of 836 strings live and keeps the amounts of consumed memory within the machine's last-level cache.
+The time measurement is nanoseconds per @f@-call, for internal or leaf strings.
 % String lifetime (measured in number of subsequent string allocations) is ?? distributed, because each node in the call tree survives as long as its descendent calls.
 …
 \centering
   \includegraphics{plot-string-allocn.pdf}
+  \begin{tabularx}{\linewidth}{>{\centering\arraybackslash}X >{\centering\arraybackslash}X} (a) & (b) \end{tabularx}
+  \caption{Space and time performance, under varying fraction-live targets, for the five string lengths shown, at \emph{Varying 16 and up} corpus construction.
+}
+  \begin{tabularx}{\linewidth}{>{\centering\arraybackslash}X >{\centering\arraybackslash}X} (a) Vertical, Lower is better. \\ Horizontal, leftward is better. & (b) \hspace*{5pt} STL CFA \hspace*{20pt} STL \hspace*{10pt} CFA \hspace*{10pt} \end{tabularx}
+  \caption{Space and time performance, under varying fraction-live targets, for the five string lengths shown, at \emph{Varying 16 and up} corpus construction.}
   \label{fig:string-graph-allocn}
 \end{figure}
 \VRef[Figure]{fig:string-graph-allocn} shows the results of this experiment.
 In plot (a), there is one labeled curve for each of the five chosen string sizes.
+\VRef[Figure]{fig:string-graph-allocn} shows the experimental results.
+In plot (a), there is one labeled curve for each of the five chosen string sizes, 20, 50, 100, 200, 500.
 A labeled curve shows the \CFA speed and space combinations achievable by varying the fraction-live tuning parameter.
+Stepping along the curve from top-left toward bottom-right means demanding that the string library have a smaller fraction of if its heap remain live after collection, \ie that the heap should double its size more readily.
+So a subsequent data point on a curve is approximately double the prior point's memory.
+This additional memory allows for less time spent collecting, so the point shows a lower time per operation.
+This benefit of doubling is diminishing, so the curves begin steep and end flatter.
+The point chosen as \CFA's default liveness threshold (20\%) is marked.
+At most string lengths, this point occurs as the doubling benefit becomes subjectively negligible.
+At len-500, the amount of space needed to achieve 20\% liveness is so much that last-level cache misses begin occurring, and a further slowdown occurs.
+This effect is an anomaly of the experimental setup trying to compare such varied sizes in one view; the len-500 default point is included only to provide a holisitc picture of the STL comparisons (discussed next).
+Staying within plot (a), each ``tradeoff'' line connects the default-tuned \CFA performance point with its lenght-equivalent STL performance point.
+STL always uses less memory than (occurs to the left of) its \CFA equivalent.
+Ideally, \CFA would boast that its results are always faster (STL occurs above it); however, this effect is only realized for smaller string lenghts.
+There, the STL-to-\CFA tradeoff line goes down and to the right, but for longer strings, it goes up and to the right.
+Stepping along the curve from top-left toward bottom-right means the string library has a smaller fraction of its heap remain live after collection, \ie it doubles its size more often.
+Hence, memory approximately doubles between points.
+The additional memory means less time is spent collecting, so the curve shows lower (better) times per operation (y-axis).
+This doubling benefit diminishes, so the curves begin to flatten.
+The point chosen as \CFA's default liveness threshold (20\%) is marked with a circle.
+For most string lengths, this point occurs as the doubling benefit becomes subjectively negligible.
+At len-500, the amount of space needed to achieve 20\% liveness is so much that last-level cache misses begin occurring generating a further slowdown.
+This effect is an anomaly of the experimental setup trying to compare such varied sizes in one view;
+the len-500 default point is included only to provide a holistic picture of the STL comparisons (discussed next).
+Staying within plot (a), each \emph{tradeoff} line (green) connects the default-tuned \CFA performance point with its length-equivalent STL performance point.
+STL always uses less memory (occurs to the left of) than its \CFA equivalent.
+Ideally, \CFA results would be faster (STL occurs above it);
+however, this inversion only occurs for smaller string lengths.
+That is, the STL-to-\CFA tradeoff line goes down and to the right for small strings, but for longer strings, it goes up and to the right.
 Plot (b) offers insight into why larger lengths are not currently showing the anticipated tradeoff, and suggests possible remedies.
 This plot breaks down the time spent, comparing STL--\CFA tradeoffs, at successful len-50 with unsuccessful len-200.
 Data are sourced from running the experiment under \emph{perf}, recording samples of the call stack, and imposing a mutially-exclusive classification on these call stacks.
 Reading a stacked bar from the top down, \emph{text import} capturs samples where a routine like @memcpy@ is running, so that the string library is copying characters from the corpus source into its allocated working space.
 \emph{Malloc-free} means literally one of those routines (taking nontrivial time only for STL), while \emph{gc} is the direct \CFA(-only) eqivalent.
 All of the attributions so far occur while a string constructor is live; further time spent in a string's lifecycle management functions is attributed \emph{ctor-dtor}.
+Data are sourced from running the experiment under \emph{perf}, recording samples of the call stack, and imposing a mutually-exclusive classification on these call stacks.
+Reading a stacked bar from the top down, \emph{text import} captures samples where a routine like @memcpy@ is running, so that the string library is copying characters from the corpus source into its allocated working space.
+\emph{Malloc-free} means literally one of those routines (taking nontrivial time only for STL), while \emph{gc} is the direct \CFA(-only) equivalent.
+All of the attributions so far occur while a string is live; further time spent in a string's lifecycle management functions is attributed \emph{ctor-dtor}.
 Skipping to the bottom-most attribution, \emph{harness-leaf} represents samples where the youngest live stack frame is simply the recursive @helper@ routine of \VRef[Figure]{fig:string-perf-alloc}.
 The skipped attribution \emph{Other} has samples that meet none of the criteria above.
 …
 This overhead figure was obtained by hacking a version of the \CFA string library to have a header-only implementation and measuring the resulting speed.
 The difference between separately compiled (normal) and header-only (hacked) versions is the reported overhead.
 It represents how much \CFA could speed up if it switched to a header-only implementation, to match STL.
+It represents how much \CFA could speed up if it switched to a header-only implementation to match STL.
 The difference jumps noticeably between the different string sizes, but has little correlation with string size ($r=0.3$).
 Associating this potential improvement with the \emph{other} attribution is a stylistic choice.
 …
 The \emph{harness-leaf} category is called out as a quality check; this attribution is equal across all four setups, ruling out mysterious ``everything is X\% slower'' effects.
 But with the \emph{other} attribution, STL beats \CFA by an amount that matches the STL's benefit from avoiding separate compilation.
-Continuing upward, the significant differences between STL and \CFA implementations occur.
 In the \emph{ctor-dtor} category, \CFA is spending more time than STL to inter-link its string handles.
 Comparing STL's \emph{malloc-free} with \CFA's \emph{gc}, \CFA's is considerably shorter, in spite of its copying, because \emph{gc} only spends time on live strings, while \emph{malloc-free} works on all strings.
 …
 So the critical comparison is STL's $(\textit{ctor-dtor} + \textit{malloc-free})$ \vs \CFA's $(\textit{ctor-dtor} + \textit{gc})$.
 Here, \CFA is a clear winner.
 Moreover, the discussion so far holds for both the successful and unsuccessful string lengths, albeit with the length-correlated duration of \emph{gc} (due to copying) encroching on \CFA's advantage, though not eliminating it.
+Moreover, the discussion so far holds for both the successful and unsuccessful string lengths, albeit with the length-correlated duration of \emph{gc} (due to copying) encroaching on \CFA's advantage, though not eliminating it.
 The total attributions so far see \CFA ahead, possibly by more than is shown, depending on how one views the separate-compilation difference.
 Under a \CFA-generous separate-compilation stance, \CFA is equal or ahead in every super-category so far.
+Under a \CFA-generous separate-compilation stance, \CFA is equal or ahead in every important category so far.
 The remaining attribution, \emph{text-import}, is thus a major reason that \CFA is currently unsuccessful at delivering improved speed with long strings.  \PAB{From Mike: need to finish this point.}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset a17f496 for doc/theses/mike_brooks_MMath

Legend:

doc/theses/mike_brooks_MMath/string.tex

Download in other formats: