Ignore:
Timestamp:
Jun 10, 2025, 5:19:58 PM (3 months ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Children:
2214e81
Parents:
9c1880b
Message:

proofread Section 5.4.4

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mike_brooks_MMath/string.tex

    r9c1880b ra17f496  
    971971%This alias model is assigning to an aliased substring for two strings overlapping by total containment: one is the selected string, the other is the document.
    972972Extend the metaphor to two selected areas, where one area can be drag-and-dropped into another, changing the text in the drop area and correspondingly changing the document.
    973 When the selected areas are indenpendent, the semantics of the drag-and-drop are straightforward.
     973When the selected areas are independent, the semantics of the drag-and-drop are straightforward.
    974974However, for overlapping selections, either partial or full, there are multiple useful semantics.
    975975For example, two areas overlap at the top, or bottom, or a block at a corner, where one areas is dropped into the other.
     
    18331833The test shows that \CFA enables faster speed in exchange for greater memory usage.
    18341834
    1835 A garbage collector, afforded the freedom of managed memory (where it knows about all the pointers and is allowed to modify them), often runs faster than malloc-free in an amortized analysis, even though it must occasionally stop to collect.
    1836 The speedup happens because GC is able to use its collection time to move objects.
     1835A precise garbage-collector, which knows all pointers and may modify them, often runs faster than malloc-free in an amortized analysis, even though it must occasionally stop to collect (latency issues).
     1836The speedup happens indirectly because GC is able to move objects.
    18371837(In the case of the mini-allocator powering the \CFA string library, objects are runs of text.)
    1838 Moving objects lets fresh allocations consume from a large contiguous store of available memory; the ``bump pointer'' bookkeeping for such a scheme is very light.
    1839 A malloc-free implementation without the freedom to move objects must, in the general case, allocate in the spaces between existing objects; doing so entails the heavier bookkeeping of maintaining a linked structure of freed allocations and/or coalescing freed allocations.
    1840 
    1841 A garbage collector keeps allocations around for longer than the using program can reach them.
    1842 By contrast, a program (correctly) using malloc-free releases allocations exactly when they are no longer reachable.
    1843 Therefore, the same harness will use more memory while running under garbage collection.
    1844 A garbage collector can minimize the memory overhead by searching for these dead allocations aggressively, that is, by collecting more often.
    1845 Tuned in this way, it spends a lot of time collecting, easily so much as to overwhelm its speed advantage from bump-pointer allocation.
    1846 If it is tuned to collect rarely, then it leaves a lot of garbage allocated (waiting to be collected) but gains the advantage of little time spent doing collection.
    1847 
    1848 [TODO: find citations for the above knowledge]
    1849 
    1850 The speed for memory tradeoff is, therefore, standard for comparisons like \CFA--STL string allocations.
    1851 The test verifies that it is so and quantifies the returns available.
     1838Moving objects creates a large contiguous store of available memory;
     1839fresh allocations consume from this area using light-weight \emph{bump-allocation}.
     1840A malloc-free implementation cannot move objects because the location of allocated objects is unknown;
     1841hence, free-space management is more complex managing linked structures of freed allocations and possibly coalescing freed blocks.
     1842
     1843GC occurs lazily, so the lifetime of unreachable allocations is unknown, \eg GC may never be triggered.
     1844A garbage collector can minimize the memory footprint for unreachable allocations by collecting eagerly.
     1845However, this approach can negate GC's speed advantage.
     1846Alternatively, collecting infrequently can consume large amounts of resident and virtual memory.
     1847In contrast, a program using malloc-free tries to release an allocation as soon as it is unreachable.
     1848This approach reduces the memory footprint (depending on the allocator), but has a per allocation cost for each free.
     1849% [TODO: find citations for the above knowledge]
     1850Therefore, a speed \vs memory tradeoff is a good comparator for \CFA--STL string allocations.
     1851%The test verifies that it is so and quantifies the returns available.
    18521852
    18531853These tests manipulate a tuning knob that controls how much extra space to use.
    1854 Specific values of this knob are not user-visible and are not presented in the results here.
     1854Specific values of this knob are not user-visible and are not presented in the results.
    18551855Instead, its two effects (amount of space used and time per operation) are shown.
    1856 The independent variable is the liveness target, which is the fraction of the text buffer that is in use at the end of a collection.
    1857 The allocator will expand its text buffer during a collection if the actual fraction live exceeds this target.
     1856The independent variable is the liveness target, which is the fraction of the text buffer in use at the end of a collection.
     1857The allocator expands its text buffer during a collection if the actual live fraction exceeds this target.
    18581858
    18591859\begin{figure}
     
    18821882\end{cfa}
    18831883The runs presented use a call depth of 1000 and a fan-out of 1.006, which means that approximately one call in 167 makes two recursive calls, while the rest make one.
    1884 This sizing keeps an average of 836 strings live.
    1885 This sizing was chosen to keep the amounts of consumed memory within the machine's last-level cache.
    1886 
    1887 The time measurement is of nanoseconds per such @f@-call, be it internal or leaf. 
     1884This sizing keeps an average of 836 strings live and keeps the amounts of consumed memory within the machine's last-level cache.
     1885The time measurement is nanoseconds per @f@-call, for internal or leaf strings. 
    18881886
    18891887% String lifetime (measured in number of subsequent string allocations) is ?? distributed, because each node in the call tree survives as long as its descendent calls.
     
    18931891\centering
    18941892  \includegraphics{plot-string-allocn.pdf}
    1895   \begin{tabularx}{\linewidth}{>{\centering\arraybackslash}X >{\centering\arraybackslash}X} (a) & (b) \end{tabularx}
    1896   \caption{Space and time performance, under varying fraction-live targets, for the five string lengths shown, at \emph{Varying 16 and up} corpus construction.
    1897 }
     1893  \begin{tabularx}{\linewidth}{>{\centering\arraybackslash}X >{\centering\arraybackslash}X} (a) Vertical, Lower is better. \\ Horizontal, leftward is better. & (b) \hspace*{5pt} STL CFA \hspace*{20pt} STL \hspace*{10pt} CFA \hspace*{10pt} \end{tabularx}
     1894  \caption{Space and time performance, under varying fraction-live targets, for the five string lengths shown, at \emph{Varying 16 and up} corpus construction.}
    18981895  \label{fig:string-graph-allocn}
    18991896\end{figure}
    19001897
    1901 \VRef[Figure]{fig:string-graph-allocn} shows the results of this experiment.
    1902 In plot (a), there is one labeled curve for each of the five chosen string sizes.
     1898\VRef[Figure]{fig:string-graph-allocn} shows the experimental results.
     1899In plot (a), there is one labeled curve for each of the five chosen string sizes, 20, 50, 100, 200, 500.
    19031900A labeled curve shows the \CFA speed and space combinations achievable by varying the fraction-live tuning parameter.
    1904 Stepping along the curve from top-left toward bottom-right means demanding that the string library have a smaller fraction of if its heap remain live after collection, \ie that the heap should double its size more readily.
    1905 So a subsequent data point on a curve is approximately double the prior point's memory.
    1906 This additional memory allows for less time spent collecting, so the point shows a lower time per operation.
    1907 This benefit of doubling is diminishing, so the curves begin steep and end flatter.
    1908 The point chosen as \CFA's default liveness threshold (20\%) is marked.
    1909 At most string lengths, this point occurs as the doubling benefit becomes subjectively negligible.
    1910 At len-500, the amount of space needed to achieve 20\% liveness is so much that last-level cache misses begin occurring, and a further slowdown occurs.
    1911 This effect is an anomaly of the experimental setup trying to compare such varied sizes in one view; the len-500 default point is included only to provide a holisitc picture of the STL comparisons (discussed next).
    1912 
    1913 Staying within plot (a), each ``tradeoff'' line connects the default-tuned \CFA performance point with its lenght-equivalent STL performance point.
    1914 STL always uses less memory than (occurs to the left of) its \CFA equivalent.
    1915 Ideally, \CFA would boast that its results are always faster (STL occurs above it); however, this effect is only realized for smaller string lenghts.
    1916 There, the STL-to-\CFA tradeoff line goes down and to the right, but for longer strings, it goes up and to the right.
     1901Stepping along the curve from top-left toward bottom-right means the string library has a smaller fraction of its heap remain live after collection, \ie it doubles its size more often.
     1902Hence, memory approximately doubles between points.
     1903The additional memory means less time is spent collecting, so the curve shows lower (better) times per operation (y-axis).
     1904This doubling benefit diminishes, so the curves begin to flatten.
     1905
     1906The point chosen as \CFA's default liveness threshold (20\%) is marked with a circle.
     1907For most string lengths, this point occurs as the doubling benefit becomes subjectively negligible.
     1908At len-500, the amount of space needed to achieve 20\% liveness is so much that last-level cache misses begin occurring generating a further slowdown.
     1909This effect is an anomaly of the experimental setup trying to compare such varied sizes in one view;
     1910the len-500 default point is included only to provide a holistic picture of the STL comparisons (discussed next).
     1911
     1912Staying within plot (a), each \emph{tradeoff} line (green) connects the default-tuned \CFA performance point with its length-equivalent STL performance point.
     1913STL always uses less memory (occurs to the left of) than its \CFA equivalent.
     1914Ideally, \CFA results would be faster (STL occurs above it);
     1915however, this inversion only occurs for smaller string lengths.
     1916That is, the STL-to-\CFA tradeoff line goes down and to the right for small strings, but for longer strings, it goes up and to the right.
    19171917
    19181918Plot (b) offers insight into why larger lengths are not currently showing the anticipated tradeoff, and suggests possible remedies.
    19191919This plot breaks down the time spent, comparing STL--\CFA tradeoffs, at successful len-50 with unsuccessful len-200.
    1920 Data are sourced from running the experiment under \emph{perf}, recording samples of the call stack, and imposing a mutially-exclusive classification on these call stacks.
    1921 Reading a stacked bar from the top down, \emph{text import} capturs samples where a routine like @memcpy@ is running, so that the string library is copying characters from the corpus source into its allocated working space.
    1922 \emph{Malloc-free} means literally one of those routines (taking nontrivial time only for STL), while \emph{gc} is the direct \CFA(-only) eqivalent.
    1923 All of the attributions so far occur while a string constructor is live; further time spent in a string's lifecycle management functions is attributed \emph{ctor-dtor}.
     1920Data are sourced from running the experiment under \emph{perf}, recording samples of the call stack, and imposing a mutually-exclusive classification on these call stacks.
     1921Reading a stacked bar from the top down, \emph{text import} captures samples where a routine like @memcpy@ is running, so that the string library is copying characters from the corpus source into its allocated working space.
     1922\emph{Malloc-free} means literally one of those routines (taking nontrivial time only for STL), while \emph{gc} is the direct \CFA(-only) equivalent.
     1923All of the attributions so far occur while a string is live; further time spent in a string's lifecycle management functions is attributed \emph{ctor-dtor}.
    19241924Skipping to the bottom-most attribution, \emph{harness-leaf} represents samples where the youngest live stack frame is simply the recursive @helper@ routine of \VRef[Figure]{fig:string-perf-alloc}.
    19251925The skipped attribution \emph{Other} has samples that meet none of the criteria above.
     
    19311931This overhead figure was obtained by hacking a version of the \CFA string library to have a header-only implementation and measuring the resulting speed.
    19321932The difference between separately compiled (normal) and header-only (hacked) versions is the reported overhead.
    1933 It represents how much \CFA could speed up if it switched to a header-only implementation, to match STL.
     1933It represents how much \CFA could speed up if it switched to a header-only implementation to match STL.
    19341934The difference jumps noticeably between the different string sizes, but has little correlation with string size ($r=0.3$).
    19351935Associating this potential improvement with the \emph{other} attribution is a stylistic choice.
     
    19381938The \emph{harness-leaf} category is called out as a quality check; this attribution is equal across all four setups, ruling out mysterious ``everything is X\% slower'' effects.
    19391939But with the \emph{other} attribution, STL beats \CFA by an amount that matches the STL's benefit from avoiding separate compilation.
    1940 Continuing upward, the significant differences between STL and \CFA implementations occur.
    19411940In the \emph{ctor-dtor} category, \CFA is spending more time than STL to inter-link its string handles.
    19421941Comparing STL's \emph{malloc-free} with \CFA's \emph{gc}, \CFA's is considerably shorter, in spite of its copying, because \emph{gc} only spends time on live strings, while \emph{malloc-free} works on all strings.
     
    19441943So the critical comparison is STL's $(\textit{ctor-dtor} + \textit{malloc-free})$ \vs \CFA's $(\textit{ctor-dtor} + \textit{gc})$.
    19451944Here, \CFA is a clear winner.
    1946 Moreover, the discussion so far holds for both the successful and unsuccessful string lengths, albeit with the length-correlated duration of \emph{gc} (due to copying) encroching on \CFA's advantage, though not eliminating it.
     1945Moreover, the discussion so far holds for both the successful and unsuccessful string lengths, albeit with the length-correlated duration of \emph{gc} (due to copying) encroaching on \CFA's advantage, though not eliminating it.
    19471946The total attributions so far see \CFA ahead, possibly by more than is shown, depending on how one views the separate-compilation difference.
    1948 Under a \CFA-generous separate-compilation stance, \CFA is equal or ahead in every super-category so far.
     1947Under a \CFA-generous separate-compilation stance, \CFA is equal or ahead in every important category so far.
    19491948
    19501949The remaining attribution, \emph{text-import}, is thus a major reason that \CFA is currently unsuccessful at delivering improved speed with long strings.  \PAB{From Mike: need to finish this point.}
Note: See TracChangeset for help on using the changeset viewer.