Changeset a17f496 for doc/theses/mike_brooks_MMath
- Timestamp:
- Jun 10, 2025, 5:19:58 PM (3 months ago)
- Branches:
- master
- Children:
- 2214e81
- Parents:
- 9c1880b
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mike_brooks_MMath/string.tex
r9c1880b ra17f496 971 971 %This alias model is assigning to an aliased substring for two strings overlapping by total containment: one is the selected string, the other is the document. 972 972 Extend the metaphor to two selected areas, where one area can be drag-and-dropped into another, changing the text in the drop area and correspondingly changing the document. 973 When the selected areas are inde npendent, the semantics of the drag-and-drop are straightforward.973 When the selected areas are independent, the semantics of the drag-and-drop are straightforward. 974 974 However, for overlapping selections, either partial or full, there are multiple useful semantics. 975 975 For example, two areas overlap at the top, or bottom, or a block at a corner, where one areas is dropped into the other. … … 1833 1833 The test shows that \CFA enables faster speed in exchange for greater memory usage. 1834 1834 1835 A garbage collector, afforded the freedom of managed memory (where it knows about all the pointers and is allowed to modify them), often runs faster than malloc-free in an amortized analysis, even though it must occasionally stop to collect.1836 The speedup happens because GC is able to use its collection time to move objects.1835 A precise garbage-collector, which knows all pointers and may modify them, often runs faster than malloc-free in an amortized analysis, even though it must occasionally stop to collect (latency issues). 1836 The speedup happens indirectly because GC is able to move objects. 1837 1837 (In the case of the mini-allocator powering the \CFA string library, objects are runs of text.) 1838 Moving objects lets fresh allocations consume from a large contiguous store of available memory; the ``bump pointer'' bookkeeping for such a scheme is very light.1839 A malloc-free implementation without the freedom to move objects must, in the general case, allocate in the spaces between existing objects; doing so entails the heavier bookkeeping of maintaining a linked structure of freed allocations and/or coalescing freed allocations.1840 1841 A garbage collector keeps allocations around for longer than the using program can reach them.1842 By contrast, a program (correctly) using malloc-free releases allocations exactly when they are no longer reachable. 1843 Therefore, the same harness will use more memory while running under garbage collection.1844 A garbage collector can minimize the memory overhead by searching for these dead allocations aggressively, that is, by collecting more often.1845 Tuned in this way, it spends a lot of time collecting, easily so much as to overwhelm its speed advantage from bump-pointer allocation.1846 If it is tuned to collect rarely, then it leaves a lot of garbage allocated (waiting to be collected) but gains the advantage of little time spent doing collection.1847 1848 [TODO: find citations for the above knowledge] 1849 1850 The speed for memory tradeoff is, therefore, standard for comparisons like\CFA--STL string allocations.1851 The test verifies that it is so and quantifies the returns available.1838 Moving objects creates a large contiguous store of available memory; 1839 fresh allocations consume from this area using light-weight \emph{bump-allocation}. 1840 A malloc-free implementation cannot move objects because the location of allocated objects is unknown; 1841 hence, free-space management is more complex managing linked structures of freed allocations and possibly coalescing freed blocks. 1842 1843 GC occurs lazily, so the lifetime of unreachable allocations is unknown, \eg GC may never be triggered. 1844 A garbage collector can minimize the memory footprint for unreachable allocations by collecting eagerly. 1845 However, this approach can negate GC's speed advantage. 1846 Alternatively, collecting infrequently can consume large amounts of resident and virtual memory. 1847 In contrast, a program using malloc-free tries to release an allocation as soon as it is unreachable. 1848 This approach reduces the memory footprint (depending on the allocator), but has a per allocation cost for each free. 1849 % [TODO: find citations for the above knowledge] 1850 Therefore, a speed \vs memory tradeoff is a good comparator for \CFA--STL string allocations. 1851 %The test verifies that it is so and quantifies the returns available. 1852 1852 1853 1853 These tests manipulate a tuning knob that controls how much extra space to use. 1854 Specific values of this knob are not user-visible and are not presented in the results here.1854 Specific values of this knob are not user-visible and are not presented in the results. 1855 1855 Instead, its two effects (amount of space used and time per operation) are shown. 1856 The independent variable is the liveness target, which is the fraction of the text buffer that isin use at the end of a collection.1857 The allocator will expand its text buffer during a collection if the actual fraction liveexceeds this target.1856 The independent variable is the liveness target, which is the fraction of the text buffer in use at the end of a collection. 1857 The allocator expands its text buffer during a collection if the actual live fraction exceeds this target. 1858 1858 1859 1859 \begin{figure} … … 1882 1882 \end{cfa} 1883 1883 The runs presented use a call depth of 1000 and a fan-out of 1.006, which means that approximately one call in 167 makes two recursive calls, while the rest make one. 1884 This sizing keeps an average of 836 strings live. 1885 This sizing was chosen to keep the amounts of consumed memory within the machine's last-level cache. 1886 1887 The time measurement is of nanoseconds per such @f@-call, be it internal or leaf. 1884 This sizing keeps an average of 836 strings live and keeps the amounts of consumed memory within the machine's last-level cache. 1885 The time measurement is nanoseconds per @f@-call, for internal or leaf strings. 1888 1886 1889 1887 % String lifetime (measured in number of subsequent string allocations) is ?? distributed, because each node in the call tree survives as long as its descendent calls. … … 1893 1891 \centering 1894 1892 \includegraphics{plot-string-allocn.pdf} 1895 \begin{tabularx}{\linewidth}{>{\centering\arraybackslash}X >{\centering\arraybackslash}X} (a) & (b) \end{tabularx} 1896 \caption{Space and time performance, under varying fraction-live targets, for the five string lengths shown, at \emph{Varying 16 and up} corpus construction. 1897 } 1893 \begin{tabularx}{\linewidth}{>{\centering\arraybackslash}X >{\centering\arraybackslash}X} (a) Vertical, Lower is better. \\ Horizontal, leftward is better. & (b) \hspace*{5pt} STL CFA \hspace*{20pt} STL \hspace*{10pt} CFA \hspace*{10pt} \end{tabularx} 1894 \caption{Space and time performance, under varying fraction-live targets, for the five string lengths shown, at \emph{Varying 16 and up} corpus construction.} 1898 1895 \label{fig:string-graph-allocn} 1899 1896 \end{figure} 1900 1897 1901 \VRef[Figure]{fig:string-graph-allocn} shows the results of this experiment.1902 In plot (a), there is one labeled curve for each of the five chosen string sizes .1898 \VRef[Figure]{fig:string-graph-allocn} shows the experimental results. 1899 In plot (a), there is one labeled curve for each of the five chosen string sizes, 20, 50, 100, 200, 500. 1903 1900 A labeled curve shows the \CFA speed and space combinations achievable by varying the fraction-live tuning parameter. 1904 Stepping along the curve from top-left toward bottom-right means demanding that the string library have a smaller fraction of if its heap remain live after collection, \ie that the heap should double its size more readily. 1905 So a subsequent data point on a curve is approximately double the prior point's memory. 1906 This additional memory allows for less time spent collecting, so the point shows a lower time per operation. 1907 This benefit of doubling is diminishing, so the curves begin steep and end flatter. 1908 The point chosen as \CFA's default liveness threshold (20\%) is marked. 1909 At most string lengths, this point occurs as the doubling benefit becomes subjectively negligible. 1910 At len-500, the amount of space needed to achieve 20\% liveness is so much that last-level cache misses begin occurring, and a further slowdown occurs. 1911 This effect is an anomaly of the experimental setup trying to compare such varied sizes in one view; the len-500 default point is included only to provide a holisitc picture of the STL comparisons (discussed next). 1912 1913 Staying within plot (a), each ``tradeoff'' line connects the default-tuned \CFA performance point with its lenght-equivalent STL performance point. 1914 STL always uses less memory than (occurs to the left of) its \CFA equivalent. 1915 Ideally, \CFA would boast that its results are always faster (STL occurs above it); however, this effect is only realized for smaller string lenghts. 1916 There, the STL-to-\CFA tradeoff line goes down and to the right, but for longer strings, it goes up and to the right. 1901 Stepping along the curve from top-left toward bottom-right means the string library has a smaller fraction of its heap remain live after collection, \ie it doubles its size more often. 1902 Hence, memory approximately doubles between points. 1903 The additional memory means less time is spent collecting, so the curve shows lower (better) times per operation (y-axis). 1904 This doubling benefit diminishes, so the curves begin to flatten. 1905 1906 The point chosen as \CFA's default liveness threshold (20\%) is marked with a circle. 1907 For most string lengths, this point occurs as the doubling benefit becomes subjectively negligible. 1908 At len-500, the amount of space needed to achieve 20\% liveness is so much that last-level cache misses begin occurring generating a further slowdown. 1909 This effect is an anomaly of the experimental setup trying to compare such varied sizes in one view; 1910 the len-500 default point is included only to provide a holistic picture of the STL comparisons (discussed next). 1911 1912 Staying within plot (a), each \emph{tradeoff} line (green) connects the default-tuned \CFA performance point with its length-equivalent STL performance point. 1913 STL always uses less memory (occurs to the left of) than its \CFA equivalent. 1914 Ideally, \CFA results would be faster (STL occurs above it); 1915 however, this inversion only occurs for smaller string lengths. 1916 That is, the STL-to-\CFA tradeoff line goes down and to the right for small strings, but for longer strings, it goes up and to the right. 1917 1917 1918 1918 Plot (b) offers insight into why larger lengths are not currently showing the anticipated tradeoff, and suggests possible remedies. 1919 1919 This plot breaks down the time spent, comparing STL--\CFA tradeoffs, at successful len-50 with unsuccessful len-200. 1920 Data are sourced from running the experiment under \emph{perf}, recording samples of the call stack, and imposing a mut ially-exclusive classification on these call stacks.1921 Reading a stacked bar from the top down, \emph{text import} captur s samples where a routine like @memcpy@ is running, so that the string library is copying characters from the corpus source into its allocated working space.1922 \emph{Malloc-free} means literally one of those routines (taking nontrivial time only for STL), while \emph{gc} is the direct \CFA(-only) eq ivalent.1923 All of the attributions so far occur while a string constructoris live; further time spent in a string's lifecycle management functions is attributed \emph{ctor-dtor}.1920 Data are sourced from running the experiment under \emph{perf}, recording samples of the call stack, and imposing a mutually-exclusive classification on these call stacks. 1921 Reading a stacked bar from the top down, \emph{text import} captures samples where a routine like @memcpy@ is running, so that the string library is copying characters from the corpus source into its allocated working space. 1922 \emph{Malloc-free} means literally one of those routines (taking nontrivial time only for STL), while \emph{gc} is the direct \CFA(-only) equivalent. 1923 All of the attributions so far occur while a string is live; further time spent in a string's lifecycle management functions is attributed \emph{ctor-dtor}. 1924 1924 Skipping to the bottom-most attribution, \emph{harness-leaf} represents samples where the youngest live stack frame is simply the recursive @helper@ routine of \VRef[Figure]{fig:string-perf-alloc}. 1925 1925 The skipped attribution \emph{Other} has samples that meet none of the criteria above. … … 1931 1931 This overhead figure was obtained by hacking a version of the \CFA string library to have a header-only implementation and measuring the resulting speed. 1932 1932 The difference between separately compiled (normal) and header-only (hacked) versions is the reported overhead. 1933 It represents how much \CFA could speed up if it switched to a header-only implementation ,to match STL.1933 It represents how much \CFA could speed up if it switched to a header-only implementation to match STL. 1934 1934 The difference jumps noticeably between the different string sizes, but has little correlation with string size ($r=0.3$). 1935 1935 Associating this potential improvement with the \emph{other} attribution is a stylistic choice. … … 1938 1938 The \emph{harness-leaf} category is called out as a quality check; this attribution is equal across all four setups, ruling out mysterious ``everything is X\% slower'' effects. 1939 1939 But with the \emph{other} attribution, STL beats \CFA by an amount that matches the STL's benefit from avoiding separate compilation. 1940 Continuing upward, the significant differences between STL and \CFA implementations occur.1941 1940 In the \emph{ctor-dtor} category, \CFA is spending more time than STL to inter-link its string handles. 1942 1941 Comparing STL's \emph{malloc-free} with \CFA's \emph{gc}, \CFA's is considerably shorter, in spite of its copying, because \emph{gc} only spends time on live strings, while \emph{malloc-free} works on all strings. … … 1944 1943 So the critical comparison is STL's $(\textit{ctor-dtor} + \textit{malloc-free})$ \vs \CFA's $(\textit{ctor-dtor} + \textit{gc})$. 1945 1944 Here, \CFA is a clear winner. 1946 Moreover, the discussion so far holds for both the successful and unsuccessful string lengths, albeit with the length-correlated duration of \emph{gc} (due to copying) encro ching on \CFA's advantage, though not eliminating it.1945 Moreover, the discussion so far holds for both the successful and unsuccessful string lengths, albeit with the length-correlated duration of \emph{gc} (due to copying) encroaching on \CFA's advantage, though not eliminating it. 1947 1946 The total attributions so far see \CFA ahead, possibly by more than is shown, depending on how one views the separate-compilation difference. 1948 Under a \CFA-generous separate-compilation stance, \CFA is equal or ahead in every super-category so far.1947 Under a \CFA-generous separate-compilation stance, \CFA is equal or ahead in every important category so far. 1949 1948 1950 1949 The remaining attribution, \emph{text-import}, is thus a major reason that \CFA is currently unsuccessful at delivering improved speed with long strings. \PAB{From Mike: need to finish this point.}
Note:
See TracChangeset
for help on using the changeset viewer.