Context Navigation

-              r7184906
+              reeefc0c
 The expansion and underlying API are under discussion.
 TODO: explain pivot from ``is at done?'' to ``has more?''
 Advantages of this change include being able to pass ranges to functions, for example, projecting a numerically regular subsequence of array entries, and being able to use the loop syntax to cover more collection types, such as looping over the keys of a hashtable.
+Advantages of this change include being able to pass ranges to functions, for example, projecting a numerically regular subsequence of array entries, and being able to use the loop syntax to cover more collection types, such as looping over the keys of a hash table.
 When iterating an empty list, the question, ``Is there a further element?'' needs to be posed once, receiving the answer, ``no.''
 …
 The goal is to show the \CFA lists are competitive with other designs, but the different list designs may not have equivalent functionality, so it is impossible to select a winner encompassing both functionality and execution performance.
 \subsection{Experiment Design}
+This section explains how the experiment is built.
+Many of the following parts define terminology concerning tuning knobs.
+\VRef[Figure]{f:ListPerfGlossary} provides a consolidated reference.
 \begin{figure}
 …
 -- Movement & \\
         \quad $\ni$ stack
                 & Inserts and removes happen at the same end. \\
+                & IRs happen at the same end. \\
         \quad $\ni$ queue
                 & Inserts and removes happen at opposite ends.  \\
+                & IRs happen at opposite ends.  \\
 -- Polarity
                 & Which of the two orientations in which the movement happens. \\
 …
                 & How an insertion position, or removal element, is specified.  The same position/element is picked either way. \\
         \quad $\ni$ all head
                 & inserts and removes both through the head \\
+                & IRs both through the head \\
         \quad $\ni$ insert element
                 & insert by element and remove through the head \\
 …
-This section explains how the experiment is built.
-Many of the parts following define terminology concerning tuning knobs.
-\VRef[Figure]{f:ListPerfGlossary} provides a consolidated refence.
 \subsubsection{Add-Remove Performance}
 \label{s:AddRemovePerformance}
 …
 This experiment takes the position that:
 \begin{itemize}[leftmargin=*]
+\begin{enumerate}[leftmargin=*]
         \item The total time to add and remove is relevant, as opposed to having one time for adding and a separate time for removing.
                   Adds without removes quickly fill memory;
 …
         \item Speed differences caused by the host machine's memory hierarchy need to be identified and explained,
                   but do not represent advantages of one framework over another.
 \end{itemize}
 The experiment used to measure insert/remove cost measures the mean duration of a sequence of additions and removals.
+\end{enumerate}
+The experiment used to measure IR cost measures the mean duration of a sequence of additions and removals.
 The distribution of speeds experienced by an individual add-remove pair (tail latency) is not discussed.
 Space efficiency is shown only indirectly, by way of caches' impact on speed.
 …
 In all cases, the quantity discussed is the duration of one insert-remove (IR).
+An IR is the time taken to do one innermost insertion-loop iteration, one innermost removal-loop iteration, and its share of all overheads, ammortized.
+Lower IR duration is better.
+An IR is the time taken to do one innermost insertion-loop iteration, one innermost removal-loop iteration, and its share of all overheads, amortized.
+Lower IR is better.
 This experiment typically does an IR in 1--10 ns.
 The short end of this range has durations of single-digit clock-cycle counts.
 …
 Often, an IR duration value needs to be considered relatively.
 For example, \VRef[Section]{toc:sweet-sore} asks whether one linked list implementation is more sensitive than another to changing which computer runs the test.
 A finding might be that a machine change slows implementation A by 10\% and B by 20\%.
+For example, \VRef{s:SweetSoreSpots} asks whether one linked list implementation is more sensitive than another to the computer architecture.
+A finding might be that a machine slows implementation A by 10\% and B by 20\%.
 This finding is not saying that A is faster than B (on either machine).
 The finding could stand if B started 10\% faster and the machine change levelled them off, if B started slower and got worse, or in myriad other cases.
 The finding asserts that such distinctions are not what's immediately relevant.
 The arithmetic that produces the 10\% and 20\% answers is removing the information about which one starts, or ends up, faster.
+The finding could stand if B starts faster and then levels off, if B starts slower and gets worse, or in myriad other cases.
+The finding asserts that such distinctions are not what is immediately relevant.
+The arithmetic producing different answers is removing the information about which one starts or ends up faster.
 Each implementation's to-machine duration is stated relatively to \emph{the same implementation's} from-machine duration.
 The resulting measure is still about a duration.
 …
+}
 stopTimer();
 reportedDuration = getTimerDuration() / totalOpsDone; // throughput per insert/remove operation
+reportedDuration = getTimerDuration() / totalOpsDone; // throughput per IR operation
 \end{cfa}
 To reduce administrative overhead, the $n$ nodes for each experiment list are preallocated in an array (on the stack), which removes dynamic allocations for this storage.
 …
 After each round, a counter is incremented by $n$ (for throughput).
 Time is measured outside the loop because a large $n$ can overrun the time duration before the @CONTINUE@ flag is tested.
 Hence, there is minimum of one outer (@CONTINUE@) loop iteration for large lists.
+Hence, there is a minimum of one outer (@CONTINUE@) loop iteration for large lists.
 The loop duration is divided by the counter and this throughput is reported.
 In a scatter-plot, each dot is one throughput, which means insert + remove + harness overhead.
 …
 To test list operations, the experiment performs the inserts/removes in different patterns, \eg insert and remove from front, insert from front and remove from back, random insert and remove, \etc.
 Unfortunately, the @std::list@ does \emph{not} support direct insert/remove from a node without an iterator, \ie no @erase( node )@, even though the list is doubly-linked.
+Unfortunately, the @std::list@ does \emph{not} support direct IR from a node without an iterator, \ie no @erase( node )@, even though the list is doubly-linked.
 To eliminate the iterator, a trick is used for random insertions without replacement, which takes advantage of the array nature of the nodes.
 The @i@ fields in each node are initialized from @0..n-1@.
 …
 Hence, the traversal is the same as the non-random traversal above.
 To level the experiments, an explicit access to the random node is inserted after the insertion, @temp.j = 0@, for the wrapped experiment.
 Furthermore, it is rare to insert/remove nodes and not access them.
+Furthermore, it is rare to IR nodes and not access them.
 % \emph{Interleaving} allows for movements other than pure stack and queue.
 …
 \subsubsection{Use Cases}
 \label{s:UseCases}
+Where \VRef[Figure]{f:ListPerfGlossary} enumerates the specific values, recall the use-case dimensions are:
+\begin{description}
+        \item[movement ($\times 2$)]
+          In these experiments, strict stack and queue patterns are tested.
+        \item[polarity ($\times 2$)]
+          Obtain one polarity from the other by reversing uses of first/last.
+        \item[accessor ($\times 3$)]
+          Giving an add/remove location by a list head's first/last, \vs by a preexisting reference to an individual element.
+\end{description}
 \begin{figure}
 …
 \end{figure}
-Where \VRef[Figure]{f:ListPerfGlossary} enumerates the specific values, recall the use-case dimensions are:
-\begin{description}
-        \item[movement ($\times 2$)]
-          In these experiments, strict stack and queue patterns are tested.
-        \item[polarity ($\times 2$)]
-          Obtain one polarity from the other by reversing uses of first/last.
-        \item[accessor ($\times 3$)]
-          Giving an add/remove location by a list head's first/last, \vs by a preexisting reference to an individual element?
-\end{description}
 A use case is a specific selection of movement, polarity and accessor.
 These experiments run twelve use cases.
 …
 The accessor patterns, at the (\CFA) API level, are:
 \begin{description}
+  \item[all (through the) head:]  Both inserts and removes happen through the list head.  The list head operations are @insert_first@, @insert_last@, @remove_first@ and @remove_last@. \\
+  \item[all (through the) head:]  Both IRs happen through the list head.  The list head operations are @insert_first@, @insert_last@, @remove_first@ and @remove_last@.
+  \begin{sloppypar}
   \item[insert (of known) element] \dots and remove through head:  Inserts use @insert_before(e, first)@ or @insert_after(e, last)@, where @e@ is being inserted and @first@/@last@ are element references known by list-independent means.
+  \end{sloppypar}
   \item[remove (of known) element] \dots and insert through the head:  Removes use @remove(e)@, where @e@ is being removed.  List-independent knowledge establishes that @e@ is first or last, as appropriate.
 \end{description}
 …
 Comparing all-head with remove-element gives the relative performance of head-mediated \vs element-oriented removal, because both use the same insertion style.
 \subsubsection{Sizing}
+It is true, but perhaps not obvious, that buildind and destroying long lists is slower than building and destroying short lists.
+Obviously, indeed, it takes longer to fuse and divide a hundred neighbours than five.
+But the key metric in this work, AII, is about a single link--unlink.
+So, critically, linking and unlinking a hundred neighbours actually takes \emph{more} than $20\times$ the time for five neighbours.
+The main reason is caching; when more neighbours are being manipulated, more memory is being read and written.
+But caching success is about more than the amount of memory worked on.
+Subtle changes in pattern become butterfly effects.
+Aggressive ILP scheduling, which enables short AIR times, is the amplifier.
+A data dependency, present in one framework but not another, is critical path in one situation but in not another.
+So, duration's response to size is not a steady worsening as size increases.
+Rather, each size-independent configuration often responds to size increases with leaps of worsening.
+Occasionally a leap is even followed size-run of retrograde response, where a suddenly incurred penalty has a chance to ammortize away.
+The frameworks tend to leapfrog over each other, at different points, as size increases.
+Intuition suggests measuring IR for different sized lists should just be a multiple of the single linking/unlinking of a node.
+However, there is a scaling issue as more memory is being read and written, where caching comes into play.
+But caching is more than the amount of memory being accessed;
+the access pattern is equally important.
+Aggressive instruction-level parallelism scheduling, which enables short IR times, is the amplifier, \eg a data dependency is a critical path in one situation but not in another.
+Therefore, the duration response to size is not a steady worsening as size increases.
+Often, each size-independent configuration responds to size increases in steps of slowdown.
+Occasionally a slowdown step is followed by some perforamnce increase, where an incurred penalty begins to amortize away.
+Hence, performance results can have interesting jitter as size increases.
 The analysis treats these behaviours as incidental.
 It does not try to characterize various exact-size responses.
+Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another zone.
+Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another.
+% It is true, but perhaps not obvious, that buildind and destroying long lists is slower than building and destroying short lists.
+% Obviously, indeed, it takes longer to fuse and divide a hundred neighbours than five.
+% But the key metric in this work, AII, is about a single link--unlink.
+% So, critically, linking and unlinking a hundred neighbours actually takes \emph{more} than $20\times$ the time for five neighbours.
+% The main reason is caching; when more neighbours are being manipulated, more memory is being read and written.
+%
+% But caching success is about more than the amount of memory worked on.
+% Subtle changes in pattern become butterfly effects.
+% Aggressive ILP scheduling, which enables short AIR times, is the amplifier.
+% A data dependency, present in one framework but not another, is critical path in one situation but in not another.
+% So, duration's response to size is not a steady worsening as size increases.
+% Rather, each size-independent configuration often responds to size increases with leaps of worsening.
+% Occasionally a leap is even followed size-run of retrograde response, where a suddenly incurred penalty has a chance to ammortize away.
+% The frameworks tend to leapfrog over each other, at different points, as size increases.
+%
+% The analysis treats these behaviours as incidental.
+% It does not try to characterize various exact-size responses.
+% Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another zone.
 \begin{figure}
 …
+        }
   \end{tabular}
   \caption{Variety of IR duration \vs list length, at small--medium lengths.  Two example use cases are shown: I, stack movement with head-only access (plot a); VIII, queue movement with element-oriented removal access (plot b); both use cases have insert-first polarity.  One example is run on each machine: UC-I on AMD (ploat a); UC-VIII on Intel (plot b).  Lower duration is better.}
+  \caption{Variety of IR duration \vs list length, at small--medium lengths.  Two example use cases are shown: I, stack movement with head-only access (plot a); VIII, queue movement with element-oriented removal access (plot b); both use cases have insert-first polarity.  One example is run on each machine: UC-I on AMD (ploat a); UC-VIII on Intel (plot b).  Lower is better.}
   \label{fig:plot-list-zoomin-abs}
 \end{figure}
 \VRef[Figure]{fig:plot-list-zoomin-abs} gives two example responses to size.
 The dataset here is a small portion of the overall result and it is premature to attempt conclusions about framework differences from it.
 These two examples show, firstly, how differently a pair of individual configurations can behave.
 Of more immediate significance, they also have a pattern repeated, in all eight of their size responses.
 Note the ``small'' and ``medium'' overlaid boxes, which call out the size zones' definitions.
+% The dataset here is a small portion of the overall result and it is premature to attempt conclusions about framework differences from it.
+These two example cases show how differently a pair of individual configurations behave.
+% Of more immediate significance, they also have a pattern repeated, in all eight of their size responses.
+% Note the ``small'' and ``medium'' overlaid boxes, which call out the size zones' definitions.
 Outside of an identified box, size response is erratic.
+Inside a box, size response is smooth.
+This pattern occurs generally throughout all the experimental results, beyond the two examples here.
+To preview, \VRef[Section]{toc:coarse-compre} dismisses ``Large'' sizes (above 150 elements), where the performance story is dominated by the amount of memory touched, inherently, by the choice of intrusive list \vs wrapped, and where one intrusive framework is quite obviously as good as another.
+At smaller sizes, comparing one intrusive framework to another makes sense; this comparison occurs in the remaining ``Result'' sections.
+Among the remaining ``not Large'' sizes, two further zones, \VRef[Figure]{fig:plot-list-zoomin-abs}'s Small and Medium, are selected as representatives of what can vary when the scale is changed.
+These particular ranges were chosen beacause each range tends to have one story repeated across its constituent sizes.
+If \CFA's duration increases across Small, then the other frameworks' usually do too.
+If \CFA is beating \uCpp at the low end of Large, then it usually is at the high end too.
+The leapfrogging tends to happen outside of these two ranges.
+A spot of poor performance appears in the general results for \CFA at size 1.
+Section \MLB{TODO:xref} explores the phenomenon and concludes that it is an anomaly due to a quirky interaction with the testing rig.
+To do so, two it considers size as either length or width.
+Length is the number of elements in a list.
+Width is a number of these lists being kept, worked upon in round-robin order.
+Outside of \MLB{TODO:xref}, size always means length, and width is 1.
+Inside a box, size response is relatively smooth.
+Within and among boxes there are identifiable patterns, which occur throughout all the experimental results.
+Each individual configuration is tested by five trials, giving the error bars at min and max.
+The amount of error here is typical across the configurations.
+With a few exceptions, it is modest, so experiments are repeatable.
+To preview, \VRef{s:ResultCoarseComparisonStyles} dismisses large sizes (above 150 elements) and wrapped lists, because the performance story is dominated by the amount of memory touched not by intrusive \vs wrapped lists.
+At smaller sizes, \VRef{s:ComparingIntrusiveImplementations} shows differences appear among the intrusive-list implementations.
+Among the ``not Large'' sizes ($\le$ 150), two zones, Small and Medium, are selected as representatives of what can vary when the scale is changed.
+These particular ranges are chosen because each range tends to have one story repeated across its constituent sizes.
+For example, if \CFA's duration increases across Small, then the other frameworks' usually do too, or if \CFA is beating \uCpp across Medium, then it usually is at the high end too.
+% The leapfrogging tends to happen outside of these two ranges.
+Finally, on the AMD architecture, \CFA performed poorly at size 1, on queue movements only, and no other framework saw the same effect.
+This extreme outlier is not plotted in graphs.
+After exploring the phenomenon in depth (not presented), the conclusion is a quirky interaction between the hardware and the testing harness.
+A side experiment (that does not enrich the overall comparisons) saw user-induced gaps of $\approx$10 ns between same-list operations hide the effect completely.
+These gaps are realistic because when an item goes on a list another action comes back to it \emph{later}.
+The pattern that the general harness uses, concentrating time-adjacent operations on one list, is useful for measuring the ``small'' size zone, but is contrived, from the perspective of a data hazard that only this pattern exposes.
+The general comparisons do not see the effect at all, because they use only the Small and Medium zones, with shortest length of 4.
+% A spot of poor performance appears in the general results for \CFA at size 1.
+% Section \MLB{TODO:xref} explores the phenomenon and concludes that it is an anomaly due to a quirky interaction with the testing rig.
+% To do so, two it considers size as either length or width.
+% Length is the number of elements in a list.
+% Width is a number of these lists being kept, worked upon in round-robin order.
+% Outside of \MLB{TODO:xref}, size always means length, and width is 1.
 …
 The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists.
 The main difference in the machines is their cache structure.
 The AMD has smaller caches that are shared less, while the Intel shares larger caches among more processors.
+This difference, while an interesting tradeoff for highly concurrent use, is rather one-sided for sequential use, such as this experiment's.
+The Intel offers a single processor a bigger cache.
+This difference, while an interesting tradeoff for highly concurrent use, is rather one-sided for sequential use, such as this experiment.
+Specifically, the Intel offers a single processor a bigger cache.
 \subsubsection{Recap and Master Legend}
 There are 12 use cases, which are all combinations of 2 movents, 2 polarities and 3 accessors.
 There are 4 pysical contexts, which are all combinations of 2 machines and 2 size (length) zones (and 1 width, of value 1).
+For experiments performed in later section, there are 12 use cases, which are all combinations of 2 movents, 2 polarities and 3 accessors.
+There are 4 pysical contexts, which are all combinations of 2 machines and 2 size (length) zones.
 Each physical context samples 4 specific sizes.
 There are 3.25 frameworks.
 This accounting considers how LQ-list supports only the movement--polarity combination "stack, insert first."
 So LQ-list fills a quarter of the otherwise-orthogonal space.
 Use case, physical context and framework are the explanatory factors.
+Taking all combinations of the explanatory factors gives 624 individual configurations:
+\[
+        \textrm{624 individual configurations} =
+        \sum_{\substack{
+                \textrm{12 use cases}\\
+                \textrm{4 physical contexts}\\
+                \textrm{4 specific sizes}\\
+                \textrm{3.25 frameworks}
+        }}
+        \textrm{1 individual configuration}
+\]
+Taking all combinations of the explanatory factors gives 12 $\times$ 4 $\times$ 4 $\times$ 3.25 = 624 individual configurations.
+% \[
+%       \textrm{624 individual configurations} =
+%       \sum_{\substack{
+%               \textrm{12 use cases}\\
+%               \textrm{4 physical contexts}\\
+%               \textrm{4 specific sizes}\\
+%               \textrm{3.25 frameworks}
+%       }}
+%       \textrm{1 individual configuration}
+% \]
 \begin{figure}
 …
+        }
   \end{tabular}
   \caption{IR duration, transformed for general anaysis.  The analysis follows the single example setup of \VRef[Figure]{f:zoomin-abs-i-swift}, \ie Use Case I on AMD, where IR is given as absolute duration.  Plot (a) transforms the source dataset by conditioning on specific size.  Plot (b) takes the results from only the identified size zones, discards their specific-size information, and shows the resulting distribution.  Lower duration is better.}
+  \caption{IR duration, transformed for general anaysis.  The analysis follows the single example setup of \VRef[Figure]{f:zoomin-abs-i-swift}, \ie Use Case I on AMD, where IR is given as absolute duration.  Plot (a) transforms the source dataset by conditioning on specific size.  Plot (b) takes the results from only the identified size zones, discards their specific-size information, and shows the resulting distribution.  Lower is better.}
   \label{fig:plot-list-rel}
 \end{figure}
+\begin{comment}
 of these individual configurations, plucked from  \VRef[Figure]{f:zoomin-abs-i-swift}, are the subject of \VRef[Figure]{fig:plot-list-rel}, where they are now transformed into the format used for general analysis.
 In \subref*{f:zoomin-rel-i-swift}, each of the 56 data points is an individual configuration; the subset within the two boxes has the 32 of interest.
 …
 That is, inter-configuration rollups discard the modest trial-repeatability error.
 The girth of a histogram's distribution is entirely the inter-configuration variance, of its configurations' expected performance.
+\end{comment}
+It is impossible to present this large amount of information in graphs.
+Therefore, a condensed graphing style is used in subsequent plots.
+\VRef[Figure]{fig:plot-list-rel} shows how the condensed graphing style is generated from raw data.
+\VRef[Figure]{f:zoomin-rel-i-swift} is formed from the data in \VRef[Figure]{f:zoomin-abs-i-swift}, restructured on the Y-axis using a relative duration.
+\VRef[Figure]{f:zoomin-histo-i-swift} shows the interesting data within the two boxes (Small/Medium) and their combination (Both).
+This graph plots a vertical histogram for each of the 4 lists.
+The light-shaded histogram is the raw data (similar data values overlap), and the dark histogram is the goemean average when there are multiple experiments condensed in a column.
+The caption indicates the number of values condensed into this histogram, e.g., ``/4'' $\Rightarrow$ 4 data points.
+The vertical relationship among the averages gives a quick result for a specific experiment, where lower is better.
+The relative duration smooths the results, where smoothness diminishes as size increases.
+This flatness gives nicely separated histograms.
 Thus, in the forthcoming comparison plots:
+\begin{itemize}
+\item
+The measure is mean IR among the middle 3 trials out of 5, that occurred for an individual configuration.
+\item
+The number of individual configurations per histogram is stated as ``/N,'' at a relevant grnaularity.
+\item
+All reported averages are geometric means and all IR duration axes (verticals) are logarithmic.
+\item
+Unless indicated otherwise, all explanatory factors appearing on a plot are marginalized, while those not appearing on the plot are conditioned.
+\end{itemize}
+% \begin{itemize}[leftmargin=*]
+% \item
+% The measure is mean IR among the middle 3 trials out of 5, that occurred for an individual configuration.
+% \item
+% The number of individual configurations per histogram is stated as ``/N,'' at a relevant granularity.
+% \item
+% All reported averages are geometric means and all IR duration axes (verticals) are logarithmic.
+% \item
+% Unless indicated otherwise, all explanatory factors appearing on a plot are marginalized, while those not appearing on the plot are conditioned.
+% \end{itemize}
 \subsection{Result: Coarse comparison of styles}
+\label{s:ResultCoarseComparisonStyles}
 This comparison establishes how an intrusive list performs compared with a wrapped-reference list.
 \VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test.
+\VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) IR test.
 Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum.
 In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:Contenders}) are plotted with the same symbol;
+In the graphs, all four intrusive lists (lq-list, lq-tailq, \uCpp, \CFA, see Framework in \VRef[Figure]{f:ListPerfGlossary}) are plotted with the same symbol;
 sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list).
 See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists.
 The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns.
 For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/remove times.
+The list lengths start at 10 due to the short IR times of 2--4 ns, for intrusive lists \vs STL's wrapped-reference list of 15--20 ns.
+For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the IR times.
 As the list size grows, the administrative overhead for intrusive lists quickly disappears.
 …
   \caption{Insert/remove duration \vs list length.
   Lengths go as large possible without error.
   One example use case is shown: I, stack movement, insert-first polarity and head-mediated access. Lower duration is better.}
+  One example use case is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.}
   \label{fig:plot-list-zoomout}
 \end{figure}
 The key performance factor between the intrusive and the wrapped-reference lists is the dynamic allocation for the wrapped nodes.
 Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator.
 For insert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists.
 For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
+The key performance factor between intrusive and wrapped-reference lists is the dynamic allocation for the wrapped nodes.
+Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than IR, and is sensitive to the layout of memory by the allocator.
+For intrusive-list IR, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists.
+For wrapped-reference IR, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
 the allocation dominates these costs.
 For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node.
 …
 The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same.
 \begin{figure}
   \centering
   \includegraphics{plot-list-1ord.pdf} \\
+  \includegraphics{plot-list-1ord.pdf}
   \small{\textsuperscript{\textdagger} LQ-@list@ is (/48) by its incomplete (25\%) use case coverage.  Its bars are scaled to match.}
   \caption{Histogram of IR durations, decomposed by all first-order effects.
   Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. Lower duration is better.
+  Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. Lower is better.
+  }
   \label{fig:plot-list-1ord}
 …
 \VRef[Figure]{fig:plot-list-1ord} gives the first-order effects.
 The first breakdown, architecture/size-zone (left), showing the overall performance of all 12 experiment on the two different hardware architectures.
 The relative experiment duration for each experiment is shown as a bar in each column and the black bar in that column shows the average of all 12 experiments.
 By inspection, Intel runs faster than AMD.
 As well, the small zone (lists of 4--16 elements) runs faster than the medium zone (lists of 50--200 elements).
 The size effect is more pronounced on the AMD with its smaller L3 cache than it is on the Intel.
+The first breakdown, architecture/size-zone (left), shows the overall performance of all 12 experiment on the two different hardware architectures for small and medium lists (624 / 4 = 156 experiments per column).
+% The relative experiment duration for each experiment is shown as a bar in each column and the black bar in that column shows the average of all 12 experiments.
+By inspection of the averages, Intel runs faster than AMD.
+Within an architecture, the small zone (lists of 4--16 elements) runs faster than the medium zone (lists of 50--200 elements).
+The overall slower execution on the AMD results from its smaller L3 cache \vs the larger cache on the Intel.
 (No NUMA effects for these list sizes.)
 Specifically, a 20\% standard deviation exists here, between the means of the four physical-effect categories.
+The key takeaway for this comparison is the context it establishes for interpreting the following framework comparisons.
+Both the particulars of a machine's cache design, and a list length's effect on the program's cache friendliness, affect insert/remove speed in the manner illlustrated in this breakdown.
+That is, if you are running on an unknown machine, at a scale above anomaly-prone individuals, and below where major LLC caching effects take over the general intrusive-list advantage, but with an unknown relationship to the sizing of your fickle low-level caches, you are likely to experience an unpredictable speed impact on the order of 20\%.
+A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by use case.
+Specific interactions do occur, like framework X doing better on stacks than on queues; a selection of these is addressed in \VRef[Figure]{fig:plot-list-2ord} and discussed shortly.
+But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the use cases opaquely.
+Whether a given list framework is suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement.
+So you face another lottery, with a likely win-loss range of the standard deviation of the individual use cases' means: 9\%.
+This context helps interpret \VRef[Figure]{fig:plot-list-1ord}'s final comparison, by framework.
+In this result, \CFA runs similarly to \uCpp and LQ-@list@ runs similarly to @tailq@.
+These hardware effects are accounted for when interpreting the following framework comparisons.
+% The key takeaway for this comparison is the context it establishes for interpreting the following framework comparisons.
+% Both the particulars of a machine's cache design, and a list length's effect on the program's cache friendliness, affect IR speed in the manner illustrated in this breakdown.
+% That is, if you are running on an unknown machine, at a scale above anomaly-prone individuals, and below where major LLC caching effects take over the general intrusive-list advantage, but with an unknown relationship to the sizing of your fickle low-level caches, you are likely to experience an unpredictable speed impact on the order of 20\%.
+The second breakdown, use case (middle), shows the overall performance for each of the 12 use cases from \VRef[Figure]{f:ExperimentOperations} (624 / 12 = 52 experiments per column).
+% A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by use case.
+While specific differences do occur, like framework X doing better on stacks than on queues, the overall range of the standard deviation of the individual use cases' means is only 9\%, indicating no unusual cases.
+A more detailed analysis occurs in the discussion of \VRef[Figure]{fig:plot-list-2ord}.
+% But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the use cases opaquely.
+% Whether a given list framework is suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement.
+% So you face another lottery, with a likely win-loss range of the standard deviation of the individual use cases' means: 9\%.
+The third breakdown, framework (right), shows the overall performance of the 4 list implementations (624 / 3.25 = 192).
+Here, \CFA runs similarly to \uCpp and LQ-@list@ runs similarly to @tailq@.
 The standard deviation of the frameworks' means is 8\%.
+Framework choice has, therefore, less impact on your speed than the lottery tickets you already hold.
+Now, the LQs do indeed beat the UW languages by 15\%, a fact explored further in \MLB{TODO: xref}.
+% Framework choice has, therefore, less impact on your speed than the lottery tickets you already hold.
+Now, \CFA/\uCpp run slower than LQ-@list@/@tailq@ by 15\%, a fact explored further in \VRef{s:SweetSoreSpots}.
 But so too does use case VIII typically beat use case IV by 38\%.
 As does a small size on the Intel typically beat a medium size on the AMD by 66\%.
 Framework choice is simply not where you stand to win or lose the most.
+Hence, architecture and usage patterns have a significant affect on the specific framework.
 \subsection{Result: Sweet and Sore Spots}
+\label{s:SweetSoreSpots}
 \begin{figure}
         \centering
         \includegraphics{plot-list-2ord.pdf}\\
         \small{
         \textsuperscript{\textdagger} LQ-@list@ is absent from Movement and Polarity comparisons because it does not support queue and insert-last, respectively.\\
 …
+        }
         \caption{Histogram of IR durations, illustrating interactions with framework.
+        Each distribution shows how its framework reacts to a single other factor being varied across one pair of options.
+        Every (binned and mean-contributing) individual data point represents a pair of test setups, one with the criterion set to the option labelled at the top; the other setup uses the bottom option.
+        This point's y-axis score is the ratio of these setups' durations.
+        The point lands in a bin closer to the label of the option that performs better.
+        Higher favours top option; lower favours bottom option.
+        }
         \label{fig:plot-list-2ord}
 \end{figure}
+\VRef[Figure]{fig:plot-list-1ord} stays razor-focused on only first-order effects in order to contextualize a winner/loser framework observation.
+But this perspective cannot address questions like, ``Where are \CFA's sore spots?''
+Moreover, the shallow threatment of use cases by ordinals said nothing about how stack usage compares with queues'.
+\VRef[Figure]{fig:plot-list-2ord} provides such answers.
+Its size-zone criterion refines the obvious notion that a small size runs faster than a big size; this issue is by how much.
+% \VRef[Figure]{fig:plot-list-1ord} is focused on only first-order effects in order to contextualize a winner/loser framework observation.
+% But this perspective cannot address questions like, ``Where are \CFA's sore spots?''
+% Moreover, the shallow treatment of use cases by ordinals said nothing about how stack usage compares with queues.
+\VRef[Figure]{fig:plot-list-2ord} shows how frameworks react to a single other factor being varied across one pair of options.
+Every (binned and mean-contributing) individual data point represents a pair of test setups, one with the criterion set to the option labelled at the top; the other setup uses the bottom option.
+This point's y-axis score is the ratio of these setups' durations.
+The point lands in a bin closer to the label of the option that performs better.
+The first breakdown, size zone (left), refines the notion that a small size runs faster than a big size;
+this issue is by how much.
 Indeed, all means favour small and few tails favour medium.
 But the various frameworks do no respond to the different sizes and machines uniformly.
 On the AMD, \CFA and \uCpp have a modest size sensitivity, LQ-tailq's is moderate amd LQ-list seems unaffected.
 On the Intel, \CFA's increases to moderate, while \uCpp is now unaffected, and both LQs have a dramatic response.
 The Intel is more sensitive to size than the AMD.
 Turning next to movement and polarity, the responses appear more subdued.
 Note that LQ-list has no represntation in these comparisons because it only supports stacks that push and pop with the first element.
+On the AMD, \CFA and \uCpp have a modest size sensitivity, LQ-tailq's is moderate, and LQ-list seems unaffected.
+On the Intel, \CFA's increases to moderate, while \uCpp is now unaffected, and both LQs have a large effect.
+Hence, the Intel is more sensitive to size than the AMD.
+The second breakdown, movement and polarity (middle), the responses are more subdued.
+Note, LQ-list has no represntation in these comparisons because it only supports stacks that push and pop with the first element.
 \CFA is completely stable under movement and polarity changes.
 \uCpp and LQ show modest responses favouring queues and insertion at last.
 Finally, with accessor, a \CFA sore spot emerges.
+The third breakdown, accessor (right), the responses are close, except for \CFA.
 Note the pair of two-way comparisons pulled from the three experiment setups used.
 First, the all-head/insert-element opposition addresses which insertion style is better---by-head (top) and by-element (bottom).
 Then, the all-head/remove-element opposition addresses which removal style is better---by-head (top) and by-element (bottom).
+First, the all-head/insert-element addresses which insertion style is better---by-head (top) and by-element (bottom).
+Then, the all-head/remove-element addresses which removal style is better---by-head (top) and by-element (bottom).
 The LQs favour insertion by head and removal by element.
 \CFA and \uCpp favour both operations by head.
 …
+\begin{comment}
 \subsection{\CFA Tiny-Size Anomaly}
 …
   \label{fig:plot-list-wide}
 \end{figure}
+\end{comment}
 \begin{comment}
 …
 \begin{comment}
 \VRef[Figure]{fig:plot-list-zoomin} shows the sizes below 150 blown up.
 % The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occurring through a head-provided insert/remove operation.
+% The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occurring through a head-provided IR operation.
 The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials.
 For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other.
 …
 Ultimately, this analysis provides options for a future effort that needs to get the most speed out of the \CFA list.
 \end{comment}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset eeefc0c for doc/theses/mike_brooks_MMath/list.tex

Legend:

doc/theses/mike_brooks_MMath/list.tex

Download in other formats: