Changeset eeefc0c


Ignore:
Timestamp:
Apr 26, 2026, 10:51:51 AM (3 days ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Children:
ac9c0ee8
Parents:
7184906
Message:

proofread list chapter

Location:
doc/theses/mike_brooks_MMath
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mike_brooks_MMath/list.tex

    r7184906 reeefc0c  
    503503The expansion and underlying API are under discussion.
    504504TODO: explain pivot from ``is at done?'' to ``has more?''
    505 Advantages of this change include being able to pass ranges to functions, for example, projecting a numerically regular subsequence of array entries, and being able to use the loop syntax to cover more collection types, such as looping over the keys of a hashtable.
     505Advantages of this change include being able to pass ranges to functions, for example, projecting a numerically regular subsequence of array entries, and being able to use the loop syntax to cover more collection types, such as looping over the keys of a hash table.
    506506
    507507When iterating an empty list, the question, ``Is there a further element?'' needs to be posed once, receiving the answer, ``no.''
     
    653653The goal is to show the \CFA lists are competitive with other designs, but the different list designs may not have equivalent functionality, so it is impossible to select a winner encompassing both functionality and execution performance.
    654654
     655
    655656\subsection{Experiment Design}
     657
     658This section explains how the experiment is built.
     659Many of the following parts define terminology concerning tuning knobs.
     660\VRef[Figure]{f:ListPerfGlossary} provides a consolidated reference.
    656661
    657662\begin{figure}
     
    664669-- Movement & \\
    665670        \quad $\ni$ stack
    666                 & Inserts and removes happen at the same end. \\
     671                & IRs happen at the same end. \\
    667672        \quad $\ni$ queue
    668                 & Inserts and removes happen at opposite ends.  \\
     673                & IRs happen at opposite ends.  \\
    669674-- Polarity
    670675                & Which of the two orientations in which the movement happens. \\
     
    676681                & How an insertion position, or removal element, is specified.  The same position/element is picked either way. \\
    677682        \quad $\ni$ all head
    678                 & inserts and removes both through the head \\
     683                & IRs both through the head \\
    679684        \quad $\ni$ insert element
    680685                & insert by element and remove through the head \\
     
    717722
    718723
    719 This section explains how the experiment is built.
    720 Many of the parts following define terminology concerning tuning knobs.
    721 \VRef[Figure]{f:ListPerfGlossary} provides a consolidated refence.
    722 
    723 
    724724\subsubsection{Add-Remove Performance}
    725725\label{s:AddRemovePerformance}
     
    733733
    734734This experiment takes the position that:
    735 \begin{itemize}[leftmargin=*]
     735\begin{enumerate}[leftmargin=*]
    736736        \item The total time to add and remove is relevant, as opposed to having one time for adding and a separate time for removing.
    737737                  Adds without removes quickly fill memory;
     
    743743        \item Speed differences caused by the host machine's memory hierarchy need to be identified and explained,
    744744                  but do not represent advantages of one framework over another.
    745 \end{itemize}
    746 
    747 The experiment used to measure insert/remove cost measures the mean duration of a sequence of additions and removals.
     745\end{enumerate}
     746
     747The experiment used to measure IR cost measures the mean duration of a sequence of additions and removals.
    748748The distribution of speeds experienced by an individual add-remove pair (tail latency) is not discussed.
    749749Space efficiency is shown only indirectly, by way of caches' impact on speed.
     
    755755
    756756In all cases, the quantity discussed is the duration of one insert-remove (IR).
    757 An IR is the time taken to do one innermost insertion-loop iteration, one innermost removal-loop iteration, and its share of all overheads, ammortized.
    758 
    759 Lower IR duration is better.
     757An IR is the time taken to do one innermost insertion-loop iteration, one innermost removal-loop iteration, and its share of all overheads, amortized.
     758Lower IR is better.
    760759This experiment typically does an IR in 1--10 ns.
    761760The short end of this range has durations of single-digit clock-cycle counts.
     
    763762
    764763Often, an IR duration value needs to be considered relatively.
    765 For example, \VRef[Section]{toc:sweet-sore} asks whether one linked list implementation is more sensitive than another to changing which computer runs the test.
    766 A finding might be that a machine change slows implementation A by 10\% and B by 20\%.
     764For example, \VRef{s:SweetSoreSpots} asks whether one linked list implementation is more sensitive than another to the computer architecture.
     765A finding might be that a machine slows implementation A by 10\% and B by 20\%.
    767766This finding is not saying that A is faster than B (on either machine).
    768 The finding could stand if B started 10\% faster and the machine change levelled them off, if B started slower and got worse, or in myriad other cases.
    769 The finding asserts that such distinctions are not what's immediately relevant.
    770 The arithmetic that produces the 10\% and 20\% answers is removing the information about which one starts, or ends up, faster.
     767The finding could stand if B starts faster and then levels off, if B starts slower and gets worse, or in myriad other cases.
     768The finding asserts that such distinctions are not what is immediately relevant.
     769The arithmetic producing different answers is removing the information about which one starts or ends up faster.
    771770Each implementation's to-machine duration is stated relatively to \emph{the same implementation's} from-machine duration.
    772771The resulting measure is still about a duration.
     
    799798}
    800799stopTimer();
    801 reportedDuration = getTimerDuration() / totalOpsDone; // throughput per insert/remove operation
     800reportedDuration = getTimerDuration() / totalOpsDone; // throughput per IR operation
    802801\end{cfa}
    803802To reduce administrative overhead, the $n$ nodes for each experiment list are preallocated in an array (on the stack), which removes dynamic allocations for this storage.
     
    807806After each round, a counter is incremented by $n$ (for throughput).
    808807Time is measured outside the loop because a large $n$ can overrun the time duration before the @CONTINUE@ flag is tested.
    809 Hence, there is minimum of one outer (@CONTINUE@) loop iteration for large lists.
     808Hence, there is a minimum of one outer (@CONTINUE@) loop iteration for large lists.
    810809The loop duration is divided by the counter and this throughput is reported.
    811810In a scatter-plot, each dot is one throughput, which means insert + remove + harness overhead.
     
    814813
    815814To test list operations, the experiment performs the inserts/removes in different patterns, \eg insert and remove from front, insert from front and remove from back, random insert and remove, \etc.
    816 Unfortunately, the @std::list@ does \emph{not} support direct insert/remove from a node without an iterator, \ie no @erase( node )@, even though the list is doubly-linked.
     815Unfortunately, the @std::list@ does \emph{not} support direct IR from a node without an iterator, \ie no @erase( node )@, even though the list is doubly-linked.
    817816To eliminate the iterator, a trick is used for random insertions without replacement, which takes advantage of the array nature of the nodes.
    818817The @i@ fields in each node are initialized from @0..n-1@.
     
    840839Hence, the traversal is the same as the non-random traversal above.
    841840To level the experiments, an explicit access to the random node is inserted after the insertion, @temp.j = 0@, for the wrapped experiment.
    842 Furthermore, it is rare to insert/remove nodes and not access them.
     841Furthermore, it is rare to IR nodes and not access them.
    843842
    844843% \emph{Interleaving} allows for movements other than pure stack and queue.
     
    885884\subsubsection{Use Cases}
    886885\label{s:UseCases}
     886
     887Where \VRef[Figure]{f:ListPerfGlossary} enumerates the specific values, recall the use-case dimensions are:
     888\begin{description}
     889        \item[movement ($\times 2$)]
     890          In these experiments, strict stack and queue patterns are tested.
     891        \item[polarity ($\times 2$)]
     892          Obtain one polarity from the other by reversing uses of first/last.
     893        \item[accessor ($\times 3$)]
     894          Giving an add/remove location by a list head's first/last, \vs by a preexisting reference to an individual element.
     895\end{description}
    887896
    888897\begin{figure}
     
    963972\end{figure}
    964973
    965 Where \VRef[Figure]{f:ListPerfGlossary} enumerates the specific values, recall the use-case dimensions are:
    966 \begin{description}
    967         \item[movement ($\times 2$)]
    968           In these experiments, strict stack and queue patterns are tested.
    969         \item[polarity ($\times 2$)]
    970           Obtain one polarity from the other by reversing uses of first/last.
    971         \item[accessor ($\times 3$)]
    972           Giving an add/remove location by a list head's first/last, \vs by a preexisting reference to an individual element?
    973 \end{description}
    974 
    975974A use case is a specific selection of movement, polarity and accessor.
    976975These experiments run twelve use cases.
     
    985984The accessor patterns, at the (\CFA) API level, are:
    986985\begin{description}
    987   \item[all (through the) head:]  Both inserts and removes happen through the list head.  The list head operations are @insert_first@, @insert_last@, @remove_first@ and @remove_last@. \\
     986  \item[all (through the) head:]  Both IRs happen through the list head.  The list head operations are @insert_first@, @insert_last@, @remove_first@ and @remove_last@.
     987  \begin{sloppypar}
    988988  \item[insert (of known) element] \dots and remove through head:  Inserts use @insert_before(e, first)@ or @insert_after(e, last)@, where @e@ is being inserted and @first@/@last@ are element references known by list-independent means.
     989  \end{sloppypar}
    989990  \item[remove (of known) element] \dots and insert through the head:  Removes use @remove(e)@, where @e@ is being removed.  List-independent knowledge establishes that @e@ is first or last, as appropriate.
    990991\end{description}
     
    993994Comparing all-head with remove-element gives the relative performance of head-mediated \vs element-oriented removal, because both use the same insertion style.
    994995
     996
    995997\subsubsection{Sizing}
    996998
    997 
    998 It is true, but perhaps not obvious, that buildind and destroying long lists is slower than building and destroying short lists.
    999 Obviously, indeed, it takes longer to fuse and divide a hundred neighbours than five.
    1000 But the key metric in this work, AII, is about a single link--unlink.
    1001 So, critically, linking and unlinking a hundred neighbours actually takes \emph{more} than $20\times$ the time for five neighbours.
    1002 The main reason is caching; when more neighbours are being manipulated, more memory is being read and written.
    1003 
    1004 But caching success is about more than the amount of memory worked on.
    1005 Subtle changes in pattern become butterfly effects.
    1006 Aggressive ILP scheduling, which enables short AIR times, is the amplifier.
    1007 A data dependency, present in one framework but not another, is critical path in one situation but in not another.
    1008 So, duration's response to size is not a steady worsening as size increases.
    1009 Rather, each size-independent configuration often responds to size increases with leaps of worsening.
    1010 Occasionally a leap is even followed size-run of retrograde response, where a suddenly incurred penalty has a chance to ammortize away.
    1011 The frameworks tend to leapfrog over each other, at different points, as size increases.
    1012 
     999Intuition suggests measuring IR for different sized lists should just be a multiple of the single linking/unlinking of a node.
     1000However, there is a scaling issue as more memory is being read and written, where caching comes into play.
     1001But caching is more than the amount of memory being accessed;
     1002the access pattern is equally important.
     1003Aggressive instruction-level parallelism scheduling, which enables short IR times, is the amplifier, \eg a data dependency is a critical path in one situation but not in another.
     1004Therefore, the duration response to size is not a steady worsening as size increases.
     1005Often, each size-independent configuration responds to size increases in steps of slowdown.
     1006Occasionally a slowdown step is followed by some perforamnce increase, where an incurred penalty begins to amortize away.
     1007Hence, performance results can have interesting jitter as size increases.
    10131008The analysis treats these behaviours as incidental.
    10141009It does not try to characterize various exact-size responses.
    1015 Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another zone.
     1010Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another.
     1011
     1012% It is true, but perhaps not obvious, that buildind and destroying long lists is slower than building and destroying short lists.
     1013% Obviously, indeed, it takes longer to fuse and divide a hundred neighbours than five.
     1014% But the key metric in this work, AII, is about a single link--unlink.
     1015% So, critically, linking and unlinking a hundred neighbours actually takes \emph{more} than $20\times$ the time for five neighbours.
     1016% The main reason is caching; when more neighbours are being manipulated, more memory is being read and written.
     1017%
     1018% But caching success is about more than the amount of memory worked on.
     1019% Subtle changes in pattern become butterfly effects.
     1020% Aggressive ILP scheduling, which enables short AIR times, is the amplifier.
     1021% A data dependency, present in one framework but not another, is critical path in one situation but in not another.
     1022% So, duration's response to size is not a steady worsening as size increases.
     1023% Rather, each size-independent configuration often responds to size increases with leaps of worsening.
     1024% Occasionally a leap is even followed size-run of retrograde response, where a suddenly incurred penalty has a chance to ammortize away.
     1025% The frameworks tend to leapfrog over each other, at different points, as size increases.
     1026%
     1027% The analysis treats these behaviours as incidental.
     1028% It does not try to characterize various exact-size responses.
     1029% Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another zone.
    10161030
    10171031\begin{figure}
     
    10261040        }
    10271041  \end{tabular}
    1028   \caption{Variety of IR duration \vs list length, at small--medium lengths.  Two example use cases are shown: I, stack movement with head-only access (plot a); VIII, queue movement with element-oriented removal access (plot b); both use cases have insert-first polarity.  One example is run on each machine: UC-I on AMD (ploat a); UC-VIII on Intel (plot b).  Lower duration is better.}
     1042  \caption{Variety of IR duration \vs list length, at small--medium lengths.  Two example use cases are shown: I, stack movement with head-only access (plot a); VIII, queue movement with element-oriented removal access (plot b); both use cases have insert-first polarity.  One example is run on each machine: UC-I on AMD (ploat a); UC-VIII on Intel (plot b).  Lower is better.}
    10291043  \label{fig:plot-list-zoomin-abs}
    10301044\end{figure}
    10311045
    10321046\VRef[Figure]{fig:plot-list-zoomin-abs} gives two example responses to size.
    1033 The dataset here is a small portion of the overall result and it is premature to attempt conclusions about framework differences from it.
    1034 These two examples show, firstly, how differently a pair of individual configurations can behave.
    1035 Of more immediate significance, they also have a pattern repeated, in all eight of their size responses.
    1036 Note the ``small'' and ``medium'' overlaid boxes, which call out the size zones' definitions.
     1047% The dataset here is a small portion of the overall result and it is premature to attempt conclusions about framework differences from it.
     1048These two example cases show how differently a pair of individual configurations behave.
     1049% Of more immediate significance, they also have a pattern repeated, in all eight of their size responses.
     1050% Note the ``small'' and ``medium'' overlaid boxes, which call out the size zones' definitions.
    10371051Outside of an identified box, size response is erratic.
    1038 Inside a box, size response is smooth.
    1039 This pattern occurs generally throughout all the experimental results, beyond the two examples here.
    1040 
    1041 To preview, \VRef[Section]{toc:coarse-compre} dismisses ``Large'' sizes (above 150 elements), where the performance story is dominated by the amount of memory touched, inherently, by the choice of intrusive list \vs wrapped, and where one intrusive framework is quite obviously as good as another.
    1042 At smaller sizes, comparing one intrusive framework to another makes sense; this comparison occurs in the remaining ``Result'' sections.
    1043 Among the remaining ``not Large'' sizes, two further zones, \VRef[Figure]{fig:plot-list-zoomin-abs}'s Small and Medium, are selected as representatives of what can vary when the scale is changed.
    1044 These particular ranges were chosen beacause each range tends to have one story repeated across its constituent sizes.
    1045 If \CFA's duration increases across Small, then the other frameworks' usually do too.
    1046 If \CFA is beating \uCpp at the low end of Large, then it usually is at the high end too.
    1047 The leapfrogging tends to happen outside of these two ranges.
    1048 
    1049 A spot of poor performance appears in the general results for \CFA at size 1.
    1050 Section \MLB{TODO:xref} explores the phenomenon and concludes that it is an anomaly due to a quirky interaction with the testing rig.
    1051 To do so, two it considers size as either length or width.
    1052 Length is the number of elements in a list.
    1053 Width is a number of these lists being kept, worked upon in round-robin order.
    1054 Outside of \MLB{TODO:xref}, size always means length, and width is 1.
    1055 
     1052Inside a box, size response is relatively smooth.
     1053Within and among boxes there are identifiable patterns, which occur throughout all the experimental results.
     1054Each individual configuration is tested by five trials, giving the error bars at min and max.
     1055The amount of error here is typical across the configurations.
     1056With a few exceptions, it is modest, so experiments are repeatable.
     1057
     1058To preview, \VRef{s:ResultCoarseComparisonStyles} dismisses large sizes (above 150 elements) and wrapped lists, because the performance story is dominated by the amount of memory touched not by intrusive \vs wrapped lists.
     1059At smaller sizes, \VRef{s:ComparingIntrusiveImplementations} shows differences appear among the intrusive-list implementations.
     1060Among the ``not Large'' sizes ($\le$ 150), two zones, Small and Medium, are selected as representatives of what can vary when the scale is changed.
     1061These particular ranges are chosen because each range tends to have one story repeated across its constituent sizes.
     1062For example, if \CFA's duration increases across Small, then the other frameworks' usually do too, or if \CFA is beating \uCpp across Medium, then it usually is at the high end too.
     1063% The leapfrogging tends to happen outside of these two ranges.
     1064
     1065Finally, on the AMD architecture, \CFA performed poorly at size 1, on queue movements only, and no other framework saw the same effect.
     1066This extreme outlier is not plotted in graphs.
     1067After exploring the phenomenon in depth (not presented), the conclusion is a quirky interaction between the hardware and the testing harness.
     1068A side experiment (that does not enrich the overall comparisons) saw user-induced gaps of $\approx$10 ns between same-list operations hide the effect completely.
     1069These gaps are realistic because when an item goes on a list another action comes back to it \emph{later}.
     1070The pattern that the general harness uses, concentrating time-adjacent operations on one list, is useful for measuring the ``small'' size zone, but is contrived, from the perspective of a data hazard that only this pattern exposes.
     1071The general comparisons do not see the effect at all, because they use only the Small and Medium zones, with shortest length of 4.
     1072
     1073% A spot of poor performance appears in the general results for \CFA at size 1.
     1074% Section \MLB{TODO:xref} explores the phenomenon and concludes that it is an anomaly due to a quirky interaction with the testing rig.
     1075% To do so, two it considers size as either length or width.
     1076% Length is the number of elements in a list.
     1077% Width is a number of these lists being kept, worked upon in round-robin order.
     1078% Outside of \MLB{TODO:xref}, size always means length, and width is 1.
    10561079
    10571080
     
    10831106The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists.
    10841107
    1085 
    10861108The main difference in the machines is their cache structure.
    10871109The AMD has smaller caches that are shared less, while the Intel shares larger caches among more processors.
    1088 This difference, while an interesting tradeoff for highly concurrent use, is rather one-sided for sequential use, such as this experiment's.
    1089 The Intel offers a single processor a bigger cache.
     1110This difference, while an interesting tradeoff for highly concurrent use, is rather one-sided for sequential use, such as this experiment.
     1111Specifically, the Intel offers a single processor a bigger cache.
     1112
    10901113
    10911114\subsubsection{Recap and Master Legend}
    10921115
    1093 There are 12 use cases, which are all combinations of 2 movents, 2 polarities and 3 accessors.
    1094 There are 4 pysical contexts, which are all combinations of 2 machines and 2 size (length) zones (and 1 width, of value 1).
     1116For experiments performed in later section, there are 12 use cases, which are all combinations of 2 movents, 2 polarities and 3 accessors.
     1117There are 4 pysical contexts, which are all combinations of 2 machines and 2 size (length) zones.
    10951118Each physical context samples 4 specific sizes.
    1096 
    10971119There are 3.25 frameworks.
    10981120This accounting considers how LQ-list supports only the movement--polarity combination "stack, insert first."
    10991121So LQ-list fills a quarter of the otherwise-orthogonal space.
    1100 
    11011122Use case, physical context and framework are the explanatory factors.
    1102 Taking all combinations of the explanatory factors gives 624 individual configurations:
    1103 \[
    1104         \textrm{624 individual configurations} =
    1105         \sum_{\substack{
    1106                 \textrm{12 use cases}\\
    1107                 \textrm{4 physical contexts}\\
    1108                 \textrm{4 specific sizes}\\
    1109                 \textrm{3.25 frameworks}
    1110         }}
    1111         \textrm{1 individual configuration}
    1112 \]
     1123Taking all combinations of the explanatory factors gives 12 $\times$ 4 $\times$ 4 $\times$ 3.25 = 624 individual configurations.
     1124
     1125% \[
     1126%       \textrm{624 individual configurations} =
     1127%       \sum_{\substack{
     1128%               \textrm{12 use cases}\\
     1129%               \textrm{4 physical contexts}\\
     1130%               \textrm{4 specific sizes}\\
     1131%               \textrm{3.25 frameworks}
     1132%       }}
     1133%       \textrm{1 individual configuration}
     1134% \]
    11131135
    11141136\begin{figure}
     
    11241146        }
    11251147  \end{tabular}
    1126   \caption{IR duration, transformed for general anaysis.  The analysis follows the single example setup of \VRef[Figure]{f:zoomin-abs-i-swift}, \ie Use Case I on AMD, where IR is given as absolute duration.  Plot (a) transforms the source dataset by conditioning on specific size.  Plot (b) takes the results from only the identified size zones, discards their specific-size information, and shows the resulting distribution.  Lower duration is better.}
     1148  \caption{IR duration, transformed for general anaysis.  The analysis follows the single example setup of \VRef[Figure]{f:zoomin-abs-i-swift}, \ie Use Case I on AMD, where IR is given as absolute duration.  Plot (a) transforms the source dataset by conditioning on specific size.  Plot (b) takes the results from only the identified size zones, discards their specific-size information, and shows the resulting distribution.  Lower is better.}
    11271149  \label{fig:plot-list-rel}
    11281150\end{figure}
    11291151
     1152\begin{comment}
    1130115332 of these individual configurations, plucked from  \VRef[Figure]{f:zoomin-abs-i-swift}, are the subject of \VRef[Figure]{fig:plot-list-rel}, where they are now transformed into the format used for general analysis.
    11311154In \subref*{f:zoomin-rel-i-swift}, each of the 56 data points is an individual configuration; the subset within the two boxes has the 32 of interest.
     
    11691192That is, inter-configuration rollups discard the modest trial-repeatability error.
    11701193The girth of a histogram's distribution is entirely the inter-configuration variance, of its configurations' expected performance.
    1171 
     1194\end{comment}
     1195
     1196It is impossible to present this large amount of information in graphs.
     1197Therefore, a condensed graphing style is used in subsequent plots.
     1198\VRef[Figure]{fig:plot-list-rel} shows how the condensed graphing style is generated from raw data.
     1199\VRef[Figure]{f:zoomin-rel-i-swift} is formed from the data in \VRef[Figure]{f:zoomin-abs-i-swift}, restructured on the Y-axis using a relative duration.
     1200\VRef[Figure]{f:zoomin-histo-i-swift} shows the interesting data within the two boxes (Small/Medium) and their combination (Both).
     1201This graph plots a vertical histogram for each of the 4 lists.
     1202The light-shaded histogram is the raw data (similar data values overlap), and the dark histogram is the goemean average when there are multiple experiments condensed in a column.
     1203The caption indicates the number of values condensed into this histogram, e.g., ``/4'' $\Rightarrow$ 4 data points.
     1204The vertical relationship among the averages gives a quick result for a specific experiment, where lower is better.
     1205The relative duration smooths the results, where smoothness diminishes as size increases.
     1206This flatness gives nicely separated histograms.
    11721207Thus, in the forthcoming comparison plots:
    1173 \begin{itemize}
    1174 \item
    1175 The measure is mean IR among the middle 3 trials out of 5, that occurred for an individual configuration.
    1176 \item
    1177 The number of individual configurations per histogram is stated as ``/N,'' at a relevant grnaularity.
    1178 \item
    1179 All reported averages are geometric means and all IR duration axes (verticals) are logarithmic.
    1180 \item
    1181 Unless indicated otherwise, all explanatory factors appearing on a plot are marginalized, while those not appearing on the plot are conditioned.
    1182 \end{itemize}
     1208
     1209% \begin{itemize}[leftmargin=*]
     1210% \item
     1211% The measure is mean IR among the middle 3 trials out of 5, that occurred for an individual configuration.
     1212% \item
     1213% The number of individual configurations per histogram is stated as ``/N,'' at a relevant granularity.
     1214% \item
     1215% All reported averages are geometric means and all IR duration axes (verticals) are logarithmic.
     1216% \item
     1217% Unless indicated otherwise, all explanatory factors appearing on a plot are marginalized, while those not appearing on the plot are conditioned.
     1218% \end{itemize}
    11831219
    11841220
    11851221\subsection{Result: Coarse comparison of styles}
     1222\label{s:ResultCoarseComparisonStyles}
    11861223
    11871224This comparison establishes how an intrusive list performs compared with a wrapped-reference list.
    1188 \VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test.
     1225\VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) IR test.
    11891226Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum.
    1190 In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:Contenders}) are plotted with the same symbol;
     1227In the graphs, all four intrusive lists (lq-list, lq-tailq, \uCpp, \CFA, see Framework in \VRef[Figure]{f:ListPerfGlossary}) are plotted with the same symbol;
    11911228sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list).
    11921229See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists.
    11931230
    1194 The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns.
    1195 For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/remove times.
     1231The list lengths start at 10 due to the short IR times of 2--4 ns, for intrusive lists \vs STL's wrapped-reference list of 15--20 ns.
     1232For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the IR times.
    11961233As the list size grows, the administrative overhead for intrusive lists quickly disappears.
    11971234
     
    12221259  \caption{Insert/remove duration \vs list length.
    12231260  Lengths go as large possible without error.
    1224   One example use case is shown: I, stack movement, insert-first polarity and head-mediated access. Lower duration is better.}
     1261  One example use case is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.}
    12251262  \label{fig:plot-list-zoomout}
    12261263\end{figure}
    12271264
    1228 The key performance factor between the intrusive and the wrapped-reference lists is the dynamic allocation for the wrapped nodes.
    1229 Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator.
    1230 For insert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists.
    1231 For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
     1265The key performance factor between intrusive and wrapped-reference lists is the dynamic allocation for the wrapped nodes.
     1266Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than IR, and is sensitive to the layout of memory by the allocator.
     1267For intrusive-list IR, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists.
     1268For wrapped-reference IR, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
    12321269the allocation dominates these costs.
    12331270For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node.
     
    13061343The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same.
    13071344
    1308 
    13091345\begin{figure}
    13101346  \centering
    1311   \includegraphics{plot-list-1ord.pdf} \\
     1347  \includegraphics{plot-list-1ord.pdf}
    13121348  \small{\textsuperscript{\textdagger} LQ-@list@ is (/48) by its incomplete (25\%) use case coverage.  Its bars are scaled to match.}
    13131349  \caption{Histogram of IR durations, decomposed by all first-order effects.
    1314   Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. Lower duration is better.
     1350  Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. Lower is better.
    13151351  }
    13161352  \label{fig:plot-list-1ord}
     
    13181354
    13191355\VRef[Figure]{fig:plot-list-1ord} gives the first-order effects.
    1320 The first breakdown, architecture/size-zone (left), showing the overall performance of all 12 experiment on the two different hardware architectures.
    1321 The relative experiment duration for each experiment is shown as a bar in each column and the black bar in that column shows the average of all 12 experiments.
    1322 By inspection, Intel runs faster than AMD.
    1323 As well, the small zone (lists of 4--16 elements) runs faster than the medium zone (lists of 50--200 elements).
    1324 The size effect is more pronounced on the AMD with its smaller L3 cache than it is on the Intel.
     1356The first breakdown, architecture/size-zone (left), shows the overall performance of all 12 experiment on the two different hardware architectures for small and medium lists (624 / 4 = 156 experiments per column).
     1357% The relative experiment duration for each experiment is shown as a bar in each column and the black bar in that column shows the average of all 12 experiments.
     1358By inspection of the averages, Intel runs faster than AMD.
     1359Within an architecture, the small zone (lists of 4--16 elements) runs faster than the medium zone (lists of 50--200 elements).
     1360The overall slower execution on the AMD results from its smaller L3 cache \vs the larger cache on the Intel.
    13251361(No NUMA effects for these list sizes.)
    13261362Specifically, a 20\% standard deviation exists here, between the means of the four physical-effect categories.
    1327 The key takeaway for this comparison is the context it establishes for interpreting the following framework comparisons.
    1328 Both the particulars of a machine's cache design, and a list length's effect on the program's cache friendliness, affect insert/remove speed in the manner illlustrated in this breakdown.
    1329 That is, if you are running on an unknown machine, at a scale above anomaly-prone individuals, and below where major LLC caching effects take over the general intrusive-list advantage, but with an unknown relationship to the sizing of your fickle low-level caches, you are likely to experience an unpredictable speed impact on the order of 20\%.
    1330 
    1331 A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by use case.
    1332 Specific interactions do occur, like framework X doing better on stacks than on queues; a selection of these is addressed in \VRef[Figure]{fig:plot-list-2ord} and discussed shortly.
    1333 But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the use cases opaquely.
    1334 Whether a given list framework is suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement.
    1335 So you face another lottery, with a likely win-loss range of the standard deviation of the individual use cases' means: 9\%.
    1336 
    1337 This context helps interpret \VRef[Figure]{fig:plot-list-1ord}'s final comparison, by framework.
    1338 In this result, \CFA runs similarly to \uCpp and LQ-@list@ runs similarly to @tailq@.
     1363These hardware effects are accounted for when interpreting the following framework comparisons.
     1364% The key takeaway for this comparison is the context it establishes for interpreting the following framework comparisons.
     1365% Both the particulars of a machine's cache design, and a list length's effect on the program's cache friendliness, affect IR speed in the manner illustrated in this breakdown.
     1366% That is, if you are running on an unknown machine, at a scale above anomaly-prone individuals, and below where major LLC caching effects take over the general intrusive-list advantage, but with an unknown relationship to the sizing of your fickle low-level caches, you are likely to experience an unpredictable speed impact on the order of 20\%.
     1367
     1368The second breakdown, use case (middle), shows the overall performance for each of the 12 use cases from \VRef[Figure]{f:ExperimentOperations} (624 / 12 = 52 experiments per column).
     1369% A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by use case.
     1370While specific differences do occur, like framework X doing better on stacks than on queues, the overall range of the standard deviation of the individual use cases' means is only 9\%, indicating no unusual cases.
     1371A more detailed analysis occurs in the discussion of \VRef[Figure]{fig:plot-list-2ord}.
     1372% But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the use cases opaquely.
     1373% Whether a given list framework is suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement.
     1374% So you face another lottery, with a likely win-loss range of the standard deviation of the individual use cases' means: 9\%.
     1375
     1376The third breakdown, framework (right), shows the overall performance of the 4 list implementations (624 / 3.25 = 192).
     1377Here, \CFA runs similarly to \uCpp and LQ-@list@ runs similarly to @tailq@.
    13391378The standard deviation of the frameworks' means is 8\%.
    1340 Framework choice has, therefore, less impact on your speed than the lottery tickets you already hold.
    1341 
    1342 Now, the LQs do indeed beat the UW languages by 15\%, a fact explored further in \MLB{TODO: xref}.
     1379% Framework choice has, therefore, less impact on your speed than the lottery tickets you already hold.
     1380Now, \CFA/\uCpp run slower than LQ-@list@/@tailq@ by 15\%, a fact explored further in \VRef{s:SweetSoreSpots}.
    13431381But so too does use case VIII typically beat use case IV by 38\%.
    13441382As does a small size on the Intel typically beat a medium size on the AMD by 66\%.
    1345 Framework choice is simply not where you stand to win or lose the most.
     1383Hence, architecture and usage patterns have a significant affect on the specific framework.
    13461384
    13471385
    13481386\subsection{Result: Sweet and Sore Spots}
     1387\label{s:SweetSoreSpots}
    13491388
    13501389\begin{figure}
    13511390        \centering
    13521391        \includegraphics{plot-list-2ord.pdf}\\
    1353 
    13541392        \small{
    13551393        \textsuperscript{\textdagger} LQ-@list@ is absent from Movement and Polarity comparisons because it does not support queue and insert-last, respectively.\\
     
    13581396        }
    13591397        \caption{Histogram of IR durations, illustrating interactions with framework.
    1360         Each distribution shows how its framework reacts to a single other factor being varied across one pair of options.
    1361         Every (binned and mean-contributing) individual data point represents a pair of test setups, one with the criterion set to the option labelled at the top; the other setup uses the bottom option.
    1362         This point's y-axis score is the ratio of these setups' durations.
    1363         The point lands in a bin closer to the label of the option that performs better.
     1398        Higher favours top option; lower favours bottom option.
    13641399        }
    13651400        \label{fig:plot-list-2ord}
    13661401\end{figure} 
    13671402
    1368 \VRef[Figure]{fig:plot-list-1ord} stays razor-focused on only first-order effects in order to contextualize a winner/loser framework observation.
    1369 But this perspective cannot address questions like, ``Where are \CFA's sore spots?''
    1370 Moreover, the shallow threatment of use cases by ordinals said nothing about how stack usage compares with queues'.
    1371 
    1372 \VRef[Figure]{fig:plot-list-2ord} provides such answers.
    1373 Its size-zone criterion refines the obvious notion that a small size runs faster than a big size; this issue is by how much.
     1403% \VRef[Figure]{fig:plot-list-1ord} is focused on only first-order effects in order to contextualize a winner/loser framework observation.
     1404% But this perspective cannot address questions like, ``Where are \CFA's sore spots?''
     1405% Moreover, the shallow treatment of use cases by ordinals said nothing about how stack usage compares with queues.
     1406
     1407\VRef[Figure]{fig:plot-list-2ord} shows how frameworks react to a single other factor being varied across one pair of options.
     1408Every (binned and mean-contributing) individual data point represents a pair of test setups, one with the criterion set to the option labelled at the top; the other setup uses the bottom option.
     1409This point's y-axis score is the ratio of these setups' durations.
     1410The point lands in a bin closer to the label of the option that performs better.
     1411
     1412The first breakdown, size zone (left), refines the notion that a small size runs faster than a big size;
     1413this issue is by how much.
    13741414Indeed, all means favour small and few tails favour medium.
    13751415But the various frameworks do no respond to the different sizes and machines uniformly.
    1376 On the AMD, \CFA and \uCpp have a modest size sensitivity, LQ-tailq's is moderate amd LQ-list seems unaffected.
    1377 On the Intel, \CFA's increases to moderate, while \uCpp is now unaffected, and both LQs have a dramatic response.
    1378 The Intel is more sensitive to size than the AMD.
    1379 
    1380 Turning next to movement and polarity, the responses appear more subdued.
    1381 Note that LQ-list has no represntation in these comparisons because it only supports stacks that push and pop with the first element.
     1416On the AMD, \CFA and \uCpp have a modest size sensitivity, LQ-tailq's is moderate, and LQ-list seems unaffected.
     1417On the Intel, \CFA's increases to moderate, while \uCpp is now unaffected, and both LQs have a large effect.
     1418Hence, the Intel is more sensitive to size than the AMD.
     1419
     1420The second breakdown, movement and polarity (middle), the responses are more subdued.
     1421Note, LQ-list has no represntation in these comparisons because it only supports stacks that push and pop with the first element.
    13821422\CFA is completely stable under movement and polarity changes.
    13831423\uCpp and LQ show modest responses favouring queues and insertion at last.
    13841424
    1385 Finally, with accessor, a \CFA sore spot emerges.
     1425The third breakdown, accessor (right), the responses are close, except for \CFA.
    13861426Note the pair of two-way comparisons pulled from the three experiment setups used.
    1387 First, the all-head/insert-element opposition addresses which insertion style is better---by-head (top) and by-element (bottom).
    1388 Then, the all-head/remove-element opposition addresses which removal style is better---by-head (top) and by-element (bottom).
     1427First, the all-head/insert-element addresses which insertion style is better---by-head (top) and by-element (bottom).
     1428Then, the all-head/remove-element addresses which removal style is better---by-head (top) and by-element (bottom).
    13891429The LQs favour insertion by head and removal by element.
    13901430\CFA and \uCpp favour both operations by head.
     
    13921432
    13931433
     1434\begin{comment}
    13941435\subsection{\CFA Tiny-Size Anomaly}
    13951436
     
    14431484  \label{fig:plot-list-wide}
    14441485\end{figure}
     1486\end{comment}
    14451487
    14461488\begin{comment}
     
    14641506
    14651507\begin{comment}
    1466 
    1467 
    14681508\VRef[Figure]{fig:plot-list-zoomin} shows the sizes below 150 blown up.
    1469 % The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occurring through a head-provided insert/remove operation.
     1509% The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occurring through a head-provided IR operation.
    14701510The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials.
    14711511For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other.
     
    16271667
    16281668Ultimately, this analysis provides options for a future effort that needs to get the most speed out of the \CFA list.
    1629 
    16301669\end{comment}
    16311670
  • doc/theses/mike_brooks_MMath/plots/list-1ord.gp

    r7184906 reeefc0c  
    2929
    3030set xrange [-5.5:17.5];
    31 set xlabel "Machine, Size Zone (/156);                  Use Case (/52);                             Framework (/192^†)       \nPrevalence                                   Prevalence                                     Prevalence"
     31set xlabel "Architecture, Size Zone (/156);                  Use Case (/52);                             Framework (/192^†)       \nPrevalence                                   Prevalence                                     Prevalence"
    3232set xtics ( \
    3333   "AMD, sm"    -5, \
  • doc/theses/mike_brooks_MMath/plots/list-mchn-szz.gp

    r7184906 reeefc0c  
    4646
    4747set xrange [-0.25:4.25];
    48 set xlabel "Machine, Size Zone; Prevalence" offset 2,-1
     48set xlabel "Architecture, Size Zone; Prevalence" offset 2,-1
    4949set xtics (\
    5050    "AMD\nsmall" 0, \
Note: See TracChangeset for help on using the changeset viewer.