Changeset 4cf8832


Ignore:
Timestamp:
Apr 17, 2026, 9:46:46 AM (33 hours ago)
Author:
Michael Brooks <mlbrooks@…>
Branches:
master
Parents:
99bc47b
Message:

Anticipated last new content for list perf analysis

Location:
doc/theses/mike_brooks_MMath
Files:
2 added
7 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mike_brooks_MMath/list.tex

    r99bc47b r4cf8832  
    653653The goal is to show the \CFA lists are competitive with other designs, but the different list designs may not have equivalent functionality, so it is impossible to select a winner encompassing both functionality and execution performance.
    654654
    655 
    656 \subsection{Add-Remove Performance}
     655\subsection{Experiment Design}
     656
     657\begin{figure}
     658\noindent
     659\begin{tabular}{p{1.75in}@{\ }p{4.5in}}
     660Insert-Remove (IR)
     661                & The atomic unit of work being measured: one insertion plus one remove (plus all looping/tracking overheads) \\
     662Use Case
     663                & Pattern of add-remove calls. \\
     664-- Movement & \\
     665        \quad $\ni$ stack
     666                & Inserts and removes happen at the same end. \\
     667        \quad $\ni$ queue
     668                & Inserts and removes happen at opposite ends.  \\
     669-- Polarity
     670                & Which of the two orientations in which the movement happens. \\
     671        \quad $\ni$ insert-first
     672                & All inserts at front; stack removes at front; queue removes at back. \\
     673        \quad $\ni$ insert-last
     674                & All inserts at back; stack removes at back; queue removes at front. \\
     675-- Accessor
     676                & How an insertion position, or removal element, is specified.  The same position/element is picked either way. \\
     677        \quad $\ni$ all head
     678                & inserts and removes both through the head \\
     679        \quad $\ni$ insert element
     680                & insert by element and remove through the head \\
     681        \quad $\ni$ remove element
     682                & insert through head and remove by element\\
     683Physical Context & \\
     684-- Size (number) &  Number of nodes being linked.  Unless specified, equals the \emph{length} of the program's sole list.  \emph{Width}, rarely used, is the number of lists. \\
     685-- Size Zone
     686                & Contiguous range of sizes, chosen to avoid known anomalies and to sample a brief plateau.  Each zone buckets four specific sizes.     \\
     687   \quad $\ni$ small
     688                & lists of 4--16 elements \\
     689   \quad $\ni$ medium
     690                & lists of 50--200 elements \\
     691   \quad $\ni$ (other)
     692                &   Not used for comparing intrusive frameworks. \\
     693-- machine
     694                & Computer running the experiment \\
     695        \quad $\ni$ AMD
     696                & smaller cache \\
     697        \quad $\ni$ Intel
     698                & bigger cache \\
     699Framework & A particular linked-list implementation (within its host language) \\
     700$\ni$ \CC       & The @std::list@ type of g++. \\
     701$\ni$ lq-list   & The @list@ type of LQ from glibc of gcc. \\
     702$\ni$ lq-tailq  & The @tailq@ type of the same. \\
     703$\ni$ \uCpp     & \uCpp's @uSequence@ \\
     704$\ni$ \CFA      & \CFA's @dlist@ \\
     705Explanation being
     706                & How independent explanatory variable X is analyzed \\
     707-- Marginalized
     708                & Left alone, allowed to vary, yielding a more absolute measure.  Shows the effect that X causes.  If all explanations are marginalized, then absolute times are available and a relative time has a peer group that is the entire population.  \\
     709-- Conditioned
     710                & Held constant, yielding a more relative measure.  Hides the effect that X causes.  Conditioning on X creates more, smaller relative-measure peer groups, by isolating each X-domain value.  Resulting interpretation is, ``Assuming no change in X.'' \\
     711\end{tabular}
     712\caption{
     713        Glossary of terms used in the list performance evaluation.
     714}
     715\label{f:ListPerfGlossary}
     716\end{figure}
     717
     718
     719This section explains how the experiment is built.
     720Many of the parts following define terminology concerning tuning knobs.
     721\VRef[Figure]{f:ListPerfGlossary} provides a consolidated refence.
     722
     723
     724\subsubsection{Add-Remove Performance}
    657725\label{s:AddRemovePerformance}
    658726
     
    661729Thus, adding and removing an element are the sole primitive actions.
    662730
    663 Repeated adding and removing is necessary to measure timing because these operations can be as simple as a dozen instructions.
     731Repeated adding and removing is necessary to measure timing because these operations can be as short as a dozen instructions.
    664732These instruction sequences may have cases that proceed (in a modern, deep pipeline) without a stall.
    665733
    666734This experiment takes the position that:
    667735\begin{itemize}[leftmargin=*]
    668         \item The total time to add and remove is relevant, as opposed to having one individual time for adding and a separate time for removing.
     736        \item The total time to add and remove is relevant, as opposed to having one time for adding and a separate time for removing.
    669737                  Adds without removes quickly fill memory;
    670                   removes without adds is meaningless.
    671         \item A relevant breakdown ``by operation'' is:
    672                 \begin{description}
    673                 \item[movement]
    674                   Is the add/remove applied to a stack, queue, or something else?
    675                   In these experiments, strict stack and queue shapes are tested (two movements).
    676                 \item[polarity]
    677                   In which direction does the action apply?
    678                   For a queue, do items flow from first to last or last to first?
    679                   For a stack, is the first or last end used for adding and removing?
    680                   In these experiments, both polarities are considered, labelled insert-first and insert-last (two polarities).
    681                 \item[accessor]
    682                   Is an add/remove location given by a list head's first/last (@insertFirst@, @removeLast@), or by a reference to an individual element (@insertAfter@, @remove@ of element)?
    683                   In these experiments, the (three) accessors are:
    684                   \begin{itemize}
    685                   \item
    686               inserts and removes both through the head ("all-head")
    687                   \item
    688           insert by element and remove through the head ("insert-element")
    689                   \item
    690                   insert through head and remove by element ("remove-element")
    691                   \end{itemize}
    692                 \end{description}
    693         \item So, an "operating scenario" is a specific selection of movement, polarity and accessor. These experiments run twelve operating scenarios.
     738                  removing without adding is impossible.
     739        \item A relevant breakdown ``by operation'' is, rather, the usage pattern of the add/remove calls.
     740        A example pattern choice is adding and removing at the same end, making a stack, or opposite ends, for a queue.
     741        Another is pushing on the front by calling @insert_first(lst, e)@ \vs @insert(e, old_first_elm)@; this aspect provides the test's API coverage.
     742        \VRef[Section]{s:UseCases} gives the full breakdown.
    694743        \item Speed differences caused by the host machine's memory hierarchy need to be identified and explained,
    695                   but do not represent advantages of one linked-list implementation over another.
     744                  but do not represent advantages of one framework over another.
    696745\end{itemize}
    697746
    698747The experiment used to measure insert/remove cost measures the mean duration of a sequence of additions and removals.
    699 Confidence bounds, on this mean, are discussed.
    700748The distribution of speeds experienced by an individual add-remove pair (tail latency) is not discussed.
    701749Space efficiency is shown only indirectly, by way of caches' impact on speed.
     
    706754\end{itemize}
    707755
    708 In the result analysis, where list length is a performance-influencing factor, once ``large'' lengths are dismissed, these zones are identified as representing different patterns:
    709 \begin{description}
    710         \item[size zone ``small''] lists of 4--16 elements
    711         \item[size zone ``medium''] lists of 50--200 elements
    712 \end{description}
    713 Each zone buckets four specific sizes at which trials are run.
    714 
    715 
    716 \subsubsection{Experiment setup}
    717 \label{s:ExperimentSetup}
     756In all cases, the quantity discussed is the duration of one insert-remove (IR).
     757An IR is the time taken to do one innermost insertion-loop iteration, one innermost removal-loop iteration, and its share of all overheads, ammortized.
     758
     759Lower IR duration is better.
     760This experiment typically does an IR in 1--10 ns.
     761The short end of this range has durations of single-digit clock-cycle counts.
     762Therefore, the situations that achieve the best times are saturating the instruction pipeline successfully.
     763
     764Often, an IR duration value needs to be considered relatively.
     765For example, \VRef[Section]{toc:sweet-sore} asks whether one linked list implementation is more sensitive than another to changing which computer runs the test.
     766A finding might be that a machine change slows implementation A by 10\% and B by 20\%.
     767This finding is not saying that A is faster than B (on either machine).
     768The finding could stand if B started 10\% faster and the machine change levelled them off, if B started slower and got worse, or in myriad other cases.
     769The finding asserts that such distinctions are not what's immediately relevant.
     770The arithmetic that produces the 10\% and 20\% answers is removing the information about which one starts, or ends up, faster.
     771Each implementation's to-machine duration is stated relatively to \emph{the same implementation's} from-machine duration.
     772The resulting measure is still about a duration.
     773The framework with the lower from-machine-relative duration handles the change better.
     774
     775
     776\subsubsection{Test Program}
     777\label{s:TetProgram}
    718778
    719779The experiment driver defines a (intrusive) node type:
     
    750810The loop duration is divided by the counter and this throughput is reported.
    751811In a scatter-plot, each dot is one throughput, which means insert + remove + harness overhead.
    752 The harness overhead is constant when comparing linked-list implementations and keep as small as possible.
     812The harness overhead is constant when comparing linked-list frameworks and is kept as small as possible.
    753813% The remainder of the setup section discusses the choices that affected the harness overhead.
    754814
     
    822882% This harness avoids telling the hardware what the SUT is about to do.
    823883
    824 The comparator linked-list implementations are:
    825 \begin{description}
    826 \item[std::list]  The @list@ type of g++.
    827 \item[lq-list]  The @list@ type of LQ from glibc of gcc.
    828 \item[lq-tailq] The @tailq@ type of the same.
    829 \item[upp-upp]  \uCpp provided @uSequence@
    830 \item[cfa-cfa]  \CFA's @dlist@
    831 \end{description}
    832 
    833 
    834 \subsubsection{Execution Environment}
    835 \label{s:ExperimentalEnvironment}
    836 
    837 The performance experiments are run on:
    838 \begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
    839 %\item[PC]
    840 %with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6.  The machine has 16 GB of RAM and 8 MB of last-level cache.
    841 %\item[ARM]
    842 %Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model
    843 \item[AMD]
    844 Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 1024KB L2, 16MB L3, where each L3 cache covers 1 NUMA node and 8 cores (16 processors).
    845 \item[Intel]
    846 Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 20248KB L2, 160MB L3, where each L3 cache covers 2 NUMA node and 32 cores (64 processors).
    847 \end{description}
    848 The experiments are single threaded and pinned to single core to prevent any OS movement, which might cause cache or NUMA effects perturbing the experiment.
    849 
    850 The compiler is gcc/g++-14.2.0 running on the Linux v6.8.0-52-generic OS.
    851 Switching between the default memory allocators @glibc@ and @llheap@ is done with @LD_PRELOAD@.
    852 To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@
    853 \begin{cfa}
    854 // prevent eliding, cheaper than volatile
    855 static inline void * pass( void * v ) {  __asm__  __volatile__( "" : "+r"(v) );  return v;  }
    856 ...
    857 pass( &remove_first( lst ) );                   // wrap call to prevent elision, insert cannot be elided now
    858 \end{cfa}
    859 The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists.
    860 
    861 
    862 \subsection{Result: Coarse comparison of styles}
    863 
    864 This comparison establishes how an intrusive list performs compared with a wrapped-reference list.
    865 \VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test.
    866 Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum.
    867 In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:ExperimentSetup}) are plotted with the same symbol;
    868 sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list).
    869 See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists.
    870 
    871 The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns.
    872 For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/remove times.
    873 As the list size grows, the administrative overhead for intrusive lists quickly disappears.
    874 
    875 \begin{figure}
    876   \centering
    877   \setlength{\tabcolsep}{0pt}
    878   \begin{tabular}{p{0.75in}p{2.75in}p{3in}}
    879   &
    880   \subfloat[Linear List Nodes, AMD]{\label{f:Linear-swift}
    881         \hspace*{-0.75in}
    882         \includegraphics{plot-list-zoomout-noshuf-swift.pdf}
    883   } % subfigure
    884   &
    885   \subfloat[Linear List Nodes, Intel]{\label{f:Linear-java}
    886         \includegraphics{plot-list-zoomout-noshuf-java.pdf}
    887   } % subfigure
    888   \\
    889   &
    890   \subfloat[Random List Nodes, AMD]{\label{f:Random-swift}
    891         \hspace*{-0.75in}
    892         \includegraphics{plot-list-zoomout-shuf-swift.pdf}
    893   } % subfigure
    894   &
    895   \subfloat[Random List Nodes, Intel]{\label{f:Random-java}
    896         \includegraphics{plot-list-zoomout-shuf-java.pdf}
    897   } % subfigure
    898   \end{tabular}
    899   \caption{Insert/remove duration \vs list length.
    900   Lengths go as large possible without error.
    901   One example operation is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.}
    902   \label{fig:plot-list-zoomout}
    903 \end{figure}
    904 
    905 The key performance factor between the intrusive and the wrapped-reference lists is the dynamic allocation for the wrapped nodes.
    906 Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator.
    907 For insert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists.
    908 For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
    909 the allocation dominates these costs.
    910 For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node.
    911 Hence, there is no way to tease apart the allocation, copying, and linking costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage that storage.
    912 
    913 In detail, \VRef[Figure]{f:Linear-swift}--\subref*{f:Linear-java} shows linear insertion of all the nodes and then linear removal, both in the same direction.
    914 For intrusive lists, the nodes are adjacent in memory from being preallocated in an array.
    915 For wrapped lists, the wrapped nodes happen to be adjacent because the memory allocator uses bump allocation during the initial phase of allocation.
    916 As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal.
    917 With address look-ahead, the hardware does an excellent job of managing the multi-level cache.
    918 Hence, performance is largely constant for both kinds of lists, until L3 cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists.
    919 For example, on AMD (\VRef[Figure]{f:Linear-swift}), there is one NUMA node but many small L3 caches, so performance slows down quickly as multiple L3 caches come into play, and remains constant at that level, except for some anomalies for very large lists.
    920 On Intel (\VRef[Figure]{f:Linear-java}), there are four NUMA nodes and four slowdown steps as list-length increase.
    921 At each step, the difference between the kinds of lists decreases as the NUMA effect increases.
    922 
    923 In detail, \VRef[Figure]{f:Random-swift}--\subref*{f:Random-java} shows random insertion and removal of the nodes.
    924 As for linear, there is the issue of memory allocation for the wrapped list.
    925 As well, the consecutive storage-layout is the same (array and bump allocation).
    926 Hence, the difference is the random linking among nodes, resulting in random accesses, even though the list is traversed linearly, resulting in similar cache events for both kinds of lists.
    927 Both \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} show the slowdown of random access as the list-length grows resulting from stepping out of caches into main memory and crossing NUMA nodes.
    928 % Insert and remove operations act on both sides of a link.
    929 %Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
    930 As for linear, the Intel (\VRef[Figure]{f:Random-java}) graph shows steps from the four NUMA nodes.
    931 Interestingly, after $10^6$ nodes, intrusive lists are slower than wrapped.
    932 I did not have time to track down this anomaly, but I speculate it results from the difference in touching the data in the accessed node, as the data and links are together for intrusive and separated for wrapped.
    933 For the llheap memory-allocator and the two tested architectures, intrusive lists out perform wrapped lists up to size $10^3$ for both linear and random, and performance begins to converge around $10^6$ nodes as architectural issues begin to dominate.
    934 Clearly, memory allocator and hardware architecture plays a large factor in the total cost and the crossover points as list-size increases.
    935 % In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size.
    936 
    937 The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution.
    938 Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated.
    939 Even when space is a consideration, intrusive links may not use more storage if a node is often linked.
    940 Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures or their tool-set does not provide them.
    941 
    942 % Note, linear access may not be realistic unless dynamic size changes may occur;
    943 % if the nodes are known to be adjacent, use an array.
    944 
    945 % In a wrapped-reference list, list nodes are allocated separately from the items put into the list.
    946 % Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes.
    947 
    948 % STL's performance is not affected by element order in memory.
    949 %The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations.
    950 % This much is also unaffected by element order.
    951 % Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off.
    952 % In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL.
    953 
    954 % The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster.
    955 % Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level.
    956 % Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear.
    957 % Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear.
    958 
    959 % The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists.
    960 % The tests in this chapter are only inserting and removing.
    961 % They are not operating on any user payload data that is being listed.
    962 % The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order.
    963 % These differences are inherent to the two list models.
    964 
    965 % A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening.
    966 % As a result, the interlinked nodes of the STL list are generally referencing their immediate neighbours.
    967 % This pattern occurs regardless of user-item shuffling because this test's ``use'' of the user-items' array is limited to storing element addresses. 
    968 % This experiment, driving an STL list, is simply not touching the memory that holds the user data.
    969 % Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast.
    970 
    971 % But the comparison of unshuffled intrusive with wrapped-reference gives the performance of these two styles, with their the common impediment of overfilling the cache removed.
    972 % Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes. 
    973 % This difference is appreciable below list length 0.5 M, and enormous below 10 K.
    974 
    975 
    976 \subsection{Result: Comparing intrusive implementations}
    977 \label{s:ComparingIntrusiveImplementations}
    978 
    979 The preceding result shows the intrusive implementations have better performance to the wrapped lists for small to medium sized lists.
    980 This analysis covers the experiment position taken in \VRef{s:AddRemovePerformance} for movement, polarity, and accessor.
    981 \VRef[Figure]{f:ExperimentOperations} shows the experiment operations tested, which results in 12 experiments (I--XII) for comparing intrusive implementations.
    982 To preclude hardware interference, only list sizes below 150 are examined to differentiate among the intrusive implementations,
    983 The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same.
    984 
    985 \begin{figure}
     884
     885\subsubsection{Use Cases}
     886\label{s:UseCases}
     887
     888\begin{figure}
     889\begin{comment}
    986890\centering
    987891\setlength{\tabcolsep}{8pt}
     
    1026930        \small
    1027931        \begin{tabular}{@{}ll@{}}
    1028         I:      & stack, insert first, I-head / R-head \\
    1029         II:     & stack, insert first, I-list / R-head \\
    1030         III:& stack, insert first, I-head / R-list \\
    1031         IV:     & stack, insert last, I-head / R-head \\
    1032         V:      & stack, insert last, I-list / R-head \\
    1033         VI:     & stack, insert last, I-head / R-list \\
    1034         VII:& queue, insert first, I-head / R-head \\
    1035         VIII:& queue, insert first, I-list / R-head \\
    1036         IX:     & queue, insert first, I-head / R-list \\
    1037         X:      &  queue, insert last, I-head / R-head \\
    1038         XI:     & queue, insert last, iI-list / R-head \\
    1039         XII:& queue, insert last, I-head / R-list \\
     932        I:      & stack, insert first, all head \\
     933        II:     & stack, insert first, insert element \\
     934        III:& stack, insert first, remove element \\
     935        IV:     & stack, insert last, all head \\
     936        V:      & stack, insert last, insert element \\
     937        VI:     & stack, insert last, remove element \\
     938        VII:& queue, insert first, all head \\
     939        VIII:& queue, insert first, insert element \\
     940        IX:     & queue, insert first, remove element \\
     941        X:      &  queue, insert last, all head \\
     942        XI:     & queue, insert last, iinsert element \\
     943        XII:& queue, insert last, remove element \\
    1040944        \end{tabular}
    1041945\end{tabular}
    1042 \caption{Experiment Operations}
     946\end{comment}
     947\setlength{\tabcolsep}{5pt}
     948\small
     949\begin{tabular}{rcccccccccccc}
     950& I & II        & III & IV & V & VI & VII & VIII & IX & X & XI & XII \\
     951Movement &
     952stack & stack & stack & stack & stack & stack &
     953queue & queue & queue & queue & queue & queue \\
     954Polarity &
     955i-first & i-first & i-first & i-last & i-last & i-last &
     956i-first & i-first & i-first & i-last & i-last & i-last \\
     957Acessor &
     958all hd & ins-e & rem-e & all hd & ins-e & rem-e &
     959all hd & ins-e & rem-e & all hd & ins-e & rem-e
     960\end{tabular}
     961\caption{Experiment use cases, numbered.}
    1043962\label{f:ExperimentOperations}
    1044963\end{figure}
     964
     965Where \VRef[Figure]{f:ListPerfGlossary} enumerates the specific values, reall the use-case dimensions are:
     966\begin{description}
     967        \item[movement ($\times 2$)]
     968          In these experiments, strict stack and queue patterns are tested.
     969        \item[polarity ($\times 2$)]
     970          Obtain one polarity from the other by reversing uses of first/last.
     971        \item[accessor ($\times 3$)]
     972          Giving an add/remove location by a list head's first/last, \vs by a preexisting reference to an individual element?
     973\end{description}
     974
     975A use case is a specific selection of movement, polarity and accessor.
     976These experiments run twelve use cases.
     977When a comparison is showing only what can happen when switching among use cases (as opposed to \eg how stacks are different from queues), the numbering scheme of \VRef[Figure]{f:ExperimentOperations} is used.
     978
     979With accessor, when an action names its insertion position or removal element, the harness either
     980\begin{itemize}
     981\item defers to the list-head's tracking of first/last (``through the head''), or
     982\item applies its own knowledge of the current pattern, to name a position/element that happens to be first/last (``of known element'').
     983\end{itemize}
     984
     985The accessor patterns, at the (\CFA) API level, are:
     986\begin{description}
     987  \item[all (through the) head:]  Both inserts and removes happen through the list head.  The list head operations are @insert_first@, @insert_last@, @remove_first@ and @remove_last@. \\
     988  \item[insert (of known) element] \dots and remove through head:  Inserts use @insert_before(e, first)@ or @insert_after(e, last)@, where @e@ is being inserted and @first@/@last@ are element references known by list-independent means.
     989  \item[remove (of known) element] \dots and insert through the head:  Removes use @remove(e)@, where @e@ is being removed.  List-independent knowledge establishes that @e@ is first or last, as appropriate.
     990\end{description}
     991
     992Comparing all-head with insert-element gives the relative performance of head-mediated \vs element-oriented insertion, because both use the same removal style.
     993Comparing all-head with remove-element gives the relative performance of head-mediated \vs element-oriented removal, because both use the same insertion style.
     994
     995\subsubsection{Sizing}
     996
     997
     998It is true, but perhaps not obvious, that buildind and destroying long lists is slower than building and destroying short lists.
     999Obviously, indeed, it takes longer to fuse and divide a hundred neighbours than five.
     1000But the key metric in this work, AII, is about a single link--unlink.
     1001So, critically, linking and unlinking a hundred neighbours actually takes \emph{more} than $20\times$ the time for five neighbours.
     1002The main reason is caching; when more neighbours are being manipulated, more memory is being read and written.
     1003
     1004But caching success is about more than the amount of memory worked on.
     1005Subtle changes in pattern become butterfly effects.
     1006Aggressive ILP scheduling, which enables short AIR times, is the amplifier.
     1007A data dependency, present in one framework but not another, is critical path in one situation but in not another.
     1008So, duration's response to size is not a steady worsening as size increases.
     1009Rather, each size-independent configuration often responds to size increases with leaps of worsening.
     1010Occasionally a leap is even followed size-run of retrograde response, where a suddenly incurred penalty has a chance to ammortize away.
     1011The frameworks tend to leapfrog over each other, at different points, as size increases.
     1012
     1013The analysis treats these behaviours as incidental.
     1014It does not try to characterize various exact-size responses.
     1015Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another zone.
     1016
     1017To preview, \VRef[Section]{toc:coarse-compre} dismisses ``Large'' sizes (above 150 elements), where the performance story is dominated by the amount of memory touched, inherently, by the choice of intrusive list \vs wrapped, and where one intrusive framework is quite obviously as good as another.
     1018At smaller sizes, comparing one intrusive framework to another makes sense; this comparion occurs in the remaining ``Result'' sections.
     1019Among the ``not Large'' sizes used there, two further zones, Small and Medium, are selected as representatives of what can vary when the scale is changed.
     1020These particular ranges were chosen beacause each range tends to have one story repeated across its constituent sizes.
     1021If \CFA's duration increases across Small, then the other frameworks' usually do too.
     1022If \CFA is beating \uCpp at the low end of Large, then it usually is at the high end too.
     1023The leapfrogging tends to happen outside of these two ranges.
     1024
     1025A spot of poor performance appears in the general results for \CFA at size 1.
     1026Section \MLB{TODO:xref} explores the phenomenon and concludes that it is an anomaly due to a quirky interaction with the testing rig.
     1027To do so, two it considers size as either length or width.
     1028Length is the number of elements in a list.
     1029Width is a number of these lists being kept, worked upon in round-robin order.
     1030Outside of \MLB{TODO:xref}, size always means length, and width is 1.
     1031
     1032
     1033
     1034\subsubsection{Execution Environment}
     1035\label{s:ExperimentalEnvironment}
     1036
     1037The performance experiments are run on:
     1038\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
     1039%\item[PC]
     1040%with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6.  The machine has 16 GB of RAM and 8 MB of last-level cache.
     1041%\item[ARM]
     1042%Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model
     1043\item[AMD]
     1044Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 1024KB L2, 16MB L3, where each L3 cache covers 1 NUMA node and 8 cores (16 processors).
     1045\item[Intel]
     1046Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 20248KB L2, 160MB L3, where each L3 cache covers 2 NUMA node and 32 cores (64 processors).
     1047\end{description}
     1048The experiments are single threaded and pinned to single core to prevent any OS movement, which might cause cache or NUMA effects perturbing the experiment.
     1049
     1050The compiler is gcc/g++-14.2.0 running on the Linux v6.8.0-52-generic OS.
     1051Switching between the default memory allocators @glibc@ and @llheap@ is done with @LD_PRELOAD@.
     1052To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@
     1053\begin{cfa}
     1054// prevent eliding, cheaper than volatile
     1055static inline void * pass( void * v ) {  __asm__  __volatile__( "" : "+r"(v) );  return v;  }
     1056...
     1057pass( &remove_first( lst ) );                   // wrap call to prevent elision, insert cannot be elided now
     1058\end{cfa}
     1059The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists.
     1060
     1061
     1062The main difference in the machines is their cache structure.
     1063The AMD has smaller caches that are shared less, while the Intel shares larger caches among more processors.
     1064This difference, while an interesting tradeoff for highly concurrent use, is rather one-sided for sequential use, such as this experiment's.
     1065The Intel offers a single processor a bigger cache.
     1066
     1067\subsubsection{Recap and Master Legend}
     1068
     1069There are 12 use cases, which are all combinations of 2 movents, 2 polarities and 3 accessors.
     1070There are 4 pysical contexts, which are all combinations of 2 machines and 2 size (length) zones (and 1 width, of value 1).
     1071Each physical context samples 4 specific sizes.
     1072
     1073There are 3.25 frameworks.
     1074This accounting considers how LQ-list supports only the movement--polarity combination "stack, insert first."
     1075So LQ-list fills a quarter of the otherwise-orthogonal space.
     1076
     1077Use case, physical context and framework are the explanatory factors.
     1078Taking all combinations of the explanatory factors gives 624 individual configurations.
     1079
     1080Though there are multiple experimental trials of each configuration (to assure repeatability), the usual measure is mean AIR among the trials, considered for each of the 624 individual configurations.
     1081
     1082All means reported in this analysis are geometric.
     1083
     1084\MLB{TODO: add example plots; explain histogram of 624}
     1085
     1086
     1087\subsection{Result: Coarse comparison of styles}
     1088
     1089This comparison establishes how an intrusive list performs compared with a wrapped-reference list.
     1090\VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test.
     1091Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum.
     1092In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:Contenders}) are plotted with the same symbol;
     1093sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list).
     1094See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists.
     1095
     1096The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns.
     1097For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/remove times.
     1098As the list size grows, the administrative overhead for intrusive lists quickly disappears.
     1099
     1100\begin{figure}
     1101  \centering
     1102  \setlength{\tabcolsep}{0pt}
     1103  \begin{tabular}{p{0.75in}p{2.75in}p{3in}}
     1104  &
     1105  \subfloat[Linear List Nodes, AMD]{\label{f:Linear-swift}
     1106        \hspace*{-0.75in}
     1107        \includegraphics{plot-list-zoomout-noshuf-swift.pdf}
     1108  } % subfigure
     1109  &
     1110  \subfloat[Linear List Nodes, Intel]{\label{f:Linear-java}
     1111        \includegraphics{plot-list-zoomout-noshuf-java.pdf}
     1112  } % subfigure
     1113  \\
     1114  &
     1115  \subfloat[Random List Nodes, AMD]{\label{f:Random-swift}
     1116        \hspace*{-0.75in}
     1117        \includegraphics{plot-list-zoomout-shuf-swift.pdf}
     1118  } % subfigure
     1119  &
     1120  \subfloat[Random List Nodes, Intel]{\label{f:Random-java}
     1121        \includegraphics{plot-list-zoomout-shuf-java.pdf}
     1122  } % subfigure
     1123  \end{tabular}
     1124  \caption{Insert/remove duration \vs list length.
     1125  Lengths go as large possible without error.
     1126  One example use case is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.}
     1127  \label{fig:plot-list-zoomout}
     1128\end{figure}
     1129
     1130The key performance factor between the intrusive and the wrapped-reference lists is the dynamic allocation for the wrapped nodes.
     1131Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator.
     1132For insert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists.
     1133For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
     1134the allocation dominates these costs.
     1135For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node.
     1136Hence, there is no way to tease apart the allocation, copying, and linking costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage that storage.
     1137
     1138In detail, \VRef[Figure]{f:Linear-swift}--\subref*{f:Linear-java} shows linear insertion of all the nodes and then linear removal, both in the same direction.
     1139For intrusive lists, the nodes are adjacent in memory from being preallocated in an array.
     1140For wrapped lists, the wrapped nodes happen to be adjacent because the memory allocator uses bump allocation during the initial phase of allocation.
     1141As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal.
     1142With address look-ahead, the hardware does an excellent job of managing the multi-level cache.
     1143Hence, performance is largely constant for both kinds of lists, until L3 cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists.
     1144For example, on AMD (\VRef[Figure]{f:Linear-swift}), there is one NUMA node but many small L3 caches, so performance slows down quickly as multiple L3 caches come into play, and remains constant at that level, except for some anomalies for very large lists.
     1145On Intel (\VRef[Figure]{f:Linear-java}), there are four NUMA nodes and four slowdown steps as list-length increase.
     1146At each step, the difference between the kinds of lists decreases as the NUMA effect increases.
     1147
     1148In detail, \VRef[Figure]{f:Random-swift}--\subref*{f:Random-java} shows random insertion and removal of the nodes.
     1149As for linear, there is the issue of memory allocation for the wrapped list.
     1150As well, the consecutive storage-layout is the same (array and bump allocation).
     1151Hence, the difference is the random linking among nodes, resulting in random accesses, even though the list is traversed linearly, resulting in similar cache events for both kinds of lists.
     1152Both \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} show the slowdown of random access as the list-length grows resulting from stepping out of caches into main memory and crossing NUMA nodes.
     1153% Insert and remove operations act on both sides of a link.
     1154%Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
     1155As for linear, the Intel (\VRef[Figure]{f:Random-java}) graph shows steps from the four NUMA nodes.
     1156Interestingly, after $10^6$ nodes, intrusive lists are slower than wrapped.
     1157I did not have time to track down this anomaly, but I speculate it results from the difference in touching the data in the accessed node, as the data and links are together for intrusive and separated for wrapped.
     1158For the llheap memory-allocator and the two tested architectures, intrusive lists out perform wrapped lists up to size $10^3$ for both linear and random, and performance begins to converge around $10^6$ nodes as architectural issues begin to dominate.
     1159Clearly, memory allocator and hardware architecture plays a large factor in the total cost and the crossover points as list-size increases.
     1160% In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size.
     1161
     1162The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution.
     1163Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated.
     1164Even when space is a consideration, intrusive links may not use more storage if a node is often linked.
     1165Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures or their tool-set does not provide them.
     1166
     1167% Note, linear access may not be realistic unless dynamic size changes may occur;
     1168% if the nodes are known to be adjacent, use an array.
     1169
     1170% In a wrapped-reference list, list nodes are allocated separately from the items put into the list.
     1171% Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes.
     1172
     1173% STL's performance is not affected by element order in memory.
     1174%The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations.
     1175% This much is also unaffected by element order.
     1176% Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off.
     1177% In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL.
     1178
     1179% The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster.
     1180% Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level.
     1181% Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear.
     1182% Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear.
     1183
     1184% The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists.
     1185% The tests in this chapter are only inserting and removing.
     1186% They are not operating on any user payload data that is being listed.
     1187% The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order.
     1188% These differences are inherent to the two list models.
     1189
     1190% A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening.
     1191% As a result, the interlinked nodes of the STL list are generally referencing their immediate neighbours.
     1192% This pattern occurs regardless of user-item shuffling because this test's ``use'' of the user-items' array is limited to storing element addresses. 
     1193% This experiment, driving an STL list, is simply not touching the memory that holds the user data.
     1194% Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast.
     1195
     1196% But the comparison of unshuffled intrusive with wrapped-reference gives the performance of these two styles, with their the common impediment of overfilling the cache removed.
     1197% Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes. 
     1198% This difference is appreciable below list length 0.5 M, and enormous below 10 K.
     1199
     1200
     1201\subsection{Result: Intrusive Winners and Losers}
     1202\label{s:ComparingIntrusiveImplementations}
     1203
     1204The preceding result shows the intrusive frameworks have better performance than the wrapped lists for small to medium sized lists.
     1205This analysis covers the experiment position taken in \VRef{s:AddRemovePerformance} for movement, polarity, and accessor.
     1206\VRef[Figure]{f:ExperimentOperations} shows the experiment use cases tested, which results in 12 experiments (I--XII) for comparing intrusive frameworks.
     1207To preclude hardware interference, only list sizes below 150 are examined to differentiate among the intrusive frameworks,
     1208The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same.
     1209
    10451210
    10461211\begin{figure}
    10471212  \centering
    10481213  \includegraphics{plot-list-1ord.pdf}
    1049   \caption{Histogram of operation durations, decomposed by all first-order effects.
    1050   Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. Higher in column is better}
     1214  \caption{Histogram of IR durations, decomposed by all first-order effects.
     1215  Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. The measure is duration; lower is better.}
    10511216  \label{fig:plot-list-1ord}
    10521217\end{figure}
     
    10591224The size effect is more pronounced on the AMD with its smaller L3 cache than it is on the Intel.
    10601225(No NUMA effects for these list sizes.)
    1061 Specifically, a 20\% standard deviation exists here, between the means four physical-effect categories.
     1226Specifically, a 20\% standard deviation exists here, between the means of the four physical-effect categories.
    10621227The key takeaway for this comparison is the context it establishes for interpreting the following framework comparisons.
    1063 Both the particulars of a the machine's cache design, and a list length's effect on the program's cache friendliness, affect insert/remove speed in the manner illlustrated in this breakdown.
     1228Both the particulars of a machine's cache design, and a list length's effect on the program's cache friendliness, affect insert/remove speed in the manner illlustrated in this breakdown.
    10641229That is, if you are running on an unknown machine, at a scale above anomaly-prone individuals, and below where major LLC caching effects take over the general intrusive-list advantage, but with an unknown relationship to the sizing of your fickle low-level caches, you are likely to experience an unpredictable speed impact on the order of 20\%.
    10651230
    1066 A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by operation type.
     1231A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by use case.
    10671232Specific interactions do occur, like framework X doing better on stacks than on queues; a selection of these is addressed in \VRef[Figure]{fig:plot-list-2ord} and discussed shortly.
    1068 But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the operations opaquely.
    1069 Whether a given list implementation is suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement.
    1070 So you face another lottery, with a likely win-loss range of the standard deviation of the individual operations' means: 9\%.
     1233But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the use cases opaquely.
     1234Whether a given list framework is suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement.
     1235So you face another lottery, with a likely win-loss range of the standard deviation of the individual use cases' means: 9\%.
    10711236
    10721237This context helps interpret \VRef[Figure]{fig:plot-list-1ord}'s final comparison, by framework.
     
    10761241
    10771242Now, the LQs do indeed beat the UW languages by 15\%, a fact explored further in \MLB{TODO: xref}.
    1078 But so too does operation VIII typically beat operation IV by 38\%.
     1243But so too does use case VIII typically beat use case IV by 38\%.
    10791244As does a small size on the Intel typically beat a medium size on the AMD by 66\%.
    10801245Framework choice is simply not where you stand to win or lose the most.
    10811246
     1247
     1248\subsection{Intrusive Sweet and Sore Spots}
     1249
    10821250\begin{figure}
    10831251        \centering
    10841252        \includegraphics{plot-list-2ord.pdf}
    1085         \caption{Histogram of operation durations, illustrating interactions with framework.
     1253        \caption{Histogram of IR durations, illustrating interactions with framework.
    10861254        Each distribution shows how its framework reacts to a single other factor being varied across one pair of options.
    10871255        Every (binned and mean-contributing) individual data point represents a pair of test setups, one with the criterion set to the option labelled at the top; the other setup uses the bottom option.
     
    10941262\VRef[Figure]{fig:plot-list-1ord} stays razor-focused on only first-order effects in order to contextualize a winner/loser framework observation.
    10951263But this perspective cannot address questions like, ``Where are \CFA's sore spots?''
    1096 Moreover, the shallow threatment of operations by ordinals said nothing about how stack usage compares with queues'.
     1264Moreover, the shallow threatment of use cases by ordinals said nothing about how stack usage compares with queues'.
    10971265
    10981266\VRef[Figure]{fig:plot-list-2ord} provides such answers.
     
    11171285The strongest effect is \CFA's aversion to removal by element---certainly an opportunity for improvement.
    11181286
     1287
     1288\subsection{\CFA Tiny-Size Anomaly}
     1289
     1290The \CFA list occasionally showed a concerning slowdown at length 1.
     1291The issue, seen in \VRef[Figure]{fig:plot-list-short} (top-left corner), has \CFA taking above 10 ns per IR (top-left corner).
     1292It occurs only for the queue movement, only on the AMD machine, and only for the \CFA framework.
     1293
     1294Length-1 performane is an important case.
     1295Lists like those of waiting threads are frequently left empty, with the occasional thread (or few) momentarily joining.
     1296These scenarios need to work.
     1297
     1298A cause of this behaviour was never determined.
     1299Speculation is that \CFA's increased data dependency, a result of the tagging scheme, pairs poorly with the situation implied by queue movement.
     1300The aliasing, at length 1, is: the head's first element is the head's last element.
     1301With stack movement, one of these aliases is used twice, while with queue movement, both are used in alternation.
     1302
     1303The breakdowns earlier in the performane assessment work by varying length only.
     1304That is, they see the story down the leftmost column in a triangle.
     1305The insight for contextualizing this issue was to inspect both length and width.
     1306
     1307The issue is seen as practically mitigated by noticing that the difficutly fades away as width increases.
     1308This effect is seen both in \VRef[Figure]{fig:plot-list-short}'s easement across the top triangle rows, and, zoomed farther out, in \VRef[Figure]{fig:plot-list-wide}.
     1309
     1310Increasing the width matters to the aliasing hypothesis.
     1311In a narrow experiment, one element's insert and remove happen in rapid succession.
     1312So, the two aliases are exercied closer together, making a data hazard (that lacks ideal hardware treatment) stretch the instruction-pipeline schedule more significantly.
     1313Increasing the width adds harness-induced gaps between the uses of each alias, behind which a potential hazard can hide.
     1314
     1315In the practical scenario that judges length-1 performance as relevant, width 1 is contrived.
     1316A thread putting itself on an often-empty waiters' list is not doing so on one such list repeatedly, at least not without taking other situation-iduced pauses.
     1317
     1318Thus, the congestion at low width + length comes from the harness using repetition (in order to obtain a measurable time).
     1319It does not reflect the situation that motivates the legitimate desire for good length-1 performance.
     1320
     1321There likely is a real hazard, unique to the \CFA framework, when a queue movement is repeated on a tiny list \emph{without other interventing action}.
     1322Doing so is believed to occur only in contrived situations.
     1323
     1324
     1325\begin{figure}
     1326\centering
     1327  \includegraphics[trim={00in, 5.5in, 0in, 0in}, clip, scale=0.8]{plot-list-short-temp.pdf}
     1328  \caption{Behaviour at very short lengths.}
     1329  \label{fig:plot-list-short}
     1330\end{figure}
     1331
     1332\begin{figure}
     1333\centering
     1334  \includegraphics[trim={0.25in, 1in, 0.25in, 1in}, clip, scale=0.5]{plot-list-wide-temp.pdf}
     1335  \caption{Length-1 anomaly resolving at modest width.  Points are for varying widths, at fixed length 1.}
     1336  \label{fig:plot-list-wide}
     1337\end{figure}
     1338
    11191339\begin{comment}
    11201340        These remarks are mostly about 3ord over 2ord.
     
    11231343They illustrate the difficult signal-to-noise ratio that I had to overcome in preparing this data.
    11241344They may serve as a reference guiding future \CFA linked-list work by informing on where to target improvements.
    1125 Finally, the findings offer the conclusion that \CFA's list offers more consostent performance across usage scenarios, than the other lists.
     1345Finally, the findings offer the conclusion that \CFA's list offers more consistent performance across usage scenarios, than the other lists.
    11261346\end{comment}
    11271347
     
    11361356\end{comment}
    11371357
    1138 
    1139 
    1140 \MLB{ TODO: find a home for these original conclusions:
    1141 cfa-upp similarity holde for all halves by movement or polarity;
    1142 splitting on accessor, \CFA has a poor result on element removal, LQ-list has a great result on the other accessors, and uC++ is unaffected. }
    1143 
     1358\begin{comment}
    11441359
    11451360\begin{figure}
     
    11751390The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials.
    11761391For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other.
    1177 The experiment runs twelve operating scenarios;
     1392The experiment runs twelve use cases;
    11781393the ones chosen for their variety are scenarios I and VIII from the listing of \VRef[Figure]{fig:plot-list-mchn-szz}, and their results appear in the rows.
    11791394As in the previous experiment, each hardware architecture appears in a column.
     
    12181433With this adjustment, absolute duration values (in nonsecods) are lost.
    12191434In return, the physical quadrants are re-combined, enabling assessment of the non-physical factors.
     1435\end{comment}
    12201436
    12211437\begin{comment}
  • doc/theses/mike_brooks_MMath/plots/ListCommon.py

    r99bc47b r4cf8832  
    6868
    6969explanations = ['movement', 'polarity', 'accessor',
    70                 'NumNodes',
     70                'NumNodes', 'Width', 'Length',
    7171                'SizeZone', # note fd: NumNodes -> SizeZone
    7272                'fx',
     
    192192
    193193def getSingleResults(
    194         dsname = 'general',
     194        dsnames = ['general'],
    195195        machines = allMachines,
    196196        *,
     
    202202
    203203    timings = pd.concat([
    204         getMachineDataset( dsname, m )
     204        getMachineDataset( d, m )
     205        for d in dsnames
    205206        for m in machines ])
    206207   
     
    287288
    288289def printManySummary(*,
    289         dsname = 'general',
     290        dsnames = ['general'],
    290291        machines = allMachines,
    291292        metafileCore,
     
    302303
    303304    for op in metadata.itertuples():
    304         timings = getSingleResults(dsname, machines,
     305        timings = getSingleResults(dsnames, machines,
    305306            fxs=fxs,
    306307            tgtMovement = op.movement,
     
    324325
    325326def printSingleDetail(
    326         dsname = 'general',
     327        dsnames = ['general'],
    327328        machines = allMachines,
    328329        *,
     
    336337
    337338
    338     timings = getSingleResults(dsname, machines,
     339    timings = getSingleResults(dsnames, machines,
    339340        fxs = fxs,
    340341        tgtMovement = tgtMovement,
  • doc/theses/mike_brooks_MMath/plots/list-1ord.py

    r99bc47b r4cf8832  
    1313ops = ['movement', 'polarity', 'accessor']
    1414fx = ['fx']
    15 bkgnd = ['NumNodes']  # never drilled/marginalized, always conditioned
     15bkgnd = ['NumNodes']            # never drilled/marginalized, always conditioned
     16ignore = [ 'InterleaveFrac',    # unused ever and always zero
     17           'Width',             # unused here and always one
     18           'Length' ]           # unused here and always =NumNodes
    1619
     20# assure every explanation is classified
    1721assert( set( explanations )
    18         - set( ['InterleaveFrac'] ) # unused and always zero
     22        - set( ignore )
    1923        ==
    2024        set(physicals) | set(ops) | set(fx) | set(bkgnd) )
  • doc/theses/mike_brooks_MMath/plots/list-zoomout-noshuf-java.py

    r99bc47b r4cf8832  
    88
    99printSingleDetail(
    10     dsname='zoomout-noshuf',
     10    dsnames=['zoomout-noshuf'],
    1111    machines=['java'],
    1212    tgtMovement = 'stack',
  • doc/theses/mike_brooks_MMath/plots/list-zoomout-noshuf-swift.py

    r99bc47b r4cf8832  
    88
    99printSingleDetail(
    10     dsname='zoomout-noshuf',
     10    dsnames=['zoomout-noshuf'],
    1111    machines=['swift'],
    1212    tgtMovement = 'stack',
  • doc/theses/mike_brooks_MMath/plots/list-zoomout-shuf-java.py

    r99bc47b r4cf8832  
    88
    99printSingleDetail(
    10     dsname='zoomout-shuf',
     10    dsnames=['zoomout-shuf'],
    1111    machines=['java'],
    1212    tgtMovement = 'stack',
  • doc/theses/mike_brooks_MMath/plots/list-zoomout-shuf-swift.py

    r99bc47b r4cf8832  
    88
    99printSingleDetail(
    10     dsname='zoomout-shuf',
     10    dsnames=['zoomout-shuf'],
    1111    machines=['swift'],
    1212    tgtMovement = 'stack',
Note: See TracChangeset for help on using the changeset viewer.