Context Navigation

-              r99bc47b
+              r4cf8832
 The goal is to show the \CFA lists are competitive with other designs, but the different list designs may not have equivalent functionality, so it is impossible to select a winner encompassing both functionality and execution performance.
+\subsection{Add-Remove Performance}
+\subsection{Experiment Design}
+\begin{figure}
+\noindent
+\begin{tabular}{p{1.75in}@{\ }p{4.5in}}
+Insert-Remove (IR)
+                & The atomic unit of work being measured: one insertion plus one remove (plus all looping/tracking overheads) \\
+Use Case
+                & Pattern of add-remove calls. \\
+-- Movement & \\
+        \quad $\ni$ stack
+                & Inserts and removes happen at the same end. \\
+        \quad $\ni$ queue
+                & Inserts and removes happen at opposite ends.  \\
+-- Polarity
+                & Which of the two orientations in which the movement happens. \\
+        \quad $\ni$ insert-first
+                & All inserts at front; stack removes at front; queue removes at back. \\
+        \quad $\ni$ insert-last
+                & All inserts at back; stack removes at back; queue removes at front. \\
+-- Accessor
+                & How an insertion position, or removal element, is specified.  The same position/element is picked either way. \\
+        \quad $\ni$ all head
+                & inserts and removes both through the head \\
+        \quad $\ni$ insert element
+                & insert by element and remove through the head \\
+        \quad $\ni$ remove element
+                & insert through head and remove by element\\
+Physical Context & \\
+-- Size (number) &  Number of nodes being linked.  Unless specified, equals the \emph{length} of the program's sole list.  \emph{Width}, rarely used, is the number of lists. \\
+-- Size Zone
+                & Contiguous range of sizes, chosen to avoid known anomalies and to sample a brief plateau.  Each zone buckets four specific sizes.     \\
+   \quad $\ni$ small
+                & lists of 4--16 elements \\
+   \quad $\ni$ medium
+                & lists of 50--200 elements \\
+   \quad $\ni$ (other)
+                &   Not used for comparing intrusive frameworks. \\
+-- machine
+                & Computer running the experiment \\
+        \quad $\ni$ AMD
+                & smaller cache \\
+        \quad $\ni$ Intel
+                & bigger cache \\
+Framework & A particular linked-list implementation (within its host language) \\
+$\ni$ \CC       & The @std::list@ type of g++. \\
+$\ni$ lq-list   & The @list@ type of LQ from glibc of gcc. \\
+$\ni$ lq-tailq  & The @tailq@ type of the same. \\
+$\ni$ \uCpp     & \uCpp's @uSequence@ \\
+$\ni$ \CFA      & \CFA's @dlist@ \\
+Explanation being
+                & How independent explanatory variable X is analyzed \\
+-- Marginalized
+                & Left alone, allowed to vary, yielding a more absolute measure.  Shows the effect that X causes.  If all explanations are marginalized, then absolute times are available and a relative time has a peer group that is the entire population.  \\
+-- Conditioned
+                & Held constant, yielding a more relative measure.  Hides the effect that X causes.  Conditioning on X creates more, smaller relative-measure peer groups, by isolating each X-domain value.  Resulting interpretation is, ``Assuming no change in X.'' \\
+\end{tabular}
+\caption{
+        Glossary of terms used in the list performance evaluation.
+}
+\label{f:ListPerfGlossary}
+\end{figure}
+This section explains how the experiment is built.
+Many of the parts following define terminology concerning tuning knobs.
+\VRef[Figure]{f:ListPerfGlossary} provides a consolidated refence.
+\subsubsection{Add-Remove Performance}
 \label{s:AddRemovePerformance}
 …
 Thus, adding and removing an element are the sole primitive actions.
 Repeated adding and removing is necessary to measure timing because these operations can be as simple as a dozen instructions.
+Repeated adding and removing is necessary to measure timing because these operations can be as short as a dozen instructions.
 These instruction sequences may have cases that proceed (in a modern, deep pipeline) without a stall.
 This experiment takes the position that:
 \begin{itemize}[leftmargin=*]
         \item The total time to add and remove is relevant, as opposed to having one individual time for adding and a separate time for removing.
+        \item The total time to add and remove is relevant, as opposed to having one time for adding and a separate time for removing.
                   Adds without removes quickly fill memory;
+                  removes without adds is meaningless.
+        \item A relevant breakdown ``by operation'' is:
+                \begin{description}
+                \item[movement]
+                  Is the add/remove applied to a stack, queue, or something else?
+                  In these experiments, strict stack and queue shapes are tested (two movements).
+                \item[polarity]
+                  In which direction does the action apply?
+                  For a queue, do items flow from first to last or last to first?
+                  For a stack, is the first or last end used for adding and removing?
+                  In these experiments, both polarities are considered, labelled insert-first and insert-last (two polarities).
+                \item[accessor]
+                  Is an add/remove location given by a list head's first/last (@insertFirst@, @removeLast@), or by a reference to an individual element (@insertAfter@, @remove@ of element)?
+                  In these experiments, the (three) accessors are:
+                  \begin{itemize}
+                  \item
+              inserts and removes both through the head ("all-head")
+                  \item
+          insert by element and remove through the head ("insert-element")
+                  \item
+                  insert through head and remove by element ("remove-element")
+                  \end{itemize}
+                \end{description}
+        \item So, an "operating scenario" is a specific selection of movement, polarity and accessor. These experiments run twelve operating scenarios.
+                  removing without adding is impossible.
+        \item A relevant breakdown ``by operation'' is, rather, the usage pattern of the add/remove calls.
+        A example pattern choice is adding and removing at the same end, making a stack, or opposite ends, for a queue.
+        Another is pushing on the front by calling @insert_first(lst, e)@ \vs @insert(e, old_first_elm)@; this aspect provides the test's API coverage.
+        \VRef[Section]{s:UseCases} gives the full breakdown.
         \item Speed differences caused by the host machine's memory hierarchy need to be identified and explained,
                   but do not represent advantages of one linked-list implementation over another.
+                  but do not represent advantages of one framework over another.
 \end{itemize}
 The experiment used to measure insert/remove cost measures the mean duration of a sequence of additions and removals.
-Confidence bounds, on this mean, are discussed.
 The distribution of speeds experienced by an individual add-remove pair (tail latency) is not discussed.
 Space efficiency is shown only indirectly, by way of caches' impact on speed.
 …
 \end{itemize}
+In the result analysis, where list length is a performance-influencing factor, once ``large'' lengths are dismissed, these zones are identified as representing different patterns:
+\begin{description}
+        \item[size zone ``small''] lists of 4--16 elements
+        \item[size zone ``medium''] lists of 50--200 elements
+\end{description}
+Each zone buckets four specific sizes at which trials are run.
+\subsubsection{Experiment setup}
+\label{s:ExperimentSetup}
+In all cases, the quantity discussed is the duration of one insert-remove (IR).
+An IR is the time taken to do one innermost insertion-loop iteration, one innermost removal-loop iteration, and its share of all overheads, ammortized.
+Lower IR duration is better.
+This experiment typically does an IR in 1--10 ns.
+The short end of this range has durations of single-digit clock-cycle counts.
+Therefore, the situations that achieve the best times are saturating the instruction pipeline successfully.
+Often, an IR duration value needs to be considered relatively.
+For example, \VRef[Section]{toc:sweet-sore} asks whether one linked list implementation is more sensitive than another to changing which computer runs the test.
+A finding might be that a machine change slows implementation A by 10\% and B by 20\%.
+This finding is not saying that A is faster than B (on either machine).
+The finding could stand if B started 10\% faster and the machine change levelled them off, if B started slower and got worse, or in myriad other cases.
+The finding asserts that such distinctions are not what's immediately relevant.
+The arithmetic that produces the 10\% and 20\% answers is removing the information about which one starts, or ends up, faster.
+Each implementation's to-machine duration is stated relatively to \emph{the same implementation's} from-machine duration.
+The resulting measure is still about a duration.
+The framework with the lower from-machine-relative duration handles the change better.
+\subsubsection{Test Program}
+\label{s:TetProgram}
 The experiment driver defines a (intrusive) node type:
 …
 The loop duration is divided by the counter and this throughput is reported.
 In a scatter-plot, each dot is one throughput, which means insert + remove + harness overhead.
 The harness overhead is constant when comparing linked-list implementations and keep as small as possible.
+The harness overhead is constant when comparing linked-list frameworks and is kept as small as possible.
 % The remainder of the setup section discusses the choices that affected the harness overhead.
 …
 % This harness avoids telling the hardware what the SUT is about to do.
+The comparator linked-list implementations are:
+\begin{description}
+\item[std::list]  The @list@ type of g++.
+\item[lq-list]  The @list@ type of LQ from glibc of gcc.
+\item[lq-tailq] The @tailq@ type of the same.
+\item[upp-upp]  \uCpp provided @uSequence@
+\item[cfa-cfa]  \CFA's @dlist@
+\end{description}
+\subsubsection{Execution Environment}
+\label{s:ExperimentalEnvironment}
+The performance experiments are run on:
+\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
+%\item[PC]
+%with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6.  The machine has 16 GB of RAM and 8 MB of last-level cache.
+%\item[ARM]
+%Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model
+\item[AMD]
+Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 1024KB L2, 16MB L3, where each L3 cache covers 1 NUMA node and 8 cores (16 processors).
+\item[Intel]
+Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 20248KB L2, 160MB L3, where each L3 cache covers 2 NUMA node and 32 cores (64 processors).
+\end{description}
+The experiments are single threaded and pinned to single core to prevent any OS movement, which might cause cache or NUMA effects perturbing the experiment.
+The compiler is gcc/g++-14.2.0 running on the Linux v6.8.0-52-generic OS.
+Switching between the default memory allocators @glibc@ and @llheap@ is done with @LD_PRELOAD@.
+To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@
+\begin{cfa}
+// prevent eliding, cheaper than volatile
+static inline void * pass( void * v ) {  __asm__  __volatile__( "" : "+r"(v) );  return v;  }
+...
+pass( &remove_first( lst ) );                   // wrap call to prevent elision, insert cannot be elided now
+\end{cfa}
+The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists.
+\subsection{Result: Coarse comparison of styles}
+This comparison establishes how an intrusive list performs compared with a wrapped-reference list.
+\VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test.
+Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum.
+In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:ExperimentSetup}) are plotted with the same symbol;
+sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list).
+See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists.
+The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns.
+For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/remove times.
+As the list size grows, the administrative overhead for intrusive lists quickly disappears.
+\begin{figure}
+  \centering
+  \setlength{\tabcolsep}{0pt}
+  \begin{tabular}{p{0.75in}p{2.75in}p{3in}}
+  &
+  \subfloat[Linear List Nodes, AMD]{\label{f:Linear-swift}
+        \hspace*{-0.75in}
+        \includegraphics{plot-list-zoomout-noshuf-swift.pdf}
+  } % subfigure
+  &
+  \subfloat[Linear List Nodes, Intel]{\label{f:Linear-java}
+        \includegraphics{plot-list-zoomout-noshuf-java.pdf}
+  } % subfigure
+  \\
+  &
+  \subfloat[Random List Nodes, AMD]{\label{f:Random-swift}
+        \hspace*{-0.75in}
+        \includegraphics{plot-list-zoomout-shuf-swift.pdf}
+  } % subfigure
+  &
+  \subfloat[Random List Nodes, Intel]{\label{f:Random-java}
+        \includegraphics{plot-list-zoomout-shuf-java.pdf}
+  } % subfigure
+  \end{tabular}
+  \caption{Insert/remove duration \vs list length.
+  Lengths go as large possible without error.
+  One example operation is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.}
+  \label{fig:plot-list-zoomout}
+\end{figure}
+The key performance factor between the intrusive and the wrapped-reference lists is the dynamic allocation for the wrapped nodes.
+Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator.
+For insert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists.
+For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
+the allocation dominates these costs.
+For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node.
+Hence, there is no way to tease apart the allocation, copying, and linking costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage that storage.
+In detail, \VRef[Figure]{f:Linear-swift}--\subref*{f:Linear-java} shows linear insertion of all the nodes and then linear removal, both in the same direction.
+For intrusive lists, the nodes are adjacent in memory from being preallocated in an array.
+For wrapped lists, the wrapped nodes happen to be adjacent because the memory allocator uses bump allocation during the initial phase of allocation.
+As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal.
+With address look-ahead, the hardware does an excellent job of managing the multi-level cache.
+Hence, performance is largely constant for both kinds of lists, until L3 cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists.
+For example, on AMD (\VRef[Figure]{f:Linear-swift}), there is one NUMA node but many small L3 caches, so performance slows down quickly as multiple L3 caches come into play, and remains constant at that level, except for some anomalies for very large lists.
+On Intel (\VRef[Figure]{f:Linear-java}), there are four NUMA nodes and four slowdown steps as list-length increase.
+At each step, the difference between the kinds of lists decreases as the NUMA effect increases.
+In detail, \VRef[Figure]{f:Random-swift}--\subref*{f:Random-java} shows random insertion and removal of the nodes.
+As for linear, there is the issue of memory allocation for the wrapped list.
+As well, the consecutive storage-layout is the same (array and bump allocation).
+Hence, the difference is the random linking among nodes, resulting in random accesses, even though the list is traversed linearly, resulting in similar cache events for both kinds of lists.
+Both \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} show the slowdown of random access as the list-length grows resulting from stepping out of caches into main memory and crossing NUMA nodes.
+% Insert and remove operations act on both sides of a link.
+%Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
+As for linear, the Intel (\VRef[Figure]{f:Random-java}) graph shows steps from the four NUMA nodes.
+Interestingly, after $10^6$ nodes, intrusive lists are slower than wrapped.
+I did not have time to track down this anomaly, but I speculate it results from the difference in touching the data in the accessed node, as the data and links are together for intrusive and separated for wrapped.
+For the llheap memory-allocator and the two tested architectures, intrusive lists out perform wrapped lists up to size $10^3$ for both linear and random, and performance begins to converge around $10^6$ nodes as architectural issues begin to dominate.
+Clearly, memory allocator and hardware architecture plays a large factor in the total cost and the crossover points as list-size increases.
+% In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size.
+The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution.
+Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated.
+Even when space is a consideration, intrusive links may not use more storage if a node is often linked.
+Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures or their tool-set does not provide them.
+% Note, linear access may not be realistic unless dynamic size changes may occur;
+% if the nodes are known to be adjacent, use an array.
+% In a wrapped-reference list, list nodes are allocated separately from the items put into the list.
+% Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes.
+% STL's performance is not affected by element order in memory.
+%The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations.
+% This much is also unaffected by element order.
+% Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off.
+% In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL.
+% The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster.
+% Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level.
+% Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear.
+% Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear.
+% The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists.
+% The tests in this chapter are only inserting and removing.
+% They are not operating on any user payload data that is being listed.
+% The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order.
+% These differences are inherent to the two list models.
+% A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening.
+% As a result, the interlinked nodes of the STL list are generally referencing their immediate neighbours.
+% This pattern occurs regardless of user-item shuffling because this test's ``use'' of the user-items' array is limited to storing element addresses.
+% This experiment, driving an STL list, is simply not touching the memory that holds the user data.
+% Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast.
+% But the comparison of unshuffled intrusive with wrapped-reference gives the performance of these two styles, with their the common impediment of overfilling the cache removed.
+% Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes.
+% This difference is appreciable below list length 0.5 M, and enormous below 10 K.
+\subsection{Result: Comparing intrusive implementations}
+\label{s:ComparingIntrusiveImplementations}
+The preceding result shows the intrusive implementations have better performance to the wrapped lists for small to medium sized lists.
+This analysis covers the experiment position taken in \VRef{s:AddRemovePerformance} for movement, polarity, and accessor.
+\VRef[Figure]{f:ExperimentOperations} shows the experiment operations tested, which results in 12 experiments (I--XII) for comparing intrusive implementations.
+To preclude hardware interference, only list sizes below 150 are examined to differentiate among the intrusive implementations,
+The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same.
+\begin{figure}
+\subsubsection{Use Cases}
+\label{s:UseCases}
+\begin{figure}
+\begin{comment}
 \centering
 \setlength{\tabcolsep}{8pt}
 …
         \small
         \begin{tabular}{@{}ll@{}}
         I:      & stack, insert first, I-head / R-head \\
         II:     & stack, insert first, I-list / R-head \\
         III:& stack, insert first, I-head / R-list \\
         IV:     & stack, insert last, I-head / R-head \\
         V:      & stack, insert last, I-list / R-head \\
         VI:     & stack, insert last, I-head / R-list \\
         VII:& queue, insert first, I-head / R-head \\
         VIII:& queue, insert first, I-list / R-head \\
         IX:     & queue, insert first, I-head / R-list \\
         X:      &  queue, insert last, I-head / R-head \\
         XI:     & queue, insert last, iI-list / R-head \\
         XII:& queue, insert last, I-head / R-list \\
+        I:      & stack, insert first, all head \\
+        II:     & stack, insert first, insert element \\
+        III:& stack, insert first, remove element \\
+        IV:     & stack, insert last, all head \\
+        V:      & stack, insert last, insert element \\
+        VI:     & stack, insert last, remove element \\
+        VII:& queue, insert first, all head \\
+        VIII:& queue, insert first, insert element \\
+        IX:     & queue, insert first, remove element \\
+        X:      &  queue, insert last, all head \\
+        XI:     & queue, insert last, iinsert element \\
+        XII:& queue, insert last, remove element \\
         \end{tabular}
 \end{tabular}
+\caption{Experiment Operations}
+\end{comment}
+\setlength{\tabcolsep}{5pt}
+\small
+\begin{tabular}{rcccccccccccc}
+& I & II        & III & IV & V & VI & VII & VIII & IX & X & XI & XII \\
+Movement &
+stack & stack & stack & stack & stack & stack &
+queue & queue & queue & queue & queue & queue \\
+Polarity &
+i-first & i-first & i-first & i-last & i-last & i-last &
+i-first & i-first & i-first & i-last & i-last & i-last \\
+Acessor &
+all hd & ins-e & rem-e & all hd & ins-e & rem-e &
+all hd & ins-e & rem-e & all hd & ins-e & rem-e
+\end{tabular}
+\caption{Experiment use cases, numbered.}
 \label{f:ExperimentOperations}
 \end{figure}
+Where \VRef[Figure]{f:ListPerfGlossary} enumerates the specific values, reall the use-case dimensions are:
+\begin{description}
+        \item[movement ($\times 2$)]
+          In these experiments, strict stack and queue patterns are tested.
+        \item[polarity ($\times 2$)]
+          Obtain one polarity from the other by reversing uses of first/last.
+        \item[accessor ($\times 3$)]
+          Giving an add/remove location by a list head's first/last, \vs by a preexisting reference to an individual element?
+\end{description}
+A use case is a specific selection of movement, polarity and accessor.
+These experiments run twelve use cases.
+When a comparison is showing only what can happen when switching among use cases (as opposed to \eg how stacks are different from queues), the numbering scheme of \VRef[Figure]{f:ExperimentOperations} is used.
+With accessor, when an action names its insertion position or removal element, the harness either
+\begin{itemize}
+\item defers to the list-head's tracking of first/last (``through the head''), or
+\item applies its own knowledge of the current pattern, to name a position/element that happens to be first/last (``of known element'').
+\end{itemize}
+The accessor patterns, at the (\CFA) API level, are:
+\begin{description}
+  \item[all (through the) head:]  Both inserts and removes happen through the list head.  The list head operations are @insert_first@, @insert_last@, @remove_first@ and @remove_last@. \\
+  \item[insert (of known) element] \dots and remove through head:  Inserts use @insert_before(e, first)@ or @insert_after(e, last)@, where @e@ is being inserted and @first@/@last@ are element references known by list-independent means.
+  \item[remove (of known) element] \dots and insert through the head:  Removes use @remove(e)@, where @e@ is being removed.  List-independent knowledge establishes that @e@ is first or last, as appropriate.
+\end{description}
+Comparing all-head with insert-element gives the relative performance of head-mediated \vs element-oriented insertion, because both use the same removal style.
+Comparing all-head with remove-element gives the relative performance of head-mediated \vs element-oriented removal, because both use the same insertion style.
+\subsubsection{Sizing}
+It is true, but perhaps not obvious, that buildind and destroying long lists is slower than building and destroying short lists.
+Obviously, indeed, it takes longer to fuse and divide a hundred neighbours than five.
+But the key metric in this work, AII, is about a single link--unlink.
+So, critically, linking and unlinking a hundred neighbours actually takes \emph{more} than $20\times$ the time for five neighbours.
+The main reason is caching; when more neighbours are being manipulated, more memory is being read and written.
+But caching success is about more than the amount of memory worked on.
+Subtle changes in pattern become butterfly effects.
+Aggressive ILP scheduling, which enables short AIR times, is the amplifier.
+A data dependency, present in one framework but not another, is critical path in one situation but in not another.
+So, duration's response to size is not a steady worsening as size increases.
+Rather, each size-independent configuration often responds to size increases with leaps of worsening.
+Occasionally a leap is even followed size-run of retrograde response, where a suddenly incurred penalty has a chance to ammortize away.
+The frameworks tend to leapfrog over each other, at different points, as size increases.
+The analysis treats these behaviours as incidental.
+It does not try to characterize various exact-size responses.
+Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another zone.
+To preview, \VRef[Section]{toc:coarse-compre} dismisses ``Large'' sizes (above 150 elements), where the performance story is dominated by the amount of memory touched, inherently, by the choice of intrusive list \vs wrapped, and where one intrusive framework is quite obviously as good as another.
+At smaller sizes, comparing one intrusive framework to another makes sense; this comparion occurs in the remaining ``Result'' sections.
+Among the ``not Large'' sizes used there, two further zones, Small and Medium, are selected as representatives of what can vary when the scale is changed.
+These particular ranges were chosen beacause each range tends to have one story repeated across its constituent sizes.
+If \CFA's duration increases across Small, then the other frameworks' usually do too.
+If \CFA is beating \uCpp at the low end of Large, then it usually is at the high end too.
+The leapfrogging tends to happen outside of these two ranges.
+A spot of poor performance appears in the general results for \CFA at size 1.
+Section \MLB{TODO:xref} explores the phenomenon and concludes that it is an anomaly due to a quirky interaction with the testing rig.
+To do so, two it considers size as either length or width.
+Length is the number of elements in a list.
+Width is a number of these lists being kept, worked upon in round-robin order.
+Outside of \MLB{TODO:xref}, size always means length, and width is 1.
+\subsubsection{Execution Environment}
+\label{s:ExperimentalEnvironment}
+The performance experiments are run on:
+\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
+%\item[PC]
+%with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6.  The machine has 16 GB of RAM and 8 MB of last-level cache.
+%\item[ARM]
+%Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model
+\item[AMD]
+Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 1024KB L2, 16MB L3, where each L3 cache covers 1 NUMA node and 8 cores (16 processors).
+\item[Intel]
+Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 20248KB L2, 160MB L3, where each L3 cache covers 2 NUMA node and 32 cores (64 processors).
+\end{description}
+The experiments are single threaded and pinned to single core to prevent any OS movement, which might cause cache or NUMA effects perturbing the experiment.
+The compiler is gcc/g++-14.2.0 running on the Linux v6.8.0-52-generic OS.
+Switching between the default memory allocators @glibc@ and @llheap@ is done with @LD_PRELOAD@.
+To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@
+\begin{cfa}
+// prevent eliding, cheaper than volatile
+static inline void * pass( void * v ) {  __asm__  __volatile__( "" : "+r"(v) );  return v;  }
+...
+pass( &remove_first( lst ) );                   // wrap call to prevent elision, insert cannot be elided now
+\end{cfa}
+The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists.
+The main difference in the machines is their cache structure.
+The AMD has smaller caches that are shared less, while the Intel shares larger caches among more processors.
+This difference, while an interesting tradeoff for highly concurrent use, is rather one-sided for sequential use, such as this experiment's.
+The Intel offers a single processor a bigger cache.
+\subsubsection{Recap and Master Legend}
+There are 12 use cases, which are all combinations of 2 movents, 2 polarities and 3 accessors.
+There are 4 pysical contexts, which are all combinations of 2 machines and 2 size (length) zones (and 1 width, of value 1).
+Each physical context samples 4 specific sizes.
+There are 3.25 frameworks.
+This accounting considers how LQ-list supports only the movement--polarity combination "stack, insert first."
+So LQ-list fills a quarter of the otherwise-orthogonal space.
+Use case, physical context and framework are the explanatory factors.
+Taking all combinations of the explanatory factors gives 624 individual configurations.
+Though there are multiple experimental trials of each configuration (to assure repeatability), the usual measure is mean AIR among the trials, considered for each of the 624 individual configurations.
+All means reported in this analysis are geometric.
+\MLB{TODO: add example plots; explain histogram of 624}
+\subsection{Result: Coarse comparison of styles}
+This comparison establishes how an intrusive list performs compared with a wrapped-reference list.
+\VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test.
+Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum.
+In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:Contenders}) are plotted with the same symbol;
+sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list).
+See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists.
+The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns.
+For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/remove times.
+As the list size grows, the administrative overhead for intrusive lists quickly disappears.
+\begin{figure}
+  \centering
+  \setlength{\tabcolsep}{0pt}
+  \begin{tabular}{p{0.75in}p{2.75in}p{3in}}
+  &
+  \subfloat[Linear List Nodes, AMD]{\label{f:Linear-swift}
+        \hspace*{-0.75in}
+        \includegraphics{plot-list-zoomout-noshuf-swift.pdf}
+  } % subfigure
+  &
+  \subfloat[Linear List Nodes, Intel]{\label{f:Linear-java}
+        \includegraphics{plot-list-zoomout-noshuf-java.pdf}
+  } % subfigure
+  \\
+  &
+  \subfloat[Random List Nodes, AMD]{\label{f:Random-swift}
+        \hspace*{-0.75in}
+        \includegraphics{plot-list-zoomout-shuf-swift.pdf}
+  } % subfigure
+  &
+  \subfloat[Random List Nodes, Intel]{\label{f:Random-java}
+        \includegraphics{plot-list-zoomout-shuf-java.pdf}
+  } % subfigure
+  \end{tabular}
+  \caption{Insert/remove duration \vs list length.
+  Lengths go as large possible without error.
+  One example use case is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.}
+  \label{fig:plot-list-zoomout}
+\end{figure}
+The key performance factor between the intrusive and the wrapped-reference lists is the dynamic allocation for the wrapped nodes.
+Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator.
+For insert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists.
+For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
+the allocation dominates these costs.
+For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node.
+Hence, there is no way to tease apart the allocation, copying, and linking costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage that storage.
+In detail, \VRef[Figure]{f:Linear-swift}--\subref*{f:Linear-java} shows linear insertion of all the nodes and then linear removal, both in the same direction.
+For intrusive lists, the nodes are adjacent in memory from being preallocated in an array.
+For wrapped lists, the wrapped nodes happen to be adjacent because the memory allocator uses bump allocation during the initial phase of allocation.
+As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal.
+With address look-ahead, the hardware does an excellent job of managing the multi-level cache.
+Hence, performance is largely constant for both kinds of lists, until L3 cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists.
+For example, on AMD (\VRef[Figure]{f:Linear-swift}), there is one NUMA node but many small L3 caches, so performance slows down quickly as multiple L3 caches come into play, and remains constant at that level, except for some anomalies for very large lists.
+On Intel (\VRef[Figure]{f:Linear-java}), there are four NUMA nodes and four slowdown steps as list-length increase.
+At each step, the difference between the kinds of lists decreases as the NUMA effect increases.
+In detail, \VRef[Figure]{f:Random-swift}--\subref*{f:Random-java} shows random insertion and removal of the nodes.
+As for linear, there is the issue of memory allocation for the wrapped list.
+As well, the consecutive storage-layout is the same (array and bump allocation).
+Hence, the difference is the random linking among nodes, resulting in random accesses, even though the list is traversed linearly, resulting in similar cache events for both kinds of lists.
+Both \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} show the slowdown of random access as the list-length grows resulting from stepping out of caches into main memory and crossing NUMA nodes.
+% Insert and remove operations act on both sides of a link.
+%Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
+As for linear, the Intel (\VRef[Figure]{f:Random-java}) graph shows steps from the four NUMA nodes.
+Interestingly, after $10^6$ nodes, intrusive lists are slower than wrapped.
+I did not have time to track down this anomaly, but I speculate it results from the difference in touching the data in the accessed node, as the data and links are together for intrusive and separated for wrapped.
+For the llheap memory-allocator and the two tested architectures, intrusive lists out perform wrapped lists up to size $10^3$ for both linear and random, and performance begins to converge around $10^6$ nodes as architectural issues begin to dominate.
+Clearly, memory allocator and hardware architecture plays a large factor in the total cost and the crossover points as list-size increases.
+% In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size.
+The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution.
+Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated.
+Even when space is a consideration, intrusive links may not use more storage if a node is often linked.
+Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures or their tool-set does not provide them.
+% Note, linear access may not be realistic unless dynamic size changes may occur;
+% if the nodes are known to be adjacent, use an array.
+% In a wrapped-reference list, list nodes are allocated separately from the items put into the list.
+% Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes.
+% STL's performance is not affected by element order in memory.
+%The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations.
+% This much is also unaffected by element order.
+% Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off.
+% In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL.
+% The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster.
+% Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level.
+% Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear.
+% Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear.
+% The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists.
+% The tests in this chapter are only inserting and removing.
+% They are not operating on any user payload data that is being listed.
+% The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order.
+% These differences are inherent to the two list models.
+% A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening.
+% As a result, the interlinked nodes of the STL list are generally referencing their immediate neighbours.
+% This pattern occurs regardless of user-item shuffling because this test's ``use'' of the user-items' array is limited to storing element addresses.
+% This experiment, driving an STL list, is simply not touching the memory that holds the user data.
+% Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast.
+% But the comparison of unshuffled intrusive with wrapped-reference gives the performance of these two styles, with their the common impediment of overfilling the cache removed.
+% Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes.
+% This difference is appreciable below list length 0.5 M, and enormous below 10 K.
+\subsection{Result: Intrusive Winners and Losers}
+\label{s:ComparingIntrusiveImplementations}
+The preceding result shows the intrusive frameworks have better performance than the wrapped lists for small to medium sized lists.
+This analysis covers the experiment position taken in \VRef{s:AddRemovePerformance} for movement, polarity, and accessor.
+\VRef[Figure]{f:ExperimentOperations} shows the experiment use cases tested, which results in 12 experiments (I--XII) for comparing intrusive frameworks.
+To preclude hardware interference, only list sizes below 150 are examined to differentiate among the intrusive frameworks,
+The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same.
 \begin{figure}
   \centering
   \includegraphics{plot-list-1ord.pdf}
   \caption{Histogram of operation durations, decomposed by all first-order effects.
   Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. Higher in column is better}
+  \caption{Histogram of IR durations, decomposed by all first-order effects.
+  Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. The measure is duration; lower is better.}
   \label{fig:plot-list-1ord}
 \end{figure}
 …
 The size effect is more pronounced on the AMD with its smaller L3 cache than it is on the Intel.
 (No NUMA effects for these list sizes.)
 Specifically, a 20\% standard deviation exists here, between the means four physical-effect categories.
+Specifically, a 20\% standard deviation exists here, between the means of the four physical-effect categories.
 The key takeaway for this comparison is the context it establishes for interpreting the following framework comparisons.
 Both the particulars of a the machine's cache design, and a list length's effect on the program's cache friendliness, affect insert/remove speed in the manner illlustrated in this breakdown.
+Both the particulars of a machine's cache design, and a list length's effect on the program's cache friendliness, affect insert/remove speed in the manner illlustrated in this breakdown.
 That is, if you are running on an unknown machine, at a scale above anomaly-prone individuals, and below where major LLC caching effects take over the general intrusive-list advantage, but with an unknown relationship to the sizing of your fickle low-level caches, you are likely to experience an unpredictable speed impact on the order of 20\%.
 A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by operation type.
+A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by use case.
 Specific interactions do occur, like framework X doing better on stacks than on queues; a selection of these is addressed in \VRef[Figure]{fig:plot-list-2ord} and discussed shortly.
 But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the operations opaquely.
 Whether a given list implementation is suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement.
 So you face another lottery, with a likely win-loss range of the standard deviation of the individual operations' means: 9\%.
+But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the use cases opaquely.
+Whether a given list framework is suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement.
+So you face another lottery, with a likely win-loss range of the standard deviation of the individual use cases' means: 9\%.
 This context helps interpret \VRef[Figure]{fig:plot-list-1ord}'s final comparison, by framework.
 …
 Now, the LQs do indeed beat the UW languages by 15\%, a fact explored further in \MLB{TODO: xref}.
 But so too does operation VIII typically beat operation IV by 38\%.
+But so too does use case VIII typically beat use case IV by 38\%.
 As does a small size on the Intel typically beat a medium size on the AMD by 66\%.
 Framework choice is simply not where you stand to win or lose the most.
+\subsection{Intrusive Sweet and Sore Spots}
 \begin{figure}
         \centering
         \includegraphics{plot-list-2ord.pdf}
         \caption{Histogram of operation durations, illustrating interactions with framework.
+        \caption{Histogram of IR durations, illustrating interactions with framework.
         Each distribution shows how its framework reacts to a single other factor being varied across one pair of options.
         Every (binned and mean-contributing) individual data point represents a pair of test setups, one with the criterion set to the option labelled at the top; the other setup uses the bottom option.
 …
 \VRef[Figure]{fig:plot-list-1ord} stays razor-focused on only first-order effects in order to contextualize a winner/loser framework observation.
 But this perspective cannot address questions like, ``Where are \CFA's sore spots?''
 Moreover, the shallow threatment of operations by ordinals said nothing about how stack usage compares with queues'.
+Moreover, the shallow threatment of use cases by ordinals said nothing about how stack usage compares with queues'.
 \VRef[Figure]{fig:plot-list-2ord} provides such answers.
 …
 The strongest effect is \CFA's aversion to removal by element---certainly an opportunity for improvement.
+\subsection{\CFA Tiny-Size Anomaly}
+The \CFA list occasionally showed a concerning slowdown at length 1.
+The issue, seen in \VRef[Figure]{fig:plot-list-short} (top-left corner), has \CFA taking above 10 ns per IR (top-left corner).
+It occurs only for the queue movement, only on the AMD machine, and only for the \CFA framework.
+Length-1 performane is an important case.
+Lists like those of waiting threads are frequently left empty, with the occasional thread (or few) momentarily joining.
+These scenarios need to work.
+A cause of this behaviour was never determined.
+Speculation is that \CFA's increased data dependency, a result of the tagging scheme, pairs poorly with the situation implied by queue movement.
+The aliasing, at length 1, is: the head's first element is the head's last element.
+With stack movement, one of these aliases is used twice, while with queue movement, both are used in alternation.
+The breakdowns earlier in the performane assessment work by varying length only.
+That is, they see the story down the leftmost column in a triangle.
+The insight for contextualizing this issue was to inspect both length and width.
+The issue is seen as practically mitigated by noticing that the difficutly fades away as width increases.
+This effect is seen both in \VRef[Figure]{fig:plot-list-short}'s easement across the top triangle rows, and, zoomed farther out, in \VRef[Figure]{fig:plot-list-wide}.
+Increasing the width matters to the aliasing hypothesis.
+In a narrow experiment, one element's insert and remove happen in rapid succession.
+So, the two aliases are exercied closer together, making a data hazard (that lacks ideal hardware treatment) stretch the instruction-pipeline schedule more significantly.
+Increasing the width adds harness-induced gaps between the uses of each alias, behind which a potential hazard can hide.
+In the practical scenario that judges length-1 performance as relevant, width 1 is contrived.
+A thread putting itself on an often-empty waiters' list is not doing so on one such list repeatedly, at least not without taking other situation-iduced pauses.
+Thus, the congestion at low width + length comes from the harness using repetition (in order to obtain a measurable time).
+It does not reflect the situation that motivates the legitimate desire for good length-1 performance.
+There likely is a real hazard, unique to the \CFA framework, when a queue movement is repeated on a tiny list \emph{without other interventing action}.
+Doing so is believed to occur only in contrived situations.
+\begin{figure}
+\centering
+  \includegraphics[trim={00in, 5.5in, 0in, 0in}, clip, scale=0.8]{plot-list-short-temp.pdf}
+  \caption{Behaviour at very short lengths.}
+  \label{fig:plot-list-short}
+\end{figure}
+\begin{figure}
+\centering
+  \includegraphics[trim={0.25in, 1in, 0.25in, 1in}, clip, scale=0.5]{plot-list-wide-temp.pdf}
+  \caption{Length-1 anomaly resolving at modest width.  Points are for varying widths, at fixed length 1.}
+  \label{fig:plot-list-wide}
+\end{figure}
 \begin{comment}
         These remarks are mostly about 3ord over 2ord.
 …
 They illustrate the difficult signal-to-noise ratio that I had to overcome in preparing this data.
 They may serve as a reference guiding future \CFA linked-list work by informing on where to target improvements.
 Finally, the findings offer the conclusion that \CFA's list offers more consostent performance across usage scenarios, than the other lists.
+Finally, the findings offer the conclusion that \CFA's list offers more consistent performance across usage scenarios, than the other lists.
 \end{comment}
 …
 \end{comment}
+\MLB{ TODO: find a home for these original conclusions:
+cfa-upp similarity holde for all halves by movement or polarity;
+splitting on accessor, \CFA has a poor result on element removal, LQ-list has a great result on the other accessors, and uC++ is unaffected. }
+\begin{comment}
 \begin{figure}
 …
 The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials.
 For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other.
 The experiment runs twelve operating scenarios;
+The experiment runs twelve use cases;
 the ones chosen for their variety are scenarios I and VIII from the listing of \VRef[Figure]{fig:plot-list-mchn-szz}, and their results appear in the rows.
 As in the previous experiment, each hardware architecture appears in a column.
 …
 With this adjustment, absolute duration values (in nonsecods) are lost.
 In return, the physical quadrants are re-combined, enabling assessment of the non-physical factors.
+\end{comment}
 \begin{comment}

doc/theses/mike_brooks_MMath/plots/ListCommon.py

-              r99bc47b
+              r4cf8832
 explanations = ['movement', 'polarity', 'accessor',
                 'NumNodes',
+                'NumNodes', 'Width', 'Length',
                 'SizeZone', # note fd: NumNodes -> SizeZone
                 'fx',
 …
 def getSingleResults(
         dsname = 'general',
+        dsnames = ['general'],
         machines = allMachines,
         *,
 …
     timings = pd.concat([
+        getMachineDataset( dsname, m )
+        getMachineDataset( d, m )
+        for d in dsnames
         for m in machines ])
 …
 def printManySummary(*,
         dsname = 'general',
+        dsnames = ['general'],
         machines = allMachines,
         metafileCore,
 …
     for op in metadata.itertuples():
         timings = getSingleResults(dsname, machines,
+        timings = getSingleResults(dsnames, machines,
             fxs=fxs,
             tgtMovement = op.movement,
 …
 def printSingleDetail(
         dsname = 'general',
+        dsnames = ['general'],
         machines = allMachines,
         *,
 …
     timings = getSingleResults(dsname, machines,
+    timings = getSingleResults(dsnames, machines,
         fxs = fxs,
         tgtMovement = tgtMovement,

doc/theses/mike_brooks_MMath/plots/list-1ord.py

-              r99bc47b
+              r4cf8832
 ops = ['movement', 'polarity', 'accessor']
 fx = ['fx']
+bkgnd = ['NumNodes']  # never drilled/marginalized, always conditioned
+bkgnd = ['NumNodes']            # never drilled/marginalized, always conditioned
+ignore = [ 'InterleaveFrac',    # unused ever and always zero
+           'Width',             # unused here and always one
+           'Length' ]           # unused here and always =NumNodes
+# assure every explanation is classified
 assert( set( explanations )
         - set( ['InterleaveFrac'] ) # unused and always zero
+        - set( ignore )
         ==
         set(physicals) | set(ops) | set(fx) | set(bkgnd) )

doc/theses/mike_brooks_MMath/plots/list-zoomout-noshuf-java.py

r99bc47b	r4cf8832
8	8
9	9	printSingleDetail(
10		dsname~~='zoomout-noshuf'~~,
	10	dsnames=['zoomout-noshuf'],
11	11	machines=['java'],
12	12	tgtMovement = 'stack',

doc/theses/mike_brooks_MMath/plots/list-zoomout-noshuf-swift.py

r99bc47b	r4cf8832
8	8
9	9	printSingleDetail(
10		dsname~~='zoomout-noshuf'~~,
	10	dsnames=['zoomout-noshuf'],
11	11	machines=['swift'],
12	12	tgtMovement = 'stack',

doc/theses/mike_brooks_MMath/plots/list-zoomout-shuf-java.py

r99bc47b	r4cf8832
8	8
9	9	printSingleDetail(
10		dsname~~='zoomout-shuf'~~,
	10	dsnames=['zoomout-shuf'],
11	11	machines=['java'],
12	12	tgtMovement = 'stack',

doc/theses/mike_brooks_MMath/plots/list-zoomout-shuf-swift.py

r99bc47b	r4cf8832
8	8
9	9	printSingleDetail(
10		dsname~~='zoomout-shuf'~~,
	10	dsnames=['zoomout-shuf'],
11	11	machines=['swift'],
12	12	tgtMovement = 'stack',

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 4cf8832

Legend:

doc/theses/mike_brooks_MMath/list.tex

doc/theses/mike_brooks_MMath/plots/ListCommon.py

doc/theses/mike_brooks_MMath/plots/list-1ord.py

doc/theses/mike_brooks_MMath/plots/list-zoomout-noshuf-java.py

doc/theses/mike_brooks_MMath/plots/list-zoomout-noshuf-swift.py

doc/theses/mike_brooks_MMath/plots/list-zoomout-shuf-java.py

doc/theses/mike_brooks_MMath/plots/list-zoomout-shuf-swift.py

Download in other formats: