Context Navigation

← Previous Change
Next Change →

Changeset 16a843b for doc/theses/mike_brooks_MMath

Timestamp:

Jul 19, 2025, 2:35:56 AM (4 months ago)

Author:

Michael Brooks <mlbrooks@…>

Branches:

Children:

7640ff5, da10157

Parents:

Message:

add linked-list performance assessment

Location:

doc/theses/mike_brooks_MMath

Files:

: 40 added
: 3 edited

Makefile (modified) (2 diffs)
benchmarks/list/results-general.csv (added)
benchmarks/list/results-zoomout-noshuf.csv (added)
benchmarks/list/results-zoomout-shuf.csv (added)
list.tex (modified) (1 diff)
plots/ListCommon.py (added)
plots/list-cfa-attrib-meta.dat (added)
plots/list-cfa-attrib-remelem-meta.dat (added)
plots/list-cfa-attrib-remelem.d (added)
plots/list-cfa-attrib-remelem.gp (added)
plots/list-cfa-attrib-remelem.py (added)
plots/list-cfa-attrib.d (added)
plots/list-cfa-attrib.gp (added)
plots/list-cfa-attrib.py (added)
plots/list-cmp-exout-meta.dat (added)
plots/list-cmp-exout.d (added)
plots/list-cmp-exout.gp (added)
plots/list-cmp-exout.py (added)
plots/list-cmp-intrl-outcome-meta.dat (added)
plots/list-cmp-intrl-outcome.d (added)
plots/list-cmp-intrl-outcome.gp (added)
plots/list-cmp-intrl-outcome.py (added)
plots/list-cmp-intrl-shift-meta.dat (added)
plots/list-cmp-intrl-shift.d (added)
plots/list-cmp-intrl-shift.gp (added)
plots/list-cmp-intrl-shift.py (added)
plots/list-cmp-survey-meta.dat (added)
plots/list-cmp-survey.d (added)
plots/list-cmp-survey.gp (added)
plots/list-cmp-survey.py (added)
plots/list-zoomin-abs.d (added)
plots/list-zoomin-abs.gp (added)
plots/list-zoomin-abs.py (added)
plots/list-zoomin-rel.d (added)
plots/list-zoomin-rel.gp (added)
plots/list-zoomin-rel.py (added)
plots/list-zoomout-noshuf.d (added)
plots/list-zoomout-noshuf.gp (added)
plots/list-zoomout-noshuf.py (added)
plots/list-zoomout-shuf.d (added)
plots/list-zoomout-shuf.gp (added)
plots/list-zoomout-shuf.py (added)
uw-ethesis.tex (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

doc/theses/mike_brooks_MMath/Makefile

-              r47064914
+              r16a843b
 -include $(Plots)/*.d
 ${Build}/plot-%.dat: ${Plots}/%.py ${Plots}/common.py ${Plots}/%.py.INPUTS | ${Build}
+${Build}/plot-%.dat: ${Plots}/%.py ${Plots}/common.py ${Plots}/ListCommon.py ${Plots}/%.py.INPUTS | ${Build}
         python3 $< > $@
 …
 #-include ${Build}/*.d
+# troubleshooting, e.g. `make echo_DEMOS` runs `echo $(DEMOS)`
+echo_% :
+        @echo '$($(@:echo_%=%))'

doc/theses/mike_brooks_MMath/list.tex

-              r47064914
+              r16a843b
+\section{Assessment}
+\label{toc:lst:assess}
+\subsection{Add-Remove Performance}
+The fundamental job of a linked-list library is to manage the links that connect users' items.
+Any link management is an action that causes pair(s) of elements to become, or cease to be, adjacent.
+Thus, adding and removing an element are the sole primitive actions.
+Repeated adding and removing is necessary to measure timing because these operations can be as simple as a dozen instructions.
+These instruction sequences may have cases that proceed (in a modern, deep pipeline) without a stall.
+This experiment takes the position that
+\begin{itemize}
+    \item The total time to add and remove is relevant; an attribution of time spent adding vs.\ removing is not.
+          Any use case for which addition speed matters necessarily has removes paired with adds.
+                  For otherwise, the alleged usage would exhaust the amount of work expressable as a main-memory full of nodes within a few seconds.
+    \item A relevant breakdown ``by operation'' is, rather, one that considers the structural context of these requests.
+                \begin{description}
+                \item[movement]
+          Is the add/remove order that of a stack, a queue, or something else?
+                \item[polarity]
+                  In which direction does the movement's action apply?  For a queue, do items flow from first to last or last to first?  For a stack, is the first-end or the last-end used for adding and removing?
+                \item[accessor]
+          Is an add/remove location given by a list head's ``first''/``last'', or by a reference to an individual element?
+                \end{description}
+    \item Speed differences caused by the host machine's memory hierarchy need to be identified and explained,
+          but do not represent advantages of one linked-list implementation over another.
+\end{itemize}
+This experiment measures the mean duration of a list addition and removal.
+Confidence bounds, on this mean, are discussed.
+The distribution of speeds experienced by an individual add-remove pair (tail latency) is not discussed.
+Space efficiency is shown only indirectly, by way of caches' impact on speed.
+%~MONITOR
+% If able to show cases with CFA doing better, reword.
+The goal is to show the \CFA library performing comparably to other intrusive libraries,
+in an experimental context sensitive enough to show also:
+\begin{itemize}
+    \item intrusive lists performing (majorly) differently than wrapped lists
+    \item a space of (minor) performance differences typical of existing intrusive lists
+\end{itemize}
+\subsubsection{Experiment setup}
+The experiment defines a user's datatype and considers
+the speed of building, and tearing down, a list of $n$ instances of the user's type.
+The timings are taken with a fixed-duration method based on checks @clock()@.
+In a typical 5-sec run, the outer looping checks the clock about 200 times.
+A number of experimental rounds per clock check is precalculated to be appropriate to the value of $n$.
+\begin{cfa}
+// simplified harness: CFA implementation,
+// stack movement, insert-first polarity, head-mediated access
+size_t totalOpsDone = 0;
+dlist( item_t ) lst;
+item_t items[ n ];
+startTimer();
+while ( SHOULD_CONTINUE ) {
+        for ( i; n ) insert_first( lst, items[i] );
+        for ( i; n ) remove_first( lst );
+        totalOpsDone += n;
+}
+stopTimer();
+reportedDuration = getTimerDuration() / totalOpsDone;
+\end{cfa}
+One experimental round is, first, a tight loop of inserting $n$ elements into a list, followed by another, to remove these $n$ elements.
+A counter is incremented by $n$ each round.
+When the whole experiment is done, the total elapsed time, divided by final value of the operation counter,
+is reported as the observed mean operation duration.
+In a scatterplot presentation, each dot would be one such reported mean duration.
+So, ``operation'' really means insert + remove + harness overhead.
+The harness overheads are held constant when comparing linked-list implementations.
+The remainder of the setup section discusses the choices that affected the harness overhead.
+An \emph{iterators' array} provides support for element-level operations on non-intrusive lists.
+As elaborated in Section \ref{toc:lst:issue:attach},
+wrapped-attachment lists use a distinct type (at a distinct memory location) to represent ``an item that's in the list.''
+Operations like insert-after and remove-here consume iterators.
+In the STL implementation, an iterator is a pointer to a \lstinline{std::_List_node}.
+For the STL case, the driver obtains an iterator value
+at the time of adding to the list, and stores the iterator in an array, for consumption by subsequent element-oriented operations.
+For intrusive-list cases, the driver stores the user object's address in the iterators' array.
+\begin{c++}
+// further simplified harness (bookkeeping elided): STL implementation,
+// stack movement, insert-first polarity, element-based remove access
+list< item_t * > lst;
+item_t items[ n ];
+while ( SHOULD_CONTINUE ) {
+        @list< item_t * >::iterator iters[ n ];@
+        for ( int i = 0; i < n; i += 1 ) {
+                lst.push_front( & items[i] );
+                @iters[i]@ = lst.begin();
+        }
+        for ( int i = 0; i < n; i += 1 ) {
+                lst.erase( @iters[i]@ );
+        }
+}
+\end{c++}
+%~MONITOR
+% If running insert-random scenarios, revise the assessment
+A \emph{shuffling array} helps control the memory layout of user items.
+The control required is when choosing a next item to insert.
+The user items are allocated in a contiguous array.
+Without shuffling, the driver's insert phase visits these items in order, producing a list whose adjavency links hop uniform strides.
+With shuffling active, the driver's insert phase visits only the shuffling array in order,
+which applies pseudo-random indirection to the selection of a next-to-insert element from the user-item array.
+The result is a list whose links travel randomly far.
+\begin{cfa}
+// harness (bookkeeping and iterators elided): CFA implementation,
+// stack movement, insert-first polarity, head-mediated access
+dlist( item_t ) lst;
+item_t items[ n ];
+size_t insert_ord[ n ];  // elided: populate with shuffled [0,n)
+while ( SHOULD_CONTINUE ) {
+        for ( i; n ) insert_first( lst, items[ @insert_ord[@ i @]@ ] );
+        for ( i; n ) remove_first( lst );
+}
+\end{cfa}
+\emph{Interleaving} allows for movements other than pure stack and queue.
+Note that the earlier example of using the iterators' array is still a pure stack: the item selected for @erase(...)@ is always the first.
+Including a less predictable movement is important because real applications that justify doubly linked lists use them.
+Freedom to remove from arbitrary places (and to insert under more relaxed assumptions) is the characteristic function of a doubly linked list.
+A queue with drop-out is an example of such a movement.
+A list implementation can show unrepresentative speed under a simple movement, for example, by enjoying unchallenged ``Is first element?'' branch predictions.
+Interleaving brings ``at middle of list'' cases into a stream of add or remove invocations, which would otherwise be exclusively ``at end''.
+A chosen split, like half middle and half end, populates a boolean array, which is then shuffled.
+These booleans then direct the action to end-\vs-middle.
+\begin{cfa}
+// harness (bookkeeping and shuffling elided): CFA implementation,
+// stack movement, insert-first polarity, interleaved element-based remove access
+dlist( item_t ) lst;
+item_t items[ n ];
+@bool interl[ n ];@  // elided: populate with weighted, shuffled [0,1]
+while ( SHOULD_CONTINUE ) {
+        item_t * iters[ n ];
+        for ( i; n ) {
+                insert_first( items[i] );
+                iters[i] = & items[i];
+        }
+        @item_t ** crsr[ 2 ]@ = { // two cursors into iters
+                & iters[ @0@ ], // at stack-insert-first's removal end
+                & iters[ @n / interl_frac@ ]  // in middle
+        };
+        for ( i; n ) {
+                item *** crsr_use = & crsr[ interl[ i ] ]@;
+                remove( *** crsr_use );  // removing from either middle or end
+                *crsr_use += 1;  // that item is done
+        }
+        assert( crsr[0] == & iters[ @n / interl_frac@ ] ); // through second's start
+        assert( crsr[1] == & iters[ @n@ ] );  // did the rest
+}
+\end{cfa}
+By using the pair of cursors, the harness avoids branches, which could incur prediction stall times themselves, or prime a branch in the SUT.
+This harness avoids telling the hardware what the SUT is about to do.
+These experiments are single threaded.  They run on a PC with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6.  The machine has 16 GB of RAM and 8 MB of last-level cache.
+The comparator linked-list implementations are:
+\begin{description}
+\item[lq-list]  The @list@ type of LQ from glibc of GCC-11.
+\item[lq-tailq] The @tailq@ type of the same.
+\item[upp-upp]  uC++ provided @uSequence@
+\item[cfa-cfa]  \CFA's @dlist@
+\end{description}
+\subsubsection{Result: Coarse comparison of styles}
+This comparison establishes how an intrusive list performs, compared with a wrapped-reference list.
+It also establishes the context within which it is meaningful to compare one intrusive list to another.
+%These goals notwithstanding, the effect of the host machine's memory hierarchy is more significant here than linked-list implementation.
+\begin{figure}
+\centering
+  \begin{tabular}{c}
+  \includegraphics{plot-list-zoomout-shuf.pdf} \\
+  (a) \\
+  \includegraphics{plot-list-zoomout-noshuf.pdf} \\
+  (b) \\
+  \end{tabular}
+  \caption{Operation duration \vs list length at full spectrum of list lengths.  One example operation is shown: stack movement, insert-first polarity and head-mediated access.  Lengths go as large as completes without error.  Version (a) uses shuffled items, while version (b) links items with their physical neighbours.}
+  \label{fig:plot-list-zoomout}
+\end{figure}
+\VRef[Figure]{fig:plot-list-zoomout} presents the speed measures at various list lengths.
+STL's wrapped-reference list begins with operations on a length-1 list costing around 30 ns.
+This time grows modetly as list length increases, apart from more drastic worsening at the largest lengths.
+STL's performance is not affected by element order in memory.
+The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations.
+This much is also unaffected by element order.
+Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off.
+In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL.
+The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster.
+Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level.
+Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear.
+Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear.
+In a wrapped-reference list, list nodes are allocated separately from the items put into the list.
+Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes.
+The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists.
+The tests in this chapter are only inserting and removing.
+They are not operating on any user payload data that is being listed.
+The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order.
+These differences are inherent to the two list models.
+The slowdown of shuffled intrusive occurs as the experiment's length grows from last-level cache, into main memory.
+Insert and remove operations act on both sides of a link.
+Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
+Each time a next item is processed, the memory access is a hop to a randomly far address.
+The target is not available in cache and a slowdown results.
+With the unshuffled intrusive list, each link connects to an adjacent location.  So, this case has high memory locality and stays fast.  But the unshuffled assumption is simply not realistic: if you know items are adjacent, you don't need a linked list.
+A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening.
+As a result, the interlinked nodes of the STL list are generally referenceing their immediate neighbours.
+This pattern occurs regardless of user-item suffling because this test's ``use'' of the user-items' array is limited to storing element addresses.
+This experiment, driving an STL list, is simply not touching the memory that holds the user data.
+Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast.
+But the user-data no-touch assumption is often unrealistic: decisions like,``Should I remove this item?'' need to look at the item.
+In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a nonintrusive list may be preferred for lists of approximately the chache's size.
+Therefore, under clearly typical situational assumptions, both intrusive and wrapped-reference lists will suffer similarly from a large list overfilling the memory cache, experiencing degradation like shuffled intrusive shows here.
+But the comparison of unshuffled intrusive with wrapped-reference gives the peformance of these two styles, with their the common impediment of overfilling the cache removed.
+Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes.
+This difference is appreciable below list length 0.5 M, and enormous below 10 K.
+\section{Result: Comparing intrusive implementations}
+The preceding result shows that intrusive implementations have noteworthy performance differences below 150 nodes.
+This analysis zooms in on this area and identifies the participants.
+\begin{figure}
+\centering
+  \begin{tabular}{c}
+  \includegraphics{plot-list-zoomin-abs.pdf} \\
+  (a) \\
+  \includegraphics{plot-list-zoomin-rel.pdf} \\
+  (b) \\
+  \end{tabular}
+  \caption{Operation duration \vs list length at small-medium lengths.  One example operation is shown: stack movement, insert-first polarity and head-mediated access.  (a) has absolute times.  (b) has times relative to those of LQ-\lstinline{tailq}.}
+  \label{fig:plot-list-zoomin}
+\end{figure}
+In \VRef{fig:plot-list-zoomin} part (a) shows exactly this zoom-in.
+The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occuring through a head-provided insert/remove operation.
+The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials.
+For readability, the frameworks are slightly staggered in the horizontal, but all trials near a given size were run at the same size.
+For this particular operation, uC++ fares the worst, followed by \CFA, then LQ's @tailq@.
+Its @list@ does the best at smaller lengths but loses its edge above a dozen elements.
+Moving toward being able to consider several scenarios, \VRef{fig:plot-list-zoomin} part (b) shows the same result, adjusted to treat @tailq@ as a benchmark, and expressing all the results relative to it.
+This change does not affect the who-wins statements, it just removes the ``sweet spot'' bend that the earlier discussion dismissed as incidental.
+Runs faster than @tailq@'s are below the zero and slower runs are above; @tailq@'s mean is always zero by definition, but its error bars, representing a single scenario's re-run stability, are still meaningful.
+With this bend straightened out, aggregating across lengths is feasible.
+\begin{figure}
+\centering
+  \begin{tabular}{c}
+  \includegraphics{plot-list-cmp-exout.pdf} \\
+  (a) \\
+  \includegraphics{plot-list-cmp-survey.pdf} \\
+  (b) \\
+  \end{tabular}
+  \caption{Operation duration ranges across operational scenarios.  (a) has the supersets of the running example operation.  (b) has the first-level slices of the full space of operations.}
+  \label{fig:plot-list-cmp-overall}
+\end{figure}
+\VRef{fig:plot-list-cmp-overall} introduces the resulting format.
+Part (a)'s first column summarizes all the data of \VRef{fig:plot-list-zoomin}-(b).
+Its x-axis label, ``stack/insfirst/allhead,'' names the concrete scenario that has been discussed until now.
+Moving across the columns, the next three each stretch to include more scenarios on each of the operation dimensions, one at a time.
+The second column considers the scenarios $\{\mathrm{stack}\} \times \{\mathrm{insfirst}\} \times \{\mathrm{allhead}, \mathrm{inselem}, \mathrm{remelem}\}$,
+while the third stretches polarity and the fourth streches accessor.
+Then next three columns each stretch two scenario dimensions and the last column stretches all three.
+The \CFA bar in the last column is summarizing 840 test-program runs: 14 list lengths, 2 movements, 2 polarities, 3 accessors and 5 repetitions.
+In the earlier plots of one scenario broken down by length, each data point, with its error bars, represents just 5 repetitions.
+With a couple exceptions, this reproducibility error was small.
+Now, for a \CFA bar, summarizing 70 (first column) to 840 (last column) runs, a bar's height is dominated by the different behaviours of the scenarios and list length that it summarizes.
+Accordingly, the first column's bars are short and last one's are tall.
+A box represents the inner 68\% of the durations, while its lines extend to cover 95\%.
+The symbol on the bar is the mean duration.
+The chosen benchmark of LQ-@tailq@ is not shown in this format because it would be trivial here.
+With iter-scenario differences dominating the bar size, and @tailq@'s mean performance defined to be zero in all scenarios, a @tailq@ bar on this plot would only show @tailq@'s re-run stabiity, which is of no comparison value.
+The LQ-@list@ implementation does not support all scenarios, only stack movement with insert-first polarity.
+So, its 1, 3, 4 and 7\textsuperscript{th} bars all summarize the same set of points (those with accessor constrained to all-head), as do its 2, 5, 6 and 8\textsuperscript{th} (those with accessor unconstrained).
+Rather than exploring from one scenario out, \VRef{fig:plot-list-cmp-overall}-(b) gives a more systematic breakdown of the entire experimental space.
+Other than the last grand-total column, each breakdown column shows one value from one operation dimension.
+LQ-@list@'s partial scenario coverage gives missing bars where it does not support the operation.
+And, again, it gives repetition where all data points occur in several columns' intersection, such as stack/*/* and */insfirst/*.
+In the grand total, and in all halves by movement or polarity, \CFA and uC++ are equivalent, while LQ-@list@ beats them slightly.
+Splitting on accessor, \CFA has a poor result on element removal, LQ-@list@ has a great result on the other accessors, and uC++ is unaffected.
+The unseen @tailq@ dominates across every category and beats \CFA and uC++ by 15--20\%.
+% \begin{figure}
+% \centering
+%   \begin{tabular}{c}
+%   \includegraphics{plot-list-cmp-intrl-shift.pdf} \\
+%   (a) \\
+%   \includegraphics{plot-list-cmp-intrl-outcome.pdf} \\
+%   (b) \\
+%   \end{tabular}
+%   \caption{Caption TODO}
+%   \label{fig:plot-list-cmp-intrl}
+% \end{figure}
+\section{Result: CFA cost attribution}
+This comparison loosely itemizes the reasons that the \CFA implementation runs 15--20\% slower than LQ.  Each reason provides for safer programming.  For each reaon, a version of the \CFA list was measured that forgoes its safety and regains some performance.  These potential sacrifices are:
+\newcommand{\mandhead}{\emph{mand-head}}
+\newcommand{\nolisted}{\emph{no-listed}}
+\newcommand{\noiter}{\emph{no-iter}}
+\begin{description}
+\item[mand(atory)-head] Removing support for headless lists.
+  A specific explanation of why headless support causes a slowdown is not offered.
+  But it is reasonable for a cost to result from making one pieceof code handle multiple cases; the subset of the \CFA list API that applies to headless lists shares its implementation with headed lists.
+  In the \mandhead case, disabling the feature in \CFA means using an older version of the implementation, from before headless support was added.
+  In the pre-headless library, trying to form a headless list (instructing, ``Insert loose element B after loose element A,'') is a checked runtime error.
+  LQ does not support headless lists\footnote{
+        Though its documentation does not mention the headless use case, this fact is due to one of its insert-before or insert-after routines being unusable in every list model.
+        For \lstinline{tailq}, the API requires a head.
+        For \lstinline{list}, this usage causes an ``uncaught'' runtime crash.}.
+\item[no-listed] Removing support for the @is_listed@ API query.
+  Along with it goes error checking such as ``When instering an element, it must not already be listed, \ie be referred to from somewhere else.''
+  These abilities have a cost because, in order to support them, a listed element that is being removed must be written to, to record its change in state.
+  In \CFA's representation, this cost is two pointer writes.
+  To disable the feature, these writes, and the error checking that consumes their result, are put behind an @#ifdef@.
+  The result is that a removed element sees itself as still having neighbours (though these quasi-neighbours see it differently).
+  This state is how LQ leaves a removed element; LQ does not offer an is-listed query.
+\item[no-iter(ation)] Removing support for well-terminating iteration.
+  The \CFA list uses bit-manipulation tagging on link poiters (rather than \eg null links) to express, ``No more elements this way.''
+  This tagging has the cost of submitting a retrieved value to the ALU, and awaiting this operation's completion, before dereferencing a link pointer.
+  In some cases, the is-terminating bit is transferred from one link to another, or has a similar influence on a resulting link value; this logic adds register pressure and more data dependency.
+  To disable the feature, the @#ifdef@-controlled tag manipulation logic compiles in answers like, ``No, that link is not a terminator,'' ``The dereferenceable pointer is the value you read from memory,'' and ``The terminator-marked value you need to write is the pointer you started with.''
+  Without this termination marking, repeated requests for a next valid item will always provide a positive response; when it should be negative, the indicated next element is garbage data at an address unlikely to trigger a memory error.
+  LQ has a well-terminating iteration for listed elements.
+  In the \noiter case, the slowdown is not inherent; it represents a \CFA optimization opporunity.
+\end{description}
+\MLB{Ensure benefits are discussed earlier and cross-reference}  % an LQ programmer must know not to ask, ``Who's next?'' about an unlisted element; an LQ programmer cannot write assertions about an item being listed; LQ requiring a head parameter is an opportunity for the user to provide inconsistent data
+\begin{figure}
+\centering
+  \begin{tabular}{c}
+  \includegraphics{plot-list-cfa-attrib.pdf} \\
+  (a) \\
+  \includegraphics{plot-list-cfa-attrib-remelem.pdf} \\
+  (b) \\
+  \end{tabular}
+  \caption{Operation duration ranges for functionality-reduced \CFA list implementations.  (a) has the top level slices. (b) has the next level of slicing within the slower element-based removal operation.}
+  \label{fig:plot-list-cfa-attrib}
+\end{figure}
+\VRef[Figure]{fig:plot-list-cfa-attrib} shows the \CFA list performance with these features, and their combinations, turned on and off.  When a series name is one of the three sacrifices above, the series is showing this sacrifice in isolation.  These further series names give combinations:
+\newcommand{\attribFull}{\emph{full}}
+\newcommand{\attribParity}{\emph{parity}}
+\newcommand{\attribStrip}{\emph{strip}}
+\begin{description}
+        \item[full] No sacrifices.  Same as measurements presented earlier.
+        \item[parity] \mandhead + \nolisted.  Feature parity with LQ.
+        \item[strip] \mandhead + \nolisted + \noiter.  All options set to ``faster.''
+\end{description}
+All list implementations are \CFA, possibly stripped.
+The plot uses the same LQ-relative basis as earlier.
+So getting to zero means matching LQ's @tailq@.
+\VRef[Figure]{fig:plot-list-cfa-attrib}-(a) summarizes the time attribution across the main operating scenarios.
+The \attribFull series is repeated from \VRef[Figure]{fig:plot-list-cmp-overall}, part (b), while the series showing feature sacrifices are new.
+Going all the way to \attribStrip at least nearly matches LQ in all operating scenarios, beats LQ often, and slightly beats LQ overall.
+Except within the accessor splits, both sacrifices contribute improvements individually, \noiter helps more than \attribParity, and the total \attribStrip benefit depends on both contributions.
+When the accessor is not element removal, the \attribParity shift appears to be counterproductive, leaving \noiter to deliver most of the benefit.
+For element removals, \attribParity is the heavy hitter, with \noiter contributing modestly.
+The couterproductive shift outside of element removals is likely due to optimization done in the \attribFull version after implementing headless support, \ie not present in the \mandhead version.
+This work streamlined both head-based operations (head-based removal being half the work of the element-insertion test).
+This improvement could be ported to a \mandhead-style implementation, which would bring down the \attribParity time in these cases.
+More significantly, missing this optimization affects every \attribParity result because they all use head-based inserts or removes for at least half their operations.
+It is likely a reason that \attribParity is not delivering as well overall as \noiter.
+It even represents plausible further improvements in \attribStrip.
+\VRef[Figure]{fig:plot-list-cfa-attrib}-(b) addresses element removal being the overall \CFA slow spot and element removal having a peculiar shape in the (a) analysis.
+Here, the \attribParity sacrifice bundle is broken out into its two consituents.
+The result is the same regardless of the operation.
+All three individual sacrifices contribute noteworthy improvements (\nolisted slightly less).
+The fullest improvement requires all of them.
+The \noiter feature sacrifice is unpalatable.
+But because it is not an inherent slowdown, there may be room to pursue a \noiter-level speed improvement without the \noiter feature sacrifice.
+The performance crux for \noiter is the pointer-bit tagging scheme.
+Alternative designs that may offer speedup with acceptable consequences include keeping the tag information in a separate field, and (for 64-bit architectures) keeping it in the high-order byte \ie using byte- rather than bit-oriented instructions to access it.
+The \noiter speed improvement would bring \CFA to +5\% of LQ overall, and from high twenties to high teens, in the worst case of element removal.
+Utimately, this analysis provides options for a future effort that needs to get the most speed out of the \CFA list.
 \section{Future Work}
 \label{toc:lst:futwork}

doc/theses/mike_brooks_MMath/uw-ethesis.tex

r47064914	r16a843b
116	116	\newcommand{\uCpp}{$\mu$\CC}
117	117	\newcommand{\PAB}[1]{{\color{red}PAB: #1}}
	118	\newcommand{\MLB}[1]{{\color{red}MLB: #1}}
118	119
119	120	% Hyperlinks make it very easy to navigate an electronic document.

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: