Ignore:
Timestamp:
Aug 11, 2025, 9:21:35 PM (5 weeks ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Children:
e57a405
Parents:
502ded0
Message:

more proofreading of list chapter

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mike_brooks_MMath/list.tex

    r502ded0 r9367cd6  
    77
    88
     9\section{Plan-9 Inheritance}
     10\label{s:Plan9Inheritance}
     11
     12This chapter uses a form of inheritance from the Plan-9 C dialect~\cite[\S~3.3]{Thompson90new}, which is supported by @gcc@ and @clang@ using @-fplan9-extensions@.
     13\CFA has its own variation of the Plan-9 mechanism, where the nested type is denoted using @inline@.
     14\begin{cfa}
     15union U {  int x;  double y;  char z; };
     16struct W { int i;  double j;  char k; };
     17struct S {
     18        @inline@ struct W;  $\C{// extended Plan-9 inheritance}$
     19        unsigned int tag;
     20        @inline@ U;                     $\C{// extended Plan-9 inheritance}$
     21} s;
     22\end{cfa}
     23Inline inheritance is containment, where the inlined field is unnamed and the type's internal fields are hoisted into the containing structure.
     24Hence, the field names must be unique, unlike \CC nested types, but any inlined type-names are at a nested scope level, unlike aggregate nesting in C.
     25Note, the position of the containment is normally unimportant, unless there is some form of memory or @union@ overlay.
     26Finally, the inheritance declaration of @U@ is not prefixed with @union@.
     27Like \CC, \CFA allows optional prefixing of type names with their kind, \eg @struct@, @union@, and @enum@, unless there is ambiguity with variable names in the same scope.
     28
     29\VRef[Figure]{f:Plan9Polymorphism} shows the key polymorphic feature of Plan-9 inheritance: implicit conversion of values and pointers for nested types.
     30In the example, there are implicit conversions from @S@ to @U@ and @S@ to @W@, extracting the appropriate value or pointer for the substructure.
     31\VRef[Figure]{f:DiamondInheritancePattern} shows complex multiple inheritance patterns are possible, like the \newterm{diamond pattern}~\cite[\S~6.1]{Stroustrup89}\cite[\S~4]{Cargill91}.
     32Currently, \CFA type-system does not support @virtual@ inheritance.
     33
     34\begin{figure}
     35\begin{cfa}
     36void f( U, U * ); $\C{// value, pointer}$
     37void g( W, W * ); $\C{// value, pointer}$
     38U u, * up;   W w, * wp;   S s, * sp;
     39u = s;   up = sp; $\C{// value, pointer}$
     40w = s;   wp = sp; $\C{// value, pointer}$
     41f( s, &s ); $\C{// value, pointer}$
     42g( s, &s ); $\C{// value, pointer}$
     43\end{cfa}
     44\caption{Plan-9 Polymorphism}
     45\label{f:Plan9Polymorphism}
     46\end{figure}
     47
     48\begin{figure}
     49\setlength{\tabcolsep}{10pt}
     50\begin{tabular}{ll@{}}
     51\multicolumn{1}{c}{\CC} & \multicolumn{1}{c}{\CFA}      \\
     52\begin{c++}
     53struct B { int b; };
     54struct L : @public B@ { int l, w; };
     55struct R : @public B@ { int r, w; };
     56struct T : @public L, R@ { int t; };
     57
     58T t = { { { 5 }, 3, 4 }, { { 8 }, 6, 7 }, 3 };
     59t.t = 42;
     60t.l = 42;
     61t.r = 42;
     62((L &)t).b = 42;  // disambiguate
     63\end{c++}
     64&
     65\begin{cfa}
     66struct B { int b; };
     67struct L { @inline B;@ int l, w; };
     68struct R { @inline B;@ int r, w; };
     69struct T { @inline L; inline R;@ int t; };
     70
     71T t = { { { 5 }, 3, 4 }, { { 8 }, 6, 7 }, 3 };
     72t.t = 42;
     73t.l = 42;
     74t.r = 42;
     75((L &)t).b = 42;  // disambiguate, proposed solution t.L.b = 42;
     76\end{cfa}
     77\end{tabular}
     78\caption{Diamond Non-Virtual Inheritance Pattern}
     79\label{f:DiamondInheritancePattern}
     80\end{figure}
     81
     82
    983\section{Features}
    1084
     
    1791This design covers system and data management issues stated in \VRef{toc:lst:issue}.
    1892
    19 \VRef[Figure]{fig:lst-features-intro} continues the running @req@ example from \VRef[Figure]{fig:lst-issues-attach} using the \CFA list.
     93\VRef[Figure]{fig:lst-features-intro} continues the running @req@ example from \VRef[Figure]{fig:lst-issues-attach} using the \CFA list @dlist@.
    2094The \CFA link attachment is intrusive so the resulting memory layout is per user node, as for the LQ version of \VRef[Figure]{f:Intrusive}.
    21 The \CFA framework provides generic type @dlink( T, T )@ for the two link fields (front and back).
    22 A user inserts the links into the @req@ structure via \CFA inline-inheritance from the Plan-9 C dialect~\cite[\S~3.3]{Thompson90new}.
    23 Inline inheritance is containment, where the inlined field is unnamed but the type's internal fields are hoisted into the containing structure.
    24 Hence, the field names must be unique, unlike \CC nested types, but the type names are at a nested scope level, unlike aggregate nesting in C.
    25 Note, the position of the containment is normally unimportant, unless there is some form of memory or @union@ overlay.
    26 The key feature of inlined inheritance is that a pointer to the containing structure is automatically converted to a pointer to any anonymous inline field for assignments and function calls, providing containment inheritance with implicit subtyping.
    27 Therefore, a reference to a @req@ is implicitly convertible to @dlink@ in assignments and function calls.
    28 % These links have a nontrivial, user-specified location within the @req@ structure;
    29 % this convention encapsulates the implied pointer arithmetic safely.
    30 The links in @dlist@ point at (links) in the containing node, know the offsets of all links (data is abstract), and any field-offset arithmetic or link-value changes are safe and abstract.
     95The \CFA framework provides generic type @dlink( T )@ (not to be confused with @dlist@) for the two link fields (front and back).
     96A user inserts the links into the @req@ structure via \CFA inline-inheritance \see{\VRef{s:Plan9Inheritance}}.
     97Lists leverage the automatic conversion of a pointer to anonymous inline field for assignments and function calls.
     98Therefore, a reference to a @req@ is implicitly convertible to @dlink@ in both contexts.
     99The links in @dlist@ point at another embedded @dlist@ node, know the offsets of all links (data is abstract but accessible), and any field-offset arithmetic or link-value changes are safe and abstract.
    31100
    32101\begin{figure}
     
    39108\end{figure}
    40109
     110\VRef[Figure]{f:dlistOutline} shows the outline for types @dlink@ and @dlist@.
     111Note, the first @forall@ clause is distributed across all the declaration in its containing block, eliminating repetition on each declaration.
     112The second nested @forall@ on @dlist@ is also distributed and adds an optional second type parameter, @tLinks@, denoting the linking axis \see{\VRef{s:Axis}}, \ie the kind of list this node can appear on.
     113
     114\begin{figure}
     115\begin{cfa}
     116forall( tE & ) {                                        $\C{// distributed}$
     117        struct @dlink@ { ...  };                $\C{// abstract type}$
     118        static inline void ?{}( dlink( tE ) & this );  $\C{// constructor}$
     119
     120        forall( tLinks & = dlink( tE ) ) { $\C{// distributed, default type for axis}$
     121                struct @dlist@ { ... };         $\C{// abstract type}$
     122                static inline void ?{}( dlist( tE, tLinks ) & this ); $\C{// constructor}$
     123        }
     124}
     125\end{cfa}
     126\caption{\lstinline{dlisk} / \lstinline{dlist} Outline}
     127\label{f:dlistOutline}
     128\end{figure}
     129
    41130\VRef[Figure]{fig:lst-features-multidir} shows how the \CFA library supports multi-inline links, so a node can be on one or more lists simultaneously.
    42131The declaration of @req@ has two inline-inheriting @dlink@ occurrences.
    43132The first of these gives a type named @req.by_pri@, @req@ inherits from it, and it inherits from @dlink@.
    44133The second line @req.by_rqr@ is similar to @req.by_pri@.
    45 Thus, there is a diamond, non-virtual, inheritance from @req@ to @dlink@, with @by_pri@ and @by_rqr@ being the mid-level types.
    46 
    47 Disambiguation occurs in the declarations of the list-head objects: @reqs_pri_global@, @reqs_rqr_42@, @reqs_rqr_17@, and @reqs_rqr_99@.
    48 The type of the variable @reqs_pri_global@ is @dlist(req, req.by_pri)@, meaning operations called on @reqs_pri_global@ are implicitly disambiguated.
    49 In the example, the calls @insert_first(reqs_pri_global, ...)@ imply, ``here, we are working by priority.''
     134Thus, there is a diamond, non-virtual, inheritance from @req@ to @dlink@, with @by_pri@ and @by_rqr@ being the mid-level types \see{\VRef[Figure]{f:DiamondInheritancePattern}}.
     135
     136The declarations of the list-head objects, @reqs_pri@, @reqs_rqr_42@, @reqs_rqr_17@, and @reqs_rqr_99@, bind which link nodes in @req@ are used by this list.
     137Hence, the type of the variable @reqs_pri@, @dlist(req, req.by_pri)@, means operations called on @reqs_pri@ implicitly select (disambiguate) the correct @dlink@s, \eg the calls @insert_first(reqs_pri, ...)@ imply, ``here, we are working by priority.''
    50138As in \VRef[Figure]{fig:lst-issues-multi-static}, three lists are constructed, a priority list containing all nodes, a list with only nodes containing the value 42, and a list with only nodes containing the value 17.
    51139
     
    69157\end{figure}
    70158
    71 The \CFA library also supports the common case, of single directionality, more naturally than LQ. 
    72 \VRef[Figure]{fig:lst-features-intro} shows a single-direction list done with no contrived name for the link direction,
    73 where \VRef[Figure]{f:Intrusive} adds the unnecessary field name, @d@.
    74 In \CFA, a user doing a single direction (\VRef[Figure]{fig:lst-features-intro}) sets up a simple inheritance with @dlink@, and declares a list head to have the simpler type @dlist( T )@.
    75 In contrast, (\VRef[Figure]{fig:lst-features-multidir}) sets up a diamond inheritance with @dlink@, and declares a list head to have the more-informed type @dlist( T, DIR )@.
    76 
    77 The directionality issue also has an advanced corner-case that needs treatment.
    78 When working with multiple directions, calls like @insert_first@ benefit from implicit direction disambiguation;
    79 however, other calls like @insert_after@ still require explicit disambiguation, \eg the call
    80 \begin{cfa}
    81 insert_after(r1, r2);
    82 \end{cfa}
    83 does not have enough information to clarify which of a request's simultaneous list directions is intended.
     159The list library also supports the common case of single directionality more naturally than LQ. 
     160Returning to \VRef[Figure]{fig:lst-features-intro}, the single-direction list has no contrived name for the link direction as it uses the default type in the definition of @dlist@;
     161in contrast, the LQ list in \VRef[Figure]{f:Intrusive} adds the unnecessary field name @d@.
     162In \CFA, a single direction list sets up a single inheritance with @dlink@, and the default list axis is to itself.
     163
     164When operating on a list with several directions and operations that do not take the list head, the list axis can be ambiguous.
     165For example, a call like @insert_after( r1, r2 )@ does not have enough information to know which axes to select implicitly.
    84166Is @r2@ supposed to be the next-priority request after @r1@, or is @r2@ supposed to join the same-requester list of @r1@?
    85 As such, the \CFA compiler gives an ambiguity error for this call.
    86 To resolve the ambiguity, the list library provides a hook for applying the \CFA language's scoping and priority rules.
    87 It applies as:
    88 \begin{cfa}
    89 with ( DLINK_VIA(req, req.pri) ) insert_after(r1, r2);
     167As such, the \CFA type-system gives an ambiguity error for this call.
     168There are multiple ways to resolve the ambiguity.
     169The simplest is an explicit cast on each call to select the specific axis, \eg @insert_after( (by_pri)r1, r2 )@.
     170However, multiple explicit casts are tedious and error-prone.
     171To mitigate this issue, the list library provides a hook for applying the \CFA language's scoping and priority rules.
     172\begin{cfa}
     173with ( DLINK_VIA(  req, req.pri ) ) insert_after( r1, r2 );
    90174\end{cfa}
    91175Here, the @with@ statement opens the scope of the object type for the expression;
    92176hence, the @DLINK_VIA@ result causes one of the list directions to become a more attractive candidate to \CFA's overload resolution.
    93 This boost applies within the scope of the following statement, but could also be a custom block or an entire function body.
     177This boost can be applied across multiple statements in a block or an entire function body.
    94178\begin{cquote}
    95179\setlength{\tabcolsep}{15pt}
    96180\begin{tabular}{@{}ll@{}}
    97181\begin{cfa}
    98 void f() @with( DLINK_VIA(req, req.pri) )@ {
    99         ...
    100 
    101         insert_after(r1, r2);
    102 
    103         ...
    104 }
     182@with( DLINK_VIA( req, req.pri ) ) {@
     183        ...  insert_after( r1, r2 );  ...
     184@}@
    105185\end{cfa}
    106186&
    107187\begin{cfa}
    108 void f() {
    109         ...
    110         @with( DLINK_VIA(req, req.pri) )@ {
    111                 ...  insert_after(r1, r2);  ...
    112         }
    113         ...
    114 }
     188void f() @with( DLINK_VIA( req, req.pri ) ) {@
     189        ... insert_after( r1, r2 ); ...
     190@}@
    115191\end{cfa}
    116192\end{tabular}
    117193\end{cquote}
    118 By using a larger scope, a user can put code within that acts as if there is only one list direction.
    119 This boost is needed only when operating on a list with several directions, using operations that do not take the list head.
    120 Otherwise, the sole applicable list direction \emph{just works}.
    121 
    122 Unlike \CC templates container-types, the \CFA library works completely within the type system;
    123 both @dlink@ and @dlist@ are ordinary types.
     194Within the @with@, the code acts as if there is only one list direction.
     195
     196Unlike the \CC template container-types, the \CFA library works completely within the type system;
     197both @dlink@ and @dlist@ are ordinary types, not language macros.
    124198There is no textual expansion other than header-included static-inline function for performance.
    125 Errors in user code are reported only with mention of the library's declarations.
    126 Finally, the library is separately compiled from the usage code.
    127 
    128 The \CFA library works in headed and headless modes.  TODO: elaborate.
     199Hence, errors in user code are reported only with mention of the library's declarations.
     200Finally, the library is separately compiled from the usage code, modulo inlining.
     201
     202The \CFA library works in headed and headless modes.
     203\PAB{TODO: elaborate.}
    129204
    130205
     
    142217@last@ returns a reference to the last node of the list without removing it or @0p@ if the list is empty.
    143218\item
    144 @insert_before@ adds a node before the specified node and returns the added node for cascading.
    145 \item
    146 @insert_after@ adds a node after the specified node and returns the added node for cascading.
    147 \item
    148 @remove@ removes the specified node from the list (any location) and returns a reference to the node.
     219@insert_before@ adds a node before a specified node \see{\lstinline{insert_last} for insertion at the end}\footnote{
     220Some list packages allow \lstinline{0p} (\lstinline{nullptr}) for the before/after node implying insert/remove at the start/end of the list, respectively.
     221However, this inserts an \lstinline{if} statement in the fastpath of a potentially commonly used list operation.}
     222\item
     223@insert_after@ adds a node after a specified node \see{\lstinline{insert_first} for insertion at the start}\footnotemark[\value{footnote}].
     224\item
     225@remove@ removes a specified node from the list (any location) and returns a reference to the node.
    149226\item
    150227@iter@ create an iterator for the list.
     
    158235@isLast@ returns true if the node is the last node in the list and false otherwise.
    159236\item
    160 @pred@ returns a reference to the previous (predecessor, towards first) node before the specified node or @0p@ if the specified node is the first node in the list.
    161 \item
    162 @next@ returns a reference to the next (successor, towards last) node after the specified node or @0p@ if the specified node is the last node in the list.
     237@pred@ returns a reference to the previous (predecessor, towards first) node before a specified node or @0p@ if a specified node is the first node in the list.
     238\item
     239@next@ returns a reference to the next (successor, towards last) node after a specified node or @0p@ if a specified node is the last node in the list.
    163240\item
    164241@insert_first@ adds a node to the start of the list so it becomes the first node and returns a reference to the node for cascading.
     
    236313The choice depends on how the programmer wants to access the fields: @it->f@ or @it.f@.
    237314The following examples use a reference because the loop body manipulates the node values rather than the list pointers.
    238 The end of iteration is denoted by the loop cursor having the null pointer, denoted @0p@ in \CFA.
     315The end of iteration is denoted by the loop cursor returning @0p@.
    239316
    240317\noindent
     
    443520
    444521\VRef[Figure]{fig:lst-impl-links} continues the running @req@ example, now showing the \CFA list library's internal representation.
    445 The @dlink@ structure contains exactly two pointers: @next@ and @prev@, which are opaque to a user.
     522The @dlink@ structure contains exactly two pointers: @next@ and @prev@, which are opaque.
    446523Even though the user-facing list model is linear, the CFA library implements all listing as circular.
    447 This choice helps achieve uniform end treatment and TODO finish summarizing benefit.
     524This choice helps achieve uniform end treatment and \PAB{TODO finish summarizing benefit}.
    448525A link pointer targets a neighbouring @dlink@ structure, rather than a neighbouring @req@.
    449526(Recall, the running example has the user putting a @dlink@ within a @req@.)
     
    451528\begin{figure}
    452529        \centering
    453         \includegraphics{lst-impl-links.pdf}
     530        \includegraphics[width=\textwidth]{lst-impl-links.pdf}
    454531        \caption{
    455532                \CFA list library representations for the cases under discussion.
     
    458535\end{figure}
    459536
    460 System-added link pointers (dashed lines) are internally tagged to indicate linear endpoints that are the circular pointers.
     537Circular link-pointers (dashed lines) are tagged internally in the pointer to indicate linear endpoints.
    461538Links among neighbour nodes are not tagged.
    462 Iteration reports ``has more elements'' when crossing natural links, and ``no more elements'' upon reaching a tagged link.
     539Iteration reports ``has more elements'' when accessing untagged links, and ``no more elements'' when accessing tagged links.
     540Hence, the tags are set on the links that a user cannot navigate.
    463541
    464542In a headed list, the list head (@dlist(req)@) acts as an extra element in the implementation-level circularly-linked list.
    465 The content of a @dlist@ is a (private) @dlink@, with the @next@ pointer to the first element, and the @prev@ pointer to the last element.
    466 Since the head wraps a @dlink@, as does a @req@ does too, and since a link-pointer targets a @dlink@, the resulting cycle is among @dlink@ structures, situated inside of header/node.
     543The content of a @dlist@ header is a (private) @dlink@, with the @next@ pointer to the first element, and the @prev@ pointer to the last element.
     544Since the head wraps a @dlink@, as does @req@, and since a link-pointer targets a @dlink@, the resulting cycle is among @dlink@ structures, situated inside of header/node.
    467545An untagged pointer points within a @req@, while a tagged pointer points within a list head.
    468546In a headless list, the circular backing list is only among @dlink@s within @req@s.
    469 The tags are set on the links that a user cannot navigate.
    470 
    471 No distinction is made between an unlisted item under a headed model and a singleton list under a headless model.
     547
     548No distinction is made between an unlisted node (top left and middle) under a headed model and a singleton list under a headless model.
    472549Both are represented as an item referring to itself, with both tags set.
    473550
     
    476553\label{toc:lst:assess}
    477554
     555This section examines the performance of the discussed list implementations.
     556The goal is to show the \CFA lists are competitive with other designs, but the different list designs may not have equivalent functionality, so it is impossible to select a winner encompassing both functionality and execution performance.
     557
     558
    478559\subsection{Add-Remove Performance}
    479560
    480 The fundamental job of a linked-list library is to manage the links that connect users' items.
    481 Any link management is an action that causes pair(s) of elements to become, or cease to be, adjacent.
     561The fundamental job of a linked-list library is to manage the links that connect nodes.
     562Any link management is an action that causes pair(s) of elements to become, or cease to be, adjacent in the list.
    482563Thus, adding and removing an element are the sole primitive actions.
    483564
     
    485566These instruction sequences may have cases that proceed (in a modern, deep pipeline) without a stall.
    486567
    487 This experiment takes the position that
     568This experiment takes the position that:
    488569\begin{itemize}
    489     \item The total time to add and remove is relevant; an attribution of time spent adding vs.\ removing is not.
    490           Any use case for which addition speed matters necessarily has removes paired with adds.
    491                   For otherwise, the alleged usage would exhaust the amount of work expressable as a main-memory full of nodes within a few seconds.
    492     \item A relevant breakdown ``by operation'' is, rather, one that considers the structural context of these requests.
     570    \item The total time to add and remove is relevant \vs individual times for add and remove.
     571          Adds without removes quickly fill memory;
     572                  removes without adds is meaningless.
     573    \item A relevant breakdown ``by operation'' is:
    493574                \begin{description}
    494575                \item[movement]
    495           Is the add/remove order that of a stack, a queue, or something else?
     576          Is the add/remove applied to a stack, queue, or something else?
    496577                \item[polarity]
    497                   In which direction does the movement's action apply?  For a queue, do items flow from first to last or last to first?  For a stack, is the first-end or the last-end used for adding and removing?
     578                  In which direction does the action apply?
     579                  For a queue, do items flow from first to last or last to first?
     580                  For a stack, is the first or last end used for adding and removing?
    498581                \item[accessor]
    499582          Is an add/remove location given by a list head's ``first''/``last'', or by a reference to an individual element?
     
    503586\end{itemize}
    504587
    505 This experiment measures the mean duration of a list addition and removal.
     588The experiment used to measure insert/remove cost measures the mean duration of a sequence of additions and removals.
    506589Confidence bounds, on this mean, are discussed.
    507590The distribution of speeds experienced by an individual add-remove pair (tail latency) is not discussed.
    508 
    509591Space efficiency is shown only indirectly, by way of caches' impact on speed.
    510 
    511 %~MONITOR
    512 % If able to show cases with CFA doing better, reword.
    513 The goal is to show the \CFA library performing comparably to other intrusive libraries,
    514 in an experimental context sensitive enough to show also:
     592The experiment is sensitive enough to show:
    515593\begin{itemize}
    516     \item intrusive lists performing (majorly) differently than wrapped lists
    517     \item a space of (minor) performance differences typical of existing intrusive lists
     594    \item intrusive lists performing (majorly) differently than wrapped lists,
     595    \item a space of (minor) performance differences among the intrusive lists.
    518596\end{itemize}
    519597
     
    521599\subsubsection{Experiment setup}
    522600
    523 The experiment defines a user's datatype and considers
    524 the speed of building, and tearing down, a list of $n$ instances of the user's type.
    525 
    526 The timings are taken with a fixed-duration method based on checks @clock()@.
    527 In a typical 5-sec run, the outer looping checks the clock about 200 times.
    528 A number of experimental rounds per clock check is precalculated to be appropriate to the value of $n$.
    529 
     601The experiment driver defines a (intrusive) node type:
     602\begin{cfa}
     603struct Node {
     604        int i, j, k;  // fields
     605        // possible intrusive links
     606};
     607\end{cfa}
     608and considers the speed of building and tearing down a list of $n$ instances of it.
     609% A number of experimental rounds per clock check is precalculated to be appropriate to the value of $n$.
    530610\begin{cfa}
    531611// simplified harness: CFA implementation,
    532612// stack movement, insert-first polarity, head-mediated access
    533613size_t totalOpsDone = 0;
    534 dlist( item_t ) lst;
    535 item_t items[ n ];
     614dlist( node_t ) lst;
     615node_t nodes[ n ];                                              $\C{// preallocated list storage}$
    536616startTimer();
    537 while ( SHOULD_CONTINUE ) {
    538         for ( i; n ) insert_first( lst, items[i] );
    539         for ( i; n ) remove_first( lst );
     617while ( CONTINUE ) {                                    $\C{// \(\approx\) 20 second duration}$
     618        for ( i; n ) insert_first( lst, nodes[i] );  $\C{// build up}$
     619        for ( i; n ) remove_first( lst );       $\C{// tear down}$
    540620        totalOpsDone += n;
    541621}
    542622stopTimer();
    543 reportedDuration = getTimerDuration() / totalOpsDone;
    544 \end{cfa}
    545 
    546 One experimental round is, first, a tight loop of inserting $n$ elements into a list, followed by another, to remove these $n$ elements.
    547 A counter is incremented by $n$ each round.
    548 When the whole experiment is done, the total elapsed time, divided by final value of the operation counter,
    549 is reported as the observed mean operation duration.
    550 In a scatterplot presentation, each dot would be one such reported mean duration.
    551 So, ``operation'' really means insert + remove + harness overhead.
    552 
    553 The harness overheads are held constant when comparing linked-list implementations.
    554 The remainder of the setup section discusses the choices that affected the harness overhead.
    555 
    556 An \emph{iterators' array} provides support for element-level operations on non-intrusive lists.
    557 As elaborated in Section \ref{toc:lst:issue:attach},
    558 wrapped-attachment lists use a distinct type (at a distinct memory location) to represent ``an item that's in the list.''
    559 Operations like insert-after and remove-here consume iterators.
    560 In the STL implementation, an iterator is a pointer to a \lstinline{std::_List_node}.
    561 For the STL case, the driver obtains an iterator value
    562 at the time of adding to the list, and stores the iterator in an array, for consumption by subsequent element-oriented operations.
    563 For intrusive-list cases, the driver stores the user object's address in the iterators' array.
    564 
     623reportedDuration = getTimerDuration() / totalOpsDone; // throughput per insert/remove operation
     624\end{cfa}
     625To reduce administrative overhead, the $n$ nodes for each experiment list are preallocated in an array (on the stack), which removes dynamic allocations for this storage.
     626These nodes contain an intrusive pointer for intrusive lists, or a pointer to it is stored in a dynamically-allocated internal-node for a wrapper list.
     627Copying the node for wrapped lists skews the results with administration costs.
     628The list insertion/removal operations are repeated for a typical 20+ second duration.
     629After each round, a counter is incremented by $n$ (for throughput).
     630Time is measured outside the loop because a large $n$ can overrun the time duration before the @CONTINUE@ flag is tested.
     631The loop duration is divided by the counter and this throughput is reported.
     632In a scatter-plot, each dot is one throughput, which means insert + remove + harness overhead.
     633The harness overhead is constant when comparing linked-list implementations and keep as small as possible.
     634% The remainder of the setup section discusses the choices that affected the harness overhead.
     635
     636To test list operations, the experiment performs the inserts/removes in different patterns, \eg insert and remove from front, insert from front and remove from back, random insert and remove, \etc.
     637Unfortunately, the @std::list@ does \emph{not} support direct insert/remove from a node without an iterator, \ie no @erase( node )@, even though the list is doubly-linked.
     638To eliminate this additional cost in the harness, a trick is used for random insertions without replacement.
     639The @i@ fields in each node are initialized from @0..n-1@, these @i@ values are then shuffled in the nodes, and the @i@ value is used to represent an indirection to that node for insertion, so the nodes are inserted in random order, and hence, removed in th same random order.
     640$\label{p:Shuffle}$
    565641\begin{c++}
    566 // further simplified harness (bookkeeping elided): STL implementation,
    567 // stack movement, insert-first polarity, element-based remove access
    568 list< item_t * > lst;
    569 item_t items[ n ];
    570 while ( SHOULD_CONTINUE ) {
    571         @list< item_t * >::iterator iters[ n ];@
    572         for ( int i = 0; i < n; i += 1 ) {
    573                 lst.push_front( & items[i] );
    574                 @iters[i]@ = lst.begin();
     642        for ( i; n ) @nodes[i].i = i@;          $\C{// indirection}$
     643        shuffle( nodes, n );                            $\C{// random shuffle indirects within nodes}$
     644
     645        while ( CONTINUE ) {
     646                for ( i; n ) insert_first( lst, nodes[ @nodes[i].i@ ] ); $\C{// build up}$
     647                for ( i; n ) pass( &remove_first( lst ) );      $\C{// tear down}$
     648                totalOpsDone += n;
    575649        }
    576         for ( int i = 0; i < n; i += 1 ) {
    577                 lst.erase( @iters[i]@ );
    578         }
    579 }
    580650\end{c++}
    581 
    582 %~MONITOR
    583 % If running insert-random scenarios, revise the assessment
    584 
    585 A \emph{shuffling array} helps control the memory layout of user items.
    586 The control required is when choosing a next item to insert.
    587 The user items are allocated in a contiguous array.
    588 Without shuffling, the driver's insert phase visits these items in order, producing a list whose adjavency links hop uniform strides.
    589 With shuffling active, the driver's insert phase visits only the shuffling array in order,
    590 which applies pseudo-random indirection to the selection of a next-to-insert element from the user-item array.
    591 The result is a list whose links travel randomly far.
    592 
    593 \begin{cfa}
    594 // harness (bookkeeping and iterators elided): CFA implementation,
    595 // stack movement, insert-first polarity, head-mediated access
    596 dlist( item_t ) lst;
    597 item_t items[ n ];
    598 size_t insert_ord[ n ];  // elided: populate with shuffled [0,n)
    599 while ( SHOULD_CONTINUE ) {
    600         for ( i; n ) insert_first( lst, items[ @insert_ord[@ i @]@ ] );
    601         for ( i; n ) remove_first( lst );
    602 }
    603 \end{cfa}
    604 
    605 \emph{Interleaving} allows for movements other than pure stack and queue.
    606 Note that the earlier example of using the iterators' array is still a pure stack: the item selected for @erase(...)@ is always the first.
    607 Including a less predictable movement is important because real applications that justify doubly linked lists use them.
    608 Freedom to remove from arbitrary places (and to insert under more relaxed assumptions) is the characteristic function of a doubly linked list.
    609 A queue with drop-out is an example of such a movement.
    610 A list implementation can show unrepresentative speed under a simple movement, for example, by enjoying unchallenged ``Is first element?'' branch predictions.
    611 
    612 Interleaving brings ``at middle of list'' cases into a stream of add or remove invocations, which would otherwise be exclusively ``at end''.
    613 A chosen split, like half middle and half end, populates a boolean array, which is then shuffled.
    614 These booleans then direct the action to end-\vs-middle.
    615 
    616 \begin{cfa}
    617 // harness (bookkeeping and shuffling elided): CFA implementation,
    618 // stack movement, insert-first polarity, interleaved element-based remove access
    619 dlist( item_t ) lst;
    620 item_t items[ n ];
    621 @bool interl[ n ];@  // elided: populate with weighted, shuffled [0,1]
    622 while ( SHOULD_CONTINUE ) {
    623         item_t * iters[ n ];
    624         for ( i; n ) {
    625                 insert_first( items[i] );
    626                 iters[i] = & items[i];
    627         }
    628         @item_t ** crsr[ 2 ]@ = { // two cursors into iters
    629                 & iters[ @0@ ], // at stack-insert-first's removal end
    630                 & iters[ @n / interl_frac@ ]  // in middle
    631         };
    632         for ( i; n ) {
    633                 item *** crsr_use = & crsr[ interl[ i ] ]@;
    634                 remove( *** crsr_use );  // removing from either middle or end
    635                 *crsr_use += 1;  // that item is done
    636         }
    637         assert( crsr[0] == & iters[ @n / interl_frac@ ] ); // through second's start
    638         assert( crsr[1] == & iters[ @n@ ] );  // did the rest
    639 }
    640 \end{cfa}
    641 
    642 By using the pair of cursors, the harness avoids branches, which could incur prediction stall times themselves, or prime a branch in the SUT.
    643 This harness avoids telling the hardware what the SUT is about to do.
    644 
    645 These experiments are single threaded.  They run on a PC with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6.  The machine has 16 GB of RAM and 8 MB of last-level cache.
     651This approach works across intrusive and wrapped lists.
     652
     653% \emph{Interleaving} allows for movements other than pure stack and queue.
     654% Note that the earlier example of using the iterators' array is still a pure stack: the item selected for @erase(...)@ is always the first.
     655% Including a less predictable movement is important because real applications that justify doubly linked lists use them.
     656% Freedom to remove from arbitrary places (and to insert under more relaxed assumptions) is the characteristic function of a doubly linked list.
     657% A queue with drop-out is an example of such a movement.
     658% A list implementation can show unrepresentative speed under a simple movement, for example, by enjoying unchallenged ``Is first element?'' branch predictions.
     659
     660% Interleaving brings ``at middle of list'' cases into a stream of add or remove invocations, which would otherwise be exclusively ``at end''.
     661% A chosen split, like half middle and half end, populates a boolean array, which is then shuffled.
     662% These booleans then direct the action to end-\vs-middle.
     663%
     664% \begin{cfa}
     665% // harness (bookkeeping and shuffling elided): CFA implementation,
     666% // stack movement, insert-first polarity, interleaved element-based remove access
     667% dlist( item_t ) lst;
     668% item_t items[ n ];
     669% @bool interl[ n ];@  // elided: populate with weighted, shuffled [0,1]
     670% while ( CONTINUE ) {
     671%       item_t * iters[ n ];
     672%       for ( i; n ) {
     673%               insert_first( items[i] );
     674%               iters[i] = & items[i];
     675%       }
     676%       @item_t ** crsr[ 2 ]@ = { // two cursors into iters
     677%               & iters[ @0@ ], // at stack-insert-first's removal end
     678%               & iters[ @n / interl_frac@ ]  // in middle
     679%       };
     680%       for ( i; n ) {
     681%               item *** crsr_use = & crsr[ interl[ i ] ]@;
     682%               remove( *** crsr_use );  // removing from either middle or end
     683%               *crsr_use += 1;  // that item is done
     684%       }
     685%       assert( crsr[0] == & iters[ @n / interl_frac@ ] ); // through second's start
     686%       assert( crsr[1] == & iters[ @n@ ] );  // did the rest
     687% }
     688% \end{cfa}
     689%
     690% By using the pair of cursors, the harness avoids branches, which could incur prediction stall times themselves, or prime a branch in the SUT.
     691% This harness avoids telling the hardware what the SUT is about to do.
    646692
    647693The comparator linked-list implementations are:
    648694\begin{description}
    649 \item[lq-list]  The @list@ type of LQ from glibc of GCC-11.
     695\item[std::list]  The @list@ type of g++.
     696\item[lq-list]  The @list@ type of LQ from glibc of gcc.
    650697\item[lq-tailq] The @tailq@ type of the same.
    651 \item[upp-upp]  uC++ provided @uSequence@
     698\item[upp-upp]  \uC provided @uSequence@
    652699\item[cfa-cfa]  \CFA's @dlist@
    653700\end{description}
    654701
    655702
     703\subsection{Experimental Environment}
     704\label{s:ExperimentalEnvironment}
     705
     706The performance experiments are run on:
     707\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
     708%\item[PC]
     709%with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6.  The machine has 16 GB of RAM and 8 MB of last-level cache.
     710%\item[ARM]
     711%Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model
     712\item[AMD]
     713Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 1024KB L2, 16MB L3, where each L3 cache covers 16 processors.
     714%\item[Intel]
     715%Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model
     716\end{description}
     717The experiments are single threaded and pinned to single core to prevent any OS movement, which might cause cache or NUMA effects perturbing the experiment.
     718
     719The compiler is gcc/g++-14.2.0 running on the Linux v6.8.0-52-generic OS.
     720Switching between the default memory allocators @glibc@ and @llheap@ is done with @LD_PRELOAD@.
     721To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@
     722\begin{cfa}
     723// prevent eliding, cheaper than volatile
     724static inline void * pass( void * v ) {  __asm__  __volatile__( "" : "+r"(v) );  return v;  }
     725...
     726pass( &remove_first( lst ) );                   // wrap call to prevent elision, insert cannot be elided now
     727\end{cfa}
     728The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists.
     729
     730
    656731\subsubsection{Result: Coarse comparison of styles}
    657732
    658 This comparison establishes how an intrusive list performs, compared with a wrapped-reference list.
    659 It also establishes the context within which it is meaningful to compare one intrusive list to another.
    660 
    661 %These goals notwithstanding, the effect of the host machine's memory hierarchy is more significant here than linked-list implementation.
     733This comparison establishes how an intrusive list performs compared with a wrapped-reference list.
     734\VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test.
     735Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum.
     736In the graphs, all four intrusive lists are plotted with the same symbol;
     737theses symbols clumped on top of each, showing the performance difference among intrusive lists is small in comparison to the wrapped list \see{\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists}.
     738
     739The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns.
     740For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/remove times.
     741As the list size grows, the administrative overhead for intrusive lists quickly disappears.
     742
     743\begin{figure}
     744  \centering
     745  \subfloat[Linear List Nodes]{\label{f:Linear}
     746    \includegraphics{plot-list-zoomout-noshuf.pdf}
     747  } % subfigure
     748  \\
     749  \subfloat[Shuffled List Nodes]{\label{f:Shuffled}
     750    \includegraphics{plot-list-zoomout-shuf.pdf}
     751  } % subfigure
     752  \caption{Insert/remove duration \vs list length.
     753  Lengths go as large possible without error.
     754  One example operation is shown: stack movement, insert-first polarity and head-mediated access.}
     755  \label{fig:plot-list-zoomout}
     756\end{figure}
     757
     758The large difference in performance between intrusive and wrapped-reference lists results from the dynamic allocation for the wrapped nodes.
     759In effect, this experiment is largely measuring the cost of malloc/free rather than the insert/remove, and is affected by the layout of memory by the allocator.
     760Fundamentally, insert/remove for a doubly linked-list has a basic cost, which is seen by the similar results for the intrusive lists.
     761Hence, the costs for a wrapped-reference list are: allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
     762the dynamic allocation dominates these costs.
     763For example, the experiment was run with both glibc and llheap memory allocators, where performance from llheap reduced the cost from 20 to 16 ns, which is still far from the 2--4 ns for intrusive lists.
     764Unfortunately, there is no way to tease apart the linking and allocation costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage the storage.
     765Hence, dynamic allocation cost is fundamental for wrapped lists.
     766
     767In detail, \VRef[Figure]{f:Linear} shows linear insertion of all the nodes and then linear removal, both in the same direction.
     768For intrusive lists, the nodes are adjacent because they are preallocated in an array.
     769For wrapped lists, the wrapped nodes are still adjacent because the memory allocator happens to use bump allocation for the small fixed-sized wrapped nodes.
     770As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal.
     771With address look-ahead, the hardware does an excellent job of managing the multi-level cache.
     772Hence, performance is largely constant for both kinds of lists, until cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists.
     773
     774In detail, \VRef[Figure]{f:Shuffled} shows shuffle insertion of all the nodes and then linear removal, both in the same direction.
     775As for linear, there are issues with the wrapped list and memory allocation.
     776For intrusive lists, it is possible to link the nodes randomly, so consecutive nodes in memory seldom point at adjacent nodes.
     777For wrapped lists, the placement of wrapped nodes is dictated by the memory allocator, and results in memory layout virtually the same as for linear, \ie the wrapped nodes are laid out consecutively in memory, but each wrapped node points at a randomly located external node.
     778As seen on \VPageref{p:Shuffle}, the random access reads data from external node, which are located in random order.
     779So while the wrapped nodes are accessed linearly, external nodes are touched randomly for both kinds of lists resulting in similar cache events.
     780The slowdown of shuffled occurs as the experiment's length grows from last-level cache into main memory.
     781% Insert and remove operations act on both sides of a link.
     782%Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
     783Each time a next item is processed, the memory access is a hop to a randomly far address.
     784The target is unavailable in the cache and a slowdown results.
     785Note, the external-data no-touch assumption is often unrealistic: decisions like,``Should I remove this item?'' need to look at the item.
     786Therefore, under realistic assumptions, both intrusive and wrapped-reference lists suffer similar caching issues for very large lists.
     787% In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size.
     788
     789The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution.
     790Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated.
     791Even when space is a consideration, intrusive links may not use any more storage if a node is mostly linked.
     792Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures.
     793
     794% Note, linear access may not be realistic unless dynamic size changes may occur;
     795% if the nodes are known to be adjacent, use an array.
     796
     797% In a wrapped-reference list, list nodes are allocated separately from the items put into the list.
     798% Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes.
     799
     800% STL's performance is not affected by element order in memory.
     801%The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations.
     802% This much is also unaffected by element order.
     803% Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off.
     804% In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL.
     805
     806% The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster.
     807% Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level.
     808% Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear.
     809% Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear.
     810
     811% The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists.
     812% The tests in this chapter are only inserting and removing.
     813% They are not operating on any user payload data that is being listed.
     814% The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order.
     815% These differences are inherent to the two list models.
     816
     817% A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening.
     818% As a result, the interlinked nodes of the STL list are generally referencing their immediate neighbours.
     819% This pattern occurs regardless of user-item shuffling because this test's ``use'' of the user-items' array is limited to storing element addresses. 
     820% This experiment, driving an STL list, is simply not touching the memory that holds the user data.
     821% Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast.
     822
     823% But the comparison of unshuffled intrusive with wrapped-reference gives the performance of these two styles, with their the common impediment of overfilling the cache removed.
     824% Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes. 
     825% This difference is appreciable below list length 0.5 M, and enormous below 10 K.
     826
     827
     828\section{Result: Comparing intrusive implementations}
     829\label{s:ComparingIntrusiveImplementations}
     830
     831The preceding result shows that all the intrusive implementations examined have noteworthy performance compared to wrapped lists.
     832This analysis picks the list region 10-150 and zooms-in to differentiate among the different intrusive implementations.
     833The data is selected from the start of \VRef[Figure]{f:Linear}, but the start of \VRef[Figure]{f:Shuffled} would be largely the same.
    662834
    663835\begin{figure}
    664836\centering
    665   \begin{tabular}{c}
    666   \includegraphics{plot-list-zoomout-shuf.pdf} \\
    667   (a) \\
    668   \includegraphics{plot-list-zoomout-noshuf.pdf} \\
    669   (b) \\
    670   \end{tabular}
    671   \caption{Operation duration \vs list length at full spectrum of list lengths.  One example operation is shown: stack movement, insert-first polarity and head-mediated access.  Lengths go as large as completes without error.  Version (a) uses shuffled items, while version (b) links items with their physical neighbours.}
    672   \label{fig:plot-list-zoomout}
    673 \end{figure}
    674 
    675 \VRef[Figure]{fig:plot-list-zoomout} presents the speed measures at various list lengths.
    676 STL's wrapped-reference list begins with operations on a length-1 list costing around 30 ns.
    677 This time grows modetly as list length increases, apart from more drastic worsening at the largest lengths.
    678 STL's performance is not affected by element order in memory.
    679 The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations.
    680 This much is also unaffected by element order.
    681 Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off.
    682 In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL.
    683 
    684 The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster.
    685 Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level.
    686 Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear.
    687 Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear.
    688 
    689 In a wrapped-reference list, list nodes are allocated separately from the items put into the list.
    690 Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes.
    691 
    692 The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists.
    693 The tests in this chapter are only inserting and removing.
    694 They are not operating on any user payload data that is being listed.
    695 The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order.
    696 These differences are inherent to the two list models.
    697 
    698 The slowdown of shuffled intrusive occurs as the experiment's length grows from last-level cache, into main memory.
    699 Insert and remove operations act on both sides of a link.
    700 Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
    701 Each time a next item is processed, the memory access is a hop to a randomly far address.
    702 The target is not available in cache and a slowdown results.
    703 
    704 With the unshuffled intrusive list, each link connects to an adjacent location.  So, this case has high memory locality and stays fast.  But the unshuffled assumption is simply not realistic: if you know items are adjacent, you don't need a linked list.
    705 
    706 A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening.
    707 As a result, the interlinked nodes of the STL list are generally referenceing their immediate neighbours.
    708 This pattern occurs regardless of user-item suffling because this test's ``use'' of the user-items' array is limited to storing element addresses. 
    709 This experiment, driving an STL list, is simply not touching the memory that holds the user data.
    710 Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast.
    711 But the user-data no-touch assumption is often unrealistic: decisions like,``Should I remove this item?'' need to look at the item.
    712 In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a nonintrusive list may be preferred for lists of approximately the chache's size.
    713 
    714 Therefore, under clearly typical situational assumptions, both intrusive and wrapped-reference lists will suffer similarly from a large list overfilling the memory cache, experiencing degradation like shuffled intrusive shows here.
    715 
    716 But the comparison of unshuffled intrusive with wrapped-reference gives the peformance of these two styles, with their the common impediment of overfilling the cache removed.
    717 Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes. 
    718 This difference is appreciable below list length 0.5 M, and enormous below 10 K.
    719 
    720 
    721 \section{Result: Comparing intrusive implementations}
    722 
    723 The preceding result shows that intrusive implementations have noteworthy performance differences below 150 nodes.
    724 This analysis zooms in on this area and identifies the participants.
    725 
    726 \begin{figure}
    727 \centering
    728   \begin{tabular}{c}
    729   \includegraphics{plot-list-zoomin-abs.pdf} \\
    730   (a) \\
    731   \includegraphics{plot-list-zoomin-rel.pdf} \\
    732   (b) \\
    733   \end{tabular}
     837  \subfloat[Absolute Time]{\label{f:AbsoluteTime}
     838  \includegraphics{plot-list-zoomin-abs.pdf}
     839  } % subfigure
     840  \\
     841  \subfloat[Relative Time]{\label{f:RelativeTime}
     842  \includegraphics{plot-list-zoomin-rel.pdf}
     843  } % subfigure
    734844  \caption{Operation duration \vs list length at small-medium lengths.  One example operation is shown: stack movement, insert-first polarity and head-mediated access.  (a) has absolute times.  (b) has times relative to those of LQ-\lstinline{tailq}.}
    735845  \label{fig:plot-list-zoomin}
    736846\end{figure}
    737847
    738 In \VRef{fig:plot-list-zoomin} part (a) shows exactly this zoom-in.
    739 The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occuring through a head-provided insert/remove operation.
     848\VRef[Figure]{fig:plot-list-zoomin} shows the zone from 10--150 blown up, both in absolute and relative time.
     849% The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occurring through a head-provided insert/remove operation.
    740850The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials.
    741 For readability, the frameworks are slightly staggered in the horizontal, but all trials near a given size were run at the same size.
    742 
    743 For this particular operation, uC++ fares the worst, followed by \CFA, then LQ's @tailq@.
    744 Its @list@ does the best at smaller lengths but loses its edge above a dozen elements.
    745 
    746 Moving toward being able to consider several scenarios, \VRef{fig:plot-list-zoomin} part (b) shows the same result, adjusted to treat @tailq@ as a benchmark, and expressing all the results relative to it.
     851For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other.
     852
     853While preparing experiment results, I first tested on my old office PC, AMD FX-8370E Eight-Core, before switching to the large new server for final testing.
     854For this experiment, the results flipped in my favour when running on the server.
     855New CPU architectures are now amazingly good at branch prediction and micro-parallelism in the pipelines.
     856Specifically, on the PC, my \CFA and companion \uC lists are slower than lq-tail and lq-list by 10\% to 20\%.
     857On the server, \CFA and \uC lists are can be fast by up to 100\%.
     858Overall, LQ-tailq does the best at short lengths but loses out above a dozen elements.
     859
     860\VRef[Figure]{f:RelativeTime} shows the percentage difference by treating @tailq@ as the control benchmark, and showing the other list results relative to it.
    747861This change does not affect the who-wins statements, it just removes the ``sweet spot'' bend that the earlier discussion dismissed as incidental.
    748 Runs faster than @tailq@'s are below the zero and slower runs are above; @tailq@'s mean is always zero by definition, but its error bars, representing a single scenario's re-run stability, are still meaningful.
     862Runs faster than @tailq@'s are below the zero and slower runs are above;
     863@tailq@'s mean is always zero by definition, but its error bars, representing a single scenario's re-run stability, are still meaningful.
    749864With this bend straightened out, aggregating across lengths is feasible.
    750865
    751866\begin{figure}
    752867\centering
    753   \begin{tabular}{c}
    754   \includegraphics{plot-list-cmp-exout.pdf} \\
    755   (a) \\
    756   \includegraphics{plot-list-cmp-survey.pdf} \\
    757   (b) \\
    758   \end{tabular}
     868  \subfloat[Supersets]{\label{f:Superset}
     869  \includegraphics{plot-list-cmp-exout.pdf}
     870  } % subfigure
     871  \\
     872  \subfloat[1st Level Slice]{\label{f:1stLevelSlice}
     873  \includegraphics{plot-list-cmp-survey.pdf}
     874  } % subfigure
    759875  \caption{Operation duration ranges across operational scenarios.  (a) has the supersets of the running example operation.  (b) has the first-level slices of the full space of operations.}
    760876  \label{fig:plot-list-cmp-overall}
    761877\end{figure}
    762878
    763 \VRef{fig:plot-list-cmp-overall} introduces the resulting format.
     879\VRef[Figure]{fig:plot-list-cmp-overall} introduces alternative views of the data.
    764880Part (a)'s first column summarizes all the data of \VRef{fig:plot-list-zoomin}-(b).
    765881Its x-axis label, ``stack/insfirst/allhead,'' names the concrete scenario that has been discussed until now.
    766882Moving across the columns, the next three each stretch to include more scenarios on each of the operation dimensions, one at a time.
    767883The second column considers the scenarios $\{\mathrm{stack}\} \times \{\mathrm{insfirst}\} \times \{\mathrm{allhead}, \mathrm{inselem}, \mathrm{remelem}\}$,
    768 while the third stretches polarity and the fourth streches accessor.
     884while the third stretches polarity and the fourth stretches accessor.
    769885Then next three columns each stretch two scenario dimensions and the last column stretches all three.
    770886The \CFA bar in the last column is summarizing 840 test-program runs: 14 list lengths, 2 movements, 2 polarities, 3 accessors and 5 repetitions.
     
    778894
    779895The chosen benchmark of LQ-@tailq@ is not shown in this format because it would be trivial here.
    780 With iter-scenario differences dominating the bar size, and @tailq@'s mean performance defined to be zero in all scenarios, a @tailq@ bar on this plot would only show @tailq@'s re-run stabiity, which is of no comparison value.
     896With iter-scenario differences dominating the bar size, and @tailq@'s mean performance defined to be zero in all scenarios, a @tailq@ bar on this plot would only show @tailq@'s re-run stability, which is of no comparison value.
    781897
    782898The LQ-@list@ implementation does not support all scenarios, only stack movement with insert-first polarity.
     
    805921% \end{figure}
    806922
     923
    807924\section{Result: CFA cost attribution}
    808925
    809 This comparison loosely itemizes the reasons that the \CFA implementation runs 15--20\% slower than LQ.  Each reason provides for safer programming.  For each reaon, a version of the \CFA list was measured that forgoes its safety and regains some performance.  These potential sacrifices are:
     926This comparison loosely itemizes the reasons that the \CFA implementation runs 15--20\% slower than LQ.
     927Each reason provides for safer programming.
     928For each reason, a version of the \CFA list was measured that forgoes its safety and regains some performance.
     929These potential sacrifices are:
    810930\newcommand{\mandhead}{\emph{mand-head}}
    811931\newcommand{\nolisted}{\emph{no-listed}}
     
    814934\item[mand(atory)-head] Removing support for headless lists.
    815935  A specific explanation of why headless support causes a slowdown is not offered.
    816   But it is reasonable for a cost to result from making one pieceof code handle multiple cases; the subset of the \CFA list API that applies to headless lists shares its implementation with headed lists.
     936  But it is reasonable for a cost to result from making one piece of code handle multiple cases; the subset of the \CFA list API that applies to headless lists shares its implementation with headed lists.
    817937  In the \mandhead case, disabling the feature in \CFA means using an older version of the implementation, from before headless support was added.
    818938  In the pre-headless library, trying to form a headless list (instructing, ``Insert loose element B after loose element A,'') is a checked runtime error.
     
    822942        For \lstinline{list}, this usage causes an ``uncaught'' runtime crash.}.
    823943\item[no-listed] Removing support for the @is_listed@ API query.
    824   Along with it goes error checking such as ``When instering an element, it must not already be listed, \ie be referred to from somewhere else.''
     944  Along with it goes error checking such as ``When inserting an element, it must not already be listed, \ie be referred to from somewhere else.''
    825945  These abilities have a cost because, in order to support them, a listed element that is being removed must be written to, to record its change in state.
    826946  In \CFA's representation, this cost is two pointer writes.
     
    835955  Without this termination marking, repeated requests for a next valid item will always provide a positive response; when it should be negative, the indicated next element is garbage data at an address unlikely to trigger a memory error.
    836956  LQ has a well-terminating iteration for listed elements.
    837   In the \noiter case, the slowdown is not inherent; it represents a \CFA optimization opporunity.
     957  In the \noiter case, the slowdown is not inherent; it represents a \CFA optimization opportunity.
    838958\end{description}
    839959\MLB{Ensure benefits are discussed earlier and cross-reference}  % an LQ programmer must know not to ask, ``Who's next?'' about an unlisted element; an LQ programmer cannot write assertions about an item being listed; LQ requiring a head parameter is an opportunity for the user to provide inconsistent data
     
    871991For element removals, \attribParity is the heavy hitter, with \noiter contributing modestly.
    872992
    873 The couterproductive shift outside of element removals is likely due to optimization done in the \attribFull version after implementing headless support, \ie not present in the \mandhead version.
     993The counterproductive shift outside of element removals is likely due to optimization done in the \attribFull version after implementing headless support, \ie not present in the \mandhead version.
    874994This work streamlined both head-based operations (head-based removal being half the work of the element-insertion test).
    875995This improvement could be ported to a \mandhead-style implementation, which would bring down the \attribParity time in these cases.
     
    8801000
    8811001\VRef[Figure]{fig:plot-list-cfa-attrib}-(b) addresses element removal being the overall \CFA slow spot and element removal having a peculiar shape in the (a) analysis.
    882 Here, the \attribParity sacrifice bundle is broken out into its two consituents.
     1002Here, the \attribParity sacrifice bundle is broken out into its two constituents.
    8831003The result is the same regardless of the operation.
    8841004All three individual sacrifices contribute noteworthy improvements (\nolisted slightly less).
     
    8911011The \noiter speed improvement would bring \CFA to +5\% of LQ overall, and from high twenties to high teens, in the worst case of element removal.
    8921012
    893 Utimately, this analysis provides options for a future effort that needs to get the most speed out of the \CFA list.
     1013Ultimately, this analysis provides options for a future effort that needs to get the most speed out of the \CFA list.
    8941014
    8951015
Note: See TracChangeset for help on using the changeset viewer.