Context Navigation

← Previous Change
Next Change →

list.tex

Timestamp:

Apr 11, 2026, 6:39:52 PM (12 hours ago)

Author:

Peter A. Buhr <pabuhr@…>

Branches:

master

Parents:

e8a7b66d

Message:

proofreading list chapter

File:

: 1 edited

doc/theses/mike_brooks_MMath/list.tex (modified) (9 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/theses/mike_brooks_MMath/list.tex

-              re8a7b66d
+              r75ba2fa6
 \subsubsection{Experiment setup}
+\label{s:ExperimentSetup}
 The experiment driver defines a (intrusive) node type:
 …
 \VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test.
 Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum.
+In the graphs, all four intrusive lists are plotted with the same symbol;
+theses symbols clumped on top of each, showing the performance difference among intrusive lists is small in comparison to the wrapped list \see{\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists}.
+In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:ExperimentSetup}) are plotted with the same symbol;
+sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list).
+See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists.
 The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns.
 …
   \\
+  &
   \subfloat[Shuffled List Nodes, AMD]{\label{f:Shuffled-swift}
+  \subfloat[Random List Nodes, AMD]{\label{f:Random-swift}
         \hspace*{-0.75in}
         \includegraphics{plot-list-zoomout-shuf-swift.pdf}
   } % subfigure
+  &
   \subfloat[Shuffled List Nodes, Intel]{\label{f:Shuffled-java}
+  \subfloat[Random List Nodes, Intel]{\label{f:Random-java}
         \includegraphics{plot-list-zoomout-shuf-java.pdf}
   } % subfigure
 …
   \caption{Insert/remove duration \vs list length.
   Lengths go as large possible without error.
   One example operation is shown: stack movement, insert-first polarity and head-mediated access.}
+  One example operation is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.}
   \label{fig:plot-list-zoomout}
 \end{figure}
+The large difference in performance between intrusive and wrapped-reference lists results from the dynamic allocation for the wrapped nodes.
+In effect, this experiment is largely measuring the cost of malloc/free rather than the insert/remove, and is affected by the layout of memory by the allocator.
+Fundamentally, insert/remove for a doubly linked-list has a basic cost, which is seen by the similar results for the intrusive lists.
+Hence, the costs for a wrapped-reference list are: allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
+the dynamic allocation dominates these costs.
+For example, the experiment was run with both glibc and llheap memory allocators, where performance from llheap reduced the cost from 20 to 16 ns, which is still far from the 2--4 ns for intrusive lists.
+Unfortunately, there is no way to tease apart the linking and allocation costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage the storage.
+Hence, dynamic allocation cost is fundamental for wrapped lists.
+The key performance factor between the intrusive and the wrapped-reference lists is the dynamic allocation for the wrapped nodes.
+Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator.
+For insert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists.
+For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
+the allocation dominates these costs.
+For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node.
+Hence, there is no way to tease apart the allocation, copying, and linking costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage that storage.
 In detail, \VRef[Figure]{f:Linear-swift}--\subref*{f:Linear-java} shows linear insertion of all the nodes and then linear removal, both in the same direction.
 For intrusive lists, the nodes are adjacent because they are preallocated in an array.
 For wrapped lists, the wrapped nodes are still adjacent because the memory allocator happens to use bump allocation for the small fixed-sized wrapped nodes.
+For intrusive lists, the nodes are adjacent in memory from being preallocated in an array.
+For wrapped lists, the wrapped nodes happen to be adjacent because the memory allocator uses bump allocation during the initial phase of allocation.
 As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal.
 With address look-ahead, the hardware does an excellent job of managing the multi-level cache.
 Hence, performance is largely constant for both kinds of lists, until L3 cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists.
+In detail, \VRef[Figure]{f:Shuffled-swift}--\subref*{f:Shuffled-java} shows shuffled insertion and removal of the nodes.
+As for linear, there are issues with the wrapped list and memory allocation.
+For intrusive lists, it is possible to link the nodes randomly, so consecutive nodes in memory seldom point at adjacent nodes.
+For wrapped lists, the placement of wrapped nodes is dictated by the memory allocator, and results in memory layout virtually the same as for linear, \ie the wrapped nodes are laid out consecutively in memory, but each wrapped node points at a randomly located external node.
+As seen on \VPageref{p:Shuffle}, the random access reads data from external node, which are located in random order.
+So while the wrapped nodes are accessed linearly, external nodes are touched randomly for both kinds of lists resulting in similar cache events.
+The slowdown of shuffled occurs as the experiment's length grows from last-level cache into main memory.
+For example, on AMD (\VRef[Figure]{f:Linear-swift}), there is one NUMA node but many small L3 caches, so performance slows down quickly as multiple L3 caches come into play, and remains constant at that level, except for some anomalies for very large lists.
+On Intel (\VRef[Figure]{f:Linear-java}), there are four NUMA nodes and four slowdown steps as list-length increase.
+At each step, the difference between the kinds of lists decreases as the NUMA effect increases.
+In detail, \VRef[Figure]{f:Random-swift}--\subref*{f:Random-java} shows random insertion and removal of the nodes.
+As for linear, there is the issue of memory allocation for the wrapped list.
+As well, the consecutive storage-layout is the same (array and bump allocation).
+Hence, the difference is the random linking among nodes, resulting in random accesses, even though the list is traversed linearly, resulting in similar cache events for both kinds of lists.
+Both \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} show the slowdown of random access as the list-length grows resulting from stepping out of caches into main memory and crossing NUMA nodes.
 % Insert and remove operations act on both sides of a link.
 %Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
+Each time a next item is processed, the memory access is a hop to a randomly far address.
+The target is unavailable in the cache and a slowdown results.
+Note, the external-data no-touch assumption is often unrealistic: decisions like,``Should I remove this item?'' need to look at the item.
+Therefore, under realistic assumptions, both intrusive and wrapped-reference lists suffer similar caching issues for very large lists.
+As for linear, the Intel (\VRef[Figure]{f:Random-java}) graph shows steps from the four NUMA nodes.
+Interestingly, after $10^6$ nodes, intrusive lists are slower than wrapped.
+I did not have time to track down this anomaly, but I speculate it results from the difference in touching the data in the accessed node, as the data and links are together for intrusive and separated for wrapped.
+For the llheap memory-allocator and the two tested architectures, intrusive lists out perform wrapped lists up to size $10^3$ for both linear and random, and performance begins to converge around $10^6$ nodes as architectural issues begin to dominate.
+Clearly, memory allocator and hardware architecture plays a large factor in the total cost and the crossover points as list-size increases.
 % In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size.
 The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution.
 Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated.
 Even when space is a consideration, intrusive links may not use any more storage if a node is mostly linked.
 Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures.
+Even when space is a consideration, intrusive links may not use more storage if a node is often linked.
+Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures or their tool-set does not provide them.
 % Note, linear access may not be realistic unless dynamic size changes may occur;
 …
 \label{s:ComparingIntrusiveImplementations}
 The preceding result shows that all the intrusive implementations examined have noteworthy performance compared to wrapped lists.
 This analysis picks the sizes below 150 and zooms in to differentiate among the different intrusive implementations.
 The data is selected from the start of \VRef[Figure]{f:Linear}, but the start of \VRef[Figure]{f:Shuffled} would be largely the same.
+The preceding result shows the intrusive implementations examined have better performance compared to wrapped lists for small to medium sized lists.
+This analysis picks list sizes below 150 and zooms in to differentiate among the intrusive implementations.
+The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same.
 \begin{figure}
 …
   } % subfigure
   \end{tabular}
   \caption{Operation duration \vs list length at small-medium lengths.  Two example operations are shown: [I] stack movement with head-only access (plots a and b); [VIII] queue movement with element-oriented removal access (plots c and d); both operations use insert-first polarity.}
+  \caption{Operation duration \vs list length at small-medium lengths.  Two example operations are shown: [I] stack movement with head-only access (plots a and b); [VIII] queue movement with element-oriented removal access (plots c and d); both operations use insert-first polarity. Lower is better.}
   \label{fig:plot-list-zoomin}
 \end{figure}
 …
 The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials.
 For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other.
+\VRef[Figure]{fig:plot-list-zoomin} has the pair machines, each in a column.
+It also has a pair of operating scenarios, each in a row.
+The experiment runs twelve operating scenarios; the ones chosen for their variety happen to be items I and VIII from the listing of \VRef[Figure]{fig:plot-list-mchn-szz}.
+The experiment runs twelve operating scenarios;
+the ones chosen for their variety are scenarios I and VIII from the listing of \VRef[Figure]{fig:plot-list-mchn-szz}, and their results appear in the rows.
+As in the previous experiment, each hardware architecture appears in a column.
 Note that LQ-list does not support queue operations, so this framework is absent from Operation VIII.
 At lengths 1 and 2, winner patterns seen at larger sizes generally do not apply.
 Indeed, an issue with \CFA giving terribel on queues at length 1 is evident in \VRef[Figure]{f:AbsoluteTime-viii-swift}.
+Indeed, an issue with \CFA giving terrible on queues at length 1 is evident in \VRef[Figure]{f:AbsoluteTime-viii-swift}.
 This phenomenon is elaborated in \MLB{TODO: xref}.
 For the remainder of this section, these sizes are disregarded.
 …
 \label{toc:lst:futwork}
+Not discussed in the chapter is an issue in the \CFA type system with Plan-9 @inline@ declarations.
+It implements the trait 'embedded' for the derived-base pair named in the arguments.
+This trait defines a pseudo conversion called backtick-inner, from derived to base (the safe direction).
+Such a real conversion exists for the concrete types, due to our compiler having the p9 language feature.
+But when trying to be polymorphic over the types, there is no way to access this built-in conversion, so my macro stands in until Fangren does more with conversions.
+Not discussed in the chapter is a \CFA type-system issue with Plan-9 @inline@ declarations, implementing the trait @embedded@ to access the contained @dlist@ link-fields.
+This trait defines an implicit conversion from derived to base (the safe direction).
+Such a conversion exists implicitly for concrete types using the Plan-9 inheritance feature.
+However, this implicit conversion is only partially implemented for polymorphism types, such as @dlist@, which prevents the straightforward list interface shown throughout the chapter.
+My workaround is the macro @P9_EMBEDDED@ placed before an intrusive node is used to declare a list.
+\begin{cfa}
+struct node {
+        int v;
+        inline dlink(node);
+};
+@P9_EMBEDDED( node, dlink(node) );@
+dlist( node ) dlist;
+\end{cfa}
+The macro creates specialized access functions to explicitly extract the required information for the polymorphic @dlist@ type.
+These access functions could have been generated implicitly for each intrusive node by adding another compiler pass.
+However, this would have been substantial temporary work, when the correct solution is the compiler fix.
+Hence, the macro workaround is only a small temporary inconvenience;
+otherwise, all the list API shown in this chapter works.
+\begin{comment}
 An API author has decided that the intended user experience is:
 \begin{itemize}
 …
 TODO: deal with: Link fields are system-managed.
 Links in GDB.
+\end{comment}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 75ba2fa6 for doc/theses/mike_brooks_MMath/list.tex

Legend:

doc/theses/mike_brooks_MMath/list.tex

Download in other formats: