Ignore:
Timestamp:
Apr 11, 2026, 6:39:52 PM (12 hours ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Parents:
e8a7b66d
Message:

proofreading list chapter

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mike_brooks_MMath/list.tex

    re8a7b66d r75ba2fa6  
    695695
    696696\subsubsection{Experiment setup}
     697\label{s:ExperimentSetup}
    697698
    698699The experiment driver defines a (intrusive) node type:
     
    844845\VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test.
    845846Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum.
    846 In the graphs, all four intrusive lists are plotted with the same symbol;
    847 theses symbols clumped on top of each, showing the performance difference among intrusive lists is small in comparison to the wrapped list \see{\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists}.
     847In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:ExperimentSetup}) are plotted with the same symbol;
     848sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list).
     849See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists.
    848850
    849851The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns.
     
    866868  \\
    867869  &
    868   \subfloat[Shuffled List Nodes, AMD]{\label{f:Shuffled-swift}
     870  \subfloat[Random List Nodes, AMD]{\label{f:Random-swift}
    869871        \hspace*{-0.75in}
    870872        \includegraphics{plot-list-zoomout-shuf-swift.pdf}
    871873  } % subfigure
    872874  &
    873   \subfloat[Shuffled List Nodes, Intel]{\label{f:Shuffled-java}
     875  \subfloat[Random List Nodes, Intel]{\label{f:Random-java}
    874876        \includegraphics{plot-list-zoomout-shuf-java.pdf}
    875877  } % subfigure
     
    877879  \caption{Insert/remove duration \vs list length.
    878880  Lengths go as large possible without error.
    879   One example operation is shown: stack movement, insert-first polarity and head-mediated access.}
     881  One example operation is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.}
    880882  \label{fig:plot-list-zoomout}
    881883\end{figure}
    882884
    883 The large difference in performance between intrusive and wrapped-reference lists results from the dynamic allocation for the wrapped nodes.
    884 In effect, this experiment is largely measuring the cost of malloc/free rather than the insert/remove, and is affected by the layout of memory by the allocator.
    885 Fundamentally, insert/remove for a doubly linked-list has a basic cost, which is seen by the similar results for the intrusive lists.
    886 Hence, the costs for a wrapped-reference list are: allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
    887 the dynamic allocation dominates these costs.
    888 For example, the experiment was run with both glibc and llheap memory allocators, where performance from llheap reduced the cost from 20 to 16 ns, which is still far from the 2--4 ns for intrusive lists.
    889 Unfortunately, there is no way to tease apart the linking and allocation costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage the storage.
    890 Hence, dynamic allocation cost is fundamental for wrapped lists.
     885The key performance factor between the intrusive and the wrapped-reference lists is the dynamic allocation for the wrapped nodes.
     886Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator.
     887For insert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists.
     888For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
     889the allocation dominates these costs.
     890For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node.
     891Hence, there is no way to tease apart the allocation, copying, and linking costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage that storage.
    891892
    892893In detail, \VRef[Figure]{f:Linear-swift}--\subref*{f:Linear-java} shows linear insertion of all the nodes and then linear removal, both in the same direction.
    893 For intrusive lists, the nodes are adjacent because they are preallocated in an array.
    894 For wrapped lists, the wrapped nodes are still adjacent because the memory allocator happens to use bump allocation for the small fixed-sized wrapped nodes.
     894For intrusive lists, the nodes are adjacent in memory from being preallocated in an array.
     895For wrapped lists, the wrapped nodes happen to be adjacent because the memory allocator uses bump allocation during the initial phase of allocation.
    895896As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal.
    896897With address look-ahead, the hardware does an excellent job of managing the multi-level cache.
    897898Hence, performance is largely constant for both kinds of lists, until L3 cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists.
    898 
    899 In detail, \VRef[Figure]{f:Shuffled-swift}--\subref*{f:Shuffled-java} shows shuffled insertion and removal of the nodes.
    900 As for linear, there are issues with the wrapped list and memory allocation.
    901 For intrusive lists, it is possible to link the nodes randomly, so consecutive nodes in memory seldom point at adjacent nodes.
    902 For wrapped lists, the placement of wrapped nodes is dictated by the memory allocator, and results in memory layout virtually the same as for linear, \ie the wrapped nodes are laid out consecutively in memory, but each wrapped node points at a randomly located external node.
    903 As seen on \VPageref{p:Shuffle}, the random access reads data from external node, which are located in random order.
    904 So while the wrapped nodes are accessed linearly, external nodes are touched randomly for both kinds of lists resulting in similar cache events.
    905 The slowdown of shuffled occurs as the experiment's length grows from last-level cache into main memory.
     899For example, on AMD (\VRef[Figure]{f:Linear-swift}), there is one NUMA node but many small L3 caches, so performance slows down quickly as multiple L3 caches come into play, and remains constant at that level, except for some anomalies for very large lists.
     900On Intel (\VRef[Figure]{f:Linear-java}), there are four NUMA nodes and four slowdown steps as list-length increase.
     901At each step, the difference between the kinds of lists decreases as the NUMA effect increases.
     902
     903In detail, \VRef[Figure]{f:Random-swift}--\subref*{f:Random-java} shows random insertion and removal of the nodes.
     904As for linear, there is the issue of memory allocation for the wrapped list.
     905As well, the consecutive storage-layout is the same (array and bump allocation).
     906Hence, the difference is the random linking among nodes, resulting in random accesses, even though the list is traversed linearly, resulting in similar cache events for both kinds of lists.
     907Both \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} show the slowdown of random access as the list-length grows resulting from stepping out of caches into main memory and crossing NUMA nodes.
    906908% Insert and remove operations act on both sides of a link.
    907909%Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
    908 Each time a next item is processed, the memory access is a hop to a randomly far address.
    909 The target is unavailable in the cache and a slowdown results.
    910 Note, the external-data no-touch assumption is often unrealistic: decisions like,``Should I remove this item?'' need to look at the item.
    911 Therefore, under realistic assumptions, both intrusive and wrapped-reference lists suffer similar caching issues for very large lists.
     910As for linear, the Intel (\VRef[Figure]{f:Random-java}) graph shows steps from the four NUMA nodes.
     911Interestingly, after $10^6$ nodes, intrusive lists are slower than wrapped.
     912I did not have time to track down this anomaly, but I speculate it results from the difference in touching the data in the accessed node, as the data and links are together for intrusive and separated for wrapped.
     913For the llheap memory-allocator and the two tested architectures, intrusive lists out perform wrapped lists up to size $10^3$ for both linear and random, and performance begins to converge around $10^6$ nodes as architectural issues begin to dominate.
     914Clearly, memory allocator and hardware architecture plays a large factor in the total cost and the crossover points as list-size increases.
    912915% In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size.
    913916
    914917The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution.
    915918Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated.
    916 Even when space is a consideration, intrusive links may not use any more storage if a node is mostly linked.
    917 Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures.
     919Even when space is a consideration, intrusive links may not use more storage if a node is often linked.
     920Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures or their tool-set does not provide them.
    918921
    919922% Note, linear access may not be realistic unless dynamic size changes may occur;
     
    954957\label{s:ComparingIntrusiveImplementations}
    955958
    956 The preceding result shows that all the intrusive implementations examined have noteworthy performance compared to wrapped lists.
    957 This analysis picks the sizes below 150 and zooms in to differentiate among the different intrusive implementations.
    958 The data is selected from the start of \VRef[Figure]{f:Linear}, but the start of \VRef[Figure]{f:Shuffled} would be largely the same.
     959The preceding result shows the intrusive implementations examined have better performance compared to wrapped lists for small to medium sized lists.
     960This analysis picks list sizes below 150 and zooms in to differentiate among the intrusive implementations.
     961The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same.
    959962
    960963\begin{figure}
     
    982985  } % subfigure
    983986  \end{tabular}
    984   \caption{Operation duration \vs list length at small-medium lengths.  Two example operations are shown: [I] stack movement with head-only access (plots a and b); [VIII] queue movement with element-oriented removal access (plots c and d); both operations use insert-first polarity.}
     987  \caption{Operation duration \vs list length at small-medium lengths.  Two example operations are shown: [I] stack movement with head-only access (plots a and b); [VIII] queue movement with element-oriented removal access (plots c and d); both operations use insert-first polarity. Lower is better.}
    985988  \label{fig:plot-list-zoomin}
    986989\end{figure}
     
    990993The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials.
    991994For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other.
    992 
    993 \VRef[Figure]{fig:plot-list-zoomin} has the pair machines, each in a column.
    994 It also has a pair of operating scenarios, each in a row.
    995 The experiment runs twelve operating scenarios; the ones chosen for their variety happen to be items I and VIII from the listing of \VRef[Figure]{fig:plot-list-mchn-szz}.
     995The experiment runs twelve operating scenarios;
     996the ones chosen for their variety are scenarios I and VIII from the listing of \VRef[Figure]{fig:plot-list-mchn-szz}, and their results appear in the rows.
     997As in the previous experiment, each hardware architecture appears in a column.
    996998Note that LQ-list does not support queue operations, so this framework is absent from Operation VIII.
    997999
    9981000At lengths 1 and 2, winner patterns seen at larger sizes generally do not apply.
    999 Indeed, an issue with \CFA giving terribel on queues at length 1 is evident in \VRef[Figure]{f:AbsoluteTime-viii-swift}.
     1001Indeed, an issue with \CFA giving terrible on queues at length 1 is evident in \VRef[Figure]{f:AbsoluteTime-viii-swift}.
    10001002This phenomenon is elaborated in \MLB{TODO: xref}.
    10011003For the remainder of this section, these sizes are disregarded.
     
    11971199\label{toc:lst:futwork}
    11981200
    1199 Not discussed in the chapter is an issue in the \CFA type system with Plan-9 @inline@ declarations.
    1200 It implements the trait 'embedded' for the derived-base pair named in the arguments.
    1201 This trait defines a pseudo conversion called backtick-inner, from derived to base (the safe direction).
    1202 Such a real conversion exists for the concrete types, due to our compiler having the p9 language feature.
    1203 But when trying to be polymorphic over the types, there is no way to access this built-in conversion, so my macro stands in until Fangren does more with conversions.
    1204 
     1201Not discussed in the chapter is a \CFA type-system issue with Plan-9 @inline@ declarations, implementing the trait @embedded@ to access the contained @dlist@ link-fields.
     1202This trait defines an implicit conversion from derived to base (the safe direction).
     1203Such a conversion exists implicitly for concrete types using the Plan-9 inheritance feature.
     1204However, this implicit conversion is only partially implemented for polymorphism types, such as @dlist@, which prevents the straightforward list interface shown throughout the chapter.
     1205
     1206My workaround is the macro @P9_EMBEDDED@ placed before an intrusive node is used to declare a list.
     1207\begin{cfa}
     1208struct node {
     1209        int v;
     1210        inline dlink(node);
     1211};
     1212@P9_EMBEDDED( node, dlink(node) );@
     1213dlist( node ) dlist;
     1214\end{cfa}
     1215The macro creates specialized access functions to explicitly extract the required information for the polymorphic @dlist@ type.
     1216These access functions could have been generated implicitly for each intrusive node by adding another compiler pass.
     1217However, this would have been substantial temporary work, when the correct solution is the compiler fix.
     1218Hence, the macro workaround is only a small temporary inconvenience;
     1219otherwise, all the list API shown in this chapter works.
     1220
     1221
     1222\begin{comment}
    12051223An API author has decided that the intended user experience is:
    12061224\begin{itemize}
     
    13561374TODO: deal with: Link fields are system-managed.
    13571375Links in GDB.
    1358 
     1376\end{comment}
Note: See TracChangeset for help on using the changeset viewer.