Changeset 75ba2fa6 for doc/theses/mike_brooks_MMath/list.tex
- Timestamp:
- Apr 11, 2026, 6:39:52 PM (12 hours ago)
- Branches:
- master
- Parents:
- e8a7b66d
- File:
-
- 1 edited
-
doc/theses/mike_brooks_MMath/list.tex (modified) (9 diffs)
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mike_brooks_MMath/list.tex
re8a7b66d r75ba2fa6 695 695 696 696 \subsubsection{Experiment setup} 697 \label{s:ExperimentSetup} 697 698 698 699 The experiment driver defines a (intrusive) node type: … … 844 845 \VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test. 845 846 Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum. 846 In the graphs, all four intrusive lists are plotted with the same symbol; 847 theses symbols clumped on top of each, showing the performance difference among intrusive lists is small in comparison to the wrapped list \see{\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists}. 847 In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:ExperimentSetup}) are plotted with the same symbol; 848 sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list). 849 See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists. 848 850 849 851 The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns. … … 866 868 \\ 867 869 & 868 \subfloat[ Shuffled List Nodes, AMD]{\label{f:Shuffled-swift}870 \subfloat[Random List Nodes, AMD]{\label{f:Random-swift} 869 871 \hspace*{-0.75in} 870 872 \includegraphics{plot-list-zoomout-shuf-swift.pdf} 871 873 } % subfigure 872 874 & 873 \subfloat[ Shuffled List Nodes, Intel]{\label{f:Shuffled-java}875 \subfloat[Random List Nodes, Intel]{\label{f:Random-java} 874 876 \includegraphics{plot-list-zoomout-shuf-java.pdf} 875 877 } % subfigure … … 877 879 \caption{Insert/remove duration \vs list length. 878 880 Lengths go as large possible without error. 879 One example operation is shown: stack movement, insert-first polarity and head-mediated access. }881 One example operation is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.} 880 882 \label{fig:plot-list-zoomout} 881 883 \end{figure} 882 884 883 The large difference in performance between intrusive and wrapped-reference lists results from the dynamic allocation for the wrapped nodes. 884 In effect, this experiment is largely measuring the cost of malloc/free rather than the insert/remove, and is affected by the layout of memory by the allocator. 885 Fundamentally, insert/remove for a doubly linked-list has a basic cost, which is seen by the similar results for the intrusive lists. 886 Hence, the costs for a wrapped-reference list are: allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list; 887 the dynamic allocation dominates these costs. 888 For example, the experiment was run with both glibc and llheap memory allocators, where performance from llheap reduced the cost from 20 to 16 ns, which is still far from the 2--4 ns for intrusive lists. 889 Unfortunately, there is no way to tease apart the linking and allocation costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage the storage. 890 Hence, dynamic allocation cost is fundamental for wrapped lists. 885 The key performance factor between the intrusive and the wrapped-reference lists is the dynamic allocation for the wrapped nodes. 886 Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator. 887 For insert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists. 888 For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list; 889 the allocation dominates these costs. 890 For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node. 891 Hence, there is no way to tease apart the allocation, copying, and linking costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage that storage. 891 892 892 893 In detail, \VRef[Figure]{f:Linear-swift}--\subref*{f:Linear-java} shows linear insertion of all the nodes and then linear removal, both in the same direction. 893 For intrusive lists, the nodes are adjacent because they arepreallocated in an array.894 For wrapped lists, the wrapped nodes are still adjacent because the memory allocator happens to use bump allocation for the small fixed-sized wrapped nodes.894 For intrusive lists, the nodes are adjacent in memory from being preallocated in an array. 895 For wrapped lists, the wrapped nodes happen to be adjacent because the memory allocator uses bump allocation during the initial phase of allocation. 895 896 As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal. 896 897 With address look-ahead, the hardware does an excellent job of managing the multi-level cache. 897 898 Hence, performance is largely constant for both kinds of lists, until L3 cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists. 898 899 In detail, \VRef[Figure]{f:Shuffled-swift}--\subref*{f:Shuffled-java} shows shuffled insertion and removal of the nodes. 900 As for linear, there are issues with the wrapped list and memory allocation. 901 For intrusive lists, it is possible to link the nodes randomly, so consecutive nodes in memory seldom point at adjacent nodes. 902 For wrapped lists, the placement of wrapped nodes is dictated by the memory allocator, and results in memory layout virtually the same as for linear, \ie the wrapped nodes are laid out consecutively in memory, but each wrapped node points at a randomly located external node. 903 As seen on \VPageref{p:Shuffle}, the random access reads data from external node, which are located in random order. 904 So while the wrapped nodes are accessed linearly, external nodes are touched randomly for both kinds of lists resulting in similar cache events. 905 The slowdown of shuffled occurs as the experiment's length grows from last-level cache into main memory. 899 For example, on AMD (\VRef[Figure]{f:Linear-swift}), there is one NUMA node but many small L3 caches, so performance slows down quickly as multiple L3 caches come into play, and remains constant at that level, except for some anomalies for very large lists. 900 On Intel (\VRef[Figure]{f:Linear-java}), there are four NUMA nodes and four slowdown steps as list-length increase. 901 At each step, the difference between the kinds of lists decreases as the NUMA effect increases. 902 903 In detail, \VRef[Figure]{f:Random-swift}--\subref*{f:Random-java} shows random insertion and removal of the nodes. 904 As for linear, there is the issue of memory allocation for the wrapped list. 905 As well, the consecutive storage-layout is the same (array and bump allocation). 906 Hence, the difference is the random linking among nodes, resulting in random accesses, even though the list is traversed linearly, resulting in similar cache events for both kinds of lists. 907 Both \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} show the slowdown of random access as the list-length grows resulting from stepping out of caches into main memory and crossing NUMA nodes. 906 908 % Insert and remove operations act on both sides of a link. 907 909 %Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location. 908 Each time a next item is processed, the memory access is a hop to a randomly far address. 909 The target is unavailable in the cache and a slowdown results. 910 Note, the external-data no-touch assumption is often unrealistic: decisions like,``Should I remove this item?'' need to look at the item. 911 Therefore, under realistic assumptions, both intrusive and wrapped-reference lists suffer similar caching issues for very large lists. 910 As for linear, the Intel (\VRef[Figure]{f:Random-java}) graph shows steps from the four NUMA nodes. 911 Interestingly, after $10^6$ nodes, intrusive lists are slower than wrapped. 912 I did not have time to track down this anomaly, but I speculate it results from the difference in touching the data in the accessed node, as the data and links are together for intrusive and separated for wrapped. 913 For the llheap memory-allocator and the two tested architectures, intrusive lists out perform wrapped lists up to size $10^3$ for both linear and random, and performance begins to converge around $10^6$ nodes as architectural issues begin to dominate. 914 Clearly, memory allocator and hardware architecture plays a large factor in the total cost and the crossover points as list-size increases. 912 915 % In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size. 913 916 914 917 The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution. 915 918 Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated. 916 Even when space is a consideration, intrusive links may not use any more storage if a node is mostlylinked.917 Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures .919 Even when space is a consideration, intrusive links may not use more storage if a node is often linked. 920 Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures or their tool-set does not provide them. 918 921 919 922 % Note, linear access may not be realistic unless dynamic size changes may occur; … … 954 957 \label{s:ComparingIntrusiveImplementations} 955 958 956 The preceding result shows th at all the intrusive implementations examined have noteworthy performance compared to wrapped lists.957 This analysis picks the sizes below 150 and zooms in to differentiate among the differentintrusive implementations.958 The data is selected from the start of \VRef[Figure ]{f:Linear}, but the start of \VRef[Figure]{f:Shuffled} would belargely the same.959 The preceding result shows the intrusive implementations examined have better performance compared to wrapped lists for small to medium sized lists. 960 This analysis picks list sizes below 150 and zooms in to differentiate among the intrusive implementations. 961 The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same. 959 962 960 963 \begin{figure} … … 982 985 } % subfigure 983 986 \end{tabular} 984 \caption{Operation duration \vs list length at small-medium lengths. Two example operations are shown: [I] stack movement with head-only access (plots a and b); [VIII] queue movement with element-oriented removal access (plots c and d); both operations use insert-first polarity. }987 \caption{Operation duration \vs list length at small-medium lengths. Two example operations are shown: [I] stack movement with head-only access (plots a and b); [VIII] queue movement with element-oriented removal access (plots c and d); both operations use insert-first polarity. Lower is better.} 985 988 \label{fig:plot-list-zoomin} 986 989 \end{figure} … … 990 993 The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials. 991 994 For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other. 992 993 \VRef[Figure]{fig:plot-list-zoomin} has the pair machines, each in a column. 994 It also has a pair of operating scenarios, each in a row. 995 The experiment runs twelve operating scenarios; the ones chosen for their variety happen to be items I and VIII from the listing of \VRef[Figure]{fig:plot-list-mchn-szz}. 995 The experiment runs twelve operating scenarios; 996 the ones chosen for their variety are scenarios I and VIII from the listing of \VRef[Figure]{fig:plot-list-mchn-szz}, and their results appear in the rows. 997 As in the previous experiment, each hardware architecture appears in a column. 996 998 Note that LQ-list does not support queue operations, so this framework is absent from Operation VIII. 997 999 998 1000 At lengths 1 and 2, winner patterns seen at larger sizes generally do not apply. 999 Indeed, an issue with \CFA giving terrib elon queues at length 1 is evident in \VRef[Figure]{f:AbsoluteTime-viii-swift}.1001 Indeed, an issue with \CFA giving terrible on queues at length 1 is evident in \VRef[Figure]{f:AbsoluteTime-viii-swift}. 1000 1002 This phenomenon is elaborated in \MLB{TODO: xref}. 1001 1003 For the remainder of this section, these sizes are disregarded. … … 1197 1199 \label{toc:lst:futwork} 1198 1200 1199 Not discussed in the chapter is an issue in the \CFA type system with Plan-9 @inline@ declarations. 1200 It implements the trait 'embedded' for the derived-base pair named in the arguments. 1201 This trait defines a pseudo conversion called backtick-inner, from derived to base (the safe direction). 1202 Such a real conversion exists for the concrete types, due to our compiler having the p9 language feature. 1203 But when trying to be polymorphic over the types, there is no way to access this built-in conversion, so my macro stands in until Fangren does more with conversions. 1204 1201 Not discussed in the chapter is a \CFA type-system issue with Plan-9 @inline@ declarations, implementing the trait @embedded@ to access the contained @dlist@ link-fields. 1202 This trait defines an implicit conversion from derived to base (the safe direction). 1203 Such a conversion exists implicitly for concrete types using the Plan-9 inheritance feature. 1204 However, this implicit conversion is only partially implemented for polymorphism types, such as @dlist@, which prevents the straightforward list interface shown throughout the chapter. 1205 1206 My workaround is the macro @P9_EMBEDDED@ placed before an intrusive node is used to declare a list. 1207 \begin{cfa} 1208 struct node { 1209 int v; 1210 inline dlink(node); 1211 }; 1212 @P9_EMBEDDED( node, dlink(node) );@ 1213 dlist( node ) dlist; 1214 \end{cfa} 1215 The macro creates specialized access functions to explicitly extract the required information for the polymorphic @dlist@ type. 1216 These access functions could have been generated implicitly for each intrusive node by adding another compiler pass. 1217 However, this would have been substantial temporary work, when the correct solution is the compiler fix. 1218 Hence, the macro workaround is only a small temporary inconvenience; 1219 otherwise, all the list API shown in this chapter works. 1220 1221 1222 \begin{comment} 1205 1223 An API author has decided that the intended user experience is: 1206 1224 \begin{itemize} … … 1356 1374 TODO: deal with: Link fields are system-managed. 1357 1375 Links in GDB. 1358 1376 \end{comment}
Note:
See TracChangeset
for help on using the changeset viewer.