Changeset 4cf8832
- Timestamp:
- Apr 17, 2026, 9:46:46 AM (33 hours ago)
- Branches:
- master
- Parents:
- 99bc47b
- Location:
- doc/theses/mike_brooks_MMath
- Files:
-
- 2 added
- 7 edited
-
list.tex (modified) (14 diffs)
-
pictures/plot-list-short-temp.pdf (added)
-
pictures/plot-list-wide-temp.pdf (added)
-
plots/ListCommon.py (modified) (7 diffs)
-
plots/list-1ord.py (modified) (1 diff)
-
plots/list-zoomout-noshuf-java.py (modified) (1 diff)
-
plots/list-zoomout-noshuf-swift.py (modified) (1 diff)
-
plots/list-zoomout-shuf-java.py (modified) (1 diff)
-
plots/list-zoomout-shuf-swift.py (modified) (1 diff)
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mike_brooks_MMath/list.tex
r99bc47b r4cf8832 653 653 The goal is to show the \CFA lists are competitive with other designs, but the different list designs may not have equivalent functionality, so it is impossible to select a winner encompassing both functionality and execution performance. 654 654 655 656 \subsection{Add-Remove Performance} 655 \subsection{Experiment Design} 656 657 \begin{figure} 658 \noindent 659 \begin{tabular}{p{1.75in}@{\ }p{4.5in}} 660 Insert-Remove (IR) 661 & The atomic unit of work being measured: one insertion plus one remove (plus all looping/tracking overheads) \\ 662 Use Case 663 & Pattern of add-remove calls. \\ 664 -- Movement & \\ 665 \quad $\ni$ stack 666 & Inserts and removes happen at the same end. \\ 667 \quad $\ni$ queue 668 & Inserts and removes happen at opposite ends. \\ 669 -- Polarity 670 & Which of the two orientations in which the movement happens. \\ 671 \quad $\ni$ insert-first 672 & All inserts at front; stack removes at front; queue removes at back. \\ 673 \quad $\ni$ insert-last 674 & All inserts at back; stack removes at back; queue removes at front. \\ 675 -- Accessor 676 & How an insertion position, or removal element, is specified. The same position/element is picked either way. \\ 677 \quad $\ni$ all head 678 & inserts and removes both through the head \\ 679 \quad $\ni$ insert element 680 & insert by element and remove through the head \\ 681 \quad $\ni$ remove element 682 & insert through head and remove by element\\ 683 Physical Context & \\ 684 -- Size (number) & Number of nodes being linked. Unless specified, equals the \emph{length} of the program's sole list. \emph{Width}, rarely used, is the number of lists. \\ 685 -- Size Zone 686 & Contiguous range of sizes, chosen to avoid known anomalies and to sample a brief plateau. Each zone buckets four specific sizes. \\ 687 \quad $\ni$ small 688 & lists of 4--16 elements \\ 689 \quad $\ni$ medium 690 & lists of 50--200 elements \\ 691 \quad $\ni$ (other) 692 & Not used for comparing intrusive frameworks. \\ 693 -- machine 694 & Computer running the experiment \\ 695 \quad $\ni$ AMD 696 & smaller cache \\ 697 \quad $\ni$ Intel 698 & bigger cache \\ 699 Framework & A particular linked-list implementation (within its host language) \\ 700 $\ni$ \CC & The @std::list@ type of g++. \\ 701 $\ni$ lq-list & The @list@ type of LQ from glibc of gcc. \\ 702 $\ni$ lq-tailq & The @tailq@ type of the same. \\ 703 $\ni$ \uCpp & \uCpp's @uSequence@ \\ 704 $\ni$ \CFA & \CFA's @dlist@ \\ 705 Explanation being 706 & How independent explanatory variable X is analyzed \\ 707 -- Marginalized 708 & Left alone, allowed to vary, yielding a more absolute measure. Shows the effect that X causes. If all explanations are marginalized, then absolute times are available and a relative time has a peer group that is the entire population. \\ 709 -- Conditioned 710 & Held constant, yielding a more relative measure. Hides the effect that X causes. Conditioning on X creates more, smaller relative-measure peer groups, by isolating each X-domain value. Resulting interpretation is, ``Assuming no change in X.'' \\ 711 \end{tabular} 712 \caption{ 713 Glossary of terms used in the list performance evaluation. 714 } 715 \label{f:ListPerfGlossary} 716 \end{figure} 717 718 719 This section explains how the experiment is built. 720 Many of the parts following define terminology concerning tuning knobs. 721 \VRef[Figure]{f:ListPerfGlossary} provides a consolidated refence. 722 723 724 \subsubsection{Add-Remove Performance} 657 725 \label{s:AddRemovePerformance} 658 726 … … 661 729 Thus, adding and removing an element are the sole primitive actions. 662 730 663 Repeated adding and removing is necessary to measure timing because these operations can be as s impleas a dozen instructions.731 Repeated adding and removing is necessary to measure timing because these operations can be as short as a dozen instructions. 664 732 These instruction sequences may have cases that proceed (in a modern, deep pipeline) without a stall. 665 733 666 734 This experiment takes the position that: 667 735 \begin{itemize}[leftmargin=*] 668 \item The total time to add and remove is relevant, as opposed to having one individualtime for adding and a separate time for removing.736 \item The total time to add and remove is relevant, as opposed to having one time for adding and a separate time for removing. 669 737 Adds without removes quickly fill memory; 670 removes without adds is meaningless. 671 \item A relevant breakdown ``by operation'' is: 672 \begin{description} 673 \item[movement] 674 Is the add/remove applied to a stack, queue, or something else? 675 In these experiments, strict stack and queue shapes are tested (two movements). 676 \item[polarity] 677 In which direction does the action apply? 678 For a queue, do items flow from first to last or last to first? 679 For a stack, is the first or last end used for adding and removing? 680 In these experiments, both polarities are considered, labelled insert-first and insert-last (two polarities). 681 \item[accessor] 682 Is an add/remove location given by a list head's first/last (@insertFirst@, @removeLast@), or by a reference to an individual element (@insertAfter@, @remove@ of element)? 683 In these experiments, the (three) accessors are: 684 \begin{itemize} 685 \item 686 inserts and removes both through the head ("all-head") 687 \item 688 insert by element and remove through the head ("insert-element") 689 \item 690 insert through head and remove by element ("remove-element") 691 \end{itemize} 692 \end{description} 693 \item So, an "operating scenario" is a specific selection of movement, polarity and accessor. These experiments run twelve operating scenarios. 738 removing without adding is impossible. 739 \item A relevant breakdown ``by operation'' is, rather, the usage pattern of the add/remove calls. 740 A example pattern choice is adding and removing at the same end, making a stack, or opposite ends, for a queue. 741 Another is pushing on the front by calling @insert_first(lst, e)@ \vs @insert(e, old_first_elm)@; this aspect provides the test's API coverage. 742 \VRef[Section]{s:UseCases} gives the full breakdown. 694 743 \item Speed differences caused by the host machine's memory hierarchy need to be identified and explained, 695 but do not represent advantages of one linked-list implementationover another.744 but do not represent advantages of one framework over another. 696 745 \end{itemize} 697 746 698 747 The experiment used to measure insert/remove cost measures the mean duration of a sequence of additions and removals. 699 Confidence bounds, on this mean, are discussed.700 748 The distribution of speeds experienced by an individual add-remove pair (tail latency) is not discussed. 701 749 Space efficiency is shown only indirectly, by way of caches' impact on speed. … … 706 754 \end{itemize} 707 755 708 In the result analysis, where list length is a performance-influencing factor, once ``large'' lengths are dismissed, these zones are identified as representing different patterns: 709 \begin{description} 710 \item[size zone ``small''] lists of 4--16 elements 711 \item[size zone ``medium''] lists of 50--200 elements 712 \end{description} 713 Each zone buckets four specific sizes at which trials are run. 714 715 716 \subsubsection{Experiment setup} 717 \label{s:ExperimentSetup} 756 In all cases, the quantity discussed is the duration of one insert-remove (IR). 757 An IR is the time taken to do one innermost insertion-loop iteration, one innermost removal-loop iteration, and its share of all overheads, ammortized. 758 759 Lower IR duration is better. 760 This experiment typically does an IR in 1--10 ns. 761 The short end of this range has durations of single-digit clock-cycle counts. 762 Therefore, the situations that achieve the best times are saturating the instruction pipeline successfully. 763 764 Often, an IR duration value needs to be considered relatively. 765 For example, \VRef[Section]{toc:sweet-sore} asks whether one linked list implementation is more sensitive than another to changing which computer runs the test. 766 A finding might be that a machine change slows implementation A by 10\% and B by 20\%. 767 This finding is not saying that A is faster than B (on either machine). 768 The finding could stand if B started 10\% faster and the machine change levelled them off, if B started slower and got worse, or in myriad other cases. 769 The finding asserts that such distinctions are not what's immediately relevant. 770 The arithmetic that produces the 10\% and 20\% answers is removing the information about which one starts, or ends up, faster. 771 Each implementation's to-machine duration is stated relatively to \emph{the same implementation's} from-machine duration. 772 The resulting measure is still about a duration. 773 The framework with the lower from-machine-relative duration handles the change better. 774 775 776 \subsubsection{Test Program} 777 \label{s:TetProgram} 718 778 719 779 The experiment driver defines a (intrusive) node type: … … 750 810 The loop duration is divided by the counter and this throughput is reported. 751 811 In a scatter-plot, each dot is one throughput, which means insert + remove + harness overhead. 752 The harness overhead is constant when comparing linked-list implementations and keepas small as possible.812 The harness overhead is constant when comparing linked-list frameworks and is kept as small as possible. 753 813 % The remainder of the setup section discusses the choices that affected the harness overhead. 754 814 … … 822 882 % This harness avoids telling the hardware what the SUT is about to do. 823 883 824 The comparator linked-list implementations are: 825 \begin{description} 826 \item[std::list] The @list@ type of g++. 827 \item[lq-list] The @list@ type of LQ from glibc of gcc. 828 \item[lq-tailq] The @tailq@ type of the same. 829 \item[upp-upp] \uCpp provided @uSequence@ 830 \item[cfa-cfa] \CFA's @dlist@ 831 \end{description} 832 833 834 \subsubsection{Execution Environment} 835 \label{s:ExperimentalEnvironment} 836 837 The performance experiments are run on: 838 \begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt] 839 %\item[PC] 840 %with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6. The machine has 16 GB of RAM and 8 MB of last-level cache. 841 %\item[ARM] 842 %Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model 843 \item[AMD] 844 Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 1024KB L2, 16MB L3, where each L3 cache covers 1 NUMA node and 8 cores (16 processors). 845 \item[Intel] 846 Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 20248KB L2, 160MB L3, where each L3 cache covers 2 NUMA node and 32 cores (64 processors). 847 \end{description} 848 The experiments are single threaded and pinned to single core to prevent any OS movement, which might cause cache or NUMA effects perturbing the experiment. 849 850 The compiler is gcc/g++-14.2.0 running on the Linux v6.8.0-52-generic OS. 851 Switching between the default memory allocators @glibc@ and @llheap@ is done with @LD_PRELOAD@. 852 To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@ 853 \begin{cfa} 854 // prevent eliding, cheaper than volatile 855 static inline void * pass( void * v ) { __asm__ __volatile__( "" : "+r"(v) ); return v; } 856 ... 857 pass( &remove_first( lst ) ); // wrap call to prevent elision, insert cannot be elided now 858 \end{cfa} 859 The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists. 860 861 862 \subsection{Result: Coarse comparison of styles} 863 864 This comparison establishes how an intrusive list performs compared with a wrapped-reference list. 865 \VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test. 866 Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum. 867 In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:ExperimentSetup}) are plotted with the same symbol; 868 sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list). 869 See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists. 870 871 The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns. 872 For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/remove times. 873 As the list size grows, the administrative overhead for intrusive lists quickly disappears. 874 875 \begin{figure} 876 \centering 877 \setlength{\tabcolsep}{0pt} 878 \begin{tabular}{p{0.75in}p{2.75in}p{3in}} 879 & 880 \subfloat[Linear List Nodes, AMD]{\label{f:Linear-swift} 881 \hspace*{-0.75in} 882 \includegraphics{plot-list-zoomout-noshuf-swift.pdf} 883 } % subfigure 884 & 885 \subfloat[Linear List Nodes, Intel]{\label{f:Linear-java} 886 \includegraphics{plot-list-zoomout-noshuf-java.pdf} 887 } % subfigure 888 \\ 889 & 890 \subfloat[Random List Nodes, AMD]{\label{f:Random-swift} 891 \hspace*{-0.75in} 892 \includegraphics{plot-list-zoomout-shuf-swift.pdf} 893 } % subfigure 894 & 895 \subfloat[Random List Nodes, Intel]{\label{f:Random-java} 896 \includegraphics{plot-list-zoomout-shuf-java.pdf} 897 } % subfigure 898 \end{tabular} 899 \caption{Insert/remove duration \vs list length. 900 Lengths go as large possible without error. 901 One example operation is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.} 902 \label{fig:plot-list-zoomout} 903 \end{figure} 904 905 The key performance factor between the intrusive and the wrapped-reference lists is the dynamic allocation for the wrapped nodes. 906 Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator. 907 For insert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists. 908 For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list; 909 the allocation dominates these costs. 910 For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node. 911 Hence, there is no way to tease apart the allocation, copying, and linking costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage that storage. 912 913 In detail, \VRef[Figure]{f:Linear-swift}--\subref*{f:Linear-java} shows linear insertion of all the nodes and then linear removal, both in the same direction. 914 For intrusive lists, the nodes are adjacent in memory from being preallocated in an array. 915 For wrapped lists, the wrapped nodes happen to be adjacent because the memory allocator uses bump allocation during the initial phase of allocation. 916 As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal. 917 With address look-ahead, the hardware does an excellent job of managing the multi-level cache. 918 Hence, performance is largely constant for both kinds of lists, until L3 cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists. 919 For example, on AMD (\VRef[Figure]{f:Linear-swift}), there is one NUMA node but many small L3 caches, so performance slows down quickly as multiple L3 caches come into play, and remains constant at that level, except for some anomalies for very large lists. 920 On Intel (\VRef[Figure]{f:Linear-java}), there are four NUMA nodes and four slowdown steps as list-length increase. 921 At each step, the difference between the kinds of lists decreases as the NUMA effect increases. 922 923 In detail, \VRef[Figure]{f:Random-swift}--\subref*{f:Random-java} shows random insertion and removal of the nodes. 924 As for linear, there is the issue of memory allocation for the wrapped list. 925 As well, the consecutive storage-layout is the same (array and bump allocation). 926 Hence, the difference is the random linking among nodes, resulting in random accesses, even though the list is traversed linearly, resulting in similar cache events for both kinds of lists. 927 Both \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} show the slowdown of random access as the list-length grows resulting from stepping out of caches into main memory and crossing NUMA nodes. 928 % Insert and remove operations act on both sides of a link. 929 %Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location. 930 As for linear, the Intel (\VRef[Figure]{f:Random-java}) graph shows steps from the four NUMA nodes. 931 Interestingly, after $10^6$ nodes, intrusive lists are slower than wrapped. 932 I did not have time to track down this anomaly, but I speculate it results from the difference in touching the data in the accessed node, as the data and links are together for intrusive and separated for wrapped. 933 For the llheap memory-allocator and the two tested architectures, intrusive lists out perform wrapped lists up to size $10^3$ for both linear and random, and performance begins to converge around $10^6$ nodes as architectural issues begin to dominate. 934 Clearly, memory allocator and hardware architecture plays a large factor in the total cost and the crossover points as list-size increases. 935 % In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size. 936 937 The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution. 938 Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated. 939 Even when space is a consideration, intrusive links may not use more storage if a node is often linked. 940 Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures or their tool-set does not provide them. 941 942 % Note, linear access may not be realistic unless dynamic size changes may occur; 943 % if the nodes are known to be adjacent, use an array. 944 945 % In a wrapped-reference list, list nodes are allocated separately from the items put into the list. 946 % Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes. 947 948 % STL's performance is not affected by element order in memory. 949 %The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations. 950 % This much is also unaffected by element order. 951 % Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off. 952 % In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL. 953 954 % The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster. 955 % Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level. 956 % Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear. 957 % Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear. 958 959 % The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists. 960 % The tests in this chapter are only inserting and removing. 961 % They are not operating on any user payload data that is being listed. 962 % The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order. 963 % These differences are inherent to the two list models. 964 965 % A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening. 966 % As a result, the interlinked nodes of the STL list are generally referencing their immediate neighbours. 967 % This pattern occurs regardless of user-item shuffling because this test's ``use'' of the user-items' array is limited to storing element addresses. 968 % This experiment, driving an STL list, is simply not touching the memory that holds the user data. 969 % Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast. 970 971 % But the comparison of unshuffled intrusive with wrapped-reference gives the performance of these two styles, with their the common impediment of overfilling the cache removed. 972 % Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes. 973 % This difference is appreciable below list length 0.5 M, and enormous below 10 K. 974 975 976 \subsection{Result: Comparing intrusive implementations} 977 \label{s:ComparingIntrusiveImplementations} 978 979 The preceding result shows the intrusive implementations have better performance to the wrapped lists for small to medium sized lists. 980 This analysis covers the experiment position taken in \VRef{s:AddRemovePerformance} for movement, polarity, and accessor. 981 \VRef[Figure]{f:ExperimentOperations} shows the experiment operations tested, which results in 12 experiments (I--XII) for comparing intrusive implementations. 982 To preclude hardware interference, only list sizes below 150 are examined to differentiate among the intrusive implementations, 983 The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same. 984 985 \begin{figure} 884 885 \subsubsection{Use Cases} 886 \label{s:UseCases} 887 888 \begin{figure} 889 \begin{comment} 986 890 \centering 987 891 \setlength{\tabcolsep}{8pt} … … 1026 930 \small 1027 931 \begin{tabular}{@{}ll@{}} 1028 I: & stack, insert first, I-head / R-head \\1029 II: & stack, insert first, I-list / R-head\\1030 III:& stack, insert first, I-head / R-list \\1031 IV: & stack, insert last, I-head / R-head \\1032 V: & stack, insert last, I-list / R-head\\1033 VI: & stack, insert last, I-head / R-list \\1034 VII:& queue, insert first, I-head / R-head \\1035 VIII:& queue, insert first, I-list / R-head\\1036 IX: & queue, insert first, I-head / R-list \\1037 X: & queue, insert last, I-head / R-head \\1038 XI: & queue, insert last, i I-list / R-head\\1039 XII:& queue, insert last, I-head / R-list \\932 I: & stack, insert first, all head \\ 933 II: & stack, insert first, insert element \\ 934 III:& stack, insert first, remove element \\ 935 IV: & stack, insert last, all head \\ 936 V: & stack, insert last, insert element \\ 937 VI: & stack, insert last, remove element \\ 938 VII:& queue, insert first, all head \\ 939 VIII:& queue, insert first, insert element \\ 940 IX: & queue, insert first, remove element \\ 941 X: & queue, insert last, all head \\ 942 XI: & queue, insert last, iinsert element \\ 943 XII:& queue, insert last, remove element \\ 1040 944 \end{tabular} 1041 945 \end{tabular} 1042 \caption{Experiment Operations} 946 \end{comment} 947 \setlength{\tabcolsep}{5pt} 948 \small 949 \begin{tabular}{rcccccccccccc} 950 & I & II & III & IV & V & VI & VII & VIII & IX & X & XI & XII \\ 951 Movement & 952 stack & stack & stack & stack & stack & stack & 953 queue & queue & queue & queue & queue & queue \\ 954 Polarity & 955 i-first & i-first & i-first & i-last & i-last & i-last & 956 i-first & i-first & i-first & i-last & i-last & i-last \\ 957 Acessor & 958 all hd & ins-e & rem-e & all hd & ins-e & rem-e & 959 all hd & ins-e & rem-e & all hd & ins-e & rem-e 960 \end{tabular} 961 \caption{Experiment use cases, numbered.} 1043 962 \label{f:ExperimentOperations} 1044 963 \end{figure} 964 965 Where \VRef[Figure]{f:ListPerfGlossary} enumerates the specific values, reall the use-case dimensions are: 966 \begin{description} 967 \item[movement ($\times 2$)] 968 In these experiments, strict stack and queue patterns are tested. 969 \item[polarity ($\times 2$)] 970 Obtain one polarity from the other by reversing uses of first/last. 971 \item[accessor ($\times 3$)] 972 Giving an add/remove location by a list head's first/last, \vs by a preexisting reference to an individual element? 973 \end{description} 974 975 A use case is a specific selection of movement, polarity and accessor. 976 These experiments run twelve use cases. 977 When a comparison is showing only what can happen when switching among use cases (as opposed to \eg how stacks are different from queues), the numbering scheme of \VRef[Figure]{f:ExperimentOperations} is used. 978 979 With accessor, when an action names its insertion position or removal element, the harness either 980 \begin{itemize} 981 \item defers to the list-head's tracking of first/last (``through the head''), or 982 \item applies its own knowledge of the current pattern, to name a position/element that happens to be first/last (``of known element''). 983 \end{itemize} 984 985 The accessor patterns, at the (\CFA) API level, are: 986 \begin{description} 987 \item[all (through the) head:] Both inserts and removes happen through the list head. The list head operations are @insert_first@, @insert_last@, @remove_first@ and @remove_last@. \\ 988 \item[insert (of known) element] \dots and remove through head: Inserts use @insert_before(e, first)@ or @insert_after(e, last)@, where @e@ is being inserted and @first@/@last@ are element references known by list-independent means. 989 \item[remove (of known) element] \dots and insert through the head: Removes use @remove(e)@, where @e@ is being removed. List-independent knowledge establishes that @e@ is first or last, as appropriate. 990 \end{description} 991 992 Comparing all-head with insert-element gives the relative performance of head-mediated \vs element-oriented insertion, because both use the same removal style. 993 Comparing all-head with remove-element gives the relative performance of head-mediated \vs element-oriented removal, because both use the same insertion style. 994 995 \subsubsection{Sizing} 996 997 998 It is true, but perhaps not obvious, that buildind and destroying long lists is slower than building and destroying short lists. 999 Obviously, indeed, it takes longer to fuse and divide a hundred neighbours than five. 1000 But the key metric in this work, AII, is about a single link--unlink. 1001 So, critically, linking and unlinking a hundred neighbours actually takes \emph{more} than $20\times$ the time for five neighbours. 1002 The main reason is caching; when more neighbours are being manipulated, more memory is being read and written. 1003 1004 But caching success is about more than the amount of memory worked on. 1005 Subtle changes in pattern become butterfly effects. 1006 Aggressive ILP scheduling, which enables short AIR times, is the amplifier. 1007 A data dependency, present in one framework but not another, is critical path in one situation but in not another. 1008 So, duration's response to size is not a steady worsening as size increases. 1009 Rather, each size-independent configuration often responds to size increases with leaps of worsening. 1010 Occasionally a leap is even followed size-run of retrograde response, where a suddenly incurred penalty has a chance to ammortize away. 1011 The frameworks tend to leapfrog over each other, at different points, as size increases. 1012 1013 The analysis treats these behaviours as incidental. 1014 It does not try to characterize various exact-size responses. 1015 Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another zone. 1016 1017 To preview, \VRef[Section]{toc:coarse-compre} dismisses ``Large'' sizes (above 150 elements), where the performance story is dominated by the amount of memory touched, inherently, by the choice of intrusive list \vs wrapped, and where one intrusive framework is quite obviously as good as another. 1018 At smaller sizes, comparing one intrusive framework to another makes sense; this comparion occurs in the remaining ``Result'' sections. 1019 Among the ``not Large'' sizes used there, two further zones, Small and Medium, are selected as representatives of what can vary when the scale is changed. 1020 These particular ranges were chosen beacause each range tends to have one story repeated across its constituent sizes. 1021 If \CFA's duration increases across Small, then the other frameworks' usually do too. 1022 If \CFA is beating \uCpp at the low end of Large, then it usually is at the high end too. 1023 The leapfrogging tends to happen outside of these two ranges. 1024 1025 A spot of poor performance appears in the general results for \CFA at size 1. 1026 Section \MLB{TODO:xref} explores the phenomenon and concludes that it is an anomaly due to a quirky interaction with the testing rig. 1027 To do so, two it considers size as either length or width. 1028 Length is the number of elements in a list. 1029 Width is a number of these lists being kept, worked upon in round-robin order. 1030 Outside of \MLB{TODO:xref}, size always means length, and width is 1. 1031 1032 1033 1034 \subsubsection{Execution Environment} 1035 \label{s:ExperimentalEnvironment} 1036 1037 The performance experiments are run on: 1038 \begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt] 1039 %\item[PC] 1040 %with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6. The machine has 16 GB of RAM and 8 MB of last-level cache. 1041 %\item[ARM] 1042 %Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model 1043 \item[AMD] 1044 Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 1024KB L2, 16MB L3, where each L3 cache covers 1 NUMA node and 8 cores (16 processors). 1045 \item[Intel] 1046 Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 20248KB L2, 160MB L3, where each L3 cache covers 2 NUMA node and 32 cores (64 processors). 1047 \end{description} 1048 The experiments are single threaded and pinned to single core to prevent any OS movement, which might cause cache or NUMA effects perturbing the experiment. 1049 1050 The compiler is gcc/g++-14.2.0 running on the Linux v6.8.0-52-generic OS. 1051 Switching between the default memory allocators @glibc@ and @llheap@ is done with @LD_PRELOAD@. 1052 To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@ 1053 \begin{cfa} 1054 // prevent eliding, cheaper than volatile 1055 static inline void * pass( void * v ) { __asm__ __volatile__( "" : "+r"(v) ); return v; } 1056 ... 1057 pass( &remove_first( lst ) ); // wrap call to prevent elision, insert cannot be elided now 1058 \end{cfa} 1059 The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists. 1060 1061 1062 The main difference in the machines is their cache structure. 1063 The AMD has smaller caches that are shared less, while the Intel shares larger caches among more processors. 1064 This difference, while an interesting tradeoff for highly concurrent use, is rather one-sided for sequential use, such as this experiment's. 1065 The Intel offers a single processor a bigger cache. 1066 1067 \subsubsection{Recap and Master Legend} 1068 1069 There are 12 use cases, which are all combinations of 2 movents, 2 polarities and 3 accessors. 1070 There are 4 pysical contexts, which are all combinations of 2 machines and 2 size (length) zones (and 1 width, of value 1). 1071 Each physical context samples 4 specific sizes. 1072 1073 There are 3.25 frameworks. 1074 This accounting considers how LQ-list supports only the movement--polarity combination "stack, insert first." 1075 So LQ-list fills a quarter of the otherwise-orthogonal space. 1076 1077 Use case, physical context and framework are the explanatory factors. 1078 Taking all combinations of the explanatory factors gives 624 individual configurations. 1079 1080 Though there are multiple experimental trials of each configuration (to assure repeatability), the usual measure is mean AIR among the trials, considered for each of the 624 individual configurations. 1081 1082 All means reported in this analysis are geometric. 1083 1084 \MLB{TODO: add example plots; explain histogram of 624} 1085 1086 1087 \subsection{Result: Coarse comparison of styles} 1088 1089 This comparison establishes how an intrusive list performs compared with a wrapped-reference list. 1090 \VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test. 1091 Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum. 1092 In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:Contenders}) are plotted with the same symbol; 1093 sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list). 1094 See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists. 1095 1096 The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns. 1097 For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/remove times. 1098 As the list size grows, the administrative overhead for intrusive lists quickly disappears. 1099 1100 \begin{figure} 1101 \centering 1102 \setlength{\tabcolsep}{0pt} 1103 \begin{tabular}{p{0.75in}p{2.75in}p{3in}} 1104 & 1105 \subfloat[Linear List Nodes, AMD]{\label{f:Linear-swift} 1106 \hspace*{-0.75in} 1107 \includegraphics{plot-list-zoomout-noshuf-swift.pdf} 1108 } % subfigure 1109 & 1110 \subfloat[Linear List Nodes, Intel]{\label{f:Linear-java} 1111 \includegraphics{plot-list-zoomout-noshuf-java.pdf} 1112 } % subfigure 1113 \\ 1114 & 1115 \subfloat[Random List Nodes, AMD]{\label{f:Random-swift} 1116 \hspace*{-0.75in} 1117 \includegraphics{plot-list-zoomout-shuf-swift.pdf} 1118 } % subfigure 1119 & 1120 \subfloat[Random List Nodes, Intel]{\label{f:Random-java} 1121 \includegraphics{plot-list-zoomout-shuf-java.pdf} 1122 } % subfigure 1123 \end{tabular} 1124 \caption{Insert/remove duration \vs list length. 1125 Lengths go as large possible without error. 1126 One example use case is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.} 1127 \label{fig:plot-list-zoomout} 1128 \end{figure} 1129 1130 The key performance factor between the intrusive and the wrapped-reference lists is the dynamic allocation for the wrapped nodes. 1131 Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator. 1132 For insert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists. 1133 For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list; 1134 the allocation dominates these costs. 1135 For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node. 1136 Hence, there is no way to tease apart the allocation, copying, and linking costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage that storage. 1137 1138 In detail, \VRef[Figure]{f:Linear-swift}--\subref*{f:Linear-java} shows linear insertion of all the nodes and then linear removal, both in the same direction. 1139 For intrusive lists, the nodes are adjacent in memory from being preallocated in an array. 1140 For wrapped lists, the wrapped nodes happen to be adjacent because the memory allocator uses bump allocation during the initial phase of allocation. 1141 As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal. 1142 With address look-ahead, the hardware does an excellent job of managing the multi-level cache. 1143 Hence, performance is largely constant for both kinds of lists, until L3 cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists. 1144 For example, on AMD (\VRef[Figure]{f:Linear-swift}), there is one NUMA node but many small L3 caches, so performance slows down quickly as multiple L3 caches come into play, and remains constant at that level, except for some anomalies for very large lists. 1145 On Intel (\VRef[Figure]{f:Linear-java}), there are four NUMA nodes and four slowdown steps as list-length increase. 1146 At each step, the difference between the kinds of lists decreases as the NUMA effect increases. 1147 1148 In detail, \VRef[Figure]{f:Random-swift}--\subref*{f:Random-java} shows random insertion and removal of the nodes. 1149 As for linear, there is the issue of memory allocation for the wrapped list. 1150 As well, the consecutive storage-layout is the same (array and bump allocation). 1151 Hence, the difference is the random linking among nodes, resulting in random accesses, even though the list is traversed linearly, resulting in similar cache events for both kinds of lists. 1152 Both \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} show the slowdown of random access as the list-length grows resulting from stepping out of caches into main memory and crossing NUMA nodes. 1153 % Insert and remove operations act on both sides of a link. 1154 %Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location. 1155 As for linear, the Intel (\VRef[Figure]{f:Random-java}) graph shows steps from the four NUMA nodes. 1156 Interestingly, after $10^6$ nodes, intrusive lists are slower than wrapped. 1157 I did not have time to track down this anomaly, but I speculate it results from the difference in touching the data in the accessed node, as the data and links are together for intrusive and separated for wrapped. 1158 For the llheap memory-allocator and the two tested architectures, intrusive lists out perform wrapped lists up to size $10^3$ for both linear and random, and performance begins to converge around $10^6$ nodes as architectural issues begin to dominate. 1159 Clearly, memory allocator and hardware architecture plays a large factor in the total cost and the crossover points as list-size increases. 1160 % In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size. 1161 1162 The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution. 1163 Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated. 1164 Even when space is a consideration, intrusive links may not use more storage if a node is often linked. 1165 Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures or their tool-set does not provide them. 1166 1167 % Note, linear access may not be realistic unless dynamic size changes may occur; 1168 % if the nodes are known to be adjacent, use an array. 1169 1170 % In a wrapped-reference list, list nodes are allocated separately from the items put into the list. 1171 % Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes. 1172 1173 % STL's performance is not affected by element order in memory. 1174 %The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations. 1175 % This much is also unaffected by element order. 1176 % Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off. 1177 % In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL. 1178 1179 % The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster. 1180 % Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level. 1181 % Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear. 1182 % Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear. 1183 1184 % The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists. 1185 % The tests in this chapter are only inserting and removing. 1186 % They are not operating on any user payload data that is being listed. 1187 % The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order. 1188 % These differences are inherent to the two list models. 1189 1190 % A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening. 1191 % As a result, the interlinked nodes of the STL list are generally referencing their immediate neighbours. 1192 % This pattern occurs regardless of user-item shuffling because this test's ``use'' of the user-items' array is limited to storing element addresses. 1193 % This experiment, driving an STL list, is simply not touching the memory that holds the user data. 1194 % Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast. 1195 1196 % But the comparison of unshuffled intrusive with wrapped-reference gives the performance of these two styles, with their the common impediment of overfilling the cache removed. 1197 % Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes. 1198 % This difference is appreciable below list length 0.5 M, and enormous below 10 K. 1199 1200 1201 \subsection{Result: Intrusive Winners and Losers} 1202 \label{s:ComparingIntrusiveImplementations} 1203 1204 The preceding result shows the intrusive frameworks have better performance than the wrapped lists for small to medium sized lists. 1205 This analysis covers the experiment position taken in \VRef{s:AddRemovePerformance} for movement, polarity, and accessor. 1206 \VRef[Figure]{f:ExperimentOperations} shows the experiment use cases tested, which results in 12 experiments (I--XII) for comparing intrusive frameworks. 1207 To preclude hardware interference, only list sizes below 150 are examined to differentiate among the intrusive frameworks, 1208 The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same. 1209 1045 1210 1046 1211 \begin{figure} 1047 1212 \centering 1048 1213 \includegraphics{plot-list-1ord.pdf} 1049 \caption{Histogram of operationdurations, decomposed by all first-order effects.1050 Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. Higher in column is better}1214 \caption{Histogram of IR durations, decomposed by all first-order effects. 1215 Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. The measure is duration; lower is better.} 1051 1216 \label{fig:plot-list-1ord} 1052 1217 \end{figure} … … 1059 1224 The size effect is more pronounced on the AMD with its smaller L3 cache than it is on the Intel. 1060 1225 (No NUMA effects for these list sizes.) 1061 Specifically, a 20\% standard deviation exists here, between the means four physical-effect categories.1226 Specifically, a 20\% standard deviation exists here, between the means of the four physical-effect categories. 1062 1227 The key takeaway for this comparison is the context it establishes for interpreting the following framework comparisons. 1063 Both the particulars of a themachine's cache design, and a list length's effect on the program's cache friendliness, affect insert/remove speed in the manner illlustrated in this breakdown.1228 Both the particulars of a machine's cache design, and a list length's effect on the program's cache friendliness, affect insert/remove speed in the manner illlustrated in this breakdown. 1064 1229 That is, if you are running on an unknown machine, at a scale above anomaly-prone individuals, and below where major LLC caching effects take over the general intrusive-list advantage, but with an unknown relationship to the sizing of your fickle low-level caches, you are likely to experience an unpredictable speed impact on the order of 20\%. 1065 1230 1066 A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by operation type.1231 A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by use case. 1067 1232 Specific interactions do occur, like framework X doing better on stacks than on queues; a selection of these is addressed in \VRef[Figure]{fig:plot-list-2ord} and discussed shortly. 1068 But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the operations opaquely.1069 Whether a given list implementationis suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement.1070 So you face another lottery, with a likely win-loss range of the standard deviation of the individual operations' means: 9\%.1233 But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the use cases opaquely. 1234 Whether a given list framework is suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement. 1235 So you face another lottery, with a likely win-loss range of the standard deviation of the individual use cases' means: 9\%. 1071 1236 1072 1237 This context helps interpret \VRef[Figure]{fig:plot-list-1ord}'s final comparison, by framework. … … 1076 1241 1077 1242 Now, the LQs do indeed beat the UW languages by 15\%, a fact explored further in \MLB{TODO: xref}. 1078 But so too does operation VIII typically beat operationIV by 38\%.1243 But so too does use case VIII typically beat use case IV by 38\%. 1079 1244 As does a small size on the Intel typically beat a medium size on the AMD by 66\%. 1080 1245 Framework choice is simply not where you stand to win or lose the most. 1081 1246 1247 1248 \subsection{Intrusive Sweet and Sore Spots} 1249 1082 1250 \begin{figure} 1083 1251 \centering 1084 1252 \includegraphics{plot-list-2ord.pdf} 1085 \caption{Histogram of operationdurations, illustrating interactions with framework.1253 \caption{Histogram of IR durations, illustrating interactions with framework. 1086 1254 Each distribution shows how its framework reacts to a single other factor being varied across one pair of options. 1087 1255 Every (binned and mean-contributing) individual data point represents a pair of test setups, one with the criterion set to the option labelled at the top; the other setup uses the bottom option. … … 1094 1262 \VRef[Figure]{fig:plot-list-1ord} stays razor-focused on only first-order effects in order to contextualize a winner/loser framework observation. 1095 1263 But this perspective cannot address questions like, ``Where are \CFA's sore spots?'' 1096 Moreover, the shallow threatment of operations by ordinals said nothing about how stack usage compares with queues'.1264 Moreover, the shallow threatment of use cases by ordinals said nothing about how stack usage compares with queues'. 1097 1265 1098 1266 \VRef[Figure]{fig:plot-list-2ord} provides such answers. … … 1117 1285 The strongest effect is \CFA's aversion to removal by element---certainly an opportunity for improvement. 1118 1286 1287 1288 \subsection{\CFA Tiny-Size Anomaly} 1289 1290 The \CFA list occasionally showed a concerning slowdown at length 1. 1291 The issue, seen in \VRef[Figure]{fig:plot-list-short} (top-left corner), has \CFA taking above 10 ns per IR (top-left corner). 1292 It occurs only for the queue movement, only on the AMD machine, and only for the \CFA framework. 1293 1294 Length-1 performane is an important case. 1295 Lists like those of waiting threads are frequently left empty, with the occasional thread (or few) momentarily joining. 1296 These scenarios need to work. 1297 1298 A cause of this behaviour was never determined. 1299 Speculation is that \CFA's increased data dependency, a result of the tagging scheme, pairs poorly with the situation implied by queue movement. 1300 The aliasing, at length 1, is: the head's first element is the head's last element. 1301 With stack movement, one of these aliases is used twice, while with queue movement, both are used in alternation. 1302 1303 The breakdowns earlier in the performane assessment work by varying length only. 1304 That is, they see the story down the leftmost column in a triangle. 1305 The insight for contextualizing this issue was to inspect both length and width. 1306 1307 The issue is seen as practically mitigated by noticing that the difficutly fades away as width increases. 1308 This effect is seen both in \VRef[Figure]{fig:plot-list-short}'s easement across the top triangle rows, and, zoomed farther out, in \VRef[Figure]{fig:plot-list-wide}. 1309 1310 Increasing the width matters to the aliasing hypothesis. 1311 In a narrow experiment, one element's insert and remove happen in rapid succession. 1312 So, the two aliases are exercied closer together, making a data hazard (that lacks ideal hardware treatment) stretch the instruction-pipeline schedule more significantly. 1313 Increasing the width adds harness-induced gaps between the uses of each alias, behind which a potential hazard can hide. 1314 1315 In the practical scenario that judges length-1 performance as relevant, width 1 is contrived. 1316 A thread putting itself on an often-empty waiters' list is not doing so on one such list repeatedly, at least not without taking other situation-iduced pauses. 1317 1318 Thus, the congestion at low width + length comes from the harness using repetition (in order to obtain a measurable time). 1319 It does not reflect the situation that motivates the legitimate desire for good length-1 performance. 1320 1321 There likely is a real hazard, unique to the \CFA framework, when a queue movement is repeated on a tiny list \emph{without other interventing action}. 1322 Doing so is believed to occur only in contrived situations. 1323 1324 1325 \begin{figure} 1326 \centering 1327 \includegraphics[trim={00in, 5.5in, 0in, 0in}, clip, scale=0.8]{plot-list-short-temp.pdf} 1328 \caption{Behaviour at very short lengths.} 1329 \label{fig:plot-list-short} 1330 \end{figure} 1331 1332 \begin{figure} 1333 \centering 1334 \includegraphics[trim={0.25in, 1in, 0.25in, 1in}, clip, scale=0.5]{plot-list-wide-temp.pdf} 1335 \caption{Length-1 anomaly resolving at modest width. Points are for varying widths, at fixed length 1.} 1336 \label{fig:plot-list-wide} 1337 \end{figure} 1338 1119 1339 \begin{comment} 1120 1340 These remarks are mostly about 3ord over 2ord. … … 1123 1343 They illustrate the difficult signal-to-noise ratio that I had to overcome in preparing this data. 1124 1344 They may serve as a reference guiding future \CFA linked-list work by informing on where to target improvements. 1125 Finally, the findings offer the conclusion that \CFA's list offers more cons ostent performance across usage scenarios, than the other lists.1345 Finally, the findings offer the conclusion that \CFA's list offers more consistent performance across usage scenarios, than the other lists. 1126 1346 \end{comment} 1127 1347 … … 1136 1356 \end{comment} 1137 1357 1138 1139 1140 \MLB{ TODO: find a home for these original conclusions: 1141 cfa-upp similarity holde for all halves by movement or polarity; 1142 splitting on accessor, \CFA has a poor result on element removal, LQ-list has a great result on the other accessors, and uC++ is unaffected. } 1143 1358 \begin{comment} 1144 1359 1145 1360 \begin{figure} … … 1175 1390 The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials. 1176 1391 For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other. 1177 The experiment runs twelve operating scenarios;1392 The experiment runs twelve use cases; 1178 1393 the ones chosen for their variety are scenarios I and VIII from the listing of \VRef[Figure]{fig:plot-list-mchn-szz}, and their results appear in the rows. 1179 1394 As in the previous experiment, each hardware architecture appears in a column. … … 1218 1433 With this adjustment, absolute duration values (in nonsecods) are lost. 1219 1434 In return, the physical quadrants are re-combined, enabling assessment of the non-physical factors. 1435 \end{comment} 1220 1436 1221 1437 \begin{comment} -
doc/theses/mike_brooks_MMath/plots/ListCommon.py
r99bc47b r4cf8832 68 68 69 69 explanations = ['movement', 'polarity', 'accessor', 70 'NumNodes', 70 'NumNodes', 'Width', 'Length', 71 71 'SizeZone', # note fd: NumNodes -> SizeZone 72 72 'fx', … … 192 192 193 193 def getSingleResults( 194 dsname = 'general',194 dsnames = ['general'], 195 195 machines = allMachines, 196 196 *, … … 202 202 203 203 timings = pd.concat([ 204 getMachineDataset( dsname, m ) 204 getMachineDataset( d, m ) 205 for d in dsnames 205 206 for m in machines ]) 206 207 … … 287 288 288 289 def printManySummary(*, 289 dsname = 'general',290 dsnames = ['general'], 290 291 machines = allMachines, 291 292 metafileCore, … … 302 303 303 304 for op in metadata.itertuples(): 304 timings = getSingleResults(dsname , machines,305 timings = getSingleResults(dsnames, machines, 305 306 fxs=fxs, 306 307 tgtMovement = op.movement, … … 324 325 325 326 def printSingleDetail( 326 dsname = 'general',327 dsnames = ['general'], 327 328 machines = allMachines, 328 329 *, … … 336 337 337 338 338 timings = getSingleResults(dsname , machines,339 timings = getSingleResults(dsnames, machines, 339 340 fxs = fxs, 340 341 tgtMovement = tgtMovement, -
doc/theses/mike_brooks_MMath/plots/list-1ord.py
r99bc47b r4cf8832 13 13 ops = ['movement', 'polarity', 'accessor'] 14 14 fx = ['fx'] 15 bkgnd = ['NumNodes'] # never drilled/marginalized, always conditioned 15 bkgnd = ['NumNodes'] # never drilled/marginalized, always conditioned 16 ignore = [ 'InterleaveFrac', # unused ever and always zero 17 'Width', # unused here and always one 18 'Length' ] # unused here and always =NumNodes 16 19 20 # assure every explanation is classified 17 21 assert( set( explanations ) 18 - set( ['InterleaveFrac'] ) # unused and always zero22 - set( ignore ) 19 23 == 20 24 set(physicals) | set(ops) | set(fx) | set(bkgnd) ) -
doc/theses/mike_brooks_MMath/plots/list-zoomout-noshuf-java.py
r99bc47b r4cf8832 8 8 9 9 printSingleDetail( 10 dsname ='zoomout-noshuf',10 dsnames=['zoomout-noshuf'], 11 11 machines=['java'], 12 12 tgtMovement = 'stack', -
doc/theses/mike_brooks_MMath/plots/list-zoomout-noshuf-swift.py
r99bc47b r4cf8832 8 8 9 9 printSingleDetail( 10 dsname ='zoomout-noshuf',10 dsnames=['zoomout-noshuf'], 11 11 machines=['swift'], 12 12 tgtMovement = 'stack', -
doc/theses/mike_brooks_MMath/plots/list-zoomout-shuf-java.py
r99bc47b r4cf8832 8 8 9 9 printSingleDetail( 10 dsname ='zoomout-shuf',10 dsnames=['zoomout-shuf'], 11 11 machines=['java'], 12 12 tgtMovement = 'stack', -
doc/theses/mike_brooks_MMath/plots/list-zoomout-shuf-swift.py
r99bc47b r4cf8832 8 8 9 9 printSingleDetail( 10 dsname ='zoomout-shuf',10 dsnames=['zoomout-shuf'], 11 11 machines=['swift'], 12 12 tgtMovement = 'stack',
Note:
See TracChangeset
for help on using the changeset viewer.