Changeset eeefc0c for doc/theses/mike_brooks_MMath/list.tex
- Timestamp:
- Apr 26, 2026, 10:51:51 AM (3 days ago)
- Branches:
- master
- Children:
- ac9c0ee8
- Parents:
- 7184906
- File:
-
- 1 edited
-
doc/theses/mike_brooks_MMath/list.tex (modified) (29 diffs)
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mike_brooks_MMath/list.tex
r7184906 reeefc0c 503 503 The expansion and underlying API are under discussion. 504 504 TODO: explain pivot from ``is at done?'' to ``has more?'' 505 Advantages of this change include being able to pass ranges to functions, for example, projecting a numerically regular subsequence of array entries, and being able to use the loop syntax to cover more collection types, such as looping over the keys of a hash table.505 Advantages of this change include being able to pass ranges to functions, for example, projecting a numerically regular subsequence of array entries, and being able to use the loop syntax to cover more collection types, such as looping over the keys of a hash table. 506 506 507 507 When iterating an empty list, the question, ``Is there a further element?'' needs to be posed once, receiving the answer, ``no.'' … … 653 653 The goal is to show the \CFA lists are competitive with other designs, but the different list designs may not have equivalent functionality, so it is impossible to select a winner encompassing both functionality and execution performance. 654 654 655 655 656 \subsection{Experiment Design} 657 658 This section explains how the experiment is built. 659 Many of the following parts define terminology concerning tuning knobs. 660 \VRef[Figure]{f:ListPerfGlossary} provides a consolidated reference. 656 661 657 662 \begin{figure} … … 664 669 -- Movement & \\ 665 670 \quad $\ni$ stack 666 & I nserts and removes happen at the same end. \\671 & IRs happen at the same end. \\ 667 672 \quad $\ni$ queue 668 & I nserts and removes happen at opposite ends. \\673 & IRs happen at opposite ends. \\ 669 674 -- Polarity 670 675 & Which of the two orientations in which the movement happens. \\ … … 676 681 & How an insertion position, or removal element, is specified. The same position/element is picked either way. \\ 677 682 \quad $\ni$ all head 678 & inserts and removes both through the head \\683 & IRs both through the head \\ 679 684 \quad $\ni$ insert element 680 685 & insert by element and remove through the head \\ … … 717 722 718 723 719 This section explains how the experiment is built.720 Many of the parts following define terminology concerning tuning knobs.721 \VRef[Figure]{f:ListPerfGlossary} provides a consolidated refence.722 723 724 724 \subsubsection{Add-Remove Performance} 725 725 \label{s:AddRemovePerformance} … … 733 733 734 734 This experiment takes the position that: 735 \begin{ itemize}[leftmargin=*]735 \begin{enumerate}[leftmargin=*] 736 736 \item The total time to add and remove is relevant, as opposed to having one time for adding and a separate time for removing. 737 737 Adds without removes quickly fill memory; … … 743 743 \item Speed differences caused by the host machine's memory hierarchy need to be identified and explained, 744 744 but do not represent advantages of one framework over another. 745 \end{ itemize}746 747 The experiment used to measure insert/removecost measures the mean duration of a sequence of additions and removals.745 \end{enumerate} 746 747 The experiment used to measure IR cost measures the mean duration of a sequence of additions and removals. 748 748 The distribution of speeds experienced by an individual add-remove pair (tail latency) is not discussed. 749 749 Space efficiency is shown only indirectly, by way of caches' impact on speed. … … 755 755 756 756 In all cases, the quantity discussed is the duration of one insert-remove (IR). 757 An IR is the time taken to do one innermost insertion-loop iteration, one innermost removal-loop iteration, and its share of all overheads, ammortized. 758 759 Lower IR duration is better. 757 An IR is the time taken to do one innermost insertion-loop iteration, one innermost removal-loop iteration, and its share of all overheads, amortized. 758 Lower IR is better. 760 759 This experiment typically does an IR in 1--10 ns. 761 760 The short end of this range has durations of single-digit clock-cycle counts. … … 763 762 764 763 Often, an IR duration value needs to be considered relatively. 765 For example, \VRef [Section]{toc:sweet-sore} asks whether one linked list implementation is more sensitive than another to changing which computer runs the test.766 A finding might be that a machine changeslows implementation A by 10\% and B by 20\%.764 For example, \VRef{s:SweetSoreSpots} asks whether one linked list implementation is more sensitive than another to the computer architecture. 765 A finding might be that a machine slows implementation A by 10\% and B by 20\%. 767 766 This finding is not saying that A is faster than B (on either machine). 768 The finding could stand if B start ed 10\% faster and the machine change levelled them off, if B started slower and gotworse, or in myriad other cases.769 The finding asserts that such distinctions are not what 's immediately relevant.770 The arithmetic that produces the 10\% and 20\% answers is removing the information about which one starts, or ends up,faster.767 The finding could stand if B starts faster and then levels off, if B starts slower and gets worse, or in myriad other cases. 768 The finding asserts that such distinctions are not what is immediately relevant. 769 The arithmetic producing different answers is removing the information about which one starts or ends up faster. 771 770 Each implementation's to-machine duration is stated relatively to \emph{the same implementation's} from-machine duration. 772 771 The resulting measure is still about a duration. … … 799 798 } 800 799 stopTimer(); 801 reportedDuration = getTimerDuration() / totalOpsDone; // throughput per insert/removeoperation800 reportedDuration = getTimerDuration() / totalOpsDone; // throughput per IR operation 802 801 \end{cfa} 803 802 To reduce administrative overhead, the $n$ nodes for each experiment list are preallocated in an array (on the stack), which removes dynamic allocations for this storage. … … 807 806 After each round, a counter is incremented by $n$ (for throughput). 808 807 Time is measured outside the loop because a large $n$ can overrun the time duration before the @CONTINUE@ flag is tested. 809 Hence, there is minimum of one outer (@CONTINUE@) loop iteration for large lists.808 Hence, there is a minimum of one outer (@CONTINUE@) loop iteration for large lists. 810 809 The loop duration is divided by the counter and this throughput is reported. 811 810 In a scatter-plot, each dot is one throughput, which means insert + remove + harness overhead. … … 814 813 815 814 To test list operations, the experiment performs the inserts/removes in different patterns, \eg insert and remove from front, insert from front and remove from back, random insert and remove, \etc. 816 Unfortunately, the @std::list@ does \emph{not} support direct insert/removefrom a node without an iterator, \ie no @erase( node )@, even though the list is doubly-linked.815 Unfortunately, the @std::list@ does \emph{not} support direct IR from a node without an iterator, \ie no @erase( node )@, even though the list is doubly-linked. 817 816 To eliminate the iterator, a trick is used for random insertions without replacement, which takes advantage of the array nature of the nodes. 818 817 The @i@ fields in each node are initialized from @0..n-1@. … … 840 839 Hence, the traversal is the same as the non-random traversal above. 841 840 To level the experiments, an explicit access to the random node is inserted after the insertion, @temp.j = 0@, for the wrapped experiment. 842 Furthermore, it is rare to insert/removenodes and not access them.841 Furthermore, it is rare to IR nodes and not access them. 843 842 844 843 % \emph{Interleaving} allows for movements other than pure stack and queue. … … 885 884 \subsubsection{Use Cases} 886 885 \label{s:UseCases} 886 887 Where \VRef[Figure]{f:ListPerfGlossary} enumerates the specific values, recall the use-case dimensions are: 888 \begin{description} 889 \item[movement ($\times 2$)] 890 In these experiments, strict stack and queue patterns are tested. 891 \item[polarity ($\times 2$)] 892 Obtain one polarity from the other by reversing uses of first/last. 893 \item[accessor ($\times 3$)] 894 Giving an add/remove location by a list head's first/last, \vs by a preexisting reference to an individual element. 895 \end{description} 887 896 888 897 \begin{figure} … … 963 972 \end{figure} 964 973 965 Where \VRef[Figure]{f:ListPerfGlossary} enumerates the specific values, recall the use-case dimensions are:966 \begin{description}967 \item[movement ($\times 2$)]968 In these experiments, strict stack and queue patterns are tested.969 \item[polarity ($\times 2$)]970 Obtain one polarity from the other by reversing uses of first/last.971 \item[accessor ($\times 3$)]972 Giving an add/remove location by a list head's first/last, \vs by a preexisting reference to an individual element?973 \end{description}974 975 974 A use case is a specific selection of movement, polarity and accessor. 976 975 These experiments run twelve use cases. … … 985 984 The accessor patterns, at the (\CFA) API level, are: 986 985 \begin{description} 987 \item[all (through the) head:] Both inserts and removes happen through the list head. The list head operations are @insert_first@, @insert_last@, @remove_first@ and @remove_last@. \\ 986 \item[all (through the) head:] Both IRs happen through the list head. The list head operations are @insert_first@, @insert_last@, @remove_first@ and @remove_last@. 987 \begin{sloppypar} 988 988 \item[insert (of known) element] \dots and remove through head: Inserts use @insert_before(e, first)@ or @insert_after(e, last)@, where @e@ is being inserted and @first@/@last@ are element references known by list-independent means. 989 \end{sloppypar} 989 990 \item[remove (of known) element] \dots and insert through the head: Removes use @remove(e)@, where @e@ is being removed. List-independent knowledge establishes that @e@ is first or last, as appropriate. 990 991 \end{description} … … 993 994 Comparing all-head with remove-element gives the relative performance of head-mediated \vs element-oriented removal, because both use the same insertion style. 994 995 996 995 997 \subsubsection{Sizing} 996 998 997 998 It is true, but perhaps not obvious, that buildind and destroying long lists is slower than building and destroying short lists. 999 Obviously, indeed, it takes longer to fuse and divide a hundred neighbours than five. 1000 But the key metric in this work, AII, is about a single link--unlink. 1001 So, critically, linking and unlinking a hundred neighbours actually takes \emph{more} than $20\times$ the time for five neighbours. 1002 The main reason is caching; when more neighbours are being manipulated, more memory is being read and written. 1003 1004 But caching success is about more than the amount of memory worked on. 1005 Subtle changes in pattern become butterfly effects. 1006 Aggressive ILP scheduling, which enables short AIR times, is the amplifier. 1007 A data dependency, present in one framework but not another, is critical path in one situation but in not another. 1008 So, duration's response to size is not a steady worsening as size increases. 1009 Rather, each size-independent configuration often responds to size increases with leaps of worsening. 1010 Occasionally a leap is even followed size-run of retrograde response, where a suddenly incurred penalty has a chance to ammortize away. 1011 The frameworks tend to leapfrog over each other, at different points, as size increases. 1012 999 Intuition suggests measuring IR for different sized lists should just be a multiple of the single linking/unlinking of a node. 1000 However, there is a scaling issue as more memory is being read and written, where caching comes into play. 1001 But caching is more than the amount of memory being accessed; 1002 the access pattern is equally important. 1003 Aggressive instruction-level parallelism scheduling, which enables short IR times, is the amplifier, \eg a data dependency is a critical path in one situation but not in another. 1004 Therefore, the duration response to size is not a steady worsening as size increases. 1005 Often, each size-independent configuration responds to size increases in steps of slowdown. 1006 Occasionally a slowdown step is followed by some perforamnce increase, where an incurred penalty begins to amortize away. 1007 Hence, performance results can have interesting jitter as size increases. 1013 1008 The analysis treats these behaviours as incidental. 1014 1009 It does not try to characterize various exact-size responses. 1015 Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another zone. 1010 Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another. 1011 1012 % It is true, but perhaps not obvious, that buildind and destroying long lists is slower than building and destroying short lists. 1013 % Obviously, indeed, it takes longer to fuse and divide a hundred neighbours than five. 1014 % But the key metric in this work, AII, is about a single link--unlink. 1015 % So, critically, linking and unlinking a hundred neighbours actually takes \emph{more} than $20\times$ the time for five neighbours. 1016 % The main reason is caching; when more neighbours are being manipulated, more memory is being read and written. 1017 % 1018 % But caching success is about more than the amount of memory worked on. 1019 % Subtle changes in pattern become butterfly effects. 1020 % Aggressive ILP scheduling, which enables short AIR times, is the amplifier. 1021 % A data dependency, present in one framework but not another, is critical path in one situation but in not another. 1022 % So, duration's response to size is not a steady worsening as size increases. 1023 % Rather, each size-independent configuration often responds to size increases with leaps of worsening. 1024 % Occasionally a leap is even followed size-run of retrograde response, where a suddenly incurred penalty has a chance to ammortize away. 1025 % The frameworks tend to leapfrog over each other, at different points, as size increases. 1026 % 1027 % The analysis treats these behaviours as incidental. 1028 % It does not try to characterize various exact-size responses. 1029 % Rather, size zones are picked, specific effects inside of a zone are averaged away, and the story at one zone is compared to that at another zone. 1016 1030 1017 1031 \begin{figure} … … 1026 1040 } 1027 1041 \end{tabular} 1028 \caption{Variety of IR duration \vs list length, at small--medium lengths. Two example use cases are shown: I, stack movement with head-only access (plot a); VIII, queue movement with element-oriented removal access (plot b); both use cases have insert-first polarity. One example is run on each machine: UC-I on AMD (ploat a); UC-VIII on Intel (plot b). Lower durationis better.}1042 \caption{Variety of IR duration \vs list length, at small--medium lengths. Two example use cases are shown: I, stack movement with head-only access (plot a); VIII, queue movement with element-oriented removal access (plot b); both use cases have insert-first polarity. One example is run on each machine: UC-I on AMD (ploat a); UC-VIII on Intel (plot b). Lower is better.} 1029 1043 \label{fig:plot-list-zoomin-abs} 1030 1044 \end{figure} 1031 1045 1032 1046 \VRef[Figure]{fig:plot-list-zoomin-abs} gives two example responses to size. 1033 The dataset here is a small portion of the overall result and it is premature to attempt conclusions about framework differences from it.1034 These two example s show, firstly, how differently a pair of individual configurations canbehave.1035 Of more immediate significance, they also have a pattern repeated, in all eight of their size responses.1036 Note the ``small'' and ``medium'' overlaid boxes, which call out the size zones' definitions.1047 % The dataset here is a small portion of the overall result and it is premature to attempt conclusions about framework differences from it. 1048 These two example cases show how differently a pair of individual configurations behave. 1049 % Of more immediate significance, they also have a pattern repeated, in all eight of their size responses. 1050 % Note the ``small'' and ``medium'' overlaid boxes, which call out the size zones' definitions. 1037 1051 Outside of an identified box, size response is erratic. 1038 Inside a box, size response is smooth. 1039 This pattern occurs generally throughout all the experimental results, beyond the two examples here. 1040 1041 To preview, \VRef[Section]{toc:coarse-compre} dismisses ``Large'' sizes (above 150 elements), where the performance story is dominated by the amount of memory touched, inherently, by the choice of intrusive list \vs wrapped, and where one intrusive framework is quite obviously as good as another. 1042 At smaller sizes, comparing one intrusive framework to another makes sense; this comparison occurs in the remaining ``Result'' sections. 1043 Among the remaining ``not Large'' sizes, two further zones, \VRef[Figure]{fig:plot-list-zoomin-abs}'s Small and Medium, are selected as representatives of what can vary when the scale is changed. 1044 These particular ranges were chosen beacause each range tends to have one story repeated across its constituent sizes. 1045 If \CFA's duration increases across Small, then the other frameworks' usually do too. 1046 If \CFA is beating \uCpp at the low end of Large, then it usually is at the high end too. 1047 The leapfrogging tends to happen outside of these two ranges. 1048 1049 A spot of poor performance appears in the general results for \CFA at size 1. 1050 Section \MLB{TODO:xref} explores the phenomenon and concludes that it is an anomaly due to a quirky interaction with the testing rig. 1051 To do so, two it considers size as either length or width. 1052 Length is the number of elements in a list. 1053 Width is a number of these lists being kept, worked upon in round-robin order. 1054 Outside of \MLB{TODO:xref}, size always means length, and width is 1. 1055 1052 Inside a box, size response is relatively smooth. 1053 Within and among boxes there are identifiable patterns, which occur throughout all the experimental results. 1054 Each individual configuration is tested by five trials, giving the error bars at min and max. 1055 The amount of error here is typical across the configurations. 1056 With a few exceptions, it is modest, so experiments are repeatable. 1057 1058 To preview, \VRef{s:ResultCoarseComparisonStyles} dismisses large sizes (above 150 elements) and wrapped lists, because the performance story is dominated by the amount of memory touched not by intrusive \vs wrapped lists. 1059 At smaller sizes, \VRef{s:ComparingIntrusiveImplementations} shows differences appear among the intrusive-list implementations. 1060 Among the ``not Large'' sizes ($\le$ 150), two zones, Small and Medium, are selected as representatives of what can vary when the scale is changed. 1061 These particular ranges are chosen because each range tends to have one story repeated across its constituent sizes. 1062 For example, if \CFA's duration increases across Small, then the other frameworks' usually do too, or if \CFA is beating \uCpp across Medium, then it usually is at the high end too. 1063 % The leapfrogging tends to happen outside of these two ranges. 1064 1065 Finally, on the AMD architecture, \CFA performed poorly at size 1, on queue movements only, and no other framework saw the same effect. 1066 This extreme outlier is not plotted in graphs. 1067 After exploring the phenomenon in depth (not presented), the conclusion is a quirky interaction between the hardware and the testing harness. 1068 A side experiment (that does not enrich the overall comparisons) saw user-induced gaps of $\approx$10 ns between same-list operations hide the effect completely. 1069 These gaps are realistic because when an item goes on a list another action comes back to it \emph{later}. 1070 The pattern that the general harness uses, concentrating time-adjacent operations on one list, is useful for measuring the ``small'' size zone, but is contrived, from the perspective of a data hazard that only this pattern exposes. 1071 The general comparisons do not see the effect at all, because they use only the Small and Medium zones, with shortest length of 4. 1072 1073 % A spot of poor performance appears in the general results for \CFA at size 1. 1074 % Section \MLB{TODO:xref} explores the phenomenon and concludes that it is an anomaly due to a quirky interaction with the testing rig. 1075 % To do so, two it considers size as either length or width. 1076 % Length is the number of elements in a list. 1077 % Width is a number of these lists being kept, worked upon in round-robin order. 1078 % Outside of \MLB{TODO:xref}, size always means length, and width is 1. 1056 1079 1057 1080 … … 1083 1106 The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists. 1084 1107 1085 1086 1108 The main difference in the machines is their cache structure. 1087 1109 The AMD has smaller caches that are shared less, while the Intel shares larger caches among more processors. 1088 This difference, while an interesting tradeoff for highly concurrent use, is rather one-sided for sequential use, such as this experiment's. 1089 The Intel offers a single processor a bigger cache. 1110 This difference, while an interesting tradeoff for highly concurrent use, is rather one-sided for sequential use, such as this experiment. 1111 Specifically, the Intel offers a single processor a bigger cache. 1112 1090 1113 1091 1114 \subsubsection{Recap and Master Legend} 1092 1115 1093 There are 12 use cases, which are all combinations of 2 movents, 2 polarities and 3 accessors.1094 There are 4 pysical contexts, which are all combinations of 2 machines and 2 size (length) zones (and 1 width, of value 1).1116 For experiments performed in later section, there are 12 use cases, which are all combinations of 2 movents, 2 polarities and 3 accessors. 1117 There are 4 pysical contexts, which are all combinations of 2 machines and 2 size (length) zones. 1095 1118 Each physical context samples 4 specific sizes. 1096 1097 1119 There are 3.25 frameworks. 1098 1120 This accounting considers how LQ-list supports only the movement--polarity combination "stack, insert first." 1099 1121 So LQ-list fills a quarter of the otherwise-orthogonal space. 1100 1101 1122 Use case, physical context and framework are the explanatory factors. 1102 Taking all combinations of the explanatory factors gives 624 individual configurations: 1103 \[ 1104 \textrm{624 individual configurations} = 1105 \sum_{\substack{ 1106 \textrm{12 use cases}\\ 1107 \textrm{4 physical contexts}\\ 1108 \textrm{4 specific sizes}\\ 1109 \textrm{3.25 frameworks} 1110 }} 1111 \textrm{1 individual configuration} 1112 \] 1123 Taking all combinations of the explanatory factors gives 12 $\times$ 4 $\times$ 4 $\times$ 3.25 = 624 individual configurations. 1124 1125 % \[ 1126 % \textrm{624 individual configurations} = 1127 % \sum_{\substack{ 1128 % \textrm{12 use cases}\\ 1129 % \textrm{4 physical contexts}\\ 1130 % \textrm{4 specific sizes}\\ 1131 % \textrm{3.25 frameworks} 1132 % }} 1133 % \textrm{1 individual configuration} 1134 % \] 1113 1135 1114 1136 \begin{figure} … … 1124 1146 } 1125 1147 \end{tabular} 1126 \caption{IR duration, transformed for general anaysis. The analysis follows the single example setup of \VRef[Figure]{f:zoomin-abs-i-swift}, \ie Use Case I on AMD, where IR is given as absolute duration. Plot (a) transforms the source dataset by conditioning on specific size. Plot (b) takes the results from only the identified size zones, discards their specific-size information, and shows the resulting distribution. Lower durationis better.}1148 \caption{IR duration, transformed for general anaysis. The analysis follows the single example setup of \VRef[Figure]{f:zoomin-abs-i-swift}, \ie Use Case I on AMD, where IR is given as absolute duration. Plot (a) transforms the source dataset by conditioning on specific size. Plot (b) takes the results from only the identified size zones, discards their specific-size information, and shows the resulting distribution. Lower is better.} 1127 1149 \label{fig:plot-list-rel} 1128 1150 \end{figure} 1129 1151 1152 \begin{comment} 1130 1153 32 of these individual configurations, plucked from \VRef[Figure]{f:zoomin-abs-i-swift}, are the subject of \VRef[Figure]{fig:plot-list-rel}, where they are now transformed into the format used for general analysis. 1131 1154 In \subref*{f:zoomin-rel-i-swift}, each of the 56 data points is an individual configuration; the subset within the two boxes has the 32 of interest. … … 1169 1192 That is, inter-configuration rollups discard the modest trial-repeatability error. 1170 1193 The girth of a histogram's distribution is entirely the inter-configuration variance, of its configurations' expected performance. 1171 1194 \end{comment} 1195 1196 It is impossible to present this large amount of information in graphs. 1197 Therefore, a condensed graphing style is used in subsequent plots. 1198 \VRef[Figure]{fig:plot-list-rel} shows how the condensed graphing style is generated from raw data. 1199 \VRef[Figure]{f:zoomin-rel-i-swift} is formed from the data in \VRef[Figure]{f:zoomin-abs-i-swift}, restructured on the Y-axis using a relative duration. 1200 \VRef[Figure]{f:zoomin-histo-i-swift} shows the interesting data within the two boxes (Small/Medium) and their combination (Both). 1201 This graph plots a vertical histogram for each of the 4 lists. 1202 The light-shaded histogram is the raw data (similar data values overlap), and the dark histogram is the goemean average when there are multiple experiments condensed in a column. 1203 The caption indicates the number of values condensed into this histogram, e.g., ``/4'' $\Rightarrow$ 4 data points. 1204 The vertical relationship among the averages gives a quick result for a specific experiment, where lower is better. 1205 The relative duration smooths the results, where smoothness diminishes as size increases. 1206 This flatness gives nicely separated histograms. 1172 1207 Thus, in the forthcoming comparison plots: 1173 \begin{itemize} 1174 \item 1175 The measure is mean IR among the middle 3 trials out of 5, that occurred for an individual configuration. 1176 \item 1177 The number of individual configurations per histogram is stated as ``/N,'' at a relevant grnaularity. 1178 \item 1179 All reported averages are geometric means and all IR duration axes (verticals) are logarithmic. 1180 \item 1181 Unless indicated otherwise, all explanatory factors appearing on a plot are marginalized, while those not appearing on the plot are conditioned. 1182 \end{itemize} 1208 1209 % \begin{itemize}[leftmargin=*] 1210 % \item 1211 % The measure is mean IR among the middle 3 trials out of 5, that occurred for an individual configuration. 1212 % \item 1213 % The number of individual configurations per histogram is stated as ``/N,'' at a relevant granularity. 1214 % \item 1215 % All reported averages are geometric means and all IR duration axes (verticals) are logarithmic. 1216 % \item 1217 % Unless indicated otherwise, all explanatory factors appearing on a plot are marginalized, while those not appearing on the plot are conditioned. 1218 % \end{itemize} 1183 1219 1184 1220 1185 1221 \subsection{Result: Coarse comparison of styles} 1222 \label{s:ResultCoarseComparisonStyles} 1186 1223 1187 1224 This comparison establishes how an intrusive list performs compared with a wrapped-reference list. 1188 \VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/removetest.1225 \VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) IR test. 1189 1226 Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum. 1190 In the graphs, all four intrusive lists (lq-list, lq-tailq, upp-upp, cfa-cfa, see end of \VRef{s:Contenders}) are plotted with the same symbol;1227 In the graphs, all four intrusive lists (lq-list, lq-tailq, \uCpp, \CFA, see Framework in \VRef[Figure]{f:ListPerfGlossary}) are plotted with the same symbol; 1191 1228 sometimes theses symbols clump on top of each other, showing the performance difference among intrusive lists is small in comparison to the wrapped list (std::list). 1192 1229 See~\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists. 1193 1230 1194 The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists,\vs STL's wrapped-reference list of 15--20 ns.1195 For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/removetimes.1231 The list lengths start at 10 due to the short IR times of 2--4 ns, for intrusive lists \vs STL's wrapped-reference list of 15--20 ns. 1232 For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the IR times. 1196 1233 As the list size grows, the administrative overhead for intrusive lists quickly disappears. 1197 1234 … … 1222 1259 \caption{Insert/remove duration \vs list length. 1223 1260 Lengths go as large possible without error. 1224 One example use case is shown: I, stack movement, insert-first polarity and head-mediated access. Lower durationis better.}1261 One example use case is shown: stack movement, insert-first polarity and head-mediated access. Lower is better.} 1225 1262 \label{fig:plot-list-zoomout} 1226 1263 \end{figure} 1227 1264 1228 The key performance factor between the intrusive and thewrapped-reference lists is the dynamic allocation for the wrapped nodes.1229 Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than insert/remove, and is sensitive to the layout of memory by the allocator.1230 For in sert/remove of an intrusive list, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists.1231 For insert/remove of a wrapped-reference list, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;1265 The key performance factor between intrusive and wrapped-reference lists is the dynamic allocation for the wrapped nodes. 1266 Hence, this experiment is largely measuring the cost of @malloc@/\-@free@ rather than IR, and is sensitive to the layout of memory by the allocator. 1267 For intrusive-list IR, the cost is manipulating the link fields, which is seen by the relatively similar results for the different intrusive lists. 1268 For wrapped-reference IR, the costs are: dynamically allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list; 1232 1269 the allocation dominates these costs. 1233 1270 For example, the experiment was run with both glibc and llheap memory allocators, where llheap performance reduced the cost from 20 to 16 ns, still far from the 2--4 ns for linking an intrusive node. … … 1306 1343 The data is selected from the start of \VRef[Figures]{f:Linear-swift}--\subref*{f:Linear-java}, but the start of \VRef[Figures]{f:Random-swift}--\subref*{f:Random-java} is largely the same. 1307 1344 1308 1309 1345 \begin{figure} 1310 1346 \centering 1311 \includegraphics{plot-list-1ord.pdf} \\1347 \includegraphics{plot-list-1ord.pdf} 1312 1348 \small{\textsuperscript{\textdagger} LQ-@list@ is (/48) by its incomplete (25\%) use case coverage. Its bars are scaled to match.} 1313 1349 \caption{Histogram of IR durations, decomposed by all first-order effects. 1314 Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. Lower durationis better.1350 Each of the three breakdowns divides the entire population of test results into its mutually disjoint constituents. Lower is better. 1315 1351 } 1316 1352 \label{fig:plot-list-1ord} … … 1318 1354 1319 1355 \VRef[Figure]{fig:plot-list-1ord} gives the first-order effects. 1320 The first breakdown, architecture/size-zone (left), show ing the overall performance of all 12 experiment on the two different hardware architectures.1321 The relative experiment duration for each experiment is shown as a bar in each column and the black bar in that column shows the average of all 12 experiments.1322 By inspection , Intel runs faster than AMD.1323 As well, the small zone (lists of 4--16 elements) runs faster than the medium zone (lists of 50--200 elements).1324 The size effect is more pronounced on the AMD with its smaller L3 cache than it ison the Intel.1356 The first breakdown, architecture/size-zone (left), shows the overall performance of all 12 experiment on the two different hardware architectures for small and medium lists (624 / 4 = 156 experiments per column). 1357 % The relative experiment duration for each experiment is shown as a bar in each column and the black bar in that column shows the average of all 12 experiments. 1358 By inspection of the averages, Intel runs faster than AMD. 1359 Within an architecture, the small zone (lists of 4--16 elements) runs faster than the medium zone (lists of 50--200 elements). 1360 The overall slower execution on the AMD results from its smaller L3 cache \vs the larger cache on the Intel. 1325 1361 (No NUMA effects for these list sizes.) 1326 1362 Specifically, a 20\% standard deviation exists here, between the means of the four physical-effect categories. 1327 The key takeaway for this comparison is the context it establishes for interpreting the following framework comparisons. 1328 Both the particulars of a machine's cache design, and a list length's effect on the program's cache friendliness, affect insert/remove speed in the manner illlustrated in this breakdown. 1329 That is, if you are running on an unknown machine, at a scale above anomaly-prone individuals, and below where major LLC caching effects take over the general intrusive-list advantage, but with an unknown relationship to the sizing of your fickle low-level caches, you are likely to experience an unpredictable speed impact on the order of 20\%. 1330 1331 A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by use case. 1332 Specific interactions do occur, like framework X doing better on stacks than on queues; a selection of these is addressed in \VRef[Figure]{fig:plot-list-2ord} and discussed shortly. 1333 But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the use cases opaquely. 1334 Whether a given list framework is suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement. 1335 So you face another lottery, with a likely win-loss range of the standard deviation of the individual use cases' means: 9\%. 1336 1337 This context helps interpret \VRef[Figure]{fig:plot-list-1ord}'s final comparison, by framework. 1338 In this result, \CFA runs similarly to \uCpp and LQ-@list@ runs similarly to @tailq@. 1363 These hardware effects are accounted for when interpreting the following framework comparisons. 1364 % The key takeaway for this comparison is the context it establishes for interpreting the following framework comparisons. 1365 % Both the particulars of a machine's cache design, and a list length's effect on the program's cache friendliness, affect IR speed in the manner illustrated in this breakdown. 1366 % That is, if you are running on an unknown machine, at a scale above anomaly-prone individuals, and below where major LLC caching effects take over the general intrusive-list advantage, but with an unknown relationship to the sizing of your fickle low-level caches, you are likely to experience an unpredictable speed impact on the order of 20\%. 1367 1368 The second breakdown, use case (middle), shows the overall performance for each of the 12 use cases from \VRef[Figure]{f:ExperimentOperations} (624 / 12 = 52 experiments per column). 1369 % A similar situation comes from \VRef[Figure]{fig:plot-list-1ord}'s second comparison, by use case. 1370 While specific differences do occur, like framework X doing better on stacks than on queues, the overall range of the standard deviation of the individual use cases' means is only 9\%, indicating no unusual cases. 1371 A more detailed analysis occurs in the discussion of \VRef[Figure]{fig:plot-list-2ord}. 1372 % But they are so irrelevant to the issue of picking a winning framework that it is sufficient here to number the use cases opaquely. 1373 % Whether a given list framework is suitable for a language's general library succeeds or fails without knowledge of whether your use will have stack or queue movement. 1374 % So you face another lottery, with a likely win-loss range of the standard deviation of the individual use cases' means: 9\%. 1375 1376 The third breakdown, framework (right), shows the overall performance of the 4 list implementations (624 / 3.25 = 192). 1377 Here, \CFA runs similarly to \uCpp and LQ-@list@ runs similarly to @tailq@. 1339 1378 The standard deviation of the frameworks' means is 8\%. 1340 Framework choice has, therefore, less impact on your speed than the lottery tickets you already hold. 1341 1342 Now, the LQs do indeed beat the UW languages by 15\%, a fact explored further in \MLB{TODO: xref}. 1379 % Framework choice has, therefore, less impact on your speed than the lottery tickets you already hold. 1380 Now, \CFA/\uCpp run slower than LQ-@list@/@tailq@ by 15\%, a fact explored further in \VRef{s:SweetSoreSpots}. 1343 1381 But so too does use case VIII typically beat use case IV by 38\%. 1344 1382 As does a small size on the Intel typically beat a medium size on the AMD by 66\%. 1345 Framework choice is simply not where you stand to win or lose the most.1383 Hence, architecture and usage patterns have a significant affect on the specific framework. 1346 1384 1347 1385 1348 1386 \subsection{Result: Sweet and Sore Spots} 1387 \label{s:SweetSoreSpots} 1349 1388 1350 1389 \begin{figure} 1351 1390 \centering 1352 1391 \includegraphics{plot-list-2ord.pdf}\\ 1353 1354 1392 \small{ 1355 1393 \textsuperscript{\textdagger} LQ-@list@ is absent from Movement and Polarity comparisons because it does not support queue and insert-last, respectively.\\ … … 1358 1396 } 1359 1397 \caption{Histogram of IR durations, illustrating interactions with framework. 1360 Each distribution shows how its framework reacts to a single other factor being varied across one pair of options. 1361 Every (binned and mean-contributing) individual data point represents a pair of test setups, one with the criterion set to the option labelled at the top; the other setup uses the bottom option. 1362 This point's y-axis score is the ratio of these setups' durations. 1363 The point lands in a bin closer to the label of the option that performs better. 1398 Higher favours top option; lower favours bottom option. 1364 1399 } 1365 1400 \label{fig:plot-list-2ord} 1366 1401 \end{figure} 1367 1402 1368 \VRef[Figure]{fig:plot-list-1ord} stays razor-focused on only first-order effects in order to contextualize a winner/loser framework observation. 1369 But this perspective cannot address questions like, ``Where are \CFA's sore spots?'' 1370 Moreover, the shallow threatment of use cases by ordinals said nothing about how stack usage compares with queues'. 1371 1372 \VRef[Figure]{fig:plot-list-2ord} provides such answers. 1373 Its size-zone criterion refines the obvious notion that a small size runs faster than a big size; this issue is by how much. 1403 % \VRef[Figure]{fig:plot-list-1ord} is focused on only first-order effects in order to contextualize a winner/loser framework observation. 1404 % But this perspective cannot address questions like, ``Where are \CFA's sore spots?'' 1405 % Moreover, the shallow treatment of use cases by ordinals said nothing about how stack usage compares with queues. 1406 1407 \VRef[Figure]{fig:plot-list-2ord} shows how frameworks react to a single other factor being varied across one pair of options. 1408 Every (binned and mean-contributing) individual data point represents a pair of test setups, one with the criterion set to the option labelled at the top; the other setup uses the bottom option. 1409 This point's y-axis score is the ratio of these setups' durations. 1410 The point lands in a bin closer to the label of the option that performs better. 1411 1412 The first breakdown, size zone (left), refines the notion that a small size runs faster than a big size; 1413 this issue is by how much. 1374 1414 Indeed, all means favour small and few tails favour medium. 1375 1415 But the various frameworks do no respond to the different sizes and machines uniformly. 1376 On the AMD, \CFA and \uCpp have a modest size sensitivity, LQ-tailq's is moderate amd LQ-list seems unaffected.1377 On the Intel, \CFA's increases to moderate, while \uCpp is now unaffected, and both LQs have a dramatic response.1378 The Intel is more sensitive to size than the AMD.1379 1380 T urning next to movement and polarity, the responses appearmore subdued.1381 Note thatLQ-list has no represntation in these comparisons because it only supports stacks that push and pop with the first element.1416 On the AMD, \CFA and \uCpp have a modest size sensitivity, LQ-tailq's is moderate, and LQ-list seems unaffected. 1417 On the Intel, \CFA's increases to moderate, while \uCpp is now unaffected, and both LQs have a large effect. 1418 Hence, the Intel is more sensitive to size than the AMD. 1419 1420 The second breakdown, movement and polarity (middle), the responses are more subdued. 1421 Note, LQ-list has no represntation in these comparisons because it only supports stacks that push and pop with the first element. 1382 1422 \CFA is completely stable under movement and polarity changes. 1383 1423 \uCpp and LQ show modest responses favouring queues and insertion at last. 1384 1424 1385 Finally, with accessor, a \CFA sore spot emerges.1425 The third breakdown, accessor (right), the responses are close, except for \CFA. 1386 1426 Note the pair of two-way comparisons pulled from the three experiment setups used. 1387 First, the all-head/insert-element oppositionaddresses which insertion style is better---by-head (top) and by-element (bottom).1388 Then, the all-head/remove-element oppositionaddresses which removal style is better---by-head (top) and by-element (bottom).1427 First, the all-head/insert-element addresses which insertion style is better---by-head (top) and by-element (bottom). 1428 Then, the all-head/remove-element addresses which removal style is better---by-head (top) and by-element (bottom). 1389 1429 The LQs favour insertion by head and removal by element. 1390 1430 \CFA and \uCpp favour both operations by head. … … 1392 1432 1393 1433 1434 \begin{comment} 1394 1435 \subsection{\CFA Tiny-Size Anomaly} 1395 1436 … … 1443 1484 \label{fig:plot-list-wide} 1444 1485 \end{figure} 1486 \end{comment} 1445 1487 1446 1488 \begin{comment} … … 1464 1506 1465 1507 \begin{comment} 1466 1467 1468 1508 \VRef[Figure]{fig:plot-list-zoomin} shows the sizes below 150 blown up. 1469 % The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occurring through a head-provided insert/removeoperation.1509 % The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occurring through a head-provided IR operation. 1470 1510 The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials. 1471 1511 For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other. … … 1627 1667 1628 1668 Ultimately, this analysis provides options for a future effort that needs to get the most speed out of the \CFA list. 1629 1630 1669 \end{comment} 1631 1670
Note:
See TracChangeset
for help on using the changeset viewer.