Changeset 16a843b for doc/theses/mike_brooks_MMath
- Timestamp:
- Jul 19, 2025, 2:35:56 AM (2 months ago)
- Branches:
- master
- Children:
- 7640ff5, da10157
- Parents:
- 47064914
- Location:
- doc/theses/mike_brooks_MMath
- Files:
-
- 40 added
- 3 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mike_brooks_MMath/Makefile
r47064914 r16a843b 95 95 -include $(Plots)/*.d 96 96 97 ${Build}/plot-%.dat: ${Plots}/%.py ${Plots}/common.py ${Plots}/ %.py.INPUTS | ${Build}97 ${Build}/plot-%.dat: ${Plots}/%.py ${Plots}/common.py ${Plots}/ListCommon.py ${Plots}/%.py.INPUTS | ${Build} 98 98 python3 $< > $@ 99 99 … … 102 102 103 103 #-include ${Build}/*.d 104 105 106 # troubleshooting, e.g. `make echo_DEMOS` runs `echo $(DEMOS)` 107 echo_% : 108 @echo '$($(@:echo_%=%))' 109 -
doc/theses/mike_brooks_MMath/list.tex
r47064914 r16a843b 473 473 474 474 475 \section{Assessment} 476 \label{toc:lst:assess} 477 478 \subsection{Add-Remove Performance} 479 480 The fundamental job of a linked-list library is to manage the links that connect users' items. 481 Any link management is an action that causes pair(s) of elements to become, or cease to be, adjacent. 482 Thus, adding and removing an element are the sole primitive actions. 483 484 Repeated adding and removing is necessary to measure timing because these operations can be as simple as a dozen instructions. 485 These instruction sequences may have cases that proceed (in a modern, deep pipeline) without a stall. 486 487 This experiment takes the position that 488 \begin{itemize} 489 \item The total time to add and remove is relevant; an attribution of time spent adding vs.\ removing is not. 490 Any use case for which addition speed matters necessarily has removes paired with adds. 491 For otherwise, the alleged usage would exhaust the amount of work expressable as a main-memory full of nodes within a few seconds. 492 \item A relevant breakdown ``by operation'' is, rather, one that considers the structural context of these requests. 493 \begin{description} 494 \item[movement] 495 Is the add/remove order that of a stack, a queue, or something else? 496 \item[polarity] 497 In which direction does the movement's action apply? For a queue, do items flow from first to last or last to first? For a stack, is the first-end or the last-end used for adding and removing? 498 \item[accessor] 499 Is an add/remove location given by a list head's ``first''/``last'', or by a reference to an individual element? 500 \end{description} 501 \item Speed differences caused by the host machine's memory hierarchy need to be identified and explained, 502 but do not represent advantages of one linked-list implementation over another. 503 \end{itemize} 504 505 This experiment measures the mean duration of a list addition and removal. 506 Confidence bounds, on this mean, are discussed. 507 The distribution of speeds experienced by an individual add-remove pair (tail latency) is not discussed. 508 509 Space efficiency is shown only indirectly, by way of caches' impact on speed. 510 511 %~MONITOR 512 % If able to show cases with CFA doing better, reword. 513 The goal is to show the \CFA library performing comparably to other intrusive libraries, 514 in an experimental context sensitive enough to show also: 515 \begin{itemize} 516 \item intrusive lists performing (majorly) differently than wrapped lists 517 \item a space of (minor) performance differences typical of existing intrusive lists 518 \end{itemize} 519 520 521 \subsubsection{Experiment setup} 522 523 The experiment defines a user's datatype and considers 524 the speed of building, and tearing down, a list of $n$ instances of the user's type. 525 526 The timings are taken with a fixed-duration method based on checks @clock()@. 527 In a typical 5-sec run, the outer looping checks the clock about 200 times. 528 A number of experimental rounds per clock check is precalculated to be appropriate to the value of $n$. 529 530 \begin{cfa} 531 // simplified harness: CFA implementation, 532 // stack movement, insert-first polarity, head-mediated access 533 size_t totalOpsDone = 0; 534 dlist( item_t ) lst; 535 item_t items[ n ]; 536 startTimer(); 537 while ( SHOULD_CONTINUE ) { 538 for ( i; n ) insert_first( lst, items[i] ); 539 for ( i; n ) remove_first( lst ); 540 totalOpsDone += n; 541 } 542 stopTimer(); 543 reportedDuration = getTimerDuration() / totalOpsDone; 544 \end{cfa} 545 546 One experimental round is, first, a tight loop of inserting $n$ elements into a list, followed by another, to remove these $n$ elements. 547 A counter is incremented by $n$ each round. 548 When the whole experiment is done, the total elapsed time, divided by final value of the operation counter, 549 is reported as the observed mean operation duration. 550 In a scatterplot presentation, each dot would be one such reported mean duration. 551 So, ``operation'' really means insert + remove + harness overhead. 552 553 The harness overheads are held constant when comparing linked-list implementations. 554 The remainder of the setup section discusses the choices that affected the harness overhead. 555 556 An \emph{iterators' array} provides support for element-level operations on non-intrusive lists. 557 As elaborated in Section \ref{toc:lst:issue:attach}, 558 wrapped-attachment lists use a distinct type (at a distinct memory location) to represent ``an item that's in the list.'' 559 Operations like insert-after and remove-here consume iterators. 560 In the STL implementation, an iterator is a pointer to a \lstinline{std::_List_node}. 561 For the STL case, the driver obtains an iterator value 562 at the time of adding to the list, and stores the iterator in an array, for consumption by subsequent element-oriented operations. 563 For intrusive-list cases, the driver stores the user object's address in the iterators' array. 564 565 \begin{c++} 566 // further simplified harness (bookkeeping elided): STL implementation, 567 // stack movement, insert-first polarity, element-based remove access 568 list< item_t * > lst; 569 item_t items[ n ]; 570 while ( SHOULD_CONTINUE ) { 571 @list< item_t * >::iterator iters[ n ];@ 572 for ( int i = 0; i < n; i += 1 ) { 573 lst.push_front( & items[i] ); 574 @iters[i]@ = lst.begin(); 575 } 576 for ( int i = 0; i < n; i += 1 ) { 577 lst.erase( @iters[i]@ ); 578 } 579 } 580 \end{c++} 581 582 %~MONITOR 583 % If running insert-random scenarios, revise the assessment 584 585 A \emph{shuffling array} helps control the memory layout of user items. 586 The control required is when choosing a next item to insert. 587 The user items are allocated in a contiguous array. 588 Without shuffling, the driver's insert phase visits these items in order, producing a list whose adjavency links hop uniform strides. 589 With shuffling active, the driver's insert phase visits only the shuffling array in order, 590 which applies pseudo-random indirection to the selection of a next-to-insert element from the user-item array. 591 The result is a list whose links travel randomly far. 592 593 \begin{cfa} 594 // harness (bookkeeping and iterators elided): CFA implementation, 595 // stack movement, insert-first polarity, head-mediated access 596 dlist( item_t ) lst; 597 item_t items[ n ]; 598 size_t insert_ord[ n ]; // elided: populate with shuffled [0,n) 599 while ( SHOULD_CONTINUE ) { 600 for ( i; n ) insert_first( lst, items[ @insert_ord[@ i @]@ ] ); 601 for ( i; n ) remove_first( lst ); 602 } 603 \end{cfa} 604 605 \emph{Interleaving} allows for movements other than pure stack and queue. 606 Note that the earlier example of using the iterators' array is still a pure stack: the item selected for @erase(...)@ is always the first. 607 Including a less predictable movement is important because real applications that justify doubly linked lists use them. 608 Freedom to remove from arbitrary places (and to insert under more relaxed assumptions) is the characteristic function of a doubly linked list. 609 A queue with drop-out is an example of such a movement. 610 A list implementation can show unrepresentative speed under a simple movement, for example, by enjoying unchallenged ``Is first element?'' branch predictions. 611 612 Interleaving brings ``at middle of list'' cases into a stream of add or remove invocations, which would otherwise be exclusively ``at end''. 613 A chosen split, like half middle and half end, populates a boolean array, which is then shuffled. 614 These booleans then direct the action to end-\vs-middle. 615 616 \begin{cfa} 617 // harness (bookkeeping and shuffling elided): CFA implementation, 618 // stack movement, insert-first polarity, interleaved element-based remove access 619 dlist( item_t ) lst; 620 item_t items[ n ]; 621 @bool interl[ n ];@ // elided: populate with weighted, shuffled [0,1] 622 while ( SHOULD_CONTINUE ) { 623 item_t * iters[ n ]; 624 for ( i; n ) { 625 insert_first( items[i] ); 626 iters[i] = & items[i]; 627 } 628 @item_t ** crsr[ 2 ]@ = { // two cursors into iters 629 & iters[ @0@ ], // at stack-insert-first's removal end 630 & iters[ @n / interl_frac@ ] // in middle 631 }; 632 for ( i; n ) { 633 item *** crsr_use = & crsr[ interl[ i ] ]@; 634 remove( *** crsr_use ); // removing from either middle or end 635 *crsr_use += 1; // that item is done 636 } 637 assert( crsr[0] == & iters[ @n / interl_frac@ ] ); // through second's start 638 assert( crsr[1] == & iters[ @n@ ] ); // did the rest 639 } 640 \end{cfa} 641 642 By using the pair of cursors, the harness avoids branches, which could incur prediction stall times themselves, or prime a branch in the SUT. 643 This harness avoids telling the hardware what the SUT is about to do. 644 645 These experiments are single threaded. They run on a PC with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6. The machine has 16 GB of RAM and 8 MB of last-level cache. 646 647 The comparator linked-list implementations are: 648 \begin{description} 649 \item[lq-list] The @list@ type of LQ from glibc of GCC-11. 650 \item[lq-tailq] The @tailq@ type of the same. 651 \item[upp-upp] uC++ provided @uSequence@ 652 \item[cfa-cfa] \CFA's @dlist@ 653 \end{description} 654 655 656 \subsubsection{Result: Coarse comparison of styles} 657 658 This comparison establishes how an intrusive list performs, compared with a wrapped-reference list. 659 It also establishes the context within which it is meaningful to compare one intrusive list to another. 660 661 %These goals notwithstanding, the effect of the host machine's memory hierarchy is more significant here than linked-list implementation. 662 663 \begin{figure} 664 \centering 665 \begin{tabular}{c} 666 \includegraphics{plot-list-zoomout-shuf.pdf} \\ 667 (a) \\ 668 \includegraphics{plot-list-zoomout-noshuf.pdf} \\ 669 (b) \\ 670 \end{tabular} 671 \caption{Operation duration \vs list length at full spectrum of list lengths. One example operation is shown: stack movement, insert-first polarity and head-mediated access. Lengths go as large as completes without error. Version (a) uses shuffled items, while version (b) links items with their physical neighbours.} 672 \label{fig:plot-list-zoomout} 673 \end{figure} 674 675 \VRef[Figure]{fig:plot-list-zoomout} presents the speed measures at various list lengths. 676 STL's wrapped-reference list begins with operations on a length-1 list costing around 30 ns. 677 This time grows modetly as list length increases, apart from more drastic worsening at the largest lengths. 678 STL's performance is not affected by element order in memory. 679 The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations. 680 This much is also unaffected by element order. 681 Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off. 682 In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL. 683 684 The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster. 685 Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level. 686 Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear. 687 Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear. 688 689 In a wrapped-reference list, list nodes are allocated separately from the items put into the list. 690 Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes. 691 692 The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists. 693 The tests in this chapter are only inserting and removing. 694 They are not operating on any user payload data that is being listed. 695 The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order. 696 These differences are inherent to the two list models. 697 698 The slowdown of shuffled intrusive occurs as the experiment's length grows from last-level cache, into main memory. 699 Insert and remove operations act on both sides of a link. 700 Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location. 701 Each time a next item is processed, the memory access is a hop to a randomly far address. 702 The target is not available in cache and a slowdown results. 703 704 With the unshuffled intrusive list, each link connects to an adjacent location. So, this case has high memory locality and stays fast. But the unshuffled assumption is simply not realistic: if you know items are adjacent, you don't need a linked list. 705 706 A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening. 707 As a result, the interlinked nodes of the STL list are generally referenceing their immediate neighbours. 708 This pattern occurs regardless of user-item suffling because this test's ``use'' of the user-items' array is limited to storing element addresses. 709 This experiment, driving an STL list, is simply not touching the memory that holds the user data. 710 Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast. 711 But the user-data no-touch assumption is often unrealistic: decisions like,``Should I remove this item?'' need to look at the item. 712 In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a nonintrusive list may be preferred for lists of approximately the chache's size. 713 714 Therefore, under clearly typical situational assumptions, both intrusive and wrapped-reference lists will suffer similarly from a large list overfilling the memory cache, experiencing degradation like shuffled intrusive shows here. 715 716 But the comparison of unshuffled intrusive with wrapped-reference gives the peformance of these two styles, with their the common impediment of overfilling the cache removed. 717 Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes. 718 This difference is appreciable below list length 0.5 M, and enormous below 10 K. 719 720 721 \section{Result: Comparing intrusive implementations} 722 723 The preceding result shows that intrusive implementations have noteworthy performance differences below 150 nodes. 724 This analysis zooms in on this area and identifies the participants. 725 726 \begin{figure} 727 \centering 728 \begin{tabular}{c} 729 \includegraphics{plot-list-zoomin-abs.pdf} \\ 730 (a) \\ 731 \includegraphics{plot-list-zoomin-rel.pdf} \\ 732 (b) \\ 733 \end{tabular} 734 \caption{Operation duration \vs list length at small-medium lengths. One example operation is shown: stack movement, insert-first polarity and head-mediated access. (a) has absolute times. (b) has times relative to those of LQ-\lstinline{tailq}.} 735 \label{fig:plot-list-zoomin} 736 \end{figure} 737 738 In \VRef{fig:plot-list-zoomin} part (a) shows exactly this zoom-in. 739 The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occuring through a head-provided insert/remove operation. 740 The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials. 741 For readability, the frameworks are slightly staggered in the horizontal, but all trials near a given size were run at the same size. 742 743 For this particular operation, uC++ fares the worst, followed by \CFA, then LQ's @tailq@. 744 Its @list@ does the best at smaller lengths but loses its edge above a dozen elements. 745 746 Moving toward being able to consider several scenarios, \VRef{fig:plot-list-zoomin} part (b) shows the same result, adjusted to treat @tailq@ as a benchmark, and expressing all the results relative to it. 747 This change does not affect the who-wins statements, it just removes the ``sweet spot'' bend that the earlier discussion dismissed as incidental. 748 Runs faster than @tailq@'s are below the zero and slower runs are above; @tailq@'s mean is always zero by definition, but its error bars, representing a single scenario's re-run stability, are still meaningful. 749 With this bend straightened out, aggregating across lengths is feasible. 750 751 \begin{figure} 752 \centering 753 \begin{tabular}{c} 754 \includegraphics{plot-list-cmp-exout.pdf} \\ 755 (a) \\ 756 \includegraphics{plot-list-cmp-survey.pdf} \\ 757 (b) \\ 758 \end{tabular} 759 \caption{Operation duration ranges across operational scenarios. (a) has the supersets of the running example operation. (b) has the first-level slices of the full space of operations.} 760 \label{fig:plot-list-cmp-overall} 761 \end{figure} 762 763 \VRef{fig:plot-list-cmp-overall} introduces the resulting format. 764 Part (a)'s first column summarizes all the data of \VRef{fig:plot-list-zoomin}-(b). 765 Its x-axis label, ``stack/insfirst/allhead,'' names the concrete scenario that has been discussed until now. 766 Moving across the columns, the next three each stretch to include more scenarios on each of the operation dimensions, one at a time. 767 The second column considers the scenarios $\{\mathrm{stack}\} \times \{\mathrm{insfirst}\} \times \{\mathrm{allhead}, \mathrm{inselem}, \mathrm{remelem}\}$, 768 while the third stretches polarity and the fourth streches accessor. 769 Then next three columns each stretch two scenario dimensions and the last column stretches all three. 770 The \CFA bar in the last column is summarizing 840 test-program runs: 14 list lengths, 2 movements, 2 polarities, 3 accessors and 5 repetitions. 771 772 In the earlier plots of one scenario broken down by length, each data point, with its error bars, represents just 5 repetitions. 773 With a couple exceptions, this reproducibility error was small. 774 Now, for a \CFA bar, summarizing 70 (first column) to 840 (last column) runs, a bar's height is dominated by the different behaviours of the scenarios and list length that it summarizes. 775 Accordingly, the first column's bars are short and last one's are tall. 776 A box represents the inner 68\% of the durations, while its lines extend to cover 95\%. 777 The symbol on the bar is the mean duration. 778 779 The chosen benchmark of LQ-@tailq@ is not shown in this format because it would be trivial here. 780 With iter-scenario differences dominating the bar size, and @tailq@'s mean performance defined to be zero in all scenarios, a @tailq@ bar on this plot would only show @tailq@'s re-run stabiity, which is of no comparison value. 781 782 The LQ-@list@ implementation does not support all scenarios, only stack movement with insert-first polarity. 783 So, its 1, 3, 4 and 7\textsuperscript{th} bars all summarize the same set of points (those with accessor constrained to all-head), as do its 2, 5, 6 and 8\textsuperscript{th} (those with accessor unconstrained). 784 785 Rather than exploring from one scenario out, \VRef{fig:plot-list-cmp-overall}-(b) gives a more systematic breakdown of the entire experimental space. 786 Other than the last grand-total column, each breakdown column shows one value from one operation dimension. 787 788 LQ-@list@'s partial scenario coverage gives missing bars where it does not support the operation. 789 And, again, it gives repetition where all data points occur in several columns' intersection, such as stack/*/* and */insfirst/*. 790 791 In the grand total, and in all halves by movement or polarity, \CFA and uC++ are equivalent, while LQ-@list@ beats them slightly. 792 Splitting on accessor, \CFA has a poor result on element removal, LQ-@list@ has a great result on the other accessors, and uC++ is unaffected. 793 The unseen @tailq@ dominates across every category and beats \CFA and uC++ by 15--20\%. 794 795 % \begin{figure} 796 % \centering 797 % \begin{tabular}{c} 798 % \includegraphics{plot-list-cmp-intrl-shift.pdf} \\ 799 % (a) \\ 800 % \includegraphics{plot-list-cmp-intrl-outcome.pdf} \\ 801 % (b) \\ 802 % \end{tabular} 803 % \caption{Caption TODO} 804 % \label{fig:plot-list-cmp-intrl} 805 % \end{figure} 806 807 \section{Result: CFA cost attribution} 808 809 This comparison loosely itemizes the reasons that the \CFA implementation runs 15--20\% slower than LQ. Each reason provides for safer programming. For each reaon, a version of the \CFA list was measured that forgoes its safety and regains some performance. These potential sacrifices are: 810 \newcommand{\mandhead}{\emph{mand-head}} 811 \newcommand{\nolisted}{\emph{no-listed}} 812 \newcommand{\noiter}{\emph{no-iter}} 813 \begin{description} 814 \item[mand(atory)-head] Removing support for headless lists. 815 A specific explanation of why headless support causes a slowdown is not offered. 816 But it is reasonable for a cost to result from making one pieceof code handle multiple cases; the subset of the \CFA list API that applies to headless lists shares its implementation with headed lists. 817 In the \mandhead case, disabling the feature in \CFA means using an older version of the implementation, from before headless support was added. 818 In the pre-headless library, trying to form a headless list (instructing, ``Insert loose element B after loose element A,'') is a checked runtime error. 819 LQ does not support headless lists\footnote{ 820 Though its documentation does not mention the headless use case, this fact is due to one of its insert-before or insert-after routines being unusable in every list model. 821 For \lstinline{tailq}, the API requires a head. 822 For \lstinline{list}, this usage causes an ``uncaught'' runtime crash.}. 823 \item[no-listed] Removing support for the @is_listed@ API query. 824 Along with it goes error checking such as ``When instering an element, it must not already be listed, \ie be referred to from somewhere else.'' 825 These abilities have a cost because, in order to support them, a listed element that is being removed must be written to, to record its change in state. 826 In \CFA's representation, this cost is two pointer writes. 827 To disable the feature, these writes, and the error checking that consumes their result, are put behind an @#ifdef@. 828 The result is that a removed element sees itself as still having neighbours (though these quasi-neighbours see it differently). 829 This state is how LQ leaves a removed element; LQ does not offer an is-listed query. 830 \item[no-iter(ation)] Removing support for well-terminating iteration. 831 The \CFA list uses bit-manipulation tagging on link poiters (rather than \eg null links) to express, ``No more elements this way.'' 832 This tagging has the cost of submitting a retrieved value to the ALU, and awaiting this operation's completion, before dereferencing a link pointer. 833 In some cases, the is-terminating bit is transferred from one link to another, or has a similar influence on a resulting link value; this logic adds register pressure and more data dependency. 834 To disable the feature, the @#ifdef@-controlled tag manipulation logic compiles in answers like, ``No, that link is not a terminator,'' ``The dereferenceable pointer is the value you read from memory,'' and ``The terminator-marked value you need to write is the pointer you started with.'' 835 Without this termination marking, repeated requests for a next valid item will always provide a positive response; when it should be negative, the indicated next element is garbage data at an address unlikely to trigger a memory error. 836 LQ has a well-terminating iteration for listed elements. 837 In the \noiter case, the slowdown is not inherent; it represents a \CFA optimization opporunity. 838 \end{description} 839 \MLB{Ensure benefits are discussed earlier and cross-reference} % an LQ programmer must know not to ask, ``Who's next?'' about an unlisted element; an LQ programmer cannot write assertions about an item being listed; LQ requiring a head parameter is an opportunity for the user to provide inconsistent data 840 841 \begin{figure} 842 \centering 843 \begin{tabular}{c} 844 \includegraphics{plot-list-cfa-attrib.pdf} \\ 845 (a) \\ 846 \includegraphics{plot-list-cfa-attrib-remelem.pdf} \\ 847 (b) \\ 848 \end{tabular} 849 \caption{Operation duration ranges for functionality-reduced \CFA list implementations. (a) has the top level slices. (b) has the next level of slicing within the slower element-based removal operation.} 850 \label{fig:plot-list-cfa-attrib} 851 \end{figure} 852 853 \VRef[Figure]{fig:plot-list-cfa-attrib} shows the \CFA list performance with these features, and their combinations, turned on and off. When a series name is one of the three sacrifices above, the series is showing this sacrifice in isolation. These further series names give combinations: 854 \newcommand{\attribFull}{\emph{full}} 855 \newcommand{\attribParity}{\emph{parity}} 856 \newcommand{\attribStrip}{\emph{strip}} 857 \begin{description} 858 \item[full] No sacrifices. Same as measurements presented earlier. 859 \item[parity] \mandhead + \nolisted. Feature parity with LQ. 860 \item[strip] \mandhead + \nolisted + \noiter. All options set to ``faster.'' 861 \end{description} 862 All list implementations are \CFA, possibly stripped. 863 The plot uses the same LQ-relative basis as earlier. 864 So getting to zero means matching LQ's @tailq@. 865 866 \VRef[Figure]{fig:plot-list-cfa-attrib}-(a) summarizes the time attribution across the main operating scenarios. 867 The \attribFull series is repeated from \VRef[Figure]{fig:plot-list-cmp-overall}, part (b), while the series showing feature sacrifices are new. 868 Going all the way to \attribStrip at least nearly matches LQ in all operating scenarios, beats LQ often, and slightly beats LQ overall. 869 Except within the accessor splits, both sacrifices contribute improvements individually, \noiter helps more than \attribParity, and the total \attribStrip benefit depends on both contributions. 870 When the accessor is not element removal, the \attribParity shift appears to be counterproductive, leaving \noiter to deliver most of the benefit. 871 For element removals, \attribParity is the heavy hitter, with \noiter contributing modestly. 872 873 The couterproductive shift outside of element removals is likely due to optimization done in the \attribFull version after implementing headless support, \ie not present in the \mandhead version. 874 This work streamlined both head-based operations (head-based removal being half the work of the element-insertion test). 875 This improvement could be ported to a \mandhead-style implementation, which would bring down the \attribParity time in these cases. 876 877 More significantly, missing this optimization affects every \attribParity result because they all use head-based inserts or removes for at least half their operations. 878 It is likely a reason that \attribParity is not delivering as well overall as \noiter. 879 It even represents plausible further improvements in \attribStrip. 880 881 \VRef[Figure]{fig:plot-list-cfa-attrib}-(b) addresses element removal being the overall \CFA slow spot and element removal having a peculiar shape in the (a) analysis. 882 Here, the \attribParity sacrifice bundle is broken out into its two consituents. 883 The result is the same regardless of the operation. 884 All three individual sacrifices contribute noteworthy improvements (\nolisted slightly less). 885 The fullest improvement requires all of them. 886 887 The \noiter feature sacrifice is unpalatable. 888 But because it is not an inherent slowdown, there may be room to pursue a \noiter-level speed improvement without the \noiter feature sacrifice. 889 The performance crux for \noiter is the pointer-bit tagging scheme. 890 Alternative designs that may offer speedup with acceptable consequences include keeping the tag information in a separate field, and (for 64-bit architectures) keeping it in the high-order byte \ie using byte- rather than bit-oriented instructions to access it. 891 The \noiter speed improvement would bring \CFA to +5\% of LQ overall, and from high twenties to high teens, in the worst case of element removal. 892 893 Utimately, this analysis provides options for a future effort that needs to get the most speed out of the \CFA list. 894 895 896 897 475 898 \section{Future Work} 476 899 \label{toc:lst:futwork} -
doc/theses/mike_brooks_MMath/uw-ethesis.tex
r47064914 r16a843b 116 116 \newcommand{\uCpp}{$\mu$\CC} 117 117 \newcommand{\PAB}[1]{{\color{red}PAB: #1}} 118 \newcommand{\MLB}[1]{{\color{red}MLB: #1}} 118 119 119 120 % Hyperlinks make it very easy to navigate an electronic document.
Note:
See TracChangeset
for help on using the changeset viewer.