Changeset 834f634


Ignore:
Timestamp:
Mar 11, 2019, 11:06:10 PM (3 years ago)
Author:
Aaron Moss <a3moss@…>
Branches:
aaron-thesis, arm-eh, cleanup-dtors, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr
Children:
e402bbc
Parents:
811466d
Message:

thesis: second draft of ch.6

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/aaron_moss_PhD/phd/experiments.tex

    r811466d r834f634  
    108108All tests were run on a machine with 128~GB of RAM and 64 cores running at 2.2~GHz.
    109109
     110\begin{figure}
     111        \centering
     112        \input{tests-completed}
     113        \caption[Tests completed for each algorithmic variant]{Number of tests completed for each algorithmic variant} \label{tests-completed-fig}
     114\end{figure}
     115
    110116As a matter of experimental practicality, test runs that exceeded 8~GB of peak resident memory usage are excluded from the data set.
    111117This restriction is justifiable by real-world use, as a compiler that is merely slow may be accommodated with patience, but one that uses in excess of 8~GB of RAM may be impossible to run on many currently deployed computer systems.
    1121188~GB of RAM is not typical of the memory usage of the best-peforming two variants, \textsc{bu-dca-bas} and \textsc{bu-dca-per}, which were able to run all 131 test inputs to completion  with maximum memory usage of 70~MB and 78~MB, respectively.
    113119However, this threshold did eliminate a significant number of algorithm-test variants, with the worst-performing variant, \textsc{td-imm-inc}, only completing 62 test inputs within the memory bound.
    114 Full results for tests completed by algorithm variant are presented in Figure~\ref{tests-completed-fig}.
    115 
    116 \begin{figure}
    117 \centering
    118 \input{tests-completed}
    119 \caption[Tests completed for each algorithmic variant]{Number of tests completed for each algorithmic variant} \label{tests-completed-fig}
    120 \end{figure}
    121 
     120Full results for tests completed by algorithm variant are presented in Figure~\ref{tests-completed-fig}.
    122121As can be seen from these results, traversal direction is clearly the dominant variable in memory usage, with the \textsc{bu-*} variants performing better than the \textsc{co-*} variants, which in turn out-perform the \textsc{td-*} variants.
    123 It can also be seen that the incremental inheritance (\textsc{inc}) type environment consistently under-performs the other two environment data structures tested, as any efficiencies from the inheritance mechanism are apparently insufficient to pay for the added complexity of the data structure.
    124 
    125 To provide a more holistic view of performance, I have considered the results from the 56 test inputs which all algorithms are able to complete within the memory bound.
    126 Limiting consideration to these algorithms provides an apples-to-apples comparison between algorithms, as the excluded inputs are harder instances which take more time and memory for the algorithms which are able to solve them.
     122
     123To provide a more holistic view of performance, I have considered the results from the 56 test inputs that all algorithms are able to complete within the memory bound.
     124Limiting consideration to these algorithms provides an apples-to-apples comparison among algorithms, as the excluded inputs are harder instances, which take more time and memory for the algorithms that are able to solve them.
    127125Figures~\ref{avg-peak-mem-fig} and~\ref{avg-runtime-fig} show the mean peak memory and runtime, respectively, of each algorithm over the inputs in this data set.
    128126These averages are not themselves meaningful, but do enable an overall comparison of relative performance of the different variants.
    129 Selecting only these 56 ``easy'' test inputs does bias the average values downward, but has little effect on the relative trends; similar trends can be seen in the graphs of the \textsc{bu-*} algorithms over the 124 (of 131) test inputs which all complete, omitted to save space.
     127Selecting only these 56 ``easy'' test inputs does bias the average values downward, but has little effect on the relative trends; similar trends can be seen in the graphs of the \textsc{bu-*} algorithms over the 124 (of 131) test inputs that all complete, which have been omitted to save space.
    130128
    131129\begin{figure}
     
    154152
    155153It can be seen from these results that that the top-down, immediate assertion-satisfaction (\textsc{td-imm-*}) variants are particularly inefficient, as they check a significant number of assertions without filtering to determine if the arguments can be made to fit.
    156 It is also clear that the bottom-up (\textsc{bu}) traversal order is better than both top-down (\textsc{td}) and the Bilson-style bottom-up-combined ((\textsc{co})) orders.
     154It is also clear that the bottom-up (\textsc{bu}) traversal order is better than both top-down (\textsc{td}) and the Bilson-style bottom-up-combined (\textsc{co}) orders.
    157155While the advantage of \textsc{bu} over \textsc{co} is clear, in that it performs less redundant work if a prefix of a combination fails, the advantage of \textsc{bu} over \textsc{td} provides an answer for an open question from Baker \cite{Baker82}.
    158156I believe that bottom-up is superior because it must only handle each subexpression once to form a list of candidate interpretations, whereas the top-down approach may do similar work repeatedly to resolve a subexpression with a variety of different types, a shortcoming that cannot be fully addressed by the memoization scheme employed in the \textsc{td} algorithm.
    159157
    160 With regard to assertion satisfaction, immediate (\textsc{imm}) satisfaction is an inferior solution, though there is little performance difference between deferred (\textsc{def}) and deferred-cached (\textsc{dca}) for instances which they can both complete; particularly notable is that \textsc{dca} caching scheme does not have a noticeable impact on peak memory usage.
    161 Since the \textsc{dca} algorithm can solve some particularly hard instances which \textsc{def} cannot, it is the recommended approach.
    162 
    163 The \textsc{inc} type environment also often uses upwards of double the memory required by the other variants, in addition to being consistently slower on these easy tests; aside from \textsc{bu-imm-bas} performing worse than \textsc{bu-imm-inc} on average when larger tests are considered, these results hold for the other variants.
     158With regard to assertion satisfaction, immediate (\textsc{imm}) satisfaction is an inferior solution, though there is little performance difference between deferred (\textsc{def}) and deferred-cached (\textsc{dca}) for instances that both can complete; particularly notable is that the \textsc{dca} caching-scheme does not have a noticeable impact on peak memory usage.
     159Since the \textsc{dca} algorithm can solve some particularly hard instances that \textsc{def} cannot, it is the recommended approach.
     160
     161The incremental-inheritance (\textsc{inc}) type environment also often uses upwards of double the memory required by the other variants, in addition to being consistently slower on these easy tests; aside from \textsc{bu-imm-bas} performing worse than \textsc{bu-imm-inc} on average when larger tests are considered, these results hold for the other variants.
     162It is apparent from these results that any efficiencies from the inheritance mechanism are insufficient to pay for the added complexity of the data structure.
    164163Aside from that, the persistent union-find (\textsc{per}) type environment generally performs better than the basic (\textsc{bas}) environment, with similar peak memory usage and an average speedup factor of nearly 2, though the requirements of the \textsc{per} environment for automatic garbage collection and a shared history for combination make retrofitting it into older code difficult.
    165164
    166165\section{Instance Difficulty} \label{instance-expr-sec}
    167166
    168 To characterize the difficulty of expression resolution problem instances, the test suites must be explored at a finer granularity.
     167To characterize the difficulty of expression-resolution problem instances, the test suites must be explored at a finer granularity.
    169168As discussed in Section~\ref{resn-analysis-sec}, a single top-level expression is the fundamental problem instance for resolution, yet the test inputs discussed above are composed of thousands of top-level expressions, like the actual source code they are derived from.
    170 To pull out the effects of these individual problems, I instrumented the resolver prototype to time resolution for each expression, and also report some relevant properties of the expression.
    171 This instrumented resolver was then run on a set of difficult test instances; to limit the data collection task, these runs were limited to the best-performing \textsc{bu-dca-per} algorithm and test inputs which that algorithm took more than 1~s to complete.
    172 
    173 The 13 test inputs thus selected contain 20632 top-level expressions between them, which are separated into order-of-magnitude bins by runtime in Figure~\ref{per-prob-histo-fig}.
    174 As can be seen from this figure, overall runtime is dominated by a few particularly difficult problem instances --- the 60\% of expressions which resolve in under 0.1~ms collectively take less time to resolve than any of the 0.2\% of expressions which take at least 100~ms to resolve.
     169To pull out the effects of these individual problems, the resolver prototype is instrumented to time resolution for each expression, and also to report some relevant properties of the expression.
     170This instrumented resolver is then run on a set of difficult test instances; to limit the data collection task, these runs are restricted to the best-performing \textsc{bu-dca-per} algorithm and test inputs taking more than 1~s to complete.
     171
     172The 13 test inputs thus selected contain 20632 top-level expressions among them, which are separated into order-of-magnitude bins by runtime in Figure~\ref{per-prob-histo-fig}.
     173As can be seen from this figure, overall runtime is dominated by a few particularly difficult problem instances --- the 60\% of expressions that resolve in under 0.1~ms collectively take less time to resolve than any of the 0.2\% of expressions that take at least 100~ms to resolve.
    175174On the other hand, the 46 expressions in that 0.2\% take 38\% of the overall time in this difficult test suite, while the 201 expressions that take between 10 and 100~ms to resolve consume another 30\%.
    176175
     
    181180\end{figure}
    182181
    183 Since the top centile of expression resolution instances requires approximately two-thirds of the resolver's time, optimizing the resolver for specific hard problem instances has proven to be an effective technique for reducing overall runtime.
    184 The data indicates that number of assertions necessary to resolve has the greatest effect on runtime, as seen in
     182Since the top centile of expression-resolution instances requires approximately two-thirds of the resolver's time, optimizing the resolver for specific hard problem instances has proven to be an effective technique for reducing overall runtime.
     183The data indicates that the number of assertions necessary to resolve has the greatest effect on runtime, as seen in
    185184Figure~\ref{per-prob-assns-fig}.
    186185However, since the number of assertions required is only known once resolution is finished, the most-promising pre-resolution metric of difficulty is the nesting depth of the expression; as seen in Figure~\ref{per-prob-depth-fig}, expressions of depth $> 10$ in this data-set are uniformly difficult.
     
    212211
    213212I have integrated most of the algorithmic techniques discussed in this chapter into \CFACC{}.
    214 This integration took place over a period of months while \CFACC{} was under active development on a number of other fronts, so it is not possible to completely isolate the effects of the algorithmic changes, but I believe the algorithmic changes to be the most-significant effects on performance over the study period.
     213This integration took place over a period of months while \CFACC{} was under active development on a number of other fronts, so it is not possible to completely isolate the effects of the algorithmic changes, but I believe the algorithmic changes to have had the most-significant effects on performance over the study period.
    215214To generate this data, representative commits from the \texttt{git} history of the project were checked out and compiled, then run on the same machine used for the resolver prototype experiments discussed in Section~\ref{proto-exp-sec}.
    216 To negate the effects of changes to the \CFA{} standard library on the timing results, 55 test files from the test suite of the oldest \CFA{} variant were compiled with the \texttt{-E} flag to inline their library dependencies, and these inlined files were used to test the remaining \CFACC{} versions.
     215To negate the effects of changes to the \CFA{} standard library on the timing results, 55 test files from the test suite of the oldest \CFA{} variant are compiled with the \texttt{-E} flag to inline their library dependencies, and these inlined files are used to test the remaining \CFACC{} versions.
    217216
    218217I performed two rounds of modification to \CFACC{}; the first round moved from Bilson's original combined-bottom-up algorithm to an un-combined bottom-up algorithm, denoted \textsc{cfa-co} and \textsc{cfa-bu}, respectively.
     
    222221It is possible that the persistent union-find environment could be modified to use a reference-counted pointer internally without changing the entire memory-management framework of \CFACC{}, but such an attempt is left to future work.
    223222
    224 As can be seen in Figures~\ref{cfa-time-fig} and~\ref{cfa-mem-fig}, which show the time and peak memory results for these five versions of \CFACC{}, assertion resolution dominates total resolution cost, with the \textsc{cfa-def} and \textsc{cfa-dca} variants running consistently faster than the others on more expensive test cases; no difference can be seen between these two algorithms' performance, but that result agrees with the prototype experiments in Section~\ref{proto-exp-sec}.
     223As can be seen in Figures~\ref{cfa-time-fig} and~\ref{cfa-mem-fig}, the time and peak memory results for these five versions of \CFACC{} show that assertion resolution dominates total resolution cost, with the \textsc{cfa-def} and \textsc{cfa-dca} variants running consistently faster than the others on more expensive test cases; no difference can be seen between these two algorithms' performance, but which agrees with the prototype experiments in Section~\ref{proto-exp-sec}.
    225224The results from \CFACC{} for \textsc{cfa-co} \textit{vs.}\ \textsc{cfa-bu} do not mirror those from the prototype; I conjecture this is mostly due to the different memory-management schemes and sorts of data required to run type unification and assertion satisfaction calculations, as \CFACC{} performance has proven to be particularly sensitive to the amount of heap allocation performed.
    226225This data also shows a noticeable regression in compiler performance in the eleven months between \textsc{cfa-bu} and \textsc{cfa-imm}, which use the same resolution algorithms; this regression is not due to expression resolution, as no integration work happened in this time, but I am unable to ascertain its actual cause.
     
    252251
    253252As can be seen from the prototype results, per-expression benchmarks, and \CFACC{}, the dominant factor in the cost of \CFA{} expression resolution is assertion satisfaction.
    254 Reducing the number of total number of assertion satisfaction problems solved, as in the deferred satisfaction algorithm, is consistently effective at reducing runtime, and caching results of these satisfaction problems has shown promise in the prototype system.
     253Reducing the total number of assertions satisfied, as in the deferred satisfaction algorithm, is consistently effective at reducing runtime, and caching results of these satisfaction problem instances has shown promise in the prototype system.
    255254The results presented here also demonstrate that a bottom-up approach to expression resolution is superior to top-down, settling an open question from Baker~\cite{Baker82}.
    256255The persistent union-find type environment introduced in Chapter~\ref{env-chap} has also been demonstrated to be a modest performance improvement on the na\"{\i}ve approach.
Note: See TracChangeset for help on using the changeset viewer.