# Changeset 834f634 for doc/theses

Ignore:
Timestamp:
Mar 11, 2019, 11:06:10 PM (4 years ago)
Branches:
aaron-thesis, arm-eh, cleanup-dtors, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
Children:
e402bbc
Parents:
811466d
Message:

thesis: second draft of ch.6

File:
1 edited

### Legend:

Unmodified
 r811466d All tests were run on a machine with 128~GB of RAM and 64 cores running at 2.2~GHz. \begin{figure} \centering \input{tests-completed} \caption[Tests completed for each algorithmic variant]{Number of tests completed for each algorithmic variant} \label{tests-completed-fig} \end{figure} As a matter of experimental practicality, test runs that exceeded 8~GB of peak resident memory usage are excluded from the data set. This restriction is justifiable by real-world use, as a compiler that is merely slow may be accommodated with patience, but one that uses in excess of 8~GB of RAM may be impossible to run on many currently deployed computer systems. 8~GB of RAM is not typical of the memory usage of the best-peforming two variants, \textsc{bu-dca-bas} and \textsc{bu-dca-per}, which were able to run all 131 test inputs to completion  with maximum memory usage of 70~MB and 78~MB, respectively. However, this threshold did eliminate a significant number of algorithm-test variants, with the worst-performing variant, \textsc{td-imm-inc}, only completing 62 test inputs within the memory bound. Full results for tests completed by algorithm variant are presented in Figure~\ref{tests-completed-fig}. \begin{figure} \centering \input{tests-completed} \caption[Tests completed for each algorithmic variant]{Number of tests completed for each algorithmic variant} \label{tests-completed-fig} \end{figure} Full results for tests completed by algorithm variant are presented in Figure~\ref{tests-completed-fig}. As can be seen from these results, traversal direction is clearly the dominant variable in memory usage, with the \textsc{bu-*} variants performing better than the \textsc{co-*} variants, which in turn out-perform the \textsc{td-*} variants. It can also be seen that the incremental inheritance (\textsc{inc}) type environment consistently under-performs the other two environment data structures tested, as any efficiencies from the inheritance mechanism are apparently insufficient to pay for the added complexity of the data structure. To provide a more holistic view of performance, I have considered the results from the 56 test inputs which all algorithms are able to complete within the memory bound. Limiting consideration to these algorithms provides an apples-to-apples comparison between algorithms, as the excluded inputs are harder instances which take more time and memory for the algorithms which are able to solve them. To provide a more holistic view of performance, I have considered the results from the 56 test inputs that all algorithms are able to complete within the memory bound. Limiting consideration to these algorithms provides an apples-to-apples comparison among algorithms, as the excluded inputs are harder instances, which take more time and memory for the algorithms that are able to solve them. Figures~\ref{avg-peak-mem-fig} and~\ref{avg-runtime-fig} show the mean peak memory and runtime, respectively, of each algorithm over the inputs in this data set. These averages are not themselves meaningful, but do enable an overall comparison of relative performance of the different variants. Selecting only these 56 easy'' test inputs does bias the average values downward, but has little effect on the relative trends; similar trends can be seen in the graphs of the \textsc{bu-*} algorithms over the 124 (of 131) test inputs which all complete, omitted to save space. Selecting only these 56 easy'' test inputs does bias the average values downward, but has little effect on the relative trends; similar trends can be seen in the graphs of the \textsc{bu-*} algorithms over the 124 (of 131) test inputs that all complete, which have been omitted to save space. \begin{figure} It can be seen from these results that that the top-down, immediate assertion-satisfaction (\textsc{td-imm-*}) variants are particularly inefficient, as they check a significant number of assertions without filtering to determine if the arguments can be made to fit. It is also clear that the bottom-up (\textsc{bu}) traversal order is better than both top-down (\textsc{td}) and the Bilson-style bottom-up-combined ((\textsc{co})) orders. It is also clear that the bottom-up (\textsc{bu}) traversal order is better than both top-down (\textsc{td}) and the Bilson-style bottom-up-combined (\textsc{co}) orders. While the advantage of \textsc{bu} over \textsc{co} is clear, in that it performs less redundant work if a prefix of a combination fails, the advantage of \textsc{bu} over \textsc{td} provides an answer for an open question from Baker \cite{Baker82}. I believe that bottom-up is superior because it must only handle each subexpression once to form a list of candidate interpretations, whereas the top-down approach may do similar work repeatedly to resolve a subexpression with a variety of different types, a shortcoming that cannot be fully addressed by the memoization scheme employed in the \textsc{td} algorithm. With regard to assertion satisfaction, immediate (\textsc{imm}) satisfaction is an inferior solution, though there is little performance difference between deferred (\textsc{def}) and deferred-cached (\textsc{dca}) for instances which they can both complete; particularly notable is that \textsc{dca} caching scheme does not have a noticeable impact on peak memory usage. Since the \textsc{dca} algorithm can solve some particularly hard instances which \textsc{def} cannot, it is the recommended approach. The \textsc{inc} type environment also often uses upwards of double the memory required by the other variants, in addition to being consistently slower on these easy tests; aside from \textsc{bu-imm-bas} performing worse than \textsc{bu-imm-inc} on average when larger tests are considered, these results hold for the other variants. With regard to assertion satisfaction, immediate (\textsc{imm}) satisfaction is an inferior solution, though there is little performance difference between deferred (\textsc{def}) and deferred-cached (\textsc{dca}) for instances that both can complete; particularly notable is that the \textsc{dca} caching-scheme does not have a noticeable impact on peak memory usage. Since the \textsc{dca} algorithm can solve some particularly hard instances that \textsc{def} cannot, it is the recommended approach. The incremental-inheritance (\textsc{inc}) type environment also often uses upwards of double the memory required by the other variants, in addition to being consistently slower on these easy tests; aside from \textsc{bu-imm-bas} performing worse than \textsc{bu-imm-inc} on average when larger tests are considered, these results hold for the other variants. It is apparent from these results that any efficiencies from the inheritance mechanism are insufficient to pay for the added complexity of the data structure. Aside from that, the persistent union-find (\textsc{per}) type environment generally performs better than the basic (\textsc{bas}) environment, with similar peak memory usage and an average speedup factor of nearly 2, though the requirements of the \textsc{per} environment for automatic garbage collection and a shared history for combination make retrofitting it into older code difficult. \section{Instance Difficulty} \label{instance-expr-sec} To characterize the difficulty of expression resolution problem instances, the test suites must be explored at a finer granularity. To characterize the difficulty of expression-resolution problem instances, the test suites must be explored at a finer granularity. As discussed in Section~\ref{resn-analysis-sec}, a single top-level expression is the fundamental problem instance for resolution, yet the test inputs discussed above are composed of thousands of top-level expressions, like the actual source code they are derived from. To pull out the effects of these individual problems, I instrumented the resolver prototype to time resolution for each expression, and also report some relevant properties of the expression. This instrumented resolver was then run on a set of difficult test instances; to limit the data collection task, these runs were limited to the best-performing \textsc{bu-dca-per} algorithm and test inputs which that algorithm took more than 1~s to complete. The 13 test inputs thus selected contain 20632 top-level expressions between them, which are separated into order-of-magnitude bins by runtime in Figure~\ref{per-prob-histo-fig}. As can be seen from this figure, overall runtime is dominated by a few particularly difficult problem instances --- the 60\% of expressions which resolve in under 0.1~ms collectively take less time to resolve than any of the 0.2\% of expressions which take at least 100~ms to resolve. To pull out the effects of these individual problems, the resolver prototype is instrumented to time resolution for each expression, and also to report some relevant properties of the expression. This instrumented resolver is then run on a set of difficult test instances; to limit the data collection task, these runs are restricted to the best-performing \textsc{bu-dca-per} algorithm and test inputs taking more than 1~s to complete. The 13 test inputs thus selected contain 20632 top-level expressions among them, which are separated into order-of-magnitude bins by runtime in Figure~\ref{per-prob-histo-fig}. As can be seen from this figure, overall runtime is dominated by a few particularly difficult problem instances --- the 60\% of expressions that resolve in under 0.1~ms collectively take less time to resolve than any of the 0.2\% of expressions that take at least 100~ms to resolve. On the other hand, the 46 expressions in that 0.2\% take 38\% of the overall time in this difficult test suite, while the 201 expressions that take between 10 and 100~ms to resolve consume another 30\%. \end{figure} Since the top centile of expression resolution instances requires approximately two-thirds of the resolver's time, optimizing the resolver for specific hard problem instances has proven to be an effective technique for reducing overall runtime. The data indicates that number of assertions necessary to resolve has the greatest effect on runtime, as seen in Since the top centile of expression-resolution instances requires approximately two-thirds of the resolver's time, optimizing the resolver for specific hard problem instances has proven to be an effective technique for reducing overall runtime. The data indicates that the number of assertions necessary to resolve has the greatest effect on runtime, as seen in Figure~\ref{per-prob-assns-fig}. However, since the number of assertions required is only known once resolution is finished, the most-promising pre-resolution metric of difficulty is the nesting depth of the expression; as seen in Figure~\ref{per-prob-depth-fig}, expressions of depth $> 10$ in this data-set are uniformly difficult. I have integrated most of the algorithmic techniques discussed in this chapter into \CFACC{}. This integration took place over a period of months while \CFACC{} was under active development on a number of other fronts, so it is not possible to completely isolate the effects of the algorithmic changes, but I believe the algorithmic changes to be the most-significant effects on performance over the study period. This integration took place over a period of months while \CFACC{} was under active development on a number of other fronts, so it is not possible to completely isolate the effects of the algorithmic changes, but I believe the algorithmic changes to have had the most-significant effects on performance over the study period. To generate this data, representative commits from the \texttt{git} history of the project were checked out and compiled, then run on the same machine used for the resolver prototype experiments discussed in Section~\ref{proto-exp-sec}. To negate the effects of changes to the \CFA{} standard library on the timing results, 55 test files from the test suite of the oldest \CFA{} variant were compiled with the \texttt{-E} flag to inline their library dependencies, and these inlined files were used to test the remaining \CFACC{} versions. To negate the effects of changes to the \CFA{} standard library on the timing results, 55 test files from the test suite of the oldest \CFA{} variant are compiled with the \texttt{-E} flag to inline their library dependencies, and these inlined files are used to test the remaining \CFACC{} versions. I performed two rounds of modification to \CFACC{}; the first round moved from Bilson's original combined-bottom-up algorithm to an un-combined bottom-up algorithm, denoted \textsc{cfa-co} and \textsc{cfa-bu}, respectively. It is possible that the persistent union-find environment could be modified to use a reference-counted pointer internally without changing the entire memory-management framework of \CFACC{}, but such an attempt is left to future work. As can be seen in Figures~\ref{cfa-time-fig} and~\ref{cfa-mem-fig}, which show the time and peak memory results for these five versions of \CFACC{}, assertion resolution dominates total resolution cost, with the \textsc{cfa-def} and \textsc{cfa-dca} variants running consistently faster than the others on more expensive test cases; no difference can be seen between these two algorithms' performance, but that result agrees with the prototype experiments in Section~\ref{proto-exp-sec}. As can be seen in Figures~\ref{cfa-time-fig} and~\ref{cfa-mem-fig}, the time and peak memory results for these five versions of \CFACC{} show that assertion resolution dominates total resolution cost, with the \textsc{cfa-def} and \textsc{cfa-dca} variants running consistently faster than the others on more expensive test cases; no difference can be seen between these two algorithms' performance, but which agrees with the prototype experiments in Section~\ref{proto-exp-sec}. The results from \CFACC{} for \textsc{cfa-co} \textit{vs.}\ \textsc{cfa-bu} do not mirror those from the prototype; I conjecture this is mostly due to the different memory-management schemes and sorts of data required to run type unification and assertion satisfaction calculations, as \CFACC{} performance has proven to be particularly sensitive to the amount of heap allocation performed. This data also shows a noticeable regression in compiler performance in the eleven months between \textsc{cfa-bu} and \textsc{cfa-imm}, which use the same resolution algorithms; this regression is not due to expression resolution, as no integration work happened in this time, but I am unable to ascertain its actual cause. As can be seen from the prototype results, per-expression benchmarks, and \CFACC{}, the dominant factor in the cost of \CFA{} expression resolution is assertion satisfaction. Reducing the number of total number of assertion satisfaction problems solved, as in the deferred satisfaction algorithm, is consistently effective at reducing runtime, and caching results of these satisfaction problems has shown promise in the prototype system. Reducing the total number of assertions satisfied, as in the deferred satisfaction algorithm, is consistently effective at reducing runtime, and caching results of these satisfaction problem instances has shown promise in the prototype system. The results presented here also demonstrate that a bottom-up approach to expression resolution is superior to top-down, settling an open question from Baker~\cite{Baker82}. The persistent union-find type environment introduced in Chapter~\ref{env-chap} has also been demonstrated to be a modest performance improvement on the na\"{\i}ve approach.