- Timestamp:
- Mar 11, 2019, 11:06:10 PM (6 years ago)
- Branches:
- ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
- Children:
- e402bbc
- Parents:
- 811466d
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/aaron_moss_PhD/phd/experiments.tex
r811466d r834f634 108 108 All tests were run on a machine with 128~GB of RAM and 64 cores running at 2.2~GHz. 109 109 110 \begin{figure} 111 \centering 112 \input{tests-completed} 113 \caption[Tests completed for each algorithmic variant]{Number of tests completed for each algorithmic variant} \label{tests-completed-fig} 114 \end{figure} 115 110 116 As a matter of experimental practicality, test runs that exceeded 8~GB of peak resident memory usage are excluded from the data set. 111 117 This restriction is justifiable by real-world use, as a compiler that is merely slow may be accommodated with patience, but one that uses in excess of 8~GB of RAM may be impossible to run on many currently deployed computer systems. 112 118 8~GB of RAM is not typical of the memory usage of the best-peforming two variants, \textsc{bu-dca-bas} and \textsc{bu-dca-per}, which were able to run all 131 test inputs to completion with maximum memory usage of 70~MB and 78~MB, respectively. 113 119 However, this threshold did eliminate a significant number of algorithm-test variants, with the worst-performing variant, \textsc{td-imm-inc}, only completing 62 test inputs within the memory bound. 114 Full results for tests completed by algorithm variant are presented in Figure~\ref{tests-completed-fig}. 115 116 \begin{figure} 117 \centering 118 \input{tests-completed} 119 \caption[Tests completed for each algorithmic variant]{Number of tests completed for each algorithmic variant} \label{tests-completed-fig} 120 \end{figure} 121 120 Full results for tests completed by algorithm variant are presented in Figure~\ref{tests-completed-fig}. 122 121 As can be seen from these results, traversal direction is clearly the dominant variable in memory usage, with the \textsc{bu-*} variants performing better than the \textsc{co-*} variants, which in turn out-perform the \textsc{td-*} variants. 123 It can also be seen that the incremental inheritance (\textsc{inc}) type environment consistently under-performs the other two environment data structures tested, as any efficiencies from the inheritance mechanism are apparently insufficient to pay for the added complexity of the data structure. 124 125 To provide a more holistic view of performance, I have considered the results from the 56 test inputs which all algorithms are able to complete within the memory bound. 126 Limiting consideration to these algorithms provides an apples-to-apples comparison between algorithms, as the excluded inputs are harder instances which take more time and memory for the algorithms which are able to solve them. 122 123 To provide a more holistic view of performance, I have considered the results from the 56 test inputs that all algorithms are able to complete within the memory bound. 124 Limiting consideration to these algorithms provides an apples-to-apples comparison among algorithms, as the excluded inputs are harder instances, which take more time and memory for the algorithms that are able to solve them. 127 125 Figures~\ref{avg-peak-mem-fig} and~\ref{avg-runtime-fig} show the mean peak memory and runtime, respectively, of each algorithm over the inputs in this data set. 128 126 These averages are not themselves meaningful, but do enable an overall comparison of relative performance of the different variants. 129 Selecting only these 56 ``easy'' test inputs does bias the average values downward, but has little effect on the relative trends; similar trends can be seen in the graphs of the \textsc{bu-*} algorithms over the 124 (of 131) test inputs which all complete,omitted to save space.127 Selecting only these 56 ``easy'' test inputs does bias the average values downward, but has little effect on the relative trends; similar trends can be seen in the graphs of the \textsc{bu-*} algorithms over the 124 (of 131) test inputs that all complete, which have been omitted to save space. 130 128 131 129 \begin{figure} … … 154 152 155 153 It can be seen from these results that that the top-down, immediate assertion-satisfaction (\textsc{td-imm-*}) variants are particularly inefficient, as they check a significant number of assertions without filtering to determine if the arguments can be made to fit. 156 It is also clear that the bottom-up (\textsc{bu}) traversal order is better than both top-down (\textsc{td}) and the Bilson-style bottom-up-combined ( (\textsc{co})) orders.154 It is also clear that the bottom-up (\textsc{bu}) traversal order is better than both top-down (\textsc{td}) and the Bilson-style bottom-up-combined (\textsc{co}) orders. 157 155 While the advantage of \textsc{bu} over \textsc{co} is clear, in that it performs less redundant work if a prefix of a combination fails, the advantage of \textsc{bu} over \textsc{td} provides an answer for an open question from Baker \cite{Baker82}. 158 156 I believe that bottom-up is superior because it must only handle each subexpression once to form a list of candidate interpretations, whereas the top-down approach may do similar work repeatedly to resolve a subexpression with a variety of different types, a shortcoming that cannot be fully addressed by the memoization scheme employed in the \textsc{td} algorithm. 159 157 160 With regard to assertion satisfaction, immediate (\textsc{imm}) satisfaction is an inferior solution, though there is little performance difference between deferred (\textsc{def}) and deferred-cached (\textsc{dca}) for instances which they can both complete; particularly notable is that \textsc{dca} caching scheme does not have a noticeable impact on peak memory usage. 161 Since the \textsc{dca} algorithm can solve some particularly hard instances which \textsc{def} cannot, it is the recommended approach. 162 163 The \textsc{inc} type environment also often uses upwards of double the memory required by the other variants, in addition to being consistently slower on these easy tests; aside from \textsc{bu-imm-bas} performing worse than \textsc{bu-imm-inc} on average when larger tests are considered, these results hold for the other variants. 158 With regard to assertion satisfaction, immediate (\textsc{imm}) satisfaction is an inferior solution, though there is little performance difference between deferred (\textsc{def}) and deferred-cached (\textsc{dca}) for instances that both can complete; particularly notable is that the \textsc{dca} caching-scheme does not have a noticeable impact on peak memory usage. 159 Since the \textsc{dca} algorithm can solve some particularly hard instances that \textsc{def} cannot, it is the recommended approach. 160 161 The incremental-inheritance (\textsc{inc}) type environment also often uses upwards of double the memory required by the other variants, in addition to being consistently slower on these easy tests; aside from \textsc{bu-imm-bas} performing worse than \textsc{bu-imm-inc} on average when larger tests are considered, these results hold for the other variants. 162 It is apparent from these results that any efficiencies from the inheritance mechanism are insufficient to pay for the added complexity of the data structure. 164 163 Aside from that, the persistent union-find (\textsc{per}) type environment generally performs better than the basic (\textsc{bas}) environment, with similar peak memory usage and an average speedup factor of nearly 2, though the requirements of the \textsc{per} environment for automatic garbage collection and a shared history for combination make retrofitting it into older code difficult. 165 164 166 165 \section{Instance Difficulty} \label{instance-expr-sec} 167 166 168 To characterize the difficulty of expression 167 To characterize the difficulty of expression-resolution problem instances, the test suites must be explored at a finer granularity. 169 168 As discussed in Section~\ref{resn-analysis-sec}, a single top-level expression is the fundamental problem instance for resolution, yet the test inputs discussed above are composed of thousands of top-level expressions, like the actual source code they are derived from. 170 To pull out the effects of these individual problems, I instrumented the resolver prototype to time resolution for each expression, and also report some relevant properties of the expression.171 This instrumented resolver was then run on a set of difficult test instances; to limit the data collection task, these runs were limited to the best-performing \textsc{bu-dca-per} algorithm and test inputs which that algorithm tookmore than 1~s to complete.172 173 The 13 test inputs thus selected contain 20632 top-level expressions betweenthem, which are separated into order-of-magnitude bins by runtime in Figure~\ref{per-prob-histo-fig}.174 As can be seen from this figure, overall runtime is dominated by a few particularly difficult problem instances --- the 60\% of expressions which resolve in under 0.1~ms collectively take less time to resolve than any of the 0.2\% of expressions whichtake at least 100~ms to resolve.169 To pull out the effects of these individual problems, the resolver prototype is instrumented to time resolution for each expression, and also to report some relevant properties of the expression. 170 This instrumented resolver is then run on a set of difficult test instances; to limit the data collection task, these runs are restricted to the best-performing \textsc{bu-dca-per} algorithm and test inputs taking more than 1~s to complete. 171 172 The 13 test inputs thus selected contain 20632 top-level expressions among them, which are separated into order-of-magnitude bins by runtime in Figure~\ref{per-prob-histo-fig}. 173 As can be seen from this figure, overall runtime is dominated by a few particularly difficult problem instances --- the 60\% of expressions that resolve in under 0.1~ms collectively take less time to resolve than any of the 0.2\% of expressions that take at least 100~ms to resolve. 175 174 On the other hand, the 46 expressions in that 0.2\% take 38\% of the overall time in this difficult test suite, while the 201 expressions that take between 10 and 100~ms to resolve consume another 30\%. 176 175 … … 181 180 \end{figure} 182 181 183 Since the top centile of expression 184 The data indicates that number of assertions necessary to resolve has the greatest effect on runtime, as seen in182 Since the top centile of expression-resolution instances requires approximately two-thirds of the resolver's time, optimizing the resolver for specific hard problem instances has proven to be an effective technique for reducing overall runtime. 183 The data indicates that the number of assertions necessary to resolve has the greatest effect on runtime, as seen in 185 184 Figure~\ref{per-prob-assns-fig}. 186 185 However, since the number of assertions required is only known once resolution is finished, the most-promising pre-resolution metric of difficulty is the nesting depth of the expression; as seen in Figure~\ref{per-prob-depth-fig}, expressions of depth $> 10$ in this data-set are uniformly difficult. … … 212 211 213 212 I have integrated most of the algorithmic techniques discussed in this chapter into \CFACC{}. 214 This integration took place over a period of months while \CFACC{} was under active development on a number of other fronts, so it is not possible to completely isolate the effects of the algorithmic changes, but I believe the algorithmic changes to bethe most-significant effects on performance over the study period.213 This integration took place over a period of months while \CFACC{} was under active development on a number of other fronts, so it is not possible to completely isolate the effects of the algorithmic changes, but I believe the algorithmic changes to have had the most-significant effects on performance over the study period. 215 214 To generate this data, representative commits from the \texttt{git} history of the project were checked out and compiled, then run on the same machine used for the resolver prototype experiments discussed in Section~\ref{proto-exp-sec}. 216 To negate the effects of changes to the \CFA{} standard library on the timing results, 55 test files from the test suite of the oldest \CFA{} variant were compiled with the \texttt{-E} flag to inline their library dependencies, and these inlined files were used to test the remaining \CFACC{} versions.215 To negate the effects of changes to the \CFA{} standard library on the timing results, 55 test files from the test suite of the oldest \CFA{} variant are compiled with the \texttt{-E} flag to inline their library dependencies, and these inlined files are used to test the remaining \CFACC{} versions. 217 216 218 217 I performed two rounds of modification to \CFACC{}; the first round moved from Bilson's original combined-bottom-up algorithm to an un-combined bottom-up algorithm, denoted \textsc{cfa-co} and \textsc{cfa-bu}, respectively. … … 222 221 It is possible that the persistent union-find environment could be modified to use a reference-counted pointer internally without changing the entire memory-management framework of \CFACC{}, but such an attempt is left to future work. 223 222 224 As can be seen in Figures~\ref{cfa-time-fig} and~\ref{cfa-mem-fig}, which show the time and peak memory results for these five versions of \CFACC{}, assertion resolution dominates total resolution cost, with the \textsc{cfa-def} and \textsc{cfa-dca} variants running consistently faster than the others on more expensive test cases; no difference can be seen between these two algorithms' performance, but that resultagrees with the prototype experiments in Section~\ref{proto-exp-sec}.223 As can be seen in Figures~\ref{cfa-time-fig} and~\ref{cfa-mem-fig}, the time and peak memory results for these five versions of \CFACC{} show that assertion resolution dominates total resolution cost, with the \textsc{cfa-def} and \textsc{cfa-dca} variants running consistently faster than the others on more expensive test cases; no difference can be seen between these two algorithms' performance, but which agrees with the prototype experiments in Section~\ref{proto-exp-sec}. 225 224 The results from \CFACC{} for \textsc{cfa-co} \textit{vs.}\ \textsc{cfa-bu} do not mirror those from the prototype; I conjecture this is mostly due to the different memory-management schemes and sorts of data required to run type unification and assertion satisfaction calculations, as \CFACC{} performance has proven to be particularly sensitive to the amount of heap allocation performed. 226 225 This data also shows a noticeable regression in compiler performance in the eleven months between \textsc{cfa-bu} and \textsc{cfa-imm}, which use the same resolution algorithms; this regression is not due to expression resolution, as no integration work happened in this time, but I am unable to ascertain its actual cause. … … 252 251 253 252 As can be seen from the prototype results, per-expression benchmarks, and \CFACC{}, the dominant factor in the cost of \CFA{} expression resolution is assertion satisfaction. 254 Reducing the number of total number of assertion satisfaction problems solved, as in the deferred satisfaction algorithm, is consistently effective at reducing runtime, and caching results of these satisfaction problems has shown promise in the prototype system.253 Reducing the total number of assertions satisfied, as in the deferred satisfaction algorithm, is consistently effective at reducing runtime, and caching results of these satisfaction problem instances has shown promise in the prototype system. 255 254 The results presented here also demonstrate that a bottom-up approach to expression resolution is superior to top-down, settling an open question from Baker~\cite{Baker82}. 256 255 The persistent union-find type environment introduced in Chapter~\ref{env-chap} has also been demonstrated to be a modest performance improvement on the na\"{\i}ve approach.
Note: See TracChangeset
for help on using the changeset viewer.