Context Navigation

source: doc/theses/aaron_moss_PhD/phd/experiments.tex @ 6b6b9ba

ADTarm-ehast-experimentalenumforall-pointer-decayjacob/cs343-translationnew-ast-unique-exprpthread-emulationqualifiedEnum

Last change on this file since 6b6b9ba was f845e80, checked in by Aaron Moss <a3moss@…>, 6 years ago
thesis: apply round 2 revisions and strip change bars
Property mode set to `100644`
File size: 30.4 KB

Line
1	\chapter{Experiments}
2	\label{expr-chap}
3
4	I implemented a prototype system to test the practical effectiveness of the various algorithms described in Chapters~\ref{resolution-chap} and~\ref{env-chap}.
5	This prototype system implements the expression resolution pass of the \CFA{} compiler, \CFACC{}, with a simplified version of the \CFA{} type system and a parser to read in problem instances, and is published online under a permissive licence\footnote{\url{https://github.com/cforall/resolv-proto}}.
6	The resolver prototype allows for quicker iteration on algorithms due to its simpler language model and lack of a requirement to generate runnable code, yet captures enough of the nuances of \CFA{} to have predictive power for the runtime performance of algorithmic variants in \CFACC{} itself.
7
8	\CFACC{} can generate realistic test inputs for the resolver prototype from equivalent \CFA{} code;
9	the generated test inputs currently comprise all \CFA{} code currently in existence\footnote{Though \CFA{} is backwards-compatible with C, the lack of \lstinline{forall} functions and name overloading in C mean that the larger corpus of C code does not provide challenging test instances for \CFACC{}.}, 9,000 lines drawn primarily from the standard library and compiler test suite.
10	This code includes a substantial degree of name overloading for common library functions and a number of fundamental polymorphic abstractions, including iterators and streaming input/output.
11	\CFACC{} is also instrumented to produce a number of code metrics.
12	These metrics were used to construct synthetic test inputs during development of the resolver prototype; these synthetic inputs provided useful design guidance, but the performance results presented in this chapter are based on the more realistic directly-generated inputs.
13
14	\section{Resolver Prototype Features} \label{rp-features-sec}
15
16	The resolver prototype can express most of the \CFA{} features described in Chapter~\ref{cfa-chap}.
17	It supports both monomorphic and polymorphic functions, with type assertions for polymorphic functions.
18	Traits are not explicitly represented, but \CFACC{} inlines traits before the resolver pass, so this is a faithful representation of the existing compiler.
19	The prototype system supports variable declarations as well as function declarations, and has a lexical-scoping scheme and \CFA{}-like overloading rules.
20
21	The type system of the resolver prototype also captures key aspects of the \CFA{} type system.
22	\emph{Concrete types} represent the built-in arithmetic types of \CFA{}, along with the implicit conversions among them.
23	Each concrete type is represented by an integer identifier, and the conversion cost from $x$ to $y$ is $\|y-x\|$, a safe conversion if $y > x$, or an unsafe conversion if $y < x$.
24	This scheme is markedly simpler than the graph of conversion costs in \CFA{} (Figure~\ref{safe-conv-fig}), but captures the essentials of the design.
25	For simplicity, !zero_t! and !one_t!, the types of !0! and !1!, are represented by the type corresponding to !int!.
26	\emph{Named types} are analogues to \CFA{} aggregates, such as structs and unions; aggregate fields are encoded as unary functions from the struct type to the field type, with the function named based on the field name.
27	Named types also support type parameters, and as such can represent generic types as well.
28	Generic named types are used to represent the built-in parameterized types of \CFA{} as well; !T*! is encoded as \texttt{\#\$ptr<T>}.
29	\CFA{} arrays are also represented as pointers, to simulate array-to-pointer decay, while top-level reference types are replaced by their referent to simulate the variety of reference conversions.
30	\emph{Function types} have first-class representation in the prototype as well; \CFA{} function pointers are represented as variables with the appropriate function type, though \CFA{} polymorphic function pointers cannot be represented, as the prototype system stores information about type assertions in function declarations rather than in the function type.
31	\emph{Void} and \emph{tuple types} are also supported in the prototype, to express the multiple-return-value functions in \CFA{}, though varargs functions and !ttype! tuple-typed type variables are absent from the prototype system.
32	The prototype system also does not represent type qualifiers (\eg{} !const!, !volatile!), so all such qualifiers are stripped during conversion to the prototype system.
33
34	The resolver prototype supports three sorts of expressions in its input language.
35	The simplest are \emph{value expressions}, which are expressions declared to be a certain type; these implement literal expressions in \CFA{}, and, already being typed, are passed through the resolver unchanged.
36	The second sort, \emph{name expressions}, represent a variable expression in \CFA{}; these contain the name of a variable or function, and are matched to an appropriate declaration overloading that name.
37	The third input expression, the \emph{function expression}, represents a call to a function, with a name and zero or more argument subexpressions.
38	As is usual in \CFA{}, operators are represented as function calls; however, as mentioned above, the prototype system represents field access expressions !a.f! as function expressions as well.
39
40	The main area for future expansion in the design of the resolver prototype is conversions.
41	Cast expressions are implemented in the output language of the resolver, but cannot be expressed in the input.
42	The only implicit conversions supported are among the arithmetic-like concrete types, which capture most, but not all, of \CFA{}'s built-in implicit conversions\footnote{Notable absences include \lstinline{void*} to other pointer types, or \lstinline{0} to pointer types.}.
43	Future work should include a way to express implicit (and possibly explicit) conversions in the input language, with an investigation of the most efficient way to handle implicit conversions, and potentially a design for user-defined conversions.
44
45	\section{Resolver Prototype Design}
46
47	As discussed above, for speed of development the resolver prototype works over a simplified version of the \CFA{} type system.
48	The build system for the resolver prototype uses a number of conditional compilation flags to switch among algorithm variants while retaining maximally shared code.
49	A distinct executable name is also generated for each algorithmic variant so that distinct variants can be more easily tested against each other.
50
51	The primary architectural difference between the resolver prototype and \CFACC{} is that the prototype system uses a simple mark-and-sweep garbage collector for memory management, while \CFACC{} uses a manual memory-management approach.
52	This architectural difference affects the mutation patterns used by both systems: \CFACC{} frequently makes deep clones of multi-node object graphs to ensure that there is a single ``owner'' for each object which can safely delete it later; the prototype system, by contrast, relies on its garbage collector to handle ownership, and can often copy pointers rather than cloning objects.
53	The resolver prototype thus only needs to clone nodes that it modifies, and can share un-modified children between clones; the tree mutator abstraction in the prototype is designed to take advantage of this property.
54	The key design decision enabling this is that all child nodes are held by !const! pointer, and thus cannot be mutated once they have been stored in a parent node.
55	With minimal programming discipline, it can thus be ensured that any expression is either mutable or shared, but never both; the Dotty research compiler for Scala takes a similar architectural approach \cite{Dotty-github}.
56
57	Given the significantly better performance results from the resolver prototype than \CFACC{} and profiling data showing that memory allocation is a large component of \CFACC{} runtime, I attempted to port this garbage collector to \CFACC{}, but without success.
58	The GC could be used for memory management with few changes to the code-base, but without a substantial re-write to enforce the same ``!const! children'' discipline, \CFACC{} could not take advantage of the potential to share sub-objects; without sharing of sub-objects the GC variant of \CFACC{} must do all the same allocations and deletions and garbage-collector overhead degraded performance unacceptably (though it did fix some known memory leaks introduced by failures of the existing manual memory-management scheme).
59
60	Another minor architectural difference between the prototype system and \CFACC{} is that \CFACC{} makes extensive use of the pointer-based !std::list!, !std::set!, and !std::map! data structures, while the prototype uses the array-based !std::vector! and the hash-based !unordered_! variants of !set! and !map! instead.
61	Porting the prototype to use the pointer-based data structures resulted in modest performance regressions, whereas preliminary results from porting \CFACC{} to use !std::vector! over !std::list! also showed performance regressions, in some cases significant.
62	The relative performance impact of this architectural difference is unclear, and thus excluded from consideration.
63
64	The final difference between \CFACC{} and the resolver prototype is that, as an experiment in language usability, the prototype performs resolution-based rather than unification-based assertion satisfaction, as discussed in Section~\ref{resn-conclusion-sec}.
65	This change enables coding patterns not available in \CFACC{}, \eg{} a more flexible approach to type assertion satisfaction and better handling of functions returning polymorphic type variables that do not exist in the parameter list.
66	The experimental results in Section~\ref{proto-exp-sec} indicate that this choice is not a barrier to a performant resolver.
67	% \TODO{test performance; shouldn't be too hard to change \texttt{resolveAssertions} to use unification}
68
69	\section{Prototype Experiments} \label{proto-exp-sec}
70
71	The primary performance experiments for this thesis are conducted using the resolver prototype on problem instances generated from actual \CFA{} code using the method described in Section~\ref{rp-features-sec}.
72	The prototype is compiled in 24 variants over 3 variables, with variants identified by the hyphen-separated concatenation of their short codes, \eg{} \textsc{bu-imm-bas} for bottom-up traversal, immediate assertion satisfaction, basic type environment.
73	The variables and their values are as follows:
74
75	\begin{description}
76	\item[Traversal direction] The order in which arguments are matched with parameters, as discussed in Section~\ref{arg-parm-matching-sec}.
77	\begin{description}
78	\item[Bottom-up] (\textsc{bu}) Baker-style bottom-up pass, searching for function candidates based on the available argument interpretations.
79	\item[Combined] (\textsc{co}) Bilson-style bottom-up pass, where argument interpretations are combined into a single interpretation for each set of options.
80	\item[Top-down] (\textsc{td}) Cormack-style top-down pass, searching for argument interpretations based on function candidate parameter types. The \textsc{td-*} variants of the resolver prototype implement a caching system to avoid re-computation of the same argument interpretation with the same type.
81	\end{description}
82	\item[Assertion satisfaction] The algorithm for finding satisfying declarations for type assertions, as discussed in Section~\ref{assn-sat-sec}.
83	\begin{description}
84	\item[Immediate] (\textsc{imm}) All assertions are checked for satisfaction immediately upon generating a candidate interpretation. The techniques discussed in Section~\ref{assn-sat-sec} for environment combination and level-by-level consideration of recursive assertions are applied here.
85	\item[Deferred] (\textsc{def}) As in \textsc{imm}, but only checks minimal-cost top-level interpretations after all top-level interpretations have been generated.
86	\item[Deferred Cached] (\textsc{dca}) As in \textsc{def}, but uses the caching optimization discussed in Section~\ref{assn-sat-sec}.
87	\end{description}
88	\item[Type Environment] The type environment data structure used, as discussed in Chapter~\ref{env-chap}.
89	\begin{description}
90	\item[Basic] (\textsc{bas}) Bilson-style type environment with hash-based equivalence class storage, as discussed in Section~\ref{naive-env-sec}.
91	\item[Incremental Inheritance] (\textsc{inc}) Incremental-inheritance variant sharing unmodified common parent information among environments, as discussed in Section~\ref{inc-env-sec}.
92	\item[Persistent union-find] (\textsc{per}) Union-find-based environment, using the persistent variant discussed in Section~\ref{env-persistent-union-find} for backtracking and combination. This variant requires that all pairs of type arguments used as arguments to $combine$ descend from a common root environment; this requirement is incompatible with the caching used in the top-down traversal direction, and thus no \textsc{td-*-per} algorithms are tested.
93	\end{description}
94	\end{description}
95
96	To test the various algorithms, the resolver prototype is compiled using \texttt{g++} 6.5.0 with each of the 24 valid combinations of variables\footnote{Namely, all combinations except \textsc{td-*-per}.}, and then timed running each of the \CFA{}-derived test inputs.
97	Terminal output is suppressed for all tests to avoid confounding factors in the timing results, and all tests are run three times in series, with the median result reported in all cases.
98	The medians are representative data points; considering test cases that took at least 0.2~s to run, the average run was within 2\% of the reported median runtime, and no run diverged by more than 20\% of median runtime or 5.5~s.
99	The memory results are even more consistent, with no run exceeding 2\% difference from median in peak resident set size, and 93\% of tests recording identical peak memory usage within the 1~KB granularity of the measurement software.
100	All tests were run on a machine with 128~GB of RAM and 64 cores running at 2.2~GHz.
101
102	\begin{figure}
103	\centering
104	\input{tests-completed}
105	\caption[Tests completed for each algorithmic variant]{Number of tests completed for each algorithmic variant} \label{tests-completed-fig}
106	\end{figure}
107
108	As a matter of experimental practicality, test runs that exceeded 8~GB of peak resident memory usage are excluded from the data set.
109	This restriction is justifiable by real-world use, as a compiler that is merely slow may be accommodated with patience, but one that uses in excess of 8~GB of RAM may be impossible to run on many currently deployed computer systems.
110	8~GB of RAM is not typical of the memory usage of the best-performing two variants, \textsc{bu-dca-bas} and \textsc{bu-dca-per}, which were able to run all 131 test inputs to completion with maximum memory usage of 70~MB and 78~MB, respectively.
111	However, this threshold did eliminate a significant number of algorithm-test variants, with the worst-performing variant, \textsc{td-imm-inc}, only completing 62 test inputs within the memory bound.
112	Full results for tests completed by algorithm variant are presented in Figure~\ref{tests-completed-fig}.
113	As can be seen from these results, traversal direction is clearly the dominant variable in memory usage, with the \textsc{bu-} variants performing better than the \textsc{co-} variants, which in turn out-perform the \textsc{td-*} variants.
114
115	To provide a more holistic view of performance, I have considered the results from the 56 test inputs that all algorithms are able to complete within the memory bound.
116	Limiting consideration to these algorithms provides an apples-to-apples comparison among algorithms, as the excluded inputs are harder instances, which take more time and memory for the algorithms that are able to solve them.
117	Figures~\ref{avg-peak-mem-fig} and~\ref{avg-runtime-fig} show the mean peak memory and runtime, respectively, of each algorithm over the inputs in this data set.
118	These averages are not themselves meaningful, but do enable an overall comparison of relative performance of the different variants.
119	Selecting only these 56 ``easy'' test inputs does bias the average values downward, but has little effect on the relative trends; similar trends can be seen in the graphs of the \textsc{bu-*} algorithms over the 124 (of 131) test inputs that all complete, which have been omitted to save space.
120
121	\begin{figure}
122	\centering
123	\input{avg-peak-mem}
124	\caption[Average peak memory for each algorithmic variant]{Average peak resident set size for each algorithmic variant over the 56 test inputs all variants complete.} \label{avg-peak-mem-fig}
125	\end{figure}
126
127	\begin{figure}
128	\centering
129	\input{avg-runtime}
130	\caption[Average runtime for each algorithmic variant]{Average runtime for each algorithmic variant over the 56 test inputs all variants complete.} \label{avg-runtime-fig}
131	\end{figure}
132
133	% \begin{figure}
134	% \centering
135	% \input{bu-peak-mem}
136	% \caption[Average peak memory for each \textsc{bu-} variant]{Average peak resident set size for each \textsc{bu-} variant over the 124 test inputs all \textsc{bu-*} variants complete.} \label{bu-peak-mem-fig}
137	% \end{figure}
138
139	% \begin{figure}
140	% \centering
141	% \input{bu-runtime}
142	% \caption[Average runtime for each \textsc{bu-} variant]{Average runtime for each \textsc{bu-} variant over the 124 test inputs all \textsc{bu-*} variants complete.} \label{bu-runtime-fig}
143	% \end{figure}
144
145	It can be seen from these results that the top-down, immediate assertion-satisfaction (\textsc{td-imm-*}) variants are particularly inefficient, as they check a significant number of assertions without filtering to determine if the arguments can be made to fit.
146	It is also clear that the bottom-up (\textsc{bu}) traversal order is better than both top-down (\textsc{td}) and the Bilson-style bottom-up-combined (\textsc{co}) orders.
147	While the advantage of \textsc{bu} over \textsc{co} is clear, in that it performs less redundant work if a prefix of a combination fails, the advantage of \textsc{bu} over \textsc{td} provides an answer for an open question from Baker \cite{Baker82}.
148	I believe that bottom-up is superior because it must only handle each subexpression once to form a list of candidate interpretations, whereas the top-down approach may do similar work repeatedly to resolve a subexpression with a variety of different types, a shortcoming that cannot be fully addressed by the memoization scheme employed in the \textsc{td} algorithm.
149
150	With regard to assertion satisfaction, immediate (\textsc{imm}) satisfaction is an inferior solution, though there is little performance difference between deferred (\textsc{def}) and deferred-cached (\textsc{dca}) for instances that both can complete; particularly notable is that the \textsc{dca} caching-scheme does not have a noticeable impact on peak memory usage.
151	Since the \textsc{dca} algorithm can solve some particularly hard instances that \textsc{def} cannot, it is the recommended approach.
152
153	The incremental-inheritance (\textsc{inc}) type environment also often uses upwards of double the memory required by the other variants, in addition to being consistently slower on these easy tests; aside from \textsc{bu-imm-bas} performing worse than \textsc{bu-imm-inc} on average when larger tests are considered, these results hold for the other variants.
154	It is apparent from these results that any efficiencies from the inheritance mechanism are insufficient to pay for the added complexity of the data structure.
155	Aside from that, the persistent union-find (\textsc{per}) type environment generally performs better than the basic (\textsc{bas}) environment, with similar peak memory usage and an average speedup factor of nearly 2, though the requirements of the \textsc{per} environment for automatic garbage collection and a shared history for combination make retrofitting it into older code difficult.
156
157	\section{Instance Difficulty} \label{instance-expr-sec}
158
159	To characterize the difficulty of expression-resolution problem instances, the test suites must be explored at a finer granularity.
160	As discussed in Section~\ref{resn-analysis-sec}, a single top-level expression is the fundamental problem instance for resolution, yet the test inputs discussed above are composed of thousands of top-level expressions, like the actual source code they are derived from.
161	To pull out the effects of these individual problems, the resolver prototype is instrumented to time resolution for each expression, and also to report some relevant properties of the expression.
162	This instrumented resolver is then run on a set of difficult test instances; to limit the data collection task, these runs are restricted to the best-performing \textsc{bu-dca-per} algorithm and test inputs taking more than 1~s to complete.
163
164	The 13 test inputs thus selected contain 20,632 top-level expressions among them, which are separated into order-of-magnitude bins by runtime in Figure~\ref{per-prob-histo-fig}.
165	As can be seen from this figure, overall runtime is dominated by a few particularly difficult problem instances --- the 60\% of expressions that resolve in under 0.1~ms collectively take less time to resolve than any of the 0.2\% of expressions that take at least 100~ms to resolve.
166	On the other hand, the 46 expressions in that 0.2\% take 38\% of the overall time in this difficult test suite, while the 201 expressions that take between 10 and 100~ms to resolve consume another 30\%.
167
168	\begin{figure}
169	\centering
170	\input{per-prob-histo}
171	\caption[Histogram of top-level expressions]{Histogram of top-level expression resolution runtime, binned by order-of-magnitude. The left series counts the expressions in each bin according to the left axis, while the right series reports the summed runtime of resolution for all expressions in that bin. Note that both y-axes are log-scaled.} \label{per-prob-histo-fig}
172	\end{figure}
173
174	Since the top centile of expression-resolution instances requires approximately two-thirds of the resolver's time, optimizing the resolver for specific hard problem instances has proven to be an effective technique for reducing overall runtime.
175	The data indicates that the number of assertions necessary to resolve has the greatest effect on runtime, as seen in
176	Figure~\ref{per-prob-assns-fig}.
177	However, since the number of assertions required is only known once resolution is finished, the most-promising pre-resolution metric of difficulty is the nesting depth of the expression; as seen in Figure~\ref{per-prob-depth-fig}, expressions of depth $> 10$ in this data-set are uniformly difficult.
178	Figure~\ref{per-prob-subs-fig} presents a similar pattern for number of subexpressions, though given that the expensive tail of problem instances occurs at approximately twice the depth values, it is reasonable to believe that the difficult expressions in question are deeply-nested invocations of binary functions rather than wider but shallowly-nested expressions.
179
180	% TODO statistics to tease out difficulty? Is ANOVA the right keyword?
181	% TODO maybe metrics to sum number of poly-overloads invoked
182
183	\begin{figure}
184	\centering
185	\input{per-prob-assns}
186	\caption[Top-level expression resolution time by number of assertions resolved.]{Top-level expression resolution time by number of assertions resolved. Source input file for each expression listed in legend; note log scales on both axes.} \label{per-prob-assns-fig}
187	\end{figure}
188
189	\begin{figure}
190	\centering
191	\input{per-prob-depth}
192	\caption[Top-level expression resolution time by maximum nesting depth of expression.]{Top-level expression resolution time by maximum nesting depth of expression. Note log scales on both axes.} \label{per-prob-depth-fig}
193	\end{figure}
194
195	\begin{figure}
196	\centering
197	\input{per-prob-subs}
198	\caption[Top-level expression resolution time by number of subexpressions.]{Top-level expression resolution time by number of subexpressions. Note log scales on both axes.} \label{per-prob-subs-fig}
199	\end{figure}
200
201
202	\section{\CFA{} Results} \label{cfa-results-sec}
203
204	I have integrated a number of the algorithmic techniques discussed in this chapter into \CFACC{}.
205	This integration took place over a period of months while \CFACC{} was under active development on a number of other fronts, so it is not possible to completely isolate the effects of the algorithmic changes, but I believe the algorithmic changes to have had the most-significant effects on performance over the study period.
206	To generate this data, representative commits from the \texttt{git} history of the project were checked out and compiled, then run on the same machine used for the resolver prototype experiments discussed in Section~\ref{proto-exp-sec}.
207	To negate the effects of changes to the \CFA{} standard library on the timing results, 55 test files from the test suite of the oldest \CFA{} variant are compiled with the \texttt{-E} flag to inline their library dependencies, and these inlined files are used to test the remaining \CFACC{} versions.
208
209	I performed two rounds of modification to \CFACC{}; the first round moved from Bilson's original combined-bottom-up algorithm to an un-combined bottom-up algorithm, denoted \textsc{cfa-co} and \textsc{cfa-bu}, respectively.
210	A top-down algorithm was not attempted in \CFACC{} due to its poor performance in the prototype.
211	The second round of modifications addressed assertion satisfaction, taking Bilson's original \textsc{cfa-imm} algorithm and modifying it to use the deferred approach \textsc{cfa-def}.
212	Due to time constraints, a deferred-cached assertion satisfaction algorithm for \CFACC{} could not be completed, but both preliminary results from this effort and the averaged prototype results from Section~\ref{proto-exp-sec} indicate that assertion satisfaction caching is not likely to be a fruitful optimization for \CFACC{}.
213	The new environment data structures discussed in Section~\ref{proto-exp-sec} have not been successfully merged into \CFACC{} due to their dependencies on the garbage-collection framework in the prototype; I spent several months modifying \CFACC{} to use similar garbage collection, but due to \CFACC{} not being designed to use such memory management the performance of the modified compiler was non-viable.
214	It is possible that the persistent union-find environment could be modified to use a reference-counted pointer internally without changing the entire memory-management framework of \CFACC{}, but such an attempt is left to future work.
215
216	As can be seen in Figures~\ref{cfa-time-fig}--\ref{cfa-mem-fig}, the time and peak memory results for these five versions of \CFACC{} show that assertion resolution dominates total resolution cost, with the \textsc{cfa-def} variant running consistently faster than the others on more expensive test cases, and the speedup from the deferred approach increasing with the difficulty of the test case.
217	The results from \CFACC{} for \textsc{cfa-co} \vs{} \textsc{cfa-bu} do not mirror those from the prototype; I conjecture this is mostly due to the different memory-management schemes and sorts of data required to run type unification and assertion satisfaction calculations, as \CFACC{} performance has proven to be particularly sensitive to the amount of heap allocation performed.
218	This data also shows a noticeable regression in compiler performance in the eleven months between \textsc{cfa-bu} and \textsc{cfa-imm}, which use the same resolution algorithms; this approximate doubling in runtime is not due to expression resolution, as no integration work happened in this time, but I am unable to ascertain its actual cause.
219	To isolate the effects of the algorithmic changes from this unrelated performance regression, the speedup results in Figure~\ref{cfa-speedup-fig} are shown with respect to the start of each modification round, \textsc{cfa-bu} \vs{} \textsc{cfa-co} and \textsc{cfa-def} \vs{} \textsc{cfa-imm}.
220	It should also be noted with regard to the peak memory results in Figure~\ref{cfa-mem-fig} that the peak memory usage does not always occur during the resolution phase of the compiler.
221
222	\begin{figure}
223	\centering
224	\input{cfa-time}
225	\caption[\CFACC{} runtime against \textsc{cfa-co} baseline.]{\CFACC{} runtime against \textsc{cfa-co} baseline. Note log scales on both axes.} \label{cfa-time-fig}
226	\end{figure}
227
228	\begin{figure}
229	\centering
230	\input{cfa-speedup}
231	\caption[\CFACC{} speedup.]{\CFACC{} speedup against against \textsc{cfa-co} baseline runtime. To isolate the effect of algorithmic changes, \textsc{cfa-bu} speedup is \vs{} \textsc{cfa-co} while \textsc{cfa-def} speedup is \vs{} \textsc{cfa-imm}. The `inter-round' series shows slowdown between \textsc{cfa-bu} and \textsc{cfa-imm}.} \label{cfa-speedup-fig}
232	\end{figure}
233
234	\begin{figure}
235	\centering
236	\input{cfa-mem}
237	\caption[\CFACC{} peak memory usage against \textsc{cfa-co} baseline runtime.]{\CFACC{} peak memory usage against \textsc{cfa-co} baseline runtime. Note log scales on both axes.} \label{cfa-mem-fig}
238	\end{figure}
239
240	% use Jenkins daily build logs to rebuild speedup graph with more data
241	% https://cforall.uwaterloo.ca/jenkins/job/Cforall/job/master/7089/consoleText
242	% - near top of file: Compiler: Architecture: etc.
243	% - if it doesn't have true for Run Benchmark, less info
244	% - toward bottom, full test results
245	% - be aware of machine change: grep for Ruby, Python
246	% some data you collected personally for imm vs. def vs. dca variants in resolv-proto/logs/rp-bu-tec_vs_cfacc.ods
247
248	\section{Conclusion}
249
250	The dominant factor in the cost of \CFA{} expression resolution is assertion satisfaction.
251	Reducing the total number of assertions satisfied, as in the deferred satisfaction algorithm, is consistently effective at reducing runtime, and caching results of these satisfaction problem instances has shown promise in the prototype system.
252	The results presented here also demonstrate that a bottom-up approach to expression resolution is superior to top-down, settling an open question from Baker~\cite{Baker82}.
253	The persistent union-find type environment introduced in Chapter~\ref{env-chap} has also been demonstrated to be a modest performance improvement on the na\"{\i}ve approach.
254
255	Given the consistently strong performance of the \textsc{bu-dca-imm} and \textsc{bu-dca-per} variants of the resolver prototype, the results in this chapter demonstrate that it is possible to develop a \CFA{} compiler with acceptable runtime performance for widespread use, an important and previously unaddressed consideration for the practical viability of the language.
256	However, the less-marked improvement in Section~\ref{cfa-results-sec} from retrofitting these algorithmic changes onto the existing compiler leave the actual development of a performant \CFA{} compiler to future work.
257	Characterization and elimination of the performance deficits in the existing \CFACC{} has proven difficult, though runtime is generally dominated by the expression resolution phase; as such, building a new \CFA{} compiler based on the resolver prototype contributed by this work may prove to be an effective strategy.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: