Changeset d065ded


Ignore:
Timestamp:
Feb 22, 2019, 5:41:56 PM (3 years ago)
Author:
Aaron Moss <a3moss@…>
Branches:
aaron-thesis, arm-eh, cleanup-dtors, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr
Children:
8adcfee
Parents:
b3edf7f5
Message:

thesis: polish Ch.6 to first draft

Location:
doc
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • doc/bibliography/pl.bib

    rb3edf7f5 rd065ded  
    23632363    year        = 1990,
    23642364    pages       = {315-323}
     2365}
     2366
     2367@misc{Dotty-github,
     2368    keywords = {dotty,scala},
     2369    contributer = {a3moss@uwaterloo.ca},
     2370    author = {Martin Odersky},
     2371    title = {Dotty},
     2372    howpublished = {\href{https://github.com/lampepfl/dotty}{https://\-github.com/\-lampepfl/\-dotty}},
     2373    note = {Acessed: 2019-02-22}
    23652374}
    23662375
  • doc/theses/aaron_moss_PhD/phd/experiments.tex

    rb3edf7f5 rd065ded  
    33
    44I have implemented a prototype system to test the practical effectiveness of the various algorithms described in Chapters~\ref{resolution-chap} and~\ref{env-chap}.
    5 This prototype system essentially just implements the expression resolution pass of \CFACC{}, with a simplified version of the \CFA{} type system and a parser to read in problem instances.
    6 The prototype system allows for quicker iteration on algorithms due to its simpler language model and lack of a requirement to generate runnable code, yet captures enough of the nuances of \CFA{} to have some predictive power for the runtime performance of algorithmic variants in \CFACC{} itself.
    7 
    8 There are three sources of problem instances for the resolver prototype.
    9 The first is small, hand-written tests designed to test the expressive power and correctness of the prototype.
    10 These tests are valuable for regression testing, but not time-consuming enough to be useful performance tests.
    11 The second sort of problem instances are procedurally generated according to a set of parameters (distributions of polymorphic versus monomorphic functions, number of function arguments, number of types, \etc{}); the procedural problem generator can be used to explore the behaviour of an algorithm with respect to certain sorts of problem instances by varying the input parameters.
    12 I have implemented a flagged \CFACC{} pass which outputs information which can be used to initialize the procedural generator's parameters to realistic values.
    13 The final sort of problem instances are derived from actual \CFA{} code.
    14 The prototype has a rich enough representation of \CFA{} that actual instances of expression resolution can be expressed with good fidelity, and I have implemented a compiler pass for \CFACC{} which can generate instances from \CFA{} code.
    15 Since at this juncture all development in \CFA{} is done by our research team, I have tested the prototype system on all \CFA{} code currently extant, primarily the standard library and compiler test suite.
     5This prototype system implements the expression resolution pass of the \CFA{} compiler, \CFACC{}, with a simplified version of the \CFA{} type system and a parser to read in problem instances.
     6The resolver prototype allows for quicker iteration on algorithms due to its simpler language model and lack of a requirement to generate runnable code, yet captures enough of the nuances of \CFA{} to have some predictive power for the runtime performance of algorithmic variants in \CFACC{} itself.
     7I have implemented an optional \CFACC{} pass which generates test inputs for the resolver prototype from \CFA{} translation units; since at this juncture all development in \CFA{} is done by our research team, I have tested the prototype system on all \CFA{} code currently extant, primarily the standard library and compiler test suite.
     8
     9% There are three sources of problem instances for the resolver prototype.
     10% The first is small, hand-written tests designed to test the expressive power and correctness of the prototype.
     11% These tests are valuable for regression testing, but not time-consuming enough to be useful performance tests.
     12% The second sort of problem instances are procedurally generated according to a set of parameters (distributions of polymorphic versus monomorphic functions, number of function arguments, number of types, \etc{}); the procedural problem generator can be used to explore the behaviour of an algorithm with respect to certain sorts of problem instances by varying the input parameters.
     13% I have implemented a flagged \CFACC{} pass which outputs information which can be used to initialize the procedural generator's parameters to realistic values.
     14% The final sort of problem instances are derived from actual \CFA{} code.
     15% The prototype has a rich enough representation of \CFA{} that actual instances of expression resolution can be expressed with good fidelity, and I have implemented a compiler pass for \CFACC{} which can generate instances from \CFA{} code.
     16% Since at this juncture all development in \CFA{} is done by our research team, I have tested the prototype system on all \CFA{} code currently extant, primarily the standard library and compiler test suite.
    1617
    1718\section{Resolver Prototype Features} \label{rp-features-sec}
     
    2021It supports both monomorphic and polymorphic functions, with type assertions for polymorphic functions.
    2122Traits are not explicitly represented, but \CFACC{} inlines traits before the resolver pass, so this is a faithful representation of the existing compiler problem.
    22 The prototype system supports variable declarations as well as function declarations, and has a lexical-scoping scheme which obeys \CFA{}-like overloading and overriding rules.
    23 
    24 The type system of the resolver prototype also captures the key aspects of the \CFA{} type system.
     23The prototype system supports variable declarations as well as function declarations, and has a lexical-scoping scheme and \CFA{}-like overloading rules.
     24
     25The type system of the resolver prototype also captures key aspects of the \CFA{} type system.
    2526\emph{Concrete types} represent the built-in arithmetic types of \CFA{}, along with the implicit conversions between them.
    2627Each concrete type is represented by an integer ID, and the conversion cost from $x$ to $y$ is $|y-x|$, a safe conversion if $y > x$, or an unsafe conversion if $y < x$.
    27 This is markedly simpler than the graph of conversion costs in \CFA{}, but captures the essentials of the design well.
     28This is markedly simpler than the graph of conversion costs in \CFA{} (Figure~\ref{safe-conv-graph-fig}), but captures the essentials of the design well.
    2829For simplicity, !zero_t! and !one_t!, the types of !0! and !1!, are represented by the type corresponding to !int!.
    2930\emph{Named types} are analogues to \CFA{} aggregates, such as structs and unions; aggregate fields are encoded as unary functions from the struct type to the field type, named based on the field name.
     
    4445Cast expressions are implemented in the output language of the resolver, but cannot be expressed in the input.
    4546The only implicit conversions supported are between the arithmetic-like concrete types, which captures most, but not all, of \CFA{}'s built-in implicit conversions\footnote{Notable absences include \lstinline{void*} to other pointer types, or \lstinline{0} to pointer types.}.
    46 Future work should include a way to express implicit (and possibly explicit) conversions in the input language, with an investigation of the most efficient way to handle implicit conversions, and potentially design for user-defined conversions.
     47Future work should include a way to express implicit (and possibly explicit) conversions in the input language, with an investigation of the most efficient way to handle implicit conversions, and potentially a design for user-defined conversions.
    4748
    4849\section{Resolver Prototype Design}
    4950
    50 As discussed above, the resolver prototype works over a simplified version of the \CFA{} type system, for speed of development.
    51 The build system for the resolver prototype uses a number of conditional compilation flags to switch between algorithm variants while retaining maximal shared code.
    52 A different executable name is also generated for each algorithmic variant so that distinct variants can be more easily tested against each other.
     51As discussed above, for speed of development the resolver prototype works over a simplified version of the \CFA{} type system.
     52The build system for the resolver prototype uses a number of conditional compilation flags to switch between algorithm variants while retaining maximally shared code.
     53A distinct executable name is also generated for each algorithmic variant so that distinct variants can be more easily tested against each other.
    5354
    5455The primary architectural difference between the resolver prototype and \CFACC{} is that the prototype system uses a simple mark-and-sweep garbage collector for memory management, while \CFACC{} takes a manual memory management approach.
    5556This decision was made for the purpose of faster development iteration, but has proved to be a significant performance benefit as well.
    56 \CFACC{} frequently needs to make deep clones of significant object graphs to ensure memory ownership (followed by eventual deletion of these clones), an unnecessarily time-consuming process.
     57\CFACC{} frequently needs to make deep clones of large object graphs to ensure memory ownership (followed by eventual deletion of these clones), an unnecessarily time-consuming process.
    5758The prototype, on the other hand, only needs to clone modified nodes, and can share identical subsets of the object graph.
    5859The key design decision enabling this is that all subnodes are held by !const! pointer, and thus cannot be mutated once they have been stored in a parent node.
    59 With minimal programming discipline, it can thus be ensured that any expression is either mutable or shared, but never both; the Dotty research compiler for Scala takes a similar architectural approach\cit{}. % this citation would be "personal correspondence"
     60With minimal programming discipline, it can thus be ensured that any expression is either mutable or shared, but never both; the Dotty research compiler for Scala takes a similar architectural approach\cite{Dotty-github}.
    6061The tree mutator abstraction is designed to take advantage of this, only creating new nodes if a node must actually be mutated.
    6162I attempted to port this garbage collector to \CFACC{}, but without success.
    62 The GC could be used for memory management with few changes to the codebase, but without a substantial re-write to enforce the same ``!const! children'' discipline \CFACC{} could not take advantage of the potential to share sub-objects; without sharing of sub-objects the GC variant of \CFACC{} must do all the same allocations and deletions and garbage-collector overhead degraded performance unacceptably (though it did fix some known memory leaks intoduced by failures of the existing manual memory management scheme).
    63 
    64 Another minor architectural difference between \CFACC{} and the prototype system is that \CFACC{} makes extensive use of the pointer-chasing !std::list!, !std::set!, and !std::map! data structures, while the prototype uses the array-based !std::vector! and the hash-based !unordered_! variants of !set! and !map! instead. \TODO{investigate performance difference by testing a resolver prototype variant with List etc. redefined}
     63The GC could be used for memory management with few changes to the code-base, but without a substantial re-write to enforce the same ``!const! children'' discipline \CFACC{} could not take advantage of the potential to share sub-objects; without sharing of sub-objects the GC variant of \CFACC{} must do all the same allocations and deletions and garbage-collector overhead degraded performance unacceptably (though it did fix some known memory leaks introduced by failures of the existing manual memory management scheme).
     64
     65Another minor architectural difference between \CFACC{} and the prototype system is that \CFACC{} makes extensive use of the pointer-chasing !std::list!, !std::set!, and !std::map! data structures, while the prototype uses the array-based !std::vector! and the hash-based !unordered_! variants of !set! and !map! instead.
     66Work is ongoing to port \CFACC{} to use these more efficient data structures.
     67% TODO see how Thierry gets on with this
    6568
    6669The final difference between \CFACC{} and the resolver prototype is that, as an experiment in language usability, the prototype performs resolution-based rather than unification-based assertion satisfaction, as discussed in Section~\ref{resn-conclusion-sec}.
    6770This enables coding patterns not available in \CFACC{}, \eg{} a more flexible approach to type assertion satisfaction and better handling of functions returning polymorphic type variables that do not exist in the parameter list.
    68 \TODO{test performance; shouldn't be too hard to change \texttt{resolveAssertions} to use unification}
     71The experimental results in Section~\ref{proto-exp-sec} indicate that this choice is not a barrier to a performant resolver.
     72% \TODO{test performance; shouldn't be too hard to change \texttt{resolveAssertions} to use unification}
    6973
    7074\section{Prototype Experiments} \label{proto-exp-sec}
     
    7882        \begin{description}
    7983                \item[Bottom-up] (\textsc{bu}) Baker-style bottom-up pass, searching for function candidates based on the available argument interpretations.
    80                 \item[Combined] (\textsc{co}) Bilson-style bottom-up pass, where argument interpretations are combined into a single combination interpretation.
    81                 \item[Top-down] (\textsc{td}) Cormack-style top-down pass, searching for argument interpretations based on function candidate parameter types. The \textsc{td-*} variants of the resolver prototype implement a caching system to avoid recomputation of the same argument interpretation with the same type.
     84                \item[Combined] (\textsc{co}) Bilson-style bottom-up pass, where argument interpretations are combined into a single interpretation for each set of options.
     85                \item[Top-down] (\textsc{td}) Cormack-style top-down pass, searching for argument interpretations based on function candidate parameter types. The \textsc{td-*} variants of the resolver prototype implement a caching system to avoid re-computation of the same argument interpretation with the same type.
    8286        \end{description}
    8387        \item[Assertion satisfaction] The algorithm for finding satisfying declarations for type assertions, as discussed in Section~\ref{assn-sat-sec}.
    8488        \begin{description}
    8589                \item[Immediate] (\textsc{imm}) All assertions are checked for satisfaction immediately upon generating a candidate interpretation. The techniques discussed in Section~\ref{assn-sat-sec} for environment combination and level-by-level consideration of recursive assertions are applied here.
    86                 \item[Deferred] (\textsc{def}) As in \textsc{-imm-}, but waits to check assertions until a top-level interpretation has been generated.
    87                 \item[Cached] (\textsc{dca}) As in \textsc{-def-}, but uses the caching optimization discussed in Section~\ref{assn-sat-sec}.
     90                \item[Deferred] (\textsc{def}) As in \textsc{imm}, but only checks minimal-cost top-level interpretations after all top-level interpretations have been generated.
     91                \item[Deferred Cached] (\textsc{dca}) As in \textsc{def}, but uses the caching optimization discussed in Section~\ref{assn-sat-sec}.
    8892        \end{description}
    8993        \item[Type Environment] The type environment data structure used, as discussed in Chapter~\ref{env-chap}.
     
    9599\end{description}
    96100
    97 To test the various algorithms, the resolver prototype was compiled with each of the 24 valid combinations of variables\footnote{Namely, all combinations except \textsc{td-*-per}.}, and then timed running each of the \CFA{}-derived test inputs.
     101To test the various algorithms, the resolver prototype was compiled using \texttt{g++} 6.5.0 with each of the 24 valid combinations of variables\footnote{Namely, all combinations except \textsc{td-*-per}.}, and then timed running each of the \CFA{}-derived test inputs.
    98102Terminal output was suppressed for all tests to avoid confounding factors in the timing results, and all tests were run three times in series, with the median result reported in all cases.
    99103The medians are representative data points; considering test cases that took at least 0.2~s to run, the average run was within 2\% of the reported median runtime, and no run diverged by more than 20\% of median runtime or 5.5~s.
     
    113117\end{figure}
    114118
    115 As can be seen from these results, traversal direction is clearly the dominant variable in memory usage, with the \textsc{bu-*} variants performing better than the \textsc{co-} variants, which in turn out-perform the \textsc{td-} variants.
    116 It can also be seen that the incremental inheritance (\textsc{inc}) type environment consistently under-performs the other two environment data structures tested.
    117 
    118 To provide a more holistic view of performance, I have considerered the results from the 56 test inputs which all algorithms are able to complete within the memory bound.
     119As can be seen from these results, traversal direction is clearly the dominant variable in memory usage, with the \textsc{bu-*} variants performing better than the \textsc{co-*} variants, which in turn out-perform the \textsc{td-*} variants.
     120It can also be seen that the incremental inheritance (\textsc{inc}) type environment consistently under-performs the other two environment data structures tested, as any efficiencies from the inheritance mechanism are apparently insufficient to pay for the added complexity of the data structure.
     121
     122To provide a more holistic view of performance, I have considered the results from the 56 test inputs which all algorithms are able to complete within the memory bound.
    119123Limiting consideration to these algorithms provides an apples-to-apples comparison between algorithms, as the excluded inputs are harder instances which take more time and memory for the algorithms which are able to solve them.
    120124Figures~\ref{avg-peak-mem-fig} and~\ref{avg-runtime-fig} show the mean peak memory and runtime, respectively, of each algorithm over the inputs in this data set.
     
    146150% \end{figure}
    147151
    148 It can be seen from these results that that top-down immediate-assertion-satisfaction (\textsc{td-imm-*}) are particularly un-performant, as they check a significant number of assertions without filtering to determine if the arguments can be made to fit.
    149 It is clear that the bottom-up (\textsc{bu-*}) traversal order is better than both top-down and the Bilson-style bottom-up-combined orders.
    150 
    151 With regard to assertion satisfaction, immediate (\textsc{*-imm-*}) satisfaction is an inferior solution, though there is little performance difference between deferred (\textsc{*-def-*}) and deferred-cached (\textsc{*-dca-*}) for instances which they can both complete.
     152It can be seen from these results that that the top-down, immediate assertion-satisfaction (\textsc{td-imm-*}) variants are particularly inefficient, as they check a significant number of assertions without filtering to determine if the arguments can be made to fit.
     153It is also clear that the bottom-up (\textsc{bu}) traversal order is better than both top-down (\textsc{td}) and the Bilson-style bottom-up-combined ((\textsc{co})) orders.
     154While the advantage of \textsc{bu} over \textsc{co} is clear, in that it performs less redundant work if a prefix of a combination fails, the advantage of \textsc{bu} over \textsc{td} provides an answer for an open question from Baker \cite{Baker82}.
     155I believe that bottom-up is superior because it must only handle each subexpression once to form a list of candidate interpretations, whereas the top-down approach may do similar work repeatedly to resolve a subexpression with a variety of different types, a shortcoming that cannot be fully addressed by the memoization scheme employed in the \textsc{td} algorithm.
     156
     157With regard to assertion satisfaction, immediate (\textsc{imm}) satisfaction is an inferior solution, though there is little performance difference between deferred (\textsc{def}) and deferred-cached (\textsc{dca}) for instances which they can both complete; particularly notable is that \textsc{dca} caching scheme does not have a noticeable impact on peak memory usage.
    152158Since the \textsc{dca} algorithm can solve some particularly hard instances which \textsc{def} cannot, it is the recommended approach.
    153159
     
    157163\section{Instance Difficulty} \label{instance-expr-sec}
    158164
    159 To characterize the difficulty of expression resolution problem instances, the test suites must be explored at a finer granuarity.
     165To characterize the difficulty of expression resolution problem instances, the test suites must be explored at a finer granularity.
    160166As discussed in Section~\ref{resn-analysis-sec}, a single top-level expression is the fundamental problem instance for resolution, yet the test inputs discussed above are composed of thousands of top-level expressions, like the actual source code they are derived from.
    161167To pull out the effects of these individual problems, I instrumented the resolver prototype to time resolution for each expression, and also report some relevant properties of the expression.
     
    173179
    174180Since the top centile of expression resolution instances requires approximately two-thirds of the resolver's time, optimizing the resolver for specific hard problem instances has proven to be an effective technique for reducing overall runtime.
    175 The data below indicates that number of assertions necessary to resolve has the greatest effect on runtime, as seen in
     181The data indicates that number of assertions necessary to resolve has the greatest effect on runtime, as seen in
    176182Figure~\ref{per-prob-assns-fig}.
    177 However, since the number of assertions required is only known once resolution is finished, the most-promising pre-resolution metric of difficulty is the nesting depth of the expression; as seen in Figure~\ref{per-prob-depth-fig}, expressions of depth $> 10$ in this dataset are uniformly difficult.
     183However, since the number of assertions required is only known once resolution is finished, the most-promising pre-resolution metric of difficulty is the nesting depth of the expression; as seen in Figure~\ref{per-prob-depth-fig}, expressions of depth $> 10$ in this data-set are uniformly difficult.
    178184Figure~\ref{per-prob-subs-fig} presents a similar pattern for number of subexpressions, though given that the expensive tail of problem instances occurs at approximately twice the depth values, it is reasonable to believe that the difficult expressions in question are deeply-nested invocations of binary functions rather than wider but shallowly-nested expressions.
    179185
     
    184190\centering
    185191\input{per-prob-assns}
    186 \caption[Top-level expression resolution time by number of assertions resolved.]{Top-level expression resolution time by number of assertions resolved. Note log scales on both axes.} \label{per-prob-assns-fig}
     192\caption[Top-level expression resolution time by number of assertions resolved.]{Top-level expression resolution time by number of assertions resolved. Source input file for each expression listed in legend; note log scales on both axes.} \label{per-prob-assns-fig}
    187193\end{figure}
    188194
     
    203209
    204210I have integrated most of the algorithmic techniques discussed in this chapter into \CFACC{}.
    205 This integration took place over a period of months while \CFACC{} was under active development on a number of other fronts, so it is not possible to completely isolate the effects of the algorithmic changes, but I have generated some data.
     211This integration took place over a period of months while \CFACC{} was under active development on a number of other fronts, so it is not possible to completely isolate the effects of the algorithmic changes, but I believe the algorithmic changes to be the most-significant effects on performance over the study period.
    206212To generate this data, representative commits from the \texttt{git} history of the project were checked out and compiled, then run on the same machine used for the resolver prototype experiments discussed in Section~\ref{proto-exp-sec}.
    207213To negate the effects of changes to the \CFA{} standard library on the timing results, 55 test files from the test suite of the oldest \CFA{} variant were compiled with the \texttt{-E} flag to inline their library dependencies, and these inlined files were used to test the remaining \CFACC{} versions.
     
    209215I performed two rounds of modification to \CFACC{}; the first round moved from Bilson's original combined-bottom-up algorithm to an un-combined bottom-up algorithm, denoted \textsc{cfa-co} and \textsc{cfa-bu}, respectively.
    210216A top-down algorithm was not attempted in \CFACC{} due to its poor performance in the prototype.
    211 The second round of modifications addressed assertion satisfaction, taking Bilson's original \textsc{cfa-imm} algorithm, and iteratively modifying it, first to use the deferred approach \textsc{cfa-def}, then caching those results in the \text{cfa-dca} algorithm.
    212 The new environment data structures discussed in Section~\ref{proto-exp-sec} have not been successfully merged into \CFACC{} due to their dependencies on the garbage-collection framework in the prototype; I spent several months modifiying \CFACC{} to use similar garbage collection, but due to \CFACC{} not being designed to use such memory management the performance of the modified compiler was non-viable.
     217The second round of modifications addressed assertion satisfaction, taking Bilson's original \textsc{cfa-imm} algorithm, and iteratively modifying it, first to use the deferred approach \textsc{cfa-def}, then caching those results in the \textsc{cfa-dca} algorithm.
     218The new environment data structures discussed in Section~\ref{proto-exp-sec} have not been successfully merged into \CFACC{} due to their dependencies on the garbage-collection framework in the prototype; I spent several months modifying \CFACC{} to use similar garbage collection, but due to \CFACC{} not being designed to use such memory management the performance of the modified compiler was non-viable.
    213219It is possible that the persistent union-find environment could be modified to use a reference-counted pointer internally without changing the entire memory-management framework of \CFACC{}, but such an attempt is left to future work.
    214220
    215 As can be seen in Figures~\ref{cfa-time-fig} and~\ref{cfa-mem-fig}, which show the time and peak memory results for these five versions of \CFACC{} on the same test suite, assertion resolution dominates total resolution cost, with the \textsc{cfa-def} and \textsc{cfa-dca} variants running consistently faster than the others on more expensive test cases.
    216 The results from \CFACC{} do not exactly mirror those from the prototype; I conjecture this is mostly due to the different memory-management schemes and sorts of data required to run type unification and assertion satisfaction calculations, as \CFACC{} performance has proven to be particularly sensitive to the amount of heap allocation performed.
    217 This data also shows a noticable regression in compiler performance in the eleven months between \textsc{cfa-bu} and \textsc{cfa-imm}; this regression is not due to expression resolution, as no integration work happened in this time, but I am unable to ascertain its actual cause.
     221As can be seen in Figures~\ref{cfa-time-fig} and~\ref{cfa-mem-fig}, which show the time and peak memory results for these five versions of \CFACC{}, assertion resolution dominates total resolution cost, with the \textsc{cfa-def} and \textsc{cfa-dca} variants running consistently faster than the others on more expensive test cases; no difference can be seen between these two algorithms' performance, but that result agrees with the prototype experiments in Section~\ref{proto-exp-sec}.
     222The results from \CFACC{} for \textsc{cfa-co} \textit{vs.}\ \textsc{cfa-bu} do not mirror those from the prototype; I conjecture this is mostly due to the different memory-management schemes and sorts of data required to run type unification and assertion satisfaction calculations, as \CFACC{} performance has proven to be particularly sensitive to the amount of heap allocation performed.
     223This data also shows a noticeable regression in compiler performance in the eleven months between \textsc{cfa-bu} and \textsc{cfa-imm}, which use the same resolution algorithms; this regression is not due to expression resolution, as no integration work happened in this time, but I am unable to ascertain its actual cause.
    218224It should also be noted with regard to the peak memory results in Figure~\ref{cfa-mem-fig} that the peak memory usage does not always occur during the resolution phase of the compiler.
    219225
Note: See TracChangeset for help on using the changeset viewer.