source: doc/theses/fangren_yu_COOP_F20/Report.tex @ 101cc3a

ADTarm-ehast-experimentalenumforall-pointer-decayjacob/cs343-translationnew-ast-unique-exprpthread-emulationqualifiedEnum
Last change on this file since 101cc3a was 101cc3a, checked in by Peter A. Buhr <pabuhr@…>, 4 years ago

add references to table of contents

  • Property mode set to 100644
File size: 37.5 KB
Line 
1\documentclass[twoside,12pt]{article}
2
3%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4
5% Latex packages used in the document.
6\usepackage{fullpage,times,comment}
7\usepackage{epic,eepic}
8\usepackage{upquote}                                                                    % switch curled `'" to straight
9\usepackage{calc}
10\usepackage{varioref}                                                                   % extended references
11\usepackage[labelformat=simple,aboveskip=0pt,farskip=0pt]{subfig}
12\renewcommand{\thesubfigure}{\alph{subfigure})}
13\usepackage[flushmargin]{footmisc}                                              % support label/reference in footnote
14\usepackage{latexsym}                                   % \Box glyph
15\usepackage{mathptmx}                                   % better math font with "times"
16\usepackage[toc]{appendix}                                                              % article does not have appendix
17\usepackage[usenames]{color}
18\input{common}                                          % common CFA document macros
19\usepackage[dvips,plainpages=false,pdfpagelabels,pdfpagemode=UseNone,colorlinks=true,pagebackref=true,linkcolor=blue,citecolor=blue,urlcolor=blue,pagebackref=true,breaklinks=true]{hyperref}
20\usepackage{breakurl}
21\urlstyle{sf}
22
23
24% reduce spacing
25\setlist[itemize]{topsep=5pt,parsep=0pt}% global
26\setlist[enumerate]{topsep=5pt,parsep=0pt}% global
27
28\usepackage[pagewise]{lineno}
29\renewcommand{\linenumberfont}{\scriptsize\sffamily}
30
31% Default underscore is too low and wide. Cannot use lstlisting "literate" as replacing underscore
32% removes it as a variable-name character so keywords in variables are highlighted. MUST APPEAR
33% AFTER HYPERREF.
34\renewcommand{\textunderscore}{\leavevmode\makebox[1.2ex][c]{\rule{1ex}{0.075ex}}}
35\newcommand{\NOTE}{\textbf{NOTE}}
36\newcommand{\TODO}[1]{{\color{Purple}#1}}
37
38\setlength{\topmargin}{-0.45in}                                                 % move running title into header
39\setlength{\headsep}{0.25in}
40
41%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
42
43\CFAStyle                                                                                               % CFA code-style for all languages
44%\lstset{
45%language=C++,moredelim=**[is][\color{red}]{@}{@}               % make C++ the default language
46%}% lstset
47%\lstnewenvironment{C++}[1][]                            % use C++ style
48%{\lstset{language=C++,moredelim=**[is][\color{red}]{@}{@}}\lstset{#1}}{}
49
50%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
51
52\setcounter{secnumdepth}{3}                             % number subsubsections
53\setcounter{tocdepth}{3}                                % subsubsections in table of contents
54\makeindex
55
56%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
57
58\title{\LARGE
59Optimization of \CFA Compiler with Case Studies
60}% title
61
62\author{\LARGE
63Fangren Yu -- University of Waterloo
64}% author
65
66\date{
67\today
68}% date
69
70%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
71
72\begin{document}
73\pagestyle{headings}
74% changed after setting pagestyle
75\renewcommand{\sectionmark}[1]{\markboth{\thesection\quad #1}{\thesection\quad #1}}
76\renewcommand{\subsectionmark}[1]{\markboth{\thesubsection\quad #1}{\thesubsection\quad #1}}
77\pagenumbering{roman}
78\linenumbers                                            % comment out to turn off line numbering
79
80\maketitle
81\pdfbookmark[1]{Contents}{section}
82\tableofcontents
83
84\clearpage
85\thispagestyle{plain}
86\pagenumbering{arabic}
87
88\begin{abstract}
89\end{abstract}
90
91\section{Introduction}
92
93\section{Completed work}
94
95\subsection{Memory model with sharing}
96
97A major rework of the abstract syntax tree (AST) data structure in the compiler is completed as the first step of the project. The majority of work were documented in the reference manual of the compiler~\cite{cfa-cc}. To summarize:
98\begin{itemize}
99\item
100AST nodes (and therefore subtrees) can be shared without copying when reused.
101\item
102Modifications apply the functional programming principle, making copies for local changes without affecting the original data shared by other owners. In-place mutations are permitted as a special case when sharing does not happen. The logic is implemented by reference counting.
103\item
104Memory allocation and freeing are performed automatically using smart pointers.
105\end{itemize}
106The resolver algorithm designed for overload resolution naturally introduces a significant amount of reused intermediate representations, especially in the following two places:
107\begin{itemize}
108\item
109Function overload candidates are computed by combining the argument candidates bottom-up, with many of them being a common term. For example, if $n$ overloads of a function @f@ all take an integer for the first parameter but different types for the second (@f( int, int )@, @f( int, double )@, etc.) the first term is reused $n$ times for each of the generated candidate expressions. This effect is particularly bad for deep expression trees.
110\item
111In the unification algorithm and candidate elimination step, actual types are obtained by substituting the type parameters by their bindings. Let $n$ be the complexity (\ie number of nodes in representation) of the original type, $m$ be the complexity of bound type for parameters, and $k$ be the number of occurrences of type parameters in the original type. If everything needs to be deep-copied, the substitution step takes $O(n+mk)$ time and memory, while using shared nodes it is reduced to $O(n)$ time and $O(k)$ memory.
112\end{itemize}
113One of the worst examples for the old compiler is a long chain of I/O operations
114\begin{cfa}
115sout | 1 | 2 | 3 | 4 | ...
116\end{cfa}
117The pipe operator is overloaded by \CFA I/O library for every primitive type in C language, as well as I/O manipulators defined by the library. In total there are around 50 overloads for the output stream operation. On resolving the $n$-th pipe operator in the sequence, the first term, which is the result of sub-expression containing $n-1$ pipe operators, is reused to resolve every overload. Therefore at least $O(n^2)$ copies of expression nodes are made during resolution, not even counting type unification cost; combined with two large factors from number of overloads of pipe operators, and that the ``output stream type'' in \CFA is a trait with 27 assertions (which adds to complexity of the pipe operator's type) this makes compiling a long output sequence extremely slow. In new AST representation only $O(n)$ copies are required and type of pipe operator is not copied at all.
118
119Reduction in space complexity is especially important, as preliminary profiling result on the old compiler build shows that over half of time spent in expression resolution are on memory allocations.
120 
121
122\subsection{Merged resolver calls}
123
124The pre-resolve phase of compilation, inadequately called ``validate'' in the compiler source code, does more than just simple syntax validation, as it also normalizes input program. Some of them, however, requires type information on expressions and therefore needs to call the resolver before the general resolve phase. There are three notable places where the resolver is invoked:
125\begin{itemize}
126\item
127Attempt to generate default constructor, copy constructor and destructor for user-defined @struct@ types
128\item
129Resolve @with@ statements (the same as in Python, which introduces fields of a structure directly in scope)
130\item
131Resolve @typeof@ expressions (cf. @decltype@ in \CC); note that this step may depend on symbols introduced by @with@ statements.
132\end{itemize}
133Since the compiler codebase is large and the new memory model mostly only benefits expression resolution, the old data structure is still kept, and a conversion pass happens before and after resolve phase. Rewriting every compiler module will take a long time, and whether the new model is correct is still unknown when started, therefore only the resolver is implemented with the new data structure.
134
135Since the constructor calls were one of the most expensive to resolve (reason will be shown in the next section), pre-resolve phase were taking more time after resolver moves to the more efficient new implementation. To better facilitate the new resolver, every step that requires type information are reintegrated as part of resolver.
136
137A by-product of this work is that the reversed dependence of @with@ statement and @typeof@ can now be handled. Previously, the compiler is unable to handle cases such as
138\begin{cfa}
139struct S { int x; };
140S foo();
141typeof( foo() ) s; // type is S
142with (s) { 
143        x; // refers to s.x
144}
145\end{cfa}
146since type of @s@ is still unresolved when handling @with@ expressions. Instead, the new (and correct) approach is to evaluate @typeof@ expressions when the declaration is first seen, and it suffices because of the declaration-before-use rule.
147
148
149\subsection{Special function lookup}
150
151Reducing the number of functions looked up for overload resolution is an effective way to gain performance when there are many overloads but most of them are trivially wrong. In practice, most functions have few (if any) overloads but there are notable exceptions. Most importantly, constructor @?{}@, destructor @^?{}@, and assignment @?=?@ are generated for every user-defined type, and in a large source file there can be hundreds of them. Furthermore, many calls to them are generated for initializing variables and passing arguments. This fact makes them the most overloaded and most called functions.
152
153In an object-oriented programming language, object has methods declared with their types, so a call such as @obj.f()@ only needs to perform lookup in the method table corresponding to type of @obj@. \CFA on the other hand, does not have methods, and all types are open (\ie new operations can be defined on them), so a similar approach will not work in general. However, the ``big 3'' operators have a unique property enforced by the language rules, such that the first parameter must have a reference type. Since \CFA does not have class inheritance, reference type must always match exactly. Therefore, argument-dependent lookup can be implemented for these operators, by using a dedicated symbol table.
154
155The lookup key used for the special functions is the mangled type name of the first parameter, which acts as the @this@ parameter in an object-oriented language. To handle generic types, the type parameters are stripped off, and only the base type is matched. Note that a constructor (destructor, assignment operator) taking arbitrary @this@ argument, for example @forall( dtype T ) void ?{}( T & );@ is not allowed, and it guarantees that if the @this@ type is known, all possible overloads can be found by searching with the given type. In case that the @this@ argument itself is overloaded, it is resolved first and all possible result types are used for lookup.
156
157Note that for the generated expressions, the particular variable for @this@ argument is fully known, without overloads, so the majority of constructor call resolutions only need to check for one given object type. Explicit constructor calls and assignment statements sometimes may require lookup for multiple types. In the extremely rare case that type of @this@ argument is yet unbound, everything will have to be checked, just like without the argument-dependent lookup algorithm; fortunately, this case almost never happens in practice. An example is found in the library function @new@:
158\begin{cfa}
159forall( dtype T | sized( T ), ttype TT | { void ?{}( T &, TT ); } )
160T * new( TT p ) { return &(*malloc()){ p }; }
161\end{cfa}
162as @malloc@ may return a pointer to any type, depending on context.
163
164Interestingly, this particular line of code actually caused another complicated issue, where the unusually massive work of checking every constructor in presence makes the case even worse. Section~\ref{s:TtypeResolutionInfiniteRecursion} presents a detailed analysis for the problem.
165
166The ``callable'' operator @?()@ (cf. @operator()@ in \CC) could also be included in the special operator list, as it is usually only on user-defined types, and the restriction that first argument must be a reference seems reasonable in this case.
167
168
169\subsection{Improvement of function type representation}
170
171Since substituting type parameters with their bound types is one fundamental operation in many parts of resolver algorithm (particularly unification and environment binding), making as few copies of type nodes as possible helps reducing memory complexity. Even with the new memory management model, allocation is still a significant factor of resolver performance. Conceptually, operations on type nodes of AST should be performed in functional programming style, treating the data structure as immutable and only copy when necessary. The in-place mutation is a mere optimization that does not change logic of operations.
172The model was broken on function types by an inappropriate design. Function types require some special treatment due to the existence of assertions. In particular, it must be able to distinguish two different kinds of type parameter usage:
173\begin{cfa}
174forall( dtype T ) void foo( T * t ) {
175        forall( dtype U ) void bar( T * t, U * u ) { ... }
176}
177\end{cfa}
178Here, only @U@ is a free parameter in declaration of @bar@, as it appears in the function's own forall clause; while @T@ is not free.
179
180Moreover, the resolution algorithm also has to distinguish type bindings of multiple calls to the same function, for example with
181\begin{cfa}
182forall( dtype T ) int foo( T x );
183foo( foo( 1.0 ) );
184\end{cfa}
185The inner call has binding (T: double) while the outer call has binding (T: int). Therefore a unique representation of free parameters in each expression is required. This was previously done by creating a copy of the parameter declarations inside function type, and fixing references afterwards. However, fixing references is an inherently deep operation that does not work well with functional programming model, as it must be evaluated eagerly on the entire syntax tree representing the function type.
186
187The revised approach generates a unique ID value for each function call expression instance and represents an occurrence of free parameter type with a pair of generated ID and the original parameter declaration, so that references do not need to be fixed, and a shallow copy of function type is possible.
188
189Note that after the change, all declaration nodes in syntax tree representation maps one-to-one with the actual declarations in the program, and therefore are guaranteed to be unique. Such property can potentially enable more optimizations, and some related ideas are presented after Section~\ref{s:SharedSub-ExpressionCaseUniqueExpressions}.
190
191
192\subsection{Improvement of pruning steps}
193
194A minor improvement for candidate elimination is to skip the step on the function overloads themselves and only perform on results of function application. As function calls are usually by name, the name resolution rule dictates that every function candidate necessarily has a different type; indirect function calls are rare, and when they do appear, they usually will not have many possible interpretations, and those rarely matches exactly in argument type. Since function types have a much more complex representation than data types (with multiple parameters and assertions), checking equality on them also takes longer.
195
196A brief test of this approach shows that the number of function overloads considered in expression resolution increases by a negligible amount of less than 1 percent, while type comparisons in candidate elimination are cut by more than half. Improvement is consistent over all \CFA source files in the test suite.
197
198
199\subsection{Shared sub-expression case: Unique expressions}
200\label{s:SharedSub-ExpressionCaseUniqueExpressions}
201
202Unique expression denotes an expression that must be evaluated only once, to prevent unwanted side effects. It is currently only a compiler artifact, generated on tuple member expression of the form
203\begin{cfa}
204struct S { int a; int b; };
205S s;
206s.[a, b]; // tuple member expression, type is [int, int]
207\end{cfa}
208If the aggregate expression contains function calls, it cannot be evaluated multiple times:
209\begin{cfa}
210S makeS();
211makeS().[a, b]; // this should only make one S
212\end{cfa}
213Before code generation, the above expression is internally represented as
214\begin{cfa}
215[uniqueExpr{makeS()}.a, uniqueExpr{makeS()}.b];
216\end{cfa}
217and the unique expression is expanded to$\footnote{This uses gcc's statement expression syntax, where \lstinline|(\{block\})| evaluates to value of last expression in the block.}$
218\begin{cfa}
219({
220        if ( !_unique_var_evaluated ) {
221                _unique_var_evaluated = true;
222                _unique_var = makeS();
223        }
224        _unique_var;
225})
226\end{cfa}
227at code generation, where @_unique_var@ and @_unique_var_evaluated@ are generated variables whose scope covers all appearances of the same expression.
228
229Note that although the unique expression is only used for tuple expansion now, it is a generally useful construction, and can be seen in other languages, such as Scala's @lazy val@~\cite{Scala}; therefore it could be worthwhile to introduce the unique expression to a broader context in \CFA and even make it directly available to programmers.
230
231In the compiler's visitor pattern, however, this creates a problem where multiple paths to a logically unique expression exist, so it may be modified more than once and become ill-formed; some specific intervention is required to ensure that unique expressions are only visited once. Furthermore, a unique expression appearing in more than one places will be copied on mutation so its representation is no longer unique. Some hacks are required to keep it in sync, and the methods are different when mutating the unique expression instance itself or its underlying expression.
232
233Example when mutating the underlying expression (visit-once guard)
234\begin{cfa}
235void InsertImplicitCalls::previsit( const ast::UniqueExpr * unqExpr ) {
236        if ( visitedIds.count( unqExpr->id ) ) visit_children = false;
237        else visitedIds.insert( unqExpr->id );
238}
239\end{cfa}
240Example when mutating the unique instance itself, which actually creates copies
241\begin{cfa}
242auto mutExpr = mutate( unqExpr ); // internally calls copy when shared
243if ( ! unqMap.count( unqExpr->id ) ) {
244        ...
245} else {
246        mutExpr->expr = unqMap[mutExpr->id]->expr;
247        mutExpr->result = mutExpr->expr->result;
248}
249\end{cfa}
250Such workaround seems difficult to be fit into a common visitor template. This suggests the memory model may need different kinds of nodes to accurately represent the syntax tree.
251
252Together with the fact that declaration nodes are always unique, it is possible that AST nodes can be classified by three different types:
253\begin{itemize}
254\item
255\textbf{Strictly unique} with only one owner (declarations);
256\item
257\textbf{Logically unique} with (possibly) many owners but should not be copied (unique expression example presented here);
258\item
259\textbf{Shared} by functional programming model, which assume immutable data structure and are copied on mutation.
260\end{itemize}
261The boilerplate code can potentially handle these three cases differently.
262
263
264\section{Analysis of resolver algorithm complexity}
265
266The focus of this chapter is to identify and analyze some realistic cases that cause resolver algorithm to have an exponential run time. As previous work has shown [3], the overload resolution problem in \CFA has worst-case exponential complexity; however, only few specific patterns can trigger the exponential complexity in practice. Implementing heuristic-based optimization for those selected cases is helpful to alleviate the problem.
267
268
269\subsection{Unbound return type}
270\label{s:UnboundReturnType}
271
272The interaction of return type overloading and polymorphic functions creates this problem of function calls with unbound return type, and is further complicated by the presence of assertions.
273The prime example of a function with unbound return type is the type-safe version of C @malloc@:
274\begin{cfa}
275// size deduced from type, so no need to provide the size argument
276forall( dtype T | sized( T ) ) T * malloc( void );
277\end{cfa}
278Unbound return type can be problematic in resolver algorithm complexity because a single match of function call with unbound return type may create multiple candidates. In the worst case, consider a function declared to return any @otype@:
279\begin{cfa}
280forall( otype T ) T anyObj( void );
281\end{cfa}
282As the resolver attempts to satisfy the otype constraint on @T@, a single call to @anyObj()@ without the result type known creates at least as many candidates as the number of complete types currently in scope; with generic types it becomes even worse, for example, assuming a declaration of generic pair is available at that point:
283\begin{cfa}
284forall( otype T, otype U ) struct pair { T first; U second; };
285\end{cfa}
286Then an @anyObj()@ call can result in arbitrarily complex types, such as @pair( pair( int,int ), pair( int,int ) )@, and the depth can grow indefinitely until the specified parameter depth limit, thus creating exponentially many candidates. However, the expected types allowed by parent expressions are practically very few, so most of those interpretations are invalid; if the result type is never bound up to top level, by the semantic rules it is ambiguous if there are more than one valid bindings, and resolution can fail fast. It is therefore reasonable to delay resolving assertions on an unbound parameter in return type; however, with the current cost model, such behavior may further cause irregularities in candidate selection, such that the presence of assertions can change the preferred candidate, even when order of expression costs are supposed to stay the same. Detailed analysis of this issue will be presented later, in the correctness part.
287
288
289\subsection[ttype resolution infinite recursion]{\lstinline|ttype| resolution infinite recursion}
290\label{s:TtypeResolutionInfiniteRecursion}
291
292@ttype@ (``tuple type'') is a relatively new addition to the language that attempts to provide type-safe variadic argument semantics. Unlike regular @dtype@ parameters, @ttype@ is only valid in function parameter list, and may only appear once as the type of last parameter. At the call site, a @ttype@ parameter is bound to the tuple type of all remaining function call arguments.
293
294There are two kinds of idiomatic @ttype@ usage: one is to provide flexible argument forwarding, similar to the variadic template in \CC (\lstinline[language=C++]|template<typename... args>|), as shown below in the implementation of @unique_ptr@
295\begin{cfa}
296forall( dtype T )
297struct unique_ptr {
298        T * data;
299};
300forall( dtype T | sized( T ), ttype Args | { void ?{}( T &, Args ); })
301void ?{}( unique_ptr( T ) & this, Args args ) {
302        this.data = new( args );
303}
304\end{cfa}
305the other is to implement structural recursion in the first-rest manner:
306\begin{cfa}
307forall( otype T, ttype Params | { void process( T ); void func( Params ); })
308void func( T arg1, Params p ) {
309        process( arg1 );
310        func( p );
311}
312\end{cfa}
313For the second use case, it is important that the number of parameters in the recursive call go down, since the call site must deduce all assertion candidates, and that is only possible if by just looking at argument types (and not their values), the recursion is known to be completed in a finite number of steps.
314
315In recent experiments, however, some flaw in the type binding rules can lead to the first kind of @ttype@ use case produce an invalid candidate that the resolver enters an infinite loop.
316
317This bug was discovered in an attempt to raise assertion recursive depth limit and one of the library program takes exponentially longer time to compile. The cause of the problem is identified to be the following set of functions.
318File @memory.cfa@ contains
319\begin{cfa}
320#include "memory.hfa"
321#include "stdlib.hfa"
322\end{cfa}
323where file @memory.hfa@ contains the @unique_ptr@ declaration above, and two other similar functions with @ttype@ parameter:
324\begin{cfa}
325forall( dtype T | sized( T ), ttype Args | { void ?{}( T &, Args ); }) {
326        void ?{}( counter_data( T ) & this, Args args );
327        void ?{}( counter_ptr( T ) & this, Args args );
328        void ?{}( unique_ptr( T ) & this, Args args );
329}
330\end{cfa}
331File @stdlib.hfa@ contains
332\begin{cfa}
333forall( dtype T | sized( T ), ttype TT | { void ?{}( T &, TT ); } )
334T * new( TT p ) { return &(*malloc()){ p }; }
335\end{cfa}
336
337In the expression @(*malloc()){p}@, the type of object being constructed is yet unknown, since the return type information is not immediately provided. That caused every constructor to be searched, and while normally a bound @ttype@ cannot be unified with any free parameter, it is possible with another free @ttype@. Therefore in addition to the correct option provided by assertion, 3 wrong options are examined, each of which again requires the same assertion, for an unknown base type T and @ttype@ arguments, and that becomes an infinite loop, until the specified recursion limit and resolution is forced to fail. Moreover, during the recursion steps, number of candidates grows exponentially, since there are always 3 options at each step.
338
339Unfortunately, @ttype@ to @ttype@ binding is necessary, to allow calling the function provided by assertion indirectly.
340\begin{cfa}
341forall( dtype T | sized( T ), ttype Args | { void ?{}( T &, Args ); })
342void ?{}( unique_ptr( T ) & this, Args args ) { this.data = (T * )new( args ); }
343\end{cfa}
344Here the constructor assertion is used for the @new( args )@ call.
345Therefore, it is hard, perhaps impossible, to solve this problem by tweaking the type binding rules. An assertion caching algorithm can help improve this case by detecting cycles in recursion.
346
347Meanwhile, without the caching algorithm implemented, some changes in the \CFA source code are enough to eliminate this problem, at least in the current codebase. Note that the issue only happens with an overloaded variadic function, which rarely appears in practice, since the idiomatic use cases are for argument forwarding and self-recursion. The only overloaded @ttype@ function so far discovered in all of \CFA standard library code is the constructor, and by utilizing the argument-dependent lookup process described in Section~\ref{s:UnboundReturnType}, adding a cast before constructor call gets rid of the issue.
348\begin{cfa}
349T * new( TT p ) { return &(*(T * )malloc()){ p }; }
350\end{cfa}
351
352
353\subsection{Reused assertions in nested generic type}
354
355The following test of deeply nested dynamic generic type reveals that locally caching reused assertions is necessary, rather than just a resolver optimization, because recomputing assertions can result in bloated generated code size:
356\begin{cfa}
357struct nil {};
358forall( otype L, otype R )
359struct cons { L l; R r; };
360
361int main() {
362        #if   N==0
363        nil x;   
364        #elif N==1
365        cons( size_t, nil ) x;
366        #elif N==2
367        cons( size_t, cons( size_t, nil ) ) x;
368        #elif N==3
369        cons( size_t, cons( size_t, cons( size_t, nil ) ) ) x;
370        // similarly for N=4,5,6
371        #endif
372}
373\end{cfa}
374At the declaration of @x@, it is implicitly initialized by generated constructor call, whose signature is given by
375\begin{cfa}
376forall( otype L, otype R ) void ?{}( cons( L, R ) & );
377\end{cfa}
378Note that the @otype@ constraint contains 4 assertions:
379\begin{cfa}
380void ?{}( L & ); // default constructor
381void ?{}( L &, L & ); // copy constructor
382void ^?{}( L & ); // destructor
383L & ?=?( L &, L & ); // assignment
384\end{cfa}
385Now since the right hand side of outermost cons is again a cons, recursive assertions are required. When the compiler cannot cache and reuse already resolved assertions, it becomes a problem, as each of those 4 pending assertions again asks for 4 more assertions one level below. Without any caching, number of resolved assertions grows exponentially, while that is obviously unnecessary since there are only $n+1$ different types involved. Even worse, this causes exponentially many wrapper functions generated later at the codegen step, and results in huge compiled binary.
386
387\begin{table}[h]
388\caption{Compilation results of nested cons test}
389\begin{tabular}{|r|r|r|}
390\hline
391N       & Assertions resolved   & Binary size (KB) \\
392\hline
3933       & 170           & 70    \\
394\hline
3954       & 682           & 219   \\
396\hline
3975       & 2,730         & 829   \\
398\hline
3996       & 10,922        & 3,261 \\
400\hline
401\end{tabular}
402\end{table}
403
404As the local functions are implemented by emitting executable code on the stack~\cite{gcc-nested-func}, it eventually means that compiled code also has exponential run time. This problem has evident practical implications, as nested collection types are frequently used in real production code.
405
406
407\section{Analysis of type system correctness}
408
409In Moss' thesis~\cite[\S~4.1.2,~p.~45]{Moss19}, the author presents the following example:
410\begin{quote}
411The cost model of \CFA precludes a greedy bottom-up resolution pass, as constraints and costs introduced by calls higher in the expression tree can change the interpretation of those lower in the tree, as in the following example:
412\begin{cfa}
413void f( int );
414double g$_1$( int );
415int g$_2$( long );
416f( g( 42 ) );
417\end{cfa}
418Considered independently, @g$_1$( 42 )@ is the cheapest interpretation of @g( 42 )@, with cost (0, 0, 0, 0, 0, 0, 0) since the argument type is an exact match. However, in context, an unsafe conversion is required to downcast the return type of @g1@ to an @int@ suitable for @f@, for a total cost of (1, 0, 0, 0, 0, 0, 0) for @f( g1( 42 ) )@. If @g2@ is chosen, on the other hand, there is a safe upcast from the int type of 42 to @long@, but no cast on the return of @g2@, for a total cost of (0, 0, 1, 0, 0, 0, 0) for @f( g2( 42 ) );@ as this is cheaper, @g2@ is chosen. Due to this design, all valid interpretations of subexpressions must in general be propagated to the top ...
419\end{quote}
420The cost model in the thesis sums up all costs in resolving sub-expressions and chooses the global optimal option. However, the current implementation in @cfa-cc@, based on the original Bilson model~\cite[\S~2.2.4,~p.~32]{Bilson03}, does eager resolution and only allows local optimal candidate selections to propagate upwards:
421\begin{quote}
422From the set of candidates whose parameter and argument types have been unified and whose assertions have been satisfied, those whose sub-expression interpretations have the smallest total cost of conversion are selected ... The total cost of conversion for each of these candidates is then calculated based on the implicit conversions and polymorphism involved in adapting the types of the sub-expression interpretations to the formal parameter types.
423\end{quote}
424With this model, the algorithm picks @g1@ in resolving the @f( g( 42 ) )@ call, which seems to be undesirable.
425
426There are further evidence that shows the Bilson model is fundamentally incorrect, following the discussion of unbound return type in Section~\ref{s:UnboundReturnType}. By the conversion cost specification, a binding from a polymorphic type parameter to a concrete type incurs a polymorphic cost of 1. It remains unspecified \emph{when} the type parameters should become bound. When the parameterized types appear in the function parameters, they can be deduced from the argument type, and there is no ambiguity. In the unbound return case, however, the binding may happen at any stage in expression resolution, therefore it is impossible to define a unique local conversion cost. Note that type binding happens exactly once per parameter in resolving the entire expression, so the global binding cost is unambiguously 1.
427
428As per the current compiler implementation, it does have a notable inconsistency in handling such case. For any unbound parameter that does \emph{not} come with an associated assertion, it remains unbound to the parent expression; for those that does however, they are immediately bound in the assertion resolution step, and concrete result types are used in the parent expressions.
429
430Consider the following example:
431\begin{cfa}
432forall( dtype T ) T * f( void );
433void h( int * );
434\end{cfa}
435The expression @h( f() )@ eventually has a total cost of 1 from binding (T: int), but in the eager resolution model, the cost of 1 may occur either at call to @f@ or at call to @h@, and with the assertion resolution triggering a binding, the local cost of @f()@ is (0 poly, 0 spec) with no assertions, but (1 poly, -1 spec) with an assertion:
436\begin{cfa}
437forall( dtype T | { void g( T * ); } ) T * f( void );
438void g( int * );
439void h( int * );
440\end{cfa}
441and that contradicts the principle that adding assertions should make expression cost lower. Furthermore, the time at which type binding and assertion resolution happens is an implementation detail of the compiler, but not a part of language definition. That means two compliant \CFA compilers, one performing immediate assertion resolution at each step, and one delaying assertion resolution on unbound types, can produce different expression costs and therefore different candidate selection, making the language rule itself partially undefined and therefore unsound. By the above reasoning, the updated cost model using global sum of costs should be accepted as the standard. It also allows the compiler to freely choose when to resolve assertions, as the sum of total costs is independent of that choice; more optimizations regarding assertion resolution can also be implemented.
442
443
444\section{Timing results}
445
446For the timing results presented here, the \CFA compiler is built with gcc 9.3.0, and tested on a server machine running Ubuntu 20.04, 64GB RAM and 32-core 2.2 GHz CPU, results reported by the time command, and using only 8 cores in parallel such that the time is close to the case with 100% CPU utilization on a single thread.
447
448On the most recent build, the \CFA standard library (~1.3 MB of source code) compiles in 4 minutes 47 seconds total processor time (single thread equivalent), with the slowest file taking 13 seconds. The test suite (178 test cases, ~2.2MB of source code) completes within 25 minutes total processor time,\footnote{Including a few runtime tests; total time spent in compilation is approximately 21 minutes.} with the slowest file taking 23 seconds. In contrast, the library build on old compiler takes 85 minutes total, 5 minutes for the slowest file. Full test suite takes too long with old compiler build and is therefore not run, but the slowest test cases take approximately 5 minutes. Overall, the most recent build compared to old build in April 2020, before the project started, is consistently faster by a factor of 20.
449
450Additionally, 6 selected \CFA source files with distinct features from library and test suite are used to test compiler performance after each of the optimizations are implemented. Test files are from the most recent build and run through C preprocessor to eliminate the factor of header file changes. The selected tests are:
451\begin{itemize}
452\item
453@lib/fstream@ (112 KB)\footnote{File sizes are after preprocessing, with no line information (\lstinline|gcc -E -P|).}: implementation of I/O library
454\item
455@lib/mutex@ (166 KB): implementation of concurrency primitive
456\item
457@lib/vector@ (43 KB): container example, similar to \CC vector
458\item
459@lib/stdlib@ (64 KB): type-safe wrapper to @void *@-based C standard library functions
460\item
461@test/ISO2@ (55 KB): application of I/O library
462\item
463@test/thread@ (188 KB): application of threading library
464\end{itemize}
465
466The \CFA compiler builds are picked from git commit history that passed the test suite, and implement the optimizations incrementally:
467\begin{itemize}
468\item
469\#0 is the first working build of new AST data structure
470\item
471\#1 implements special symbol table and argument-dependent lookup
472\item
473\#2 implements late assertion satisfaction
474\item
475\#3 implements revised function type representation
476\item
477\#4 skips pruning on expressions with function type (most recent build)
478\end{itemize}
479The old resolver with no memory sharing and none of the optimizations above is also tested.
480\begin{table}
481\caption{Compile time of selected files by compiler build, in seconds}
482\begin{tabular}{|l|r|r|r|r|r|r|}
483\hline
484Test case               & old   & \#0   & \#1   & \#2   & \#3   & \#4   \\
485\hline
486@lib/fstream@   & 86.4  & 25.2  & 10.8  & 9.5   & 7.8   & 7.1   \\
487\hline
488@lib/mutex@             & 210.4 & 77.4  & 16.7  & 15.1  & 12.6  & 11.7  \\
489\hline
490@lib/vector@    & 17.2  & 8.9   & 3.1   & 2.8   & 2.4   & 2.2   \\
491\hline
492@lib/stdlib@    & 16.6  & 8.3   & 3.2   & 2.9   & 2.6   & 2.4   \\
493\hline
494@test/io2@              & 300.8 & 53.6  & 43.2  & 27.9  & 19.1  & 16.3  \\
495\hline
496@test/thread@   & 210.9 & 73.5  & 17.0  & 15.1  & 12.6  & 11.8  \\
497\hline
498\end{tabular}
499\smallskip
500\newline
501Results are average of 5 runs (3 runs if time exceeds 100 seconds)
502\end{table}
503
504
505\section{Conclusion}
506
507Over the course of 8 months of active research and development in \CFA type system and compiler algorithm, performance of the reference \CFA compiler, cfa-cc, has been greatly improved, allowing mid-sized \CFA programs to be compiled and built reasonably fast. As there are also ongoing efforts in the team on building a standard library, evaluating the runtime performance, and attempting to incorporate \CFA with existing software written in C, this project is especially meaningful for practical purposes.
508
509Analysis conducted in the project were based significantly on heuristics and practical evidence, as the theoretical bounds and average cases for the expression resolution problem differ. This approach was difficult at start to follow, with an unacceptably slow compiler, since running the program through debugger and validation tools (\eg @gdb@, @valgrind@) adds another order of magnitude to run time, which was already in minutes. However, near the end of the project, many significant improvements have already been made and new optimizations can be tested immediately. The positive feedback in development cycle benefits the \CFA team as a whole, more than just for the compiler optimizations.
510
511Some potential issues of the language that may happen frequently in practice have been identified. Due to the time constraint and complex nature of these problems, a handful of them remain unsolved, but some constructive proposals are made. Notably, introducing a local assertion cache in the resolver is a common solution for a few remaining problems, so that should be the focus of work soon.
512
513The \CFA team are planning on a public alpha release of the language as the compiler performance becomes promising, and other parts of the system, such as a standard library, are also being enhanced. Ideally, the remaining problems should be resolved before release, and the solutions will also be integral to drafting a formal specification.
514
515\addcontentsline{toc}{section}{\refname}
516\bibliographystyle{plain}
517\bibliography{pl}
518
519\end{document}
520
521% Local Variables: %
522% tab-width: 4 %
523% fill-column: 100 %
524% compile-command: "make" %
525% End: %
Note: See TracBrowser for help on using the repository browser.