1 | \documentclass[twoside,12pt]{article} |
---|
2 | |
---|
3 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
4 | |
---|
5 | % Latex packages used in the document. |
---|
6 | \usepackage{fullpage,times,comment} |
---|
7 | \usepackage{epic,eepic} |
---|
8 | \usepackage{upquote} % switch curled `'" to straight |
---|
9 | \usepackage{calc} |
---|
10 | \usepackage{varioref} % extended references |
---|
11 | \usepackage[labelformat=simple,aboveskip=0pt,farskip=0pt]{subfig} |
---|
12 | \renewcommand{\thesubfigure}{\alph{subfigure})} |
---|
13 | \usepackage[flushmargin]{footmisc} % support label/reference in footnote |
---|
14 | \usepackage{latexsym} % \Box glyph |
---|
15 | \usepackage{mathptmx} % better math font with "times" |
---|
16 | \usepackage[toc]{appendix} % article does not have appendix |
---|
17 | \usepackage[usenames]{color} |
---|
18 | \input{common} % common CFA document macros |
---|
19 | \usepackage[dvips,plainpages=false,pdfpagelabels,pdfpagemode=UseNone,colorlinks=true,pagebackref=true,linkcolor=blue,citecolor=blue,urlcolor=blue,pagebackref=true,breaklinks=true]{hyperref} |
---|
20 | \usepackage{breakurl} |
---|
21 | \urlstyle{sf} |
---|
22 | |
---|
23 | |
---|
24 | % reduce spacing |
---|
25 | \setlist[itemize]{topsep=5pt,parsep=0pt}% global |
---|
26 | \setlist[enumerate]{topsep=5pt,parsep=0pt}% global |
---|
27 | |
---|
28 | \usepackage[pagewise]{lineno} |
---|
29 | \renewcommand{\linenumberfont}{\scriptsize\sffamily} |
---|
30 | |
---|
31 | % Default underscore is too low and wide. Cannot use lstlisting "literate" as replacing underscore |
---|
32 | % removes it as a variable-name character so keywords in variables are highlighted. MUST APPEAR |
---|
33 | % AFTER HYPERREF. |
---|
34 | \renewcommand{\textunderscore}{\leavevmode\makebox[1.2ex][c]{\rule{1ex}{0.075ex}}} |
---|
35 | \newcommand{\NOTE}{\textbf{NOTE}} |
---|
36 | \newcommand{\TODO}[1]{{\color{Purple}#1}} |
---|
37 | |
---|
38 | \setlength{\topmargin}{-0.45in} % move running title into header |
---|
39 | \setlength{\headsep}{0.25in} |
---|
40 | |
---|
41 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
42 | |
---|
43 | \CFAStyle % CFA code-style for all languages |
---|
44 | %\lstset{ |
---|
45 | %language=C++,moredelim=**[is][\color{red}]{@}{@} % make C++ the default language |
---|
46 | %}% lstset |
---|
47 | %\lstnewenvironment{C++}[1][] % use C++ style |
---|
48 | %{\lstset{language=C++,moredelim=**[is][\color{red}]{@}{@}}\lstset{#1}}{} |
---|
49 | |
---|
50 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
51 | |
---|
52 | \setcounter{secnumdepth}{3} % number subsubsections |
---|
53 | \setcounter{tocdepth}{3} % subsubsections in table of contents |
---|
54 | \makeindex |
---|
55 | |
---|
56 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
57 | |
---|
58 | \title{\LARGE |
---|
59 | Optimization of \CFA Compiler with Case Studies |
---|
60 | }% title |
---|
61 | |
---|
62 | \author{\LARGE |
---|
63 | Fangren Yu -- University of Waterloo |
---|
64 | }% author |
---|
65 | |
---|
66 | \date{ |
---|
67 | \today |
---|
68 | }% date |
---|
69 | |
---|
70 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
71 | |
---|
72 | \begin{document} |
---|
73 | \pagestyle{headings} |
---|
74 | % changed after setting pagestyle |
---|
75 | \renewcommand{\sectionmark}[1]{\markboth{\thesection\quad #1}{\thesection\quad #1}} |
---|
76 | \renewcommand{\subsectionmark}[1]{\markboth{\thesubsection\quad #1}{\thesubsection\quad #1}} |
---|
77 | \pagenumbering{roman} |
---|
78 | %\linenumbers % comment out to turn off line numbering |
---|
79 | |
---|
80 | \maketitle |
---|
81 | \pdfbookmark[1]{Contents}{section} |
---|
82 | |
---|
83 | \thispagestyle{plain} |
---|
84 | \pagenumbering{arabic} |
---|
85 | |
---|
86 | \begin{abstract} |
---|
87 | \CFA is an evolutionary, non-object-oriented extension of the C programming language, featuring a parametric type-system, and is currently under active development. The reference compiler for the \CFA language, @cfa-cc@, has some of its major components dated back to the early 2000s, which are based on inefficient data structures and algorithms. This report introduces improvements targeting the expression resolution algorithm, suggested by a recent prototype experiment on a simplified model, which are implemented in @cfa-cc@ to support the full \CFA language. These optimizations speed up the compiler by a factor of 20 across the existing \CFA codebase, bringing the compilation time of a mid-sized \CFA source file down to the 10-second level. A few problem cases derived from realistic code examples are analyzed in detail, with proposed solutions. This work is a critical step in the \CFA project development to achieve its eventual goal of being used alongside C for large software systems. |
---|
88 | \end{abstract} |
---|
89 | |
---|
90 | \section*{Acknowledgements} |
---|
91 | Thanks Mum. |
---|
92 | |
---|
93 | \clearpage |
---|
94 | \tableofcontents |
---|
95 | |
---|
96 | \section{Introduction} |
---|
97 | |
---|
98 | \CFA language, developed by the Programming Language Group at the University of Waterloo, has a long history, with the initial language design in 1992 by Glen Ditchfield~\cite{Ditchfield92} and the first proof-of-concept compiler built in 2003 by Richard Bilson~\cite{Bilson03}. Many new features have been added to the language over time, but the core of \CFA's type-system --- parametric functions introduced by the @forall@ clause (hence the name of the language) providing parametric overloading --- remains mostly unchanged. |
---|
99 | |
---|
100 | The current \CFA reference compiler, @cfa-cc@, is designed using the visitor pattern~\cite{vistorpattern} over an abstract syntax tree (AST), where multiple passes over the AST modify it for subsequent passes. @cfa-cc@ still includes many parts taken directly from the original Bilson implementation, which served as the starting point for this enhancement work to the type system. Unfortunately, the prior implementation did not provide the efficiency required for the language to be practical: a \CFA source file of approximately 1000 lines of code can take a multiple minutes to compile. The cause of the problem is that the old compiler used inefficient data structures and algorithms for expression resolution, which involved significant copying and redundant work. |
---|
101 | |
---|
102 | This report presents a series of optimizations to the performance-critical parts of the resolver, with a major rework of the compiler data-structures using a functional-programming approach to reduce memory complexity. The improvements were suggested by running the compiler builds with a performance profiler against the \CFA standard-library source-code and a test suite to find the most underperforming components in the compiler algorithm. |
---|
103 | |
---|
104 | The \CFA team endorses a pragmatic philosophy that focuses on practical implications of language design and implementation rather than theoretical limits. In particular, the compiler is designed to be expressive with respect to code reuse while maintaining type safety, but compromise theoretical soundness in extreme corner cases. However, when these corner cases do appear in actual usage, they need to be thoroughly investigated. A case-by-case analysis is presented for several of these corner cases, some of which point to certain weaknesses in the language design with solutions proposed based on experimental results. |
---|
105 | |
---|
106 | \section{AST restructuring} |
---|
107 | |
---|
108 | \subsection{Memory model with sharing} |
---|
109 | |
---|
110 | A major rework of the AST data-structure in the compiler was completed as the first step of the project. The majority of this work is documented in my prior report documenting the compiler reference-manual~\cite{cfa-cc}. To summarize: |
---|
111 | \begin{itemize} |
---|
112 | \item |
---|
113 | AST nodes (and therefore subtrees) can be shared without copying. |
---|
114 | \item |
---|
115 | Modifications are performed using functional-programming principles, making copies for local changes without affecting the original data shared by other owners. In-place mutations are permitted as a special case when there is no sharing. The logic is implemented by reference counting. |
---|
116 | \item |
---|
117 | Memory allocation and freeing are performed automatically using smart pointers~\cite{smartpointers}. |
---|
118 | \end{itemize} |
---|
119 | |
---|
120 | The resolver algorithm, designed for overload resolution, uses a significant amount of reused, and hence copying, for the intermediate representations, especially in the following two places: |
---|
121 | \begin{itemize} |
---|
122 | \item |
---|
123 | Function overload candidates are computed by combining the argument candidates bottom-up, with many being a common term. For example, if $n$ overloads of a function @f@ all take an integer for the first parameter but different types for the second, \eg @f( int, int )@, @f( int, double )@, etc., the first term is copied $n$ times for each of the generated candidate expressions. This copying is particularly bad for deep expression trees. |
---|
124 | \item |
---|
125 | In the unification algorithm and candidate elimination step, actual types are obtained by substituting the type parameters by their bindings. Let $n$ be the complexity (\ie number of nodes in representation) of the original type, $m$ be the complexity of the bound type for parameters, and $k$ be the number of occurrences of type parameters in the original type. If every substitution needs to be deep-copied, these copy step takes $O(n+mk)$ time and memory, while using shared nodes it is reduced to $O(n)$ time and $O(k)$ memory. |
---|
126 | \end{itemize} |
---|
127 | One of the worst examples for the old compiler is a long chain of I/O operations: |
---|
128 | \begin{cfa} |
---|
129 | sout | 1 | 2 | 3 | 4 | ...; // print integer constants |
---|
130 | \end{cfa} |
---|
131 | The pipe operator is overloaded by the \CFA I/O library for every primitive type in the C language, as well as I/O manipulators defined by the library. In total, there are around 50 overloads for the output stream operation. On resolving the $n$-th pipe operator in the sequence, the first term, which is the result of sub-expression containing $n-1$ pipe operators, is reused to resolve every overload. Therefore at least $O(n^2)$ copies of expression nodes are made during resolution, not even counting type unification cost; combined with the two large factors from number of overloads of pipe operators, and that the ``output stream type'' in \CFA is a trait with 27 assertions (which adds to complexity of the pipe operator's type) this makes compiling a long output sequence extremely slow. In the new AST representation, only $O(n)$ copies are required and the type of the pipe operator is not copied at all. |
---|
132 | Reduction in space complexity is especially important, as preliminary profiling results on the old compiler build showed over half of the time spent in expression resolution is on memory allocations. |
---|
133 | |
---|
134 | Since the compiler codebase is large and the new memory model mostly benefits expression resolution, some of the old data structures are still kept, and a conversion pass happens before and after the general resolve phase. Rewriting every compiler module will take longer, and whether the new model is correct was unknown when this project started, therefore only the resolver is currently implemented with the new data structure. |
---|
135 | |
---|
136 | |
---|
137 | \subsection{Merged resolver calls} |
---|
138 | |
---|
139 | The pre-resolve phase of compilation, inappropriately called ``validate'' in the compiler source code, has a number of passes that do more than simple syntax and semantic validation; some passes also normalizes the input program. A few of these passes require type information for expressions, and therefore, need to call the resolver before the general resolve phase. There are three notable places where the resolver is invoked: |
---|
140 | \begin{itemize} |
---|
141 | \item |
---|
142 | Generate default constructor, copy constructor and destructor for user-defined @struct@ types. |
---|
143 | \item |
---|
144 | Resolve @with@ statements (the same as in Pascal~\cite{pascal}), which introduces fields of a structure directly into a scope. |
---|
145 | \item |
---|
146 | Resolve @typeof@ expressions (cf. @decltype@ in \CC); note that this step may depend on symbols introduced by @with@ statements. |
---|
147 | \end{itemize} |
---|
148 | |
---|
149 | Since the constructor calls are one of the most expensive to resolve (reason given in~\VRef{s:SpecialFunctionLookup}), this pre-resolve phase was taking a large amount of time even after the resolver was changed to the more efficient new implementation. The problem is that multiple resolutions repeat a significant amount of work. Therefore, to better facilitate the new resolver, every step that requires type information should be integrated as part of the general resolver phase. |
---|
150 | |
---|
151 | A by-product of this work is that reversed dependence between @with@ statement and @typeof@ can now be handled. Previously, the compiler was unable to handle cases such as: |
---|
152 | \begin{cfa} |
---|
153 | struct S { int x; }; |
---|
154 | S foo(); |
---|
155 | typeof( foo() ) s; // type is S |
---|
156 | with (s) { |
---|
157 | x; // refers to s.x |
---|
158 | } |
---|
159 | \end{cfa} |
---|
160 | since the type of @s@ is unresolved when handling @with@ expressions because the @with@ pass follows the @typeof@ pass (interchanging passes only interchanges the problem). Instead, the new (and correct) approach is to evaluate @typeof@ expressions when the declaration is first seen during resolution, and it suffices because of the declaration-before-use rule. |
---|
161 | |
---|
162 | |
---|
163 | \subsection{Special function lookup} |
---|
164 | \label{s:SpecialFunctionLookup} |
---|
165 | |
---|
166 | Reducing the number of function looked ups for overload resolution is an effective way to gain performance when there are many overloads but most of them are trivially wrong. In practice, most functions have few (if any) overloads but there are notable exceptions. Most importantly, constructor @?{}@, destructor @^?{}@, and assignment @?=?@ are generated for every user-defined type (@struct@ and @union@ in C), and in a large source file there can be hundreds of them. Furthermore, many calls are generated for initializing variables, passing arguments and copying values. This fact makes them the most overloaded and most called functions. |
---|
167 | |
---|
168 | In an object-oriented programming language, the object-method types are scoped within a class, so a call such as @obj.f()@ only needs to perform lookup in the method table corresponding to the type of @obj@. \CFA on the other hand, does not have methods, and all types are open, \ie new operations can be defined on them without inheritance; at best a \CFA type can be constrained by a translation unit. However, the ``big 3'' operators have a unique property enforced by the language rules: the first parameter must be a reference to its associated type, which acts as the @this@ parameter in an object-oriented language. Since \CFA does not have class inheritance, the reference type must always match exactly. Therefore, argument-dependent lookup can be implemented for these operators by using a dedicated, fast symbol-table. |
---|
169 | |
---|
170 | The lookup key for the special functions is the mangled type name of the first parameter. To handle generic types, the type parameters are stripped off, and only the base type is matched. Note a constructor (destructor, assignment operator) may not take an arbitrary @this@ argument, \eg @forall( dtype T ) void ?{}( T & )@, thus guaranteeing that if the @this@ type is known, all possible overloads can be found by searching with this given type. In the case where the @this@ argument itself is overloaded, it is resolved first and all possible result types are used for lookup. |
---|
171 | |
---|
172 | Note that for a generated expression, the particular variable for the @this@ argument is fully known, without overloads, so the majority of constructor-call resolutions only need to check for one given object type. Explicit constructor calls and assignment statements sometimes require lookup for multiple types. In the extremely rare case that the @this@-argument type is unbound, all necessary types are guaranteed to be checked, as for the previous lookup without the argument-dependent lookup; fortunately, this complex case almost never happens in practice. An example is found in the library function @new@: |
---|
173 | \begin{cfa} |
---|
174 | forall( dtype T | sized( T ), ttype TT | { void ?{}( T &, TT ); } ) |
---|
175 | T * new( TT p ) { return &(*malloc()){ p }; } |
---|
176 | \end{cfa} |
---|
177 | as @malloc@ may return a pointer to any type, depending on context. |
---|
178 | |
---|
179 | Interestingly, this particular declaration actually causes another complicated issue, making the complex checking of every constructor even worse. \VRef[Section]{s:TtypeResolutionInfiniteRecursion} presents a detailed analysis of this problem. |
---|
180 | |
---|
181 | The ``callable'' operator @?()@ (cf. @operator()@ in \CC) can also be included in this special operator list, as it is usually only on user-defined types, and the restriction that the first argument must be a reference seems reasonable in this case. |
---|
182 | |
---|
183 | |
---|
184 | \subsection{Improvement of function type representation} |
---|
185 | |
---|
186 | Since substituting type parameters with their bound types is one fundamental operation in many parts of resolver algorithm (particularly unification and environment binding), making as few copies of type nodes as possible helps reducing memory complexity. Even with the new memory management model, allocation is still a significant factor of resolver performance. Conceptually, operations on type nodes of the AST should be performed in functional-programming style, treating the data structure as immutable and only copying when necessary. The in-place mutation is a mere optimization that does not change the logic for operations. |
---|
187 | |
---|
188 | However, the model was broken for function types by an inappropriate design. Function types require special treatment due to the existence of assertions that constrain the types it supports. Specifically, it must be possible to distinguish two different kinds of type parameter usage: |
---|
189 | \begin{cfa} |
---|
190 | forall( dtype T ) void foo( T * t ) { |
---|
191 | forall( dtype U ) void bar( @T@ * t, @U@ * u ) { ... } |
---|
192 | } |
---|
193 | \end{cfa} |
---|
194 | Here, only @U@ is a free parameter in the nested declaration of function @bar@, as @T@ must be bound at the call site when resolving @bar@. |
---|
195 | |
---|
196 | Moreover, the resolution algorithm also has to distinguish type bindings of multiple calls to the same function, \eg: |
---|
197 | \begin{cfa} |
---|
198 | forall( dtype T ) int foo( T x ); |
---|
199 | int i = foo( foo( 1.0 ) ); |
---|
200 | \end{cfa} |
---|
201 | The inner call has binding (T: double) while the outer call has binding (T: int). Therefore a unique representation for the free parameters is required in each expression. This type binding was previously done by creating a copy of the parameter declarations inside the function type and fixing references afterwards. However, fixing references is an inherently deep operation that does not work well with the functional-programming style, as it forces eager evaluation on the entire syntax tree representing the function type. |
---|
202 | |
---|
203 | The revised approach generates a unique ID value for each function call expression instance and represents an occurrence of a free-parameter type with a pair of generated ID and original parameter declaration, so references are unique and a shallow copy of the function type is possible. |
---|
204 | |
---|
205 | Note that after the change, all declaration nodes in the syntax-tree representation now map one-to-one with the actual declarations in the program, and therefore are guaranteed to be unique. This property can potentially enable more optimizations, and some related ideas are presented at the end of \VRef{s:SharedSub-ExpressionCaseUniqueExpressions}. |
---|
206 | |
---|
207 | |
---|
208 | \subsection{Improvement of pruning steps} |
---|
209 | |
---|
210 | A minor improvement for candidate elimination is to skip the step on the function overloads and only check the results of function application. As function calls are usually by name (versus pointers to functions), the name resolution rule dictates that every function candidate necessarily has a different type; indirect function calls are rare, and when they do appear, there are even fewer cases with multiple interpretations, and these rarely match exactly in argument type. Since function types have a much more complex representation (with multiple parameters and assertions) than data types, checking equality on them also takes longer. |
---|
211 | |
---|
212 | A brief test of this approach shows that the number of function overloads considered in expression resolution increases by an amount of less than 1 percent, while type comparisons in candidate elimination are reduced by more than half. This improvement is consistent over all \CFA source files in the test suite. |
---|
213 | |
---|
214 | |
---|
215 | \subsection{Shared sub-expression case: Unique expressions} |
---|
216 | \label{s:SharedSub-ExpressionCaseUniqueExpressions} |
---|
217 | |
---|
218 | Unique expression denotes an expression evaluated only once to prevent unwanted side effects. It is currently only a compiler artifact, generated for tuple-member expression of the form: |
---|
219 | \begin{cfa} |
---|
220 | struct S { int a; int b; }; |
---|
221 | S s; |
---|
222 | s.[a, b]; // tuple member expression, type is [int, int] |
---|
223 | \end{cfa} |
---|
224 | If the aggregate expression is function call, it cannot be evaluated multiple times: |
---|
225 | \begin{cfa} |
---|
226 | S makeS(); |
---|
227 | makeS().[a, b]; // this should only generate a unique S |
---|
228 | \end{cfa} |
---|
229 | Before code generation, the above expression is internally represented as |
---|
230 | \begin{cfa} |
---|
231 | [uniqueExpr{makeS()}.a, uniqueExpr{makeS()}.b]; |
---|
232 | \end{cfa} |
---|
233 | and the unique expression is expanded to$\footnote{This uses gcc's statement expression syntax, where \lstinline|(\{block\})| evaluates to value of last expression in the block.}$ |
---|
234 | \begin{cfa} |
---|
235 | ({ |
---|
236 | if ( !_unique_var_evaluated ) { |
---|
237 | _unique_var_evaluated = true; |
---|
238 | _unique_var = makeS(); |
---|
239 | } |
---|
240 | _unique_var; |
---|
241 | }) |
---|
242 | \end{cfa} |
---|
243 | at code generation, where @_unique_var@ and @_unique_var_evaluated@ are generated variables whose scope covers all appearances of the same expression. |
---|
244 | The conditional check ensures a single call to @makeS()@ even though there are logically multiple calls because of the tuple field expansion. |
---|
245 | |
---|
246 | Note that although the unique expression is only used for tuple expansion now, it is a generally useful construction, and is seen in other programming languages, such as Scala's @lazy val@~\cite{Scala}; therefore it may be worthwhile to introduce the unique expression to a broader context in \CFA and even make it directly available to programmers. |
---|
247 | |
---|
248 | In the compiler's visitor pattern, however, this creates a problem where multiple paths to a logically unique expression exist, so it may be modified more than once and become ill-formed; some specific intervention is required to ensure unique expressions are only visited once. Furthermore, a unique expression appearing in more than one places is copied on mutation so its representation is no longer unique. |
---|
249 | |
---|
250 | Currently, special cases are required to keep everything synchronized, and the methods are different when mutating the unique expression instance itself or its underlying expression: |
---|
251 | \begin{itemize} |
---|
252 | \item |
---|
253 | When mutating the underlying expression (visit-once guard) |
---|
254 | \begin{cfa} |
---|
255 | void InsertImplicitCalls::previsit( const ast::UniqueExpr * unqExpr ) { |
---|
256 | @if ( visitedIds.count( unqExpr->id ) ) visit_children = false;@ |
---|
257 | else visitedIds.insert( unqExpr->id ); |
---|
258 | } |
---|
259 | \end{cfa} |
---|
260 | \item |
---|
261 | When mutating the unique instance itself, which actually creates copies |
---|
262 | \begin{cfa} |
---|
263 | auto mutExpr = mutate( unqExpr ); // internally calls copy when shared |
---|
264 | @if ( ! unqMap.count( unqExpr->id ) ) {@ |
---|
265 | ... |
---|
266 | } else { |
---|
267 | mutExpr->expr = unqMap[mutExpr->id]->expr; |
---|
268 | mutExpr->result = mutExpr->expr->result; |
---|
269 | } |
---|
270 | \end{cfa} |
---|
271 | \end{itemize} |
---|
272 | Such workarounds are difficult to fit into the common visitor pattern, which suggests the memory model may need different kinds of nodes to accurately represent this feature in the AST. |
---|
273 | |
---|
274 | Given that declaration nodes are unique, it is possible for AST nodes to be divided into three different types: |
---|
275 | \begin{itemize} |
---|
276 | \item |
---|
277 | \textbf{Singleton} with only one owner (declarations); |
---|
278 | \item |
---|
279 | \textbf{No-copy} with multiple owners but cannot be copied (unique expression example presented here); |
---|
280 | \item |
---|
281 | \textbf{Copy} by functional-programming style, which assumes immutable data structures that are copied on mutation. |
---|
282 | \end{itemize} |
---|
283 | The boilerplate code can potentially handle these three cases differently. |
---|
284 | |
---|
285 | |
---|
286 | \section{Analysis of resolver algorithm complexity} |
---|
287 | |
---|
288 | The focus of this section is to identify and analyze some realistic cases that cause the resolver algorithm to have an exponential runtime. As previous work has shown~\cite[\S~4.2.1]{Moss19}, the overload resolution problem in \CFA has worst-case exponential complexity; however, only few specific patterns can trigger the exponential complexity in practice. Implementing heuristic-based optimization for those selected cases is helpful to alleviate the problem. |
---|
289 | |
---|
290 | |
---|
291 | \subsection{Unbound return type} |
---|
292 | \label{s:UnboundReturnType} |
---|
293 | |
---|
294 | The interaction of return-type overloading and polymorphic functions creates function calls with unbounded return-type, and is further complicated by the presence of assertions. |
---|
295 | The prime example of a function with unbound return type is the type-safe version of C @malloc@: |
---|
296 | \begin{cfa} |
---|
297 | forall( dtype T | sized( T ) ) |
---|
298 | T * malloc( void ) { return (T *)malloc( sizeof(T) ); } // call C malloc |
---|
299 | int * i = malloc(); // type deduced from left-hand size $\Rightarrow$ no size argument or return cast |
---|
300 | \end{cfa} |
---|
301 | An unbound return-type is problematic in resolver complexity because a single match of a function call with an unbound return type may create multiple candidates. In the worst case, consider a function declared that returns any @otype@ (defined \VPageref{otype}): |
---|
302 | \begin{cfa} |
---|
303 | forall( otype T ) T anyObj( void ); |
---|
304 | \end{cfa} |
---|
305 | As the resolver attempts to satisfy the otype constraint on @T@, a call to @anyObj()@ in an expression, without the result type known, creates at least as many candidates as the number of complete types currently in scope; with generic types it becomes even worse, \eg assuming a declaration of a generic @pair@ is available at that point: |
---|
306 | \begin{cfa} |
---|
307 | forall( otype T, otype U ) struct pair { T first; U second; }; |
---|
308 | \end{cfa} |
---|
309 | Then an @anyObj()@ call can result in arbitrarily complex types, such as @pair( pair( int, int ), pair( int, int ) )@, and the depth can grow indefinitely until a specified parameter-depth limit, thus creating exponentially many candidates. However, the expected types allowed by parent expressions are practically very few, so most of those interpretations are invalid; if the result type is never bound up to the top level, by the semantic rules it is ambiguous if there is more than one valid binding and resolution fails quickly. It is therefore reasonable to delay resolving assertions on an unbound parameter in a return type; however, with the current cost model, such behavior may further cause irregularities in candidate selection, such that the presence of assertions can change the preferred candidate, even when order of expression costs are supposed to stay the same. A detailed analysis of this issue is presented in \VRef{s:AnalysisTypeSystemCorrectness}. |
---|
310 | |
---|
311 | |
---|
312 | \subsection[ttype resolution infinite recursion]{\lstinline|ttype| resolution infinite recursion} |
---|
313 | \label{s:TtypeResolutionInfiniteRecursion} |
---|
314 | |
---|
315 | @ttype@ (``tuple type'') is a relatively new addition to the language that attempts to provide type-safe variadic argument semantics. Unlike regular @dtype@ parameters, @ttype@ is only valid in a function parameter-list, and may only appear once as the last parameter type. At the call site, a @ttype@ parameter is bound to the tuple type of all remaining function-call arguments. |
---|
316 | |
---|
317 | There are two kinds of idiomatic @ttype@ usage: one is to provide flexible argument forwarding, similar to the variadic template in \CC (\lstinline[language=C++]|template<typename... args>|), as shown below in the implementation of @unique_ptr@ |
---|
318 | \begin{cfa} |
---|
319 | forall( dtype T ) |
---|
320 | struct unique_ptr { |
---|
321 | T * data; |
---|
322 | }; |
---|
323 | forall( dtype T | sized( T ), @ttype Args@ | { void ?{}( T &, Args ); }) |
---|
324 | void ?{}( unique_ptr( T ) & this, Args @args@ ) { |
---|
325 | this.data = new( @args@ ); // forward constructor arguments to dynamic allocator |
---|
326 | } |
---|
327 | \end{cfa} |
---|
328 | The other usage is to implement structural recursion in the first-rest pattern: |
---|
329 | \begin{cfa} |
---|
330 | forall( otype T, @ttype Params@ | { void process( T ); void func( Params ); }) |
---|
331 | void func( T arg1, Params p ) { |
---|
332 | process( arg1 ); |
---|
333 | func( @p@ ); // recursive call until base case of one argument |
---|
334 | } |
---|
335 | \end{cfa} |
---|
336 | For the second use case, it is imperative the number of parameters in the recursive call goes down, since the call site must deduce all assertion candidates, and that is only possible if by observation of the argument types (and not their values), the recursion is known to be completed in a finite number of steps. |
---|
337 | |
---|
338 | In recent experiments, however, a flaw in the type-binding rules can lead to the first kind of @ttype@ use case producing an invalid candidate and the resolver enters an infinite loop. |
---|
339 | This bug was discovered in an attempt to raise the assertion recursive-depth limit and one of the library programs took exponentially longer to compile. The cause of the problem is the following set of functions: |
---|
340 | \begin{cfa} |
---|
341 | // unique_ptr declaration from above |
---|
342 | |
---|
343 | forall( dtype T | sized( T ), ttype Args | { void ?{}( T &, Args ); } ) { // distribute forall clause |
---|
344 | void ?{}( counter_data( T ) & this, Args args ); |
---|
345 | void ?{}( counter_ptr( T ) & this, Args args ); |
---|
346 | void ?{}( unique_ptr( T ) & this, Args args ); |
---|
347 | } |
---|
348 | |
---|
349 | forall( dtype T | sized( T ), ttype TT | { void ?{}( T &, TT ); } ) |
---|
350 | T * new( TT p ) { return @&(*malloc()){ p };@ } |
---|
351 | \end{cfa} |
---|
352 | In the expression @(*malloc()){p}@, the type of the object being constructed is unknown, since the return-type information is not immediately available. That causes every constructor to be searched, and while normally a bound @ttype@ cannot be unified with any free parameter, it is possible with another free @ttype@. Therefore, in addition to the correct option provided by the assertion, 3 wrong options are examined, each of which again requires the same assertion, for an unknown base-type @T@ and @ttype@ argument, which becomes an infinite loop until the specified recursion limit and resolution is fails. Moreover, during the recursion steps, the number of candidates grows exponentially, since there are always 3 options at each step. |
---|
353 | |
---|
354 | Unfortunately, @ttype@ to @ttype@ binding is necessary, to allow indirectly calling a function provided in an assertion. |
---|
355 | \begin{cfa} |
---|
356 | forall( dtype T | sized( T ), ttype Args | { @void ?{}( T &, Args );@ }) |
---|
357 | void ?{}( unique_ptr( T ) & this, Args args ) { this.data = (T *)@new( args )@; } // constructor call |
---|
358 | \end{cfa} |
---|
359 | Here the constructor assertion is used by the @new( args )@ call to indirectly call the constructor on the allocated storage. |
---|
360 | Therefore, it is hard, perhaps impossible, to solve this problem by tweaking the type binding rules. An assertion caching algorithm can help improve this case by detecting cycles in recursion. |
---|
361 | |
---|
362 | Meanwhile, without a caching algorithm implemented, some changes in the \CFA source code are enough to eliminate this problem, at least in the current codebase. Note that the issue only happens with an overloaded variadic function, which rarely appears in practice, since the idiomatic use cases are for argument forwarding and self-recursion. The only overloaded @ttype@ function so far discovered in all of \CFA standard library is the constructor, and by utilizing the argument-dependent lookup process described in \VRef{s:UnboundReturnType}, adding a cast before the constructor call removes the issue. |
---|
363 | \begin{cfa} |
---|
364 | T * new( TT p ) { return &(*@(T * )@malloc()){ p }; } |
---|
365 | \end{cfa} |
---|
366 | |
---|
367 | |
---|
368 | \subsection{Reused assertions in nested generic type} |
---|
369 | |
---|
370 | The following test of deeply nested, dynamic generic type reveals that locally caching reused assertions is necessary, rather than just a resolver optimization, because recomputing assertions can result in bloated generated code size: |
---|
371 | \begin{cfa} |
---|
372 | struct nil {}; |
---|
373 | forall( otype L, otype R ) |
---|
374 | struct cons { L l; R r; }; |
---|
375 | |
---|
376 | int main() { |
---|
377 | #if N==0 |
---|
378 | nil @x@; |
---|
379 | #elif N==1 |
---|
380 | cons( size_t, nil ) @x@; |
---|
381 | #elif N==2 |
---|
382 | cons( size_t, cons( size_t, nil ) ) @x@; |
---|
383 | #elif N==3 |
---|
384 | cons( size_t, cons( size_t, cons( size_t, nil ) ) ) @x@; |
---|
385 | // similarly for N=4,5,6 |
---|
386 | #endif |
---|
387 | } |
---|
388 | \end{cfa} |
---|
389 | At the declaration of @x@, it is implicitly initialized by generated constructor call, with signature: |
---|
390 | \begin{cfa} |
---|
391 | forall( otype L, otype R ) void ?{}( cons( L, R ) & ); |
---|
392 | \end{cfa} |
---|
393 | where the @otype@ constraint contains the 4 assertions:\label{otype} |
---|
394 | \begin{cfa} |
---|
395 | void ?{}( L & ); // default constructor |
---|
396 | void ?{}( L &, L & ); // copy constructor |
---|
397 | void ^?{}( L & ); // destructor |
---|
398 | L & ?=?( L &, L & ); // assignment |
---|
399 | \end{cfa} |
---|
400 | Now since the right hand side of outermost cons is again a cons, recursive assertions are required. \VRef[Table]{t:NestedConsTest} shows when the compiler does not cache and reuse already resolved assertions, it becomes a problem, as each of these 4 pending assertions again asks for 4 more assertions one level below. Without caching, the number of resolved assertions grows exponentially, which is unnecessary since there are only $n+1$ different types involved. Even worse, this problem causes exponentially many wrapper functions to be generated at the backend, resulting in a huge binary. As the local functions are implemented by emitting executable code on the stack~\cite{gcc-nested-func}, it means that compiled code also has exponential run time. This problem has practical implications, as nested collection types are frequently used in real production code. |
---|
401 | |
---|
402 | \begin{table}[h] |
---|
403 | \centering |
---|
404 | \caption{Compilation results of nested cons test} |
---|
405 | \label{t:NestedConsTest} |
---|
406 | \begin{tabular}{|r|r|r|} |
---|
407 | \hline |
---|
408 | N & Assertions resolved & Binary size (KB) \\ |
---|
409 | \hline |
---|
410 | 3 & 170 & 70 \\ |
---|
411 | \hline |
---|
412 | 4 & 682 & 219 \\ |
---|
413 | \hline |
---|
414 | 5 & 2,730 & 829 \\ |
---|
415 | \hline |
---|
416 | 6 & 10,922 & 3,261 \\ |
---|
417 | \hline |
---|
418 | \end{tabular} |
---|
419 | \end{table} |
---|
420 | |
---|
421 | |
---|
422 | \section{Analysis of type system correctness} |
---|
423 | \label{s:AnalysisTypeSystemCorrectness} |
---|
424 | |
---|
425 | In Moss' thesis~\cite[\S~4.1.2,~p.~45]{Moss19}, the author presents the following example: |
---|
426 | \begin{quote} |
---|
427 | The cost model of \CFA precludes a greedy bottom-up resolution pass, as constraints and costs introduced by calls higher in the expression tree can change the interpretation of those lower in the tree, as in the following example: |
---|
428 | \begin{cfa} |
---|
429 | void f( int ); |
---|
430 | double g$_1$( int ); |
---|
431 | int g$_2$( long ); |
---|
432 | f( g( 42 ) ); |
---|
433 | \end{cfa} |
---|
434 | Considered independently, @g$_1$( 42 )@ is the cheapest interpretation of @g( 42 )@, with cost (0, 0, 0, 0, 0, 0, 0) since the argument type is an exact match. However, in context, an unsafe conversion is required to downcast the return type of @g1@ to an @int@ suitable for @f@, for a total cost of (1, 0, 0, 0, 0, 0, 0) for @f( g1( 42 ) )@. If @g2@ is chosen, on the other hand, there is a safe upcast from the int type of 42 to @long@, but no cast on the return of @g2@, for a total cost of (0, 0, 1, 0, 0, 0, 0) for @f( g2( 42 ) );@ as this is cheaper, @g2@ is chosen. Due to this design, all valid interpretations of subexpressions must in general be propagated to the top ... |
---|
435 | \end{quote} |
---|
436 | The cost model in the thesis sums up all costs in resolving sub-expressions and chooses the global optimal option. However, the current implementation in @cfa-cc@, based on the original Bilson model~\cite[\S~2.2.4,~p.~32]{Bilson03}, does eager resolution and only allows local optimal candidate selections to propagate upwards: |
---|
437 | \begin{quote} |
---|
438 | From the set of candidates whose parameter and argument types have been unified and whose assertions have been satisfied, those whose sub-expression interpretations have the smallest total cost of conversion are selected ... The total cost of conversion for each of these candidates is then calculated based on the implicit conversions and polymorphism involved in adapting the types of the sub-expression interpretations to the formal parameter types. |
---|
439 | \end{quote} |
---|
440 | With this model, the algorithm picks @g1@ in resolving the @f( g( 42 ) )@ call, which is undesirable. |
---|
441 | |
---|
442 | There is further evidence that shows the Bilson model is fundamentally incorrect, following the discussion of unbound return type in \VRef{s:UnboundReturnType}. By the conversion-cost specification, a binding from a polymorphic type-parameter to a concrete type incurs a polymorphic cost of 1. It remains unspecified \emph{when} the type parameters should become bound. When the parameterized types appear in function parameters, they can be deduced from the argument type, and there is no ambiguity. In the unbound return case, however, the binding may happen at any stage in expression resolution, therefore it is impossible to define a unique local conversion cost. Note that type binding happens exactly once per parameter in resolving the entire expression, so the global binding cost is unambiguously 1. |
---|
443 | |
---|
444 | In the current compiler implementation, there is a notable inconsistency in handling this case. For any unbound parameter that does \emph{not} come with an associated assertion, it remains unbound to the parent expression; for those that do, however, they are immediately bound in the assertion resolution step, and concrete result types are used in the parent expressions. |
---|
445 | Consider the following example: |
---|
446 | \begin{cfa} |
---|
447 | forall( dtype T ) T * f( void ); |
---|
448 | void h( int * ); |
---|
449 | \end{cfa} |
---|
450 | The expression @h( f() )@ eventually has a total cost of 1 from binding (T: int), but in the eager-resolution model, the cost of 1 may occur either at the call to @f@ or at call to @h@, and with the assertion resolution triggering a binding, the local cost of @f()@ is (0 poly, 0 spec) with no assertions, but (1 poly, -1 spec) with an assertion: |
---|
451 | \begin{cfa} |
---|
452 | forall( dtype T | @{ void g( T * ); }@ ) T * f( void ); |
---|
453 | void g( int * ); |
---|
454 | void h( int * ); |
---|
455 | \end{cfa} |
---|
456 | and that contradicts the principle that adding assertions should make expression cost lower. Furthermore, the time at which type binding and assertion resolution happens is an implementation detail of the compiler, not part of the language definition. That means two compliant \CFA compilers, one performing immediate assertion resolution at each step, and one delaying assertion resolution on unbound types, can produce different expression costs and therefore different candidate selection, making the language rule itself partially undefined, and therefore, unsound. By the above reasoning, the updated cost model using global sum of costs should be accepted as the standard. It also allows the compiler to freely choose when to resolve assertions, as the sum of total costs is independent of that choice; more optimizations regarding assertion resolution can also be implemented. |
---|
457 | |
---|
458 | |
---|
459 | \section{Timing results} |
---|
460 | |
---|
461 | For the timing results presented here, the \CFA compiler is built with gcc 9.3.0, and tested on a server machine running Ubuntu 20.04, 64GB RAM and 32-core 2.2 GHz CPU. |
---|
462 | Timing is reported by the @time@ command and an experiment is run using 8 cores, where each core is at 100\% CPU utilization. |
---|
463 | |
---|
464 | On the most recent build, the \CFA standard library ($\approx$1.3 MB of source code) compiles in 4 minutes 47 seconds total processor time (single thread equivalent), with the slowest file taking 13 seconds. The test suite (178 test cases, $\approx$2.2MB of source code) completes within 25 minutes total processor time, |
---|
465 | % PAB: I do not understand this footnote. |
---|
466 | %\footnote{Including a few runtime tests; total time spent in compilation is approximately 21 minutes.} |
---|
467 | with the slowest file taking 23 seconds. In contrast, the library build with the old compiler takes 85 minutes total, 5 minutes for the slowest file. The full test-suite takes too long with old compiler build and is therefore not run, but the slowest test cases take approximately 5 minutes. Overall, the most recent build compared to an old build is consistently faster by a factor of 20. |
---|
468 | |
---|
469 | \begin{table} |
---|
470 | \centering |
---|
471 | \caption{Compile time of selected files by compiler build, in seconds} |
---|
472 | \label{t:SelectedFileByCompilerBuild} |
---|
473 | \begin{tabular}{|l|r|r|r|r|r|r|} |
---|
474 | \hline |
---|
475 | Test case & old & \#0 & \#1 & \#2 & \#3 & \#4 \\ |
---|
476 | \hline |
---|
477 | @lib/fstream@ & 86.4 & 25.2 & 10.8 & 9.5 & 7.8 & 7.1 \\ |
---|
478 | \hline |
---|
479 | @lib/mutex@ & 210.4 & 77.4 & 16.7 & 15.1 & 12.6 & 11.7 \\ |
---|
480 | \hline |
---|
481 | @lib/vector@ & 17.2 & 8.9 & 3.1 & 2.8 & 2.4 & 2.2 \\ |
---|
482 | \hline |
---|
483 | @lib/stdlib@ & 16.6 & 8.3 & 3.2 & 2.9 & 2.6 & 2.4 \\ |
---|
484 | \hline |
---|
485 | @test/io2@ & 300.8 & 53.6 & 43.2 & 27.9 & 19.1 & 16.3 \\ |
---|
486 | \hline |
---|
487 | @test/thread@ & 210.9 & 73.5 & 17.0 & 15.1 & 12.6 & 11.8 \\ |
---|
488 | \hline |
---|
489 | \end{tabular} |
---|
490 | \smallskip |
---|
491 | \newline |
---|
492 | Results are average of 5 runs (3 runs if time exceeds 100 seconds) |
---|
493 | \end{table} |
---|
494 | |
---|
495 | Additionally, 6 selected \CFA source files with distinct features from the library and test suite are used to illustrate the compiler performance change after each of the implemented optimizations. Test files are from the most recent build and run through the C preprocessor to expand header file, perform macro expansions, but no line number information (@gcc -E -P@). |
---|
496 | \VRef[Table]{t:SelectedFileByCompilerBuild} shows the selected tests: |
---|
497 | \begin{itemize} |
---|
498 | \item |
---|
499 | @lib/fstream@ (112 KB) |
---|
500 | \item |
---|
501 | @lib/mutex@ (166 KB): implementation of concurrency primitive |
---|
502 | \item |
---|
503 | @lib/vector@ (43 KB): container example, similar to \CC vector |
---|
504 | \item |
---|
505 | @lib/stdlib@ (64 KB): type-safe wrapper to @void *@-based C standard library functions |
---|
506 | \item |
---|
507 | @test/ISO2@ (55 KB): application of I/O library |
---|
508 | \item |
---|
509 | @test/thread@ (188 KB): application of threading library |
---|
510 | \end{itemize} |
---|
511 | versus \CFA compiler builds picked from the git commit history that implement the optimizations incrementally: |
---|
512 | \begin{itemize} |
---|
513 | \item |
---|
514 | old resolver |
---|
515 | \item |
---|
516 | \#0 is the first working build of the new AST data structure |
---|
517 | \item |
---|
518 | \#1 implements special symbol table and argument-dependent lookup |
---|
519 | \item |
---|
520 | \#2 implements late assertion-satisfaction |
---|
521 | \item |
---|
522 | \#3 implements revised function-type representation |
---|
523 | \item |
---|
524 | \#4 skips pruning on expressions for function types (most recent build) |
---|
525 | \end{itemize} |
---|
526 | Reading left to right for a test shows the benefit of each optimization on the cost of compilation. |
---|
527 | |
---|
528 | \section{Conclusion} |
---|
529 | |
---|
530 | Over the course of 8 months of active research and development of the \CFA type system and compiler algorithms, performance of the reference \CFA compiler, cfa-cc, has been greatly improved. Now, mid-sized \CFA programs are compiled reasonably fast. Currently, there are ongoing efforts by the \CFA team to augment the standard library and evaluate its runtime performance, and incorporate \CFA with existing software written in C; therefore this project is especially meaningful for these practical purposes. |
---|
531 | |
---|
532 | Accomplishing this work was difficult. Analysis conducted in the project is based significantly on heuristics and practical evidence, as the theoretical bounds and average cases for the expression resolution problem differ. As well, the slowness of the initial compiler made attempts to understand why and where problems exist extremely difficult because both debugging and validation tools (\eg @gdb@, @valgrind@, @pref@) further slowed down compilation time. However, by the end of the project, I had found and fixed several significant problems and new optimizations are easier to introduce and test. The reduction in the development cycle benefits the \CFA team as a whole. |
---|
533 | |
---|
534 | Some potential issues of the language, which happen frequently in practice, have been identified. Due to the time constraint and complex nature of these problems, a handful of them remain unsolved, but some constructive proposals are made. Notably, introducing a local assertion cache in the resolver is a reasonable solution for a few remaining problems, so that should be the focus of future work. |
---|
535 | |
---|
536 | The \CFA team are planning on a public alpha release of the language as the compiler performance, given my recent improvements, is now useable. Other parts of the system, such as the standard library, have made significant gains due to the speed up in the development cycle. Ideally, the remaining problems should be resolved before release, and the solutions will also be integral to drafting a formal specification. |
---|
537 | |
---|
538 | \addcontentsline{toc}{section}{\refname} |
---|
539 | \bibliographystyle{plain} |
---|
540 | \bibliography{pl} |
---|
541 | |
---|
542 | \end{document} |
---|
543 | |
---|
544 | % Local Variables: % |
---|
545 | % tab-width: 4 % |
---|
546 | % fill-column: 100 % |
---|
547 | % compile-command: "make" % |
---|
548 | % End: % |
---|