Changeset 7b61ce8 for doc/theses/aaron_moss_PhD/phd/type-environment.tex

Ignore:
Timestamp:
Oct 9, 2018, 5:27:17 PM (4 years ago)
Branches:
aaron-thesis, arm-eh, cleanup-dtors, deferred_resn, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, no_list, persistent-indexer
Children:
1f690b3, eeb0767
Parents:
a29c6e2
Message:

Started type environment thesis chapter (now after expression resolution)

File:
1 edited

Legend:

Unmodified
 ra29c6e2 \label{env-chap} Talk about the type environment data structure. Pull from your presentation. One key data structure for expression resolution is the \emph{type environment}. As discussed in Chapter~\ref{resolution-chap}, being able to efficiently determine which type variables are bound to which concrete types or whether two type environments are compatible is a core requirement of the resolution algorithm. Furthermore, expression resolution involves a search through many related possible solutions, so being able to re-use shared subsets of type environment data and to switch between environments quickly is desirable for performance. In this chapter I discuss and empirically compare a number of type environment data structure variants, including some novel variations on the union-find\cit{} data structure introduced in this thesis. \section{Definitions} For purposes of this chapter, a \emph{type environment} $T$ is a set of \emph{type classes} $\myset{T_1, T_2, \cdots, T_{|T|}}$. Each type class $T_i$ contains a set of \emph{type variables} $\myset{v_{i,1}, v_{i,2}, \cdots, v_{i,|T_i|}}$; note that the sets of variables contained in two distinct classes in the same environment must be disjoint. Each individual type class $T_i$ may also be associated with a \emph{bound}, $b_i$; this bound contains the \emph{bound type} which the variables in the type class are replaced with, but also includes other information in \CFACC{}, including whether type conversions are permissible on the bound type and what sort of type variables are contained in the class (data types, function types, or variadic tuples). \begin{table} \caption[Type environment operation summary]{Summary of type environment operations.} \label{env-op-table} \centering \begin{tabular}{r@{\hskip 0.25em}ll} $find(T, v_{i,j})$ & $\rightarrow T_i | \bot$           & Locate class for variable             \\ $report(T_i)$ & $\rightarrow \{ v_{i,j} \cdots \}$      & List variables for class              \\ $bound(T_i)$ & $\rightarrow b_i | \bot$                         & Get bound for class                   \\ $insert(v_{i,1})$ &                                                                     & New single-variable class             \\ $add(T_i, v_{i,j})$ &                                                           & Add variable to class                 \\ $bind(T_i, b_i)$ &                                                                      & Set or update class bound             \\ $remove(T, T_i)$ &                                                                      & Remove class from environment \\ $unify(T, T_i, T_j)$ & $\rightarrow \top | \bot$        & Combine two type classes              \\ $combine(T, T')$ & $\rightarrow \top | \bot$            & Merge two environments                \\ $save(T)$ & $\rightarrow H$                                                     & Get handle for current state  \\ $backtrack(T, H)$ &                                                                     & Return to handle state \end{tabular} \end{table} Given this basic structure, type environments in \CFACC{} need to support eleven basic operations, summarized in Table~\ref{env-op-table}. The first seven operations are straightforward queries and updates on these data structures: The lookup operation $find(T, v_{i,j})$ produces $T_i$, the type class in $T$ which contains variable $v_{i,j}$, or an invalid sentinel value for no such class. The other two query operations act on type classes, where $report(T_i)$ produces the set $\myset{v_{i,1}, v_{i,2}, \cdots, v_{i,|T_i|}}$ of all type variables in a class $T_i$ and $bound(T_i)$ produces the bound $b_i$ of that class, or a sentinel indicating no bound is set. The update operation $insert(T, v_{i,1})$ creates a new type class $T_i$ in $T$ that contains only the variable $v_{i,1}$ and no bound; due to the disjointness property $v_{i,1}$ cannot belong to any other type class in $T$. The $add(T_i, v_{i,j})$ operation adds a new type variable $v_{i,j}$ to class $T_i$; again, $v_{i,j}$ cannot exist elsewhere in $T$. $bind(T_i, b_i)$ mutates the bound for a type class, setting or updating the current bound. The final basic mutation operation is $remove(T, T_i)$, which removes a class $T_i$ and all its type variables from an environment $T$. The $unify$ operation is the fundamental non-trivial operation a type environment data structure must support. $unify(T, T_i, T_j)$ merges a type class $T_j$ into another $T_i$, producing a failure result and leaving $T$ in an invalid state if this merge fails. It is always possible to unify the type variables of both classes by simply taking the union of both sets; given the disjointness property, no checks for set containment are required, and the variable sets can simply be concatenated if supported by the underlying data structure. $unify$ depends on an internal $unify_bound$ operation which may fail. In \CFACC{}, $unify_bound(b_i, b_j) \rightarrow b'_i|\bot$ checks that the type classes contain the same sort of variable, takes the tighter of the two conversion permissions, and checks if the bound types can be unified. If the bound types cannot be unified (\eg{} !struct A! with !int*!), then $unify_bound$ fails, while other combinations of bound types may result in recursive calls. For instance, unifying !R*! with !S*! for type variables !R! and !S! will result in a call to $unify(T, find($!R!$), find($!S!$))$, while unifying !R*! with !int*! will result in a call to $unify_bound$ on !int! and the bound type of the class containing !R!. As such, a call to $unify(T, T_i, T_j)$ may touch every type class in $T$, not just $T_i$ and $T_j$, collapsing the entirety of $T$ into a single type class in extreme cases. Given the nature of the expression resolution problem as backtracking search, caching and concurrency are both useful tools to decrease runtime. However, both of these approaches may produce multiple distinct descendants of the same initial type environment, which have possibly been mutated in incompatible ways. As such, to effectively employ either concurrency or caching, the type environment data structure must support an efficient method to check if two type environments are compatible and merge them if so. $combine(T,T')$ attempts to merge an environment $T'$ into another environment $T$, producing $\top$ if successful or leaving $T$ in an invalid state and producing $\bot$ otherwise. The invalid state of $T$ on failure is not important, given that a combination failure will result in the resolution algorithm backtracking to a different environment. $combine$ proceeds by calls to $insert$, $add$, and $unify$ as needed, and can be roughly thought of as calling $unify$ on every pair of classes in $T$ that have variables $v'_{i,j}$ and $v'_{i,k}$ in the same class $T'_i$ in $T'$. Like for $unify$, $combine$ can always find a mutually-consistent division of type variables into classes (in the extreme case, all type variables from $T$ and $T'$ in a single type class), but may fail due to inconsistent bounds on merged type classes. Finally, the backtracking access patterns of the compiler can be exploited to reduce memory usage or runtime through use of an appropriately designed data structure. The set of mutations to a type environment across the execution of the resolution algorithm produce an implicit tree of related environments, and the backtracking search typically focuses only on one leaf of the tree at once, or at most a small number of closely-related nodes as arguments to $combine$. As such, the ability to save and restore particular type environment states is useful, and supported by the $save(T) \rightarrow H$ and $backtrack(T, H)$ operations, which produce a handle for the current environment state and mutate an environment back to a previous state, respectively. These operations can be naively implemented by a deep copy of $T$ into $H$ and vice versa, but have more efficient implementations in persistency-aware data structures. % Future work: design multi-threaded version of C&F persistent map --- core idea is some sort of thread-boundary edit node