source: doc/theses/jiada_liang_MMath/background.tex@ d39d8a4

Last change on this file since d39d8a4 was d39d8a4, checked in by Peter A. Buhr <pabuhr@…>, 15 months ago

proofread background chapter

  • Property mode set to 100644
File size: 22.8 KB
Line 
1\chapter{Background}
2
3This chapter covers background material for C enumerations and \CFA features used in later discussion.
4
5
6\section{C}
7
8As mentioned in \VRef{s:Aliasing}, it is common for C programmers to ``believe'' there are three equivalent forms of named constants.
9\begin{clang}
10#define Mon 0
11static const int Mon = 0;
12enum { Mon };
13\end{clang}
14\begin{enumerate}[leftmargin=*]
15\item
16For @#define@, the programmer has to explicitly manage the constant name and value.
17Furthermore, these C preprocessor macro names are outside of the C type-system and can incorrectly change random text in a program.
18\item
19The same explicit management is true for the @const@ declaration, and the @const@ variable cannot appear in constant-expression locations, like @case@ labels, array dimensions,\footnote{
20C allows variable-length array-declarations (VLA), so this case does work, but it fails in \CC, which does not support VLAs, unless it is \lstinline{g++}.} immediate oper\-ands of assembler instructions, and occupies storage.
21\begin{clang}
22$\$$ nm test.o
230000000000000018 r Mon
24\end{clang}
25\item
26Only the @enum@ form is managed by the compiler, is part of the language type-system, works in all C constant-expression locations, and normally does not occupy storage.
27\end{enumerate}
28
29
30\subsection{C \lstinline{const}}
31\label{s:Cconst}
32
33C can simulate the aliasing @const@ declarations \see{\VRef{s:Aliasing}}, with static and dynamic initialization.
34\begin{cquote}
35\begin{tabular}{@{}ll@{}}
36\multicolumn{1}{@{}c}{\textbf{static initialization}} & \multicolumn{1}{c@{}}{\textbf{dynamic initialization}} \\
37\begin{clang}
38static const int one = 0 + 1;
39static const void * NIL = NULL;
40static const double PI = 3.14159;
41static const char Plus = '+';
42static const char * Fred = "Fred";
43static const int Mon = 0, Tue = Mon + 1, Wed = Tue + 1,
44 Thu = Wed + 1, Fri = Thu + 1, Sat = Fri + 1, Sun = Sat + 1;
45\end{clang}
46&
47\begin{clang}
48void foo() {
49 // auto scope only
50 const int r = random() % 100;
51 int va[r];
52}
53\end{clang}
54\end{tabular}
55\end{cquote}
56However, statically initialized identifiers cannot appear in constant-expression contexts, \eg @case@.
57Dynamically initialized identifiers may appear in initialization and array dimensions in @g++@, which allows variable-sized arrays on the stack.
58Again, this form of aliasing is not an enumeration.
59
60
61\subsection{C Enumeration}
62\label{s:CEnumeration}
63
64The C enumeration has the following syntax~\cite[\S~6.7.2.2]{C11}.
65\begin{clang}[identifierstyle=\linespread{0.9}\it]
66$\it enum$-specifier:
67 enum identifier$\(_{opt}\)$ { enumerator-list }
68 enum identifier$\(_{opt}\)$ { enumerator-list , }
69 enum identifier
70enumerator-list:
71 enumerator
72 enumerator-list , enumerator
73enumerator:
74 enumeration-constant
75 enumeration-constant = constant-expression
76\end{clang}
77The terms \emph{enumeration} and \emph{enumerator} used in this work \see{\VRef{s:Terminology}} come from the grammar.
78The C enumeration semantics are discussed using examples.
79
80
81\subsubsection{Type Name}
82\label{s:TypeName}
83
84An \emph{unnamed} enumeration is used to provide aliasing \see{\VRef{s:Aliasing}} exactly like a @const@ declaration in other languages.
85However, it is restricted to integral values.
86\begin{clang}
87enum { Size = 20, Max = 10, MaxPlus10 = Max + 10, @Max10Plus1@, Fred = -7 };
88\end{clang}
89Here, the aliased constants are: 20, 10, 20, 21, and -7.
90Direct initialization is by a compile-time expression generating a constant value.
91Indirect initialization (without initializer, @Max10Plus1@) is called \newterm{auto-initialization}, where enumerators are assigned values from left to right, starting at zero or the next explicitly initialized constant, incrementing by @1@.
92Because multiple independent enumerators can be combined, enumerators with the same values can occur.
93The enumerators are rvalues, so assignment is disallowed.
94Finally, enumerators are \newterm{unscoped}, \ie enumerators declared inside of an @enum@ are visible (projected) outside into the enclosing scope of the @enum@ type.
95For unnamed enumerations, this semantic is required because there is no type name for scoped qualification.
96
97As noted, this kind of aliasing declaration is not an enumeration, even though it is declared using an @enum@ in C.
98While the semantics is misleading, this enumeration form matches with aggregate types:
99\begin{cfa}
100typedef struct @/* unnamed */@ { ... } S;
101struct @/* unnamed */@ { ... } x, y, z; $\C{// questionable}$
102struct S {
103 union @/* unnamed */@ { $\C{// unscoped fields}$
104 int i; double d ; char ch;
105 };
106};
107\end{cfa}
108Hence, C programmers would expect this enumeration form to exist in harmony with the aggregate form.
109
110A \emph{named} enumeration is an enumeration:
111\begin{clang}
112enum @Week@ { Mon, Tue, Wed, Thu@ = 10@, Fri, Sat, Sun };
113\end{clang}
114and adopts the same semantics with respect to direct and auto intialization.
115For example, @Mon@ to @Wed@ are implicitly assigned with constants @0@--@2@, @Thu@ is explicitly set to constant @10@, and @Fri@ to @Sun@ are implicitly assigned with constants @11@--@13@.
116As well, initialization may occur in any order.
117\begin{clang}
118enum Week {
119 Thu@ = 10@, Fri, Sat, Sun,
120 Mon@ = 0@, Tue, Wed@,@ $\C{// terminating comma}$
121};
122\end{clang}
123Note, the comma in the enumerator list can be a terminator or a separator, allowing the list to end with a dangling comma.\footnote{
124A terminating comma appears in other C syntax, \eg the initializer list.}
125This feature allow enumerator lines to be interchanged without moving a comma.
126Named enumerators are also unscoped.
127
128
129\subsubsection{Implementation}
130\label{s:CenumImplementation}
131
132In theory, a C enumeration \emph{variable} is an implementation-defined integral type large enough to hold all enumerator values.
133In practice, C defines @int@~\cite[\S~6.4.4.3]{C11} as the underlying type for enumeration variables, restricting initialization to integral constants, which have type @int@ (unless qualified with a size suffix).
134However, type @int@ is defined as:
135\begin{quote}
136A ``plain'' @int@ object has the natural size suggested by the architecture of the execution environment (large enough to contain any value in the range @INT_MIN@ to @INT_MAX@ as defined in the header @<limits.h>@).~\cite[\S~6.2.5(5)]{C11}
137\end{quote}
138However, @int@ means a 4 bytes on both 32/64-bit architectures, which does not seem like the ``natural'' size for a 64-bit architecture.
139Whereas, @long int@ means 4 bytes on a 32-bit and 8 bytes on 64-bit architectures, and @long long int@ means 8 bytes on both 32/64-bit architectures, where 64-bit operations are simulated on 32-bit architectures.
140\VRef[Figure]{f:gccEnumerationStorageSize} shows both @gcc@ and @clang@ partially ignore this specification and type the integral size of an enumerator based its initialization.
141Hence, initialization in the range @INT_MIN@..@INT_MAX@ results in a 4-byte enumerator, and outside this range the enumerator is 8 bytes.
142Note that @sizeof( typeof( IMin ) ) != sizeof( E )@, making the size of an enumerator different than is containing enumeration type, which seems inconsistent, \eg @sizeof( typeof( 3 ) ) == sizeof( int )@.
143
144\begin{figure}
145\begin{cfa}
146enum E { IMin = INT_MIN, IMax = INT_MAX,
147 ILMin = LONG_MIN, ILMax = LONG_MAX,
148 ILLMin = LLONG_MIN, ILLMax = LLONG_MAX };
149int main() {
150 printf( "%zd %zd\n%zd %zd\n%zd %d %d\n%zd %ld %ld\n%zd %ld %ld\n",
151 sizeof(enum E), sizeof(typeof(IMin)),
152 sizeof(int), sizeof(long int),
153 sizeof(IMin), IMin, IMax,
154 sizeof(ILMin), ILMin, ILMax,
155 sizeof(ILLMin), ILLMin, ILLMax );
156}
1578 4
1584 -2147483648 2147483647
1598 -9223372036854775808 9223372036854775807
1608 -9223372036854775808 9223372036854775807
161\end{cfa}
162\caption{\lstinline{gcc}/\lstinline{clang} Enumeration Storage Size}
163\label{f:gccEnumerationStorageSize}
164\end{figure}
165
166
167\subsubsection{Usage}
168\label{s:Usage}
169
170C proves an implicit \emph{bidirectional} conversion between an enumeration and its integral type, and between two different enumerations.
171\begin{clang}
172enum Week week = Mon; $\C{// week == 0}$
173week = Fri; $\C{// week == 11}$
174int i = Sun; $\C{// implicit conversion to int, i == 13}$
175@week = 10000;@ $\C{// UNDEFINED! implicit conversion to Week}$
176
177enum Season { Spring, Summer, Fall, Winter };
178@week = Winter;@ $\C{// UNDEFINED! implicit conversion to Week}$
179\end{clang}
180While converting an enumerator to its underlying type is useful, the implicit conversion from the base or another enumeration type to an enumeration is a common source of error.
181
182Enumerators can appear in @switch@ and looping statements.
183\begin{cfa}
184enum Week { Mon, Tue, Wed, Thu, Fri, Sat, Sun };
185switch ( week ) {
186 case Mon ... Fri: $\C{// gcc case range}$
187 printf( "weekday\n" );
188 case Sat: case Sun:
189 printf( "weekend\n" );
190}
191for ( enum Week day = Mon; day <= Sun; day += 1 ) { $\C{// step of 1}$
192 printf( "day %d\n", day ); // 0-6
193}
194\end{cfa}
195For iterating using arithmetic to make sense, the enumerator values \emph{must} have a consecutive ordering with a fixed step between values.
196For example, a previous gap introduced by @Thu = 10@, results in iterating over the values 0--13, where values 3--9 are not @Week@ values.
197Note, it is the bidirectional conversion that allows incrementing @day@: @day@ is converted to @int@, integer @1@ is added, and the result is converted back to @Week@ for the assignment to @day@.
198For safety, \CC does not support the bidirectional conversion, and hence, an unsafe cast is necessary to increment @day@: @day = (Week)(day + 1)@.
199
200There is a C idiom to automatically compute the number of enumerators in an enumeration.
201\begin{cfa}
202enum E { A, B, C, D, @N@ }; // N == 4
203for ( enum E e = A; e < @N@; e += 1 ) ...
204\end{cfa}
205Here, serendipitously the auto-incrementing counts the number of enumerators and puts the total into the last enumerator @N@.
206This @N@ is often used as the dimension for an array assocated with the enumeration.
207\begin{cfa}
208E array[@N@];
209for ( enum E e = A; e < N; e += 1 ) {
210 array[e] = e;
211}
212\end{cfa}
213However, for non-consecutive ordering and non-integral typed enumerations, \see{\VRef{f:EumeratorTyping}}, this idiom fails.
214
215This idiom is often used with another C idiom for matching companion information.
216For example, an enumeration may be linked with a companion array of printable strings.
217\begin{cfa}
218enum Integral_Type { chr, schar, uschar, sshort, ushort, sint, usint, ..., NO_OF_ITYPES };
219char * Integral_Name[@NO_OF_ITYPES@] = {
220 "char", "signed char", "unsigned char",
221 "signed short int", "unsigned short int",
222 "signed int", "unsigned int", ...
223};
224enum Integral_Type @integral_type@ = ...
225printf( "%s\n", Integral_Name[@integral_type@] ); // human readable type name
226\end{cfa}
227However, the companion idiom results in the \emph{harmonizing} problem because an update to the enumeration @Integral_Type@ often requires a corresponding update to the companion array \snake{Integral_Name}.
228The requirement to harmonize is at best indicated by a comment before the enumeration.
229This issue is exacerbated if enumeration and companion array are in different translation units.
230
231\bigskip
232While C provides a true enumeration, it is restricted, has unsafe semantics, and does not provide useful/advanced enumeration features found in other programming languages.
233
234
235\section{\CFA}
236
237\CFA in \emph{not} an object-oriented programming-language, \ie functions cannot be nested in aggregate types, and hence, there is no receive notation for calling functions, \eg @obj.method(...)@, where the first argument proceeds the call.
238The following section provide short descriptions of \CFA features mentioned further in the thesis.
239
240
241\subsection{Operator Overloading}
242
243Virtually all programming languages overload the arithmetic operators across the basic types using the number and type of parameters and returns.
244Like \CC, \CFA also allows these operators to be overloaded with user-defined types.
245The syntax for operator names uses the @'?'@ character to denote a parameter, \eg prefix and infix increment operators: @?++@, @++?@, and @?+?@.
246\begin{cfa}
247struct S { int i, j };
248S @?+?@( S op1, S op2 ) { return (S){ op1.i + op2.i, op1.j + op2.j }; }
249S s1, s2;
250s1 = s1 @+@ s2; $\C[1.75in]{// infix call}$
251s1 = @?+?@( s1, s2 ); $\C{// direct call}\CRT$
252\end{cfa}
253The type system examines each call size and selects the best matching overloaded function based on the number and types of the arguments.
254If there are intermixed operands, @2 + 3.5@, the type system attempts (safe) conversions changing the arguments to one or more of the parameter type(s).
255
256
257\subsection{Function Overloading}
258
259Both \CFA and \CC allow function names to be overloaded, as long as their prototypes differ in the number and type of parameters and returns.
260\begin{cfa}
261void f( void ); $\C[1.75in]{// (1): no parameter}$
262void f( char ); $\C{// (2): overloaded on the number and parameter type}$
263void f( int, int ); $\C{// (3): overloaded on the number and parameter type}$
264f( 'A' ); $\C{// select (2)}\CRT$
265\end{cfa}
266In this case, the name @f@ is overloaded depending on the number and parameter types.
267The type system examines each call size and selects the best matching based on the number and types of the arguments.
268Here, there is a perfect match for the call, @f( 'A' )@ with the number and parameter type of function (2).
269
270Ada, Scala, and \CFA type-systems also use the return type in resolving a call.
271\begin{cfa}
272int f( void ); $\C[1.75in]{// (4); overloaded on return type}$
273double f( void ); $\C{// (5); overloaded on return type}$
274int i = f(); $\C{// select (4)}$
275double d = f(); $\C{// select (5)}\CRT$
276\end{cfa}
277
278
279\subsection{Variable Overloading}
280Unlike almost all programming languages, \CFA has variable overloading within a scope, along with shadow overloading in nested scopes.
281\begin{cfa}
282void foo( double d );
283int v; $\C[1.75in]{// (1)}$
284double v; $\C{// (2) variable overloading}$
285foo( v ); $\C{// select (2)}$
286{
287 int v; $\C{// (3) shadow overloading}$
288 double v; $\C{// (4) and variable overloading}$
289 foo( v ); $\C{// select (4)}\CRT$
290}
291\end{cfa}
292The \CFA type system simply treats overloaded variables as an overloaded function returning a value with no parameters.
293
294
295\subsection{Constructor and Destructor}
296
297While \CFA in not object oriented, it adopts many language features commonly used in object-oriented languages;
298these features are independent from object-oriented programming.
299
300All objects in \CFA are initialized by @constructors@ \emph{after} allocation and de-initialized \emph{before} deallocation.
301\CC cannot have constructors for basic-types because they have no aggregate type \lstinline[language=C++]{struct/class} in which to insert a constructor definition.
302Like \CC, \CFA has multiple auto-generated constructors for every type.
303
304The prototype for the constructor/destructor are @void ?{}( T &, ... )@ and @void ^?{}( T &, ... )@, respectively.
305The first parameter is logically, the \lstinline[language=C++]{this} or \lstinline[language=java]{self} in other object-oriented languages, and implicitly passed.
306\begin{cfa}
307struct Employee {
308 char * name;
309 double salary;
310};
311void @?{}@( Employee & this, char * name, double salary ) {
312 this.name = aalloc( sizeof(name) );
313 strcpy( this.name, name );
314 this.salary = salary;
315}
316void @^?{}@( Employee & this ) {
317 free( this.name );
318}
319{
320 Employee name = { "Sara Schmidt", 20.5 };
321} // implicit destructor call
322\end{cfa}
323Both constructor and destructor can be explicitly called.
324\begin{cfa}
325 Employee name = { "Sara Schmidt", 20.5 };
326 ... // use name
327 ^?{}( name ); // de-initialize
328 ?{}( name, "Jack Smith", 10.5 }; // re-initialize
329 ... // use name
330\end{cfa}
331
332
333\subsection{Special Literals}
334
335The C constants @0@ and @1@ have special meaning.
336@0@ is the null pointer and used in conditional expressions, where @if ( p )@ is rewritten @if ( p != 0 )@;
337@1@ is an additive identity in unary operators @++@ and @--@.
338Aware of their significance, \CFA provides a special type @zero_t@ and @one_t@ for custom types.
339\begin{cfa}
340struct S { int i, j; };
341void ?{}( S & this, @zero_t@ ) { this.i = 0; this.j = 0; } // zero_t, no parameter name allowed
342int ?@!=@?( S this, @zero_t@ ) { return this.i != 0 && this.j != 0; }
343S s = @0@;
344if ( s @!= 0@ ) ...
345\end{cfa}
346Similarity, for @one_t@.
347\begin{cfa}
348void ?{}( S & this, @one_t@ ) { this.i = 1; this.j = 1; } // one_t, no parameter name allowed
349S & ?++( S & this, @one_t@ ) { return (S){ this.i++, this.j++ }; }
350\end{cfa}
351
352
353\subsection{Polymorphic Functions}
354
355Polymorphic functions are the functions that apply to all types.
356\CFA provides \newterm{parametric polymorphism} written with the @forall@ clause.
357\begin{cfa}
358@forall( T )@ T identity( T v ) { return v; }
359identity( 42 );
360\end{cfa}
361The @identity@ function accepts a value from any type as an argument and returns that value.
362At the call size, the type parameter @T@ is bounded to @int@ from the argument @42@.
363
364For polymorphic functions to be useful, the @forall@ clause needs \newterm{type assertion}s that restricts the polymorphic types it accepts.
365\begin{cfa}
366forall( T @| { void foo( T ); }@ ) void bar( T t ) { @foo( t );@ }
367struct S { ... } s;
368void foo( struct S );
369bar( s );
370\end{cfa}
371The assertion on @T@ restricts the range of types that can be manipulated by @bar@ to only those that have an implementation of @foo@ with the matching signature, allowing @bar@'s call to @foo@ in its body.
372
373
374\subsection{Trait}
375
376A @forall@ clause can assert many restrictions on multiple types.
377A common practice is to refactor the assertions into a named \newterm{trait}.
378\begin{cfa}
379forall(T) trait @Bird@ {
380 int days_can_fly( T );
381 void fly( T );
382};
383forall( B | @Bird@( B ) )
384void bird_fly( int days_since_born, B bird ) {
385 if ( days_since_born > days_can_fly( bird )) fly( bird );
386}
387struct Robin {} robin;
388int days_can_fly( Robin robin ) { return 23; }
389void fly( Robin robin ) {}
390bird_fly( 23, robin );
391\end{cfa}
392Grouping type assertions into a named trait effectively creates a reusable interface for parametric-polymorphic types.
393
394
395\section{Expression Resolution}
396
397Overloading poses a challenge for all expression-resolution systems.
398Multiple overloaded names give multiple candidates at a call site, and a resolver must pick a \emph{best} match, where ``best'' is defined by a series of heuristics based on safety and programmer intuition/expectation.
399When multiple best matches exist, the resolution is ambiguous.
400
401The \CFA resolver attempts to identity a best candidate based on: first, the number of parameters and types, and second, when no exact match exists, the fewest implicit conversions and polymorphic variables.
402Finding an exact match is not discussed here, because the mechanism is fairly straightforward, even when the search space is large;
403only finding a non-exact match is discussed in detail.
404
405
406\subsection{Conversion Cost}
407\label{s:ConversionCost}
408
409Most programming languages perform some implicit conversions among basic types to facilitate mixed-mode arithmetic;
410otherwise, the program becomes littered with many explicit casts, which is not match programmer expectation.
411C is an aggressive language as it provides conversions among almost all of the basic types, even when the conversion is potentially unsafe or not meaningful, \ie @float@ to @bool@.
412C defines the resolution pattern as ``usual arithmetic conversion''~\cite[\S~6.3.1.8]{C11}, in which C looks for a \newterm{common type} between operands, and converts one or both operands to the common type.
413Loosely defined, a common type is a the smallest type in terms of size of representation that both operands can be converted into without losing their precision, called a \newterm{widening} or \newterm{safe conversion}.
414
415\CFA generalizes ``usual arithmetic conversion'' to \newterm{conversion cost}.
416In the first design by Bilson~\cite{Bilson03}, conversion cost is a 3-tuple, @(unsafe, poly, safe)@ applied between each argument/parameter type, where:
417\begin{enumerate}
418\item
419@unsafe@ is the number of precision losing (\newterm{narrowing} conversions),
420\item
421@poly@ is the number of polymorphic function parameters, and
422\item
423@safe@ is sum of the degree of safe (widening) conversions.
424\end{enumerate}
425Sum of degree is a method to quantify C's integer and floating-point rank.
426Every pair of widening conversion types is assigned a \newterm{distance}, and distance between the two same type is 0.
427For example, the distance from @char@ to @int@ is 2, distance from @int@ to @long@ is 1, and distance from @int@ to @long long int@ is 2.
428This distance does not mirror C's rank system.
429For example, the rank of @char@ and @signed char@ are the same in C, but the distance from @char@ to @signed char@ is assigned 1.
430@safe@ cost is summing all pairs of argument to parameter safe conversion distances.
431Among the three costs in Bilson's model, @unsafe@ is the most significant cost and @safe@ is the least significant, with an implication that \CFA always choose a candidate with the lowest @unsafe@, if possible.
432
433For example, assume the overloaded function @foo@ is called with two @int@ parameter.
434The cost for every overloaded @foo@ has been list along:
435\begin{cfa}
436void foo( char, char ); $\C[2.5in]{// (1) (2, 0, 0)}$
437void foo( char, int ); $\C{// (2) (1, 0, 0)}$
438forall( T, V ) void foo( T, V ); $\C{// (3) (0, 2, 0)}$
439forall( T ) void foo( T, T ); $\C{// (4) (0, 2, 0)}$
440forall( T ) void foo( T, int ); $\C{// (5) (0, 1, 0)}$
441void foo( long long, long ); $\C{// (6) (0, 0, 3)}$
442void foo( long, long ); $\C{// (7) (0, 0, 2)}$
443void foo( int, long ); $\C{// (8) (0, 0, 1)}$
444int i, j;
445foo( i, j ); $\C{// convert j to long and call (8)}\CRT$
446\end{cfa}
447The overloaded instances are ordered from the highest to the lowest cost, and \CFA select the last candidate (8).
448
449In the next iteration of \CFA, Schluntz and Aaron~\cite{Moss18} expanded conversion cost to a 7-tuple with 4 additional categories, @(unsafe, poly, safe, sign, vars, specialization, reference)@, with the following interpretations:
450\begin{enumerate}
451\item @unsafe@ from Bilson
452\item @poly@
453\item @safe@
454\item @sign@ is the number of sign/unsigned variable conversions
455\item @vars@ is the number of polymorphic type variables
456\item @specialization@ is a negative value of the number of type assertions
457\item @reference@ is the number of reference-to-rvalue conversions
458\end{enumerate}
459The extended conversion-cost model looks for candidates that are more specific and less generic.
460@vars@ disambiguates @forall( T, V ) foo( T, V )@ and @forall( T ) void foo( T, T )@, where the extra type parameter @V@ makes is more generic.
461A more generic type means less constraints on its parameter types.
462\CFA favours candidates with more restrictions on polymorphism, so @forall( T ) void foo( T, T )@ has lower cost.
463@specialization@ is an arbitrary count-down value starting at zero.
464For every type assertion in @forall@ clause (no assertions in the above example), \CFA subtracts one from @specialization@.
465More type assertions means more constraints on argument types, making the function less generic.
466
467\CFA defines two special cost values: @zero@ and @infinite@.
468A conversion cost is @zero@ when argument and parameter has an exact match, and a conversion cost is @infinite@ when there is no defined conversion between two types.
469For example, the conversion cost from @int@ to a @struct S@ is @infinite@.
Note: See TracBrowser for help on using the repository browser.