source: doc/theses/jiada_liang_MMath/intro.tex @ a35e342

Last change on this file since a35e342 was a35e342, checked in by JiadaL <j82liang@…>, 3 weeks ago

update the discussion of unit type

  • Property mode set to 100644
File size: 21.7 KB
Line 
1\chapter{Introduction}
2
3All basic types in a programming language have a set of constants (symbols), and these constants represent computable values, \eg integer types have constants @-1@, @17@, @0xff@ representing whole numbers, floating-point types have constants @5.3@, @2.3E-5@, @0xff.ffp0@ representing real numbers, character types have constants @'a'@, @"abc\n"@, \mbox{\lstinline{u8"}\texttt{\guillemotleft{na\"{i}ve}\guillemotright}\lstinline{"}} representing (human readable) text, \etc.
4Constants can be overloaded among types, \eg @0@ is a null pointer for all pointer types, and the value zero for integer and floating-point types.
5(In \CFA, the constants @0@ and @1@ can be overloaded for any type.)
6Higher-level types compose constants from the basic constants.
7\begin{cfa}
8struct S { int i, j, k; } s;
9s = (S){ 1, 2, 3 };                             $\C[2in]{// structure constant}$
10int x[5] = { 1, 2, 3, 4, 5 };   $\C{// array constant}\CRT$
11\end{cfa}
12A constant's symbolic name is dictated by language syntax related to types, \eg @5.@ (double), @5.0f@ (float), @5l@ (long double).
13In general, the representation of a constant's value is \newterm{opaque}, so the internal representation can be chosen arbitrarily, \eg two's complement, IEEE floating-point.
14In theory, there is an infinite set of constant names per type representing an infinite set of values.
15
16It is common in mathematics, engineering, and computer science to alias new constants to existing constants so they have the same value, \eg $\pi$, $\tau$ (2$\pi$), $\phi$ (golden ratio), K(k), M, G, T for powers of 2\footnote{Overloaded with SI powers of 10.} often prefixing bits (b) or bytes (B), \eg Gb, MB, and in general situations, \eg specific times (noon, New Years), cities (Big Apple), flowers (Lily), \etc.
17An alias can bind to another alias, which transitively binds it to the specified constant.
18Multiple aliases can represent the same value, \eg eighth note and quaver, giving synonyms.
19
20Many programming languages capture this important software-engineering capability through a mechanism called \newterm{constant} or \newterm{literal} naming, where a new constant is aliased to an existing constant.
21Its purpose is for readability: replacing constant values in a program with symbolic names that are more meaningful to programmers in the context of the application.
22Thereafter, associating a name to a different value automatically distributes this rebinding, preventing errors.
23Because an aliased name is a constant, it cannot appear in a mutable context, \eg \mbox{$\pi$ \lstinline{= 42}} is meaningless, and a constant has no address, \ie it is an \newterm{rvalue}\footnote{
24The term rvalue defines an expression that can only appear on the right-hand side of an assignment expression.}.
25In theory, there is an infinite set of possible aliasing; in practice, the number of aliasing per program is finite and small.
26
27Aliased constants can form an (ordered) set, \eg days of a week, months of a year, floors of a building (basement, ground, 1st), colours in a rainbow, \etc.
28In this case, the binding between a constant name and value can be implicit, where the values are chosen to support any set operations.
29Many programming languages capture the aliasing and ordering through a mechanism called an \newterm{enumeration}.
30\begin{quote}
31enumerate (verb, transitive).
32To count, ascertain the number of;
33more usually, to mention (a number of things or persons) separately, as if for the purpose of counting;
34to specify as in a list or catalogue.~\cite{OEDenumerate}
35\end{quote}
36Within an enumeration set, the enumeration names (aliases) must be unique, and instances of an enumerated type are \emph{often} restricted to hold only these names.
37
38It is possible to enumerate among set names without having an ordering among the set values.
39For example, the week, the weekdays, the weekend, and every second day of the week.
40\begin{cfa}[morekeywords={in}]
41for ( cursor in Mon, Tue, Wed, Thu, Fri, Sat, Sun } ... $\C[3.75in]{// week}$
42for ( cursor in Mon, Tue, Wed, Thu, Fri } ...   $\C{// weekday}$
43for ( cursor in Sat, Sun } ...                                  $\C{// weekend}$
44for ( cursor in Mon, Wed, Fri, Sun } ...                $\C{// every second day of week}\CRT$
45\end{cfa}
46A set can have a partial or total ordering, making it possible to compare set elements, \eg Monday is before Tuesday and Tuesday is after.
47Ordering allows iterating among the enumeration set using relational operators and advancement, \eg:
48\begin{cfa}
49for ( cursor = Monday; cursor @<=@ Friday; cursor = @succ@( cursor ) ) ... // weekdays
50\end{cfa}
51Here the values for the set names are logically \emph{generated} rather than listing a subset of names.
52
53Hence, the fundamental aspects of an enumeration are:
54\begin{enumerate}
55\item
56\begin{sloppypar}
57It provides a finite set of new constants, which are implicitly or explicitly assigned values that must be appropriate for any set operations, \eg increasing order.
58This aspect differentiates an enumeration from general types with an infinite set of constants.
59\end{sloppypar}
60\item
61The alias names are constants, which follow transitively from their binding to other constants.
62\item
63Defines a type for generating instances (variables).
64\item
65For safety, an enumeration instance should be restricted to hold only its constant names.
66\item
67There is a mechanism for \emph{enumerating} over the enumeration names, where the ordering can be implicit from the type, explicitly listed, or generated arithmetically.
68\end{enumerate}
69
70
71\section{Terminology}
72\label{s:Terminology}
73
74The term \newterm{enumeration} defines a type with a set of new constants, and the term \newterm{enumerator} represents an arbitrary alias name \see{\VRef{s:CEnumeration} for the name derivations}.
75An enumerated type can have three fundamental properties, \newterm{label} (name), \newterm{order} (position), and \newterm{value} (payload).
76\begin{cquote}
77\sf\setlength{\tabcolsep}{3pt}
78\begin{tabular}{rcccccccr}
79\it\color{red}enumeration       & \multicolumn{8}{c}{\it\color{red}enumerators} \\
80$\downarrow$\hspace*{15pt}      & \multicolumn{8}{c}{$\downarrow$}                              \\
81@enum@ Week \{                          & Mon,  & Tue,  & Wed,  & Thu,  & Fri,  & Sat,  & Sun {\color{red}= 42} & \};   \\
82\it\color{red}label                     & Mon   & Tue   & Wed   & Thu   & Fri   & Sat   & Sun           &               \\
83\it\color{red}order                     & 0             & 1             & 2             & 3             & 4             & 5             & 6                     &               \\
84\it\color{red}value                     & 0             & 1             & 2             & 3             & 4             & 5             & {\color{red}42}               &
85\end{tabular}
86\end{cquote}
87Here, the enumeration @Week@ defines the enumerator constants @Mon@, @Tue@, @Wed@, @Thu@, @Fri@, @Sat@, and @Sun@.
88The implicit ordering implies the successor of @Tue@ is @Mon@ and the predecessor of @Tue@ is @Wed@, independent of any associated enumerator values.
89The value is the implicitly/explicitly assigned constant to support any enumeration operations;
90the value may be hidden (opaque) or visible.
91
92Specifying complex ordering is possible:
93\begin{cfa}
94enum E1 { $\color{red}[\(_1\)$ {A, B}, $\color{blue}[\(_2\)$ C $\color{red}]\(_1\)$, {D, E} $\color{blue}]\(_2\)$ }; $\C{// overlapping square brackets}$
95enum E2 { {A, {B, C} }, { {D, E}, F };  $\C{// nesting}$
96\end{cfa}
97For @E1@, there is the partial ordering among @A@, @B@ and @C@, and @C@, @D@ and @E@, but not among @A@, @B@ and @D@, @E@.
98For @E2@, there is the total ordering @A@ $<$ @{B, C}@ $<$ @{D, E}@ $<$ @F@.
99Only flat total-ordering among enumerators is considered in this work.
100
101
102\section{Motivation}
103
104Many programming languages provide an enumeration-like mechanism, which may or may not cover the previous five fundamental enumeration aspects.
105Hence, the term \emph{enumeration} can be confusing and misunderstood.
106Furthermore, some languages conjoin the enumeration with other type features, making it difficult to tease apart which feature is being used.
107This section discusses some language features that are sometimes called an enumeration but do not provide all enumeration aspects.
108
109
110\subsection{Aliasing}
111\label{s:Aliasing}
112
113Some languages provide simple aliasing (renaming).
114\begin{cfa}
115const Size = 20, Pi = 3.14159, Name = "Jane";
116\end{cfa}
117The alias name is logically replaced in the program text by its matching constant.
118It is possible to compare aliases, if the constants allow it, \eg @Size < Pi@, whereas @Pi < Name@ might be disallowed depending on the language.
119
120Aliasing is \emph{not} macro substitution, \eg @#define Size 20@, where a name is replaced by its value \emph{before} compilation, so the name is invisible to the programming language.
121With aliasing, each new name is part of the language, and hence, participates fully, such as name overloading in the type system.
122Aliasing is not an immutable variable.
123\begin{cfa}
124extern @const@ int Size = 20;
125extern void foo( @const@ int @&@ size );
126foo( Size ); // take the address of (reference) Size
127\end{cfa}
128Taking the address of an immutable variable makes it an \newterm{lvalue}, which implies it has storage.
129With separate compilation, it is necessary to choose one translation unit to perform the initialization.
130If aliasing requires storage, its address and initialization are opaque (compiler only), similar to \CC rvalue reference @&&@.
131
132Aliasing does provide readability and automatic resubstitution.
133It also provides simple enumeration properties, but with effort.
134\begin{cfa}
135const Mon = 1, Tue = 2, Wed = 3, Thu = 4, Fri = 5, Sat = 6, Sun = 7;
136\end{cfa}
137Any reordering of the enumerators requires manual renumbering.
138\begin{cfa}
139const @Sun = 1@, Mon = 2, Tue = 3, Wed = 4, Thu = 5, Fri = 6, Sat = 7;
140\end{cfa}
141For these reasons, aliasing is sometimes called an enumeration.
142However, there is no type to create a type-checked instance or iterator cursor, so there is no ability to enumerate.
143Hence, there are multiple enumeration aspects not provided by aliasing, justifying a separate enumeration type in a programming language.
144
145
146\subsection{Algebraic Data Type}
147\label{s:AlgebraicDataType}
148
149An algebraic data type (ADT)\footnote{ADT is overloaded with abstract data type.} is another language feature often linked with enumeration, where an ADT conjoins an arbitrary type, possibly a \lstinline[language=C++]{class} or @union@, and a named constructor.
150For example, in Haskell:
151\begin{haskell}
152data S = S { i::Int, d::Double }                $\C{// structure}$
153data @Foo@ = A Int | B Double | C S             $\C{// ADT, composed of three types}$
154foo = A 3;                                                              $\C{// type Foo is inferred}$
155bar = B 3.5
156baz = C S{ i = 7, d = 7.5 }
157\end{haskell}
158the ADT has three variants (constructors), @A@, @B@, @C@, with associated types @Int@, @Double@, and @S@.
159The constructors create an initialized value of the specific type that is bound to the immutable variables @foo@, @bar@, and @baz@.
160Hence, the ADT @Foo@ is like a union containing values of the associated types, and a constructor name is used to initialize and access the value using dynamic pattern-matching.
161\begin{cquote}
162\setlength{\tabcolsep}{20pt}
163\begin{tabular}{@{}ll@{}}
164\begin{haskell}
165prtfoo val = -- function
166        -- pattern match on constructor
167        case val of
168          @A@ a -> print a
169          @B@ b -> print b
170          @C@ (S i d) -> do
171                print i
172                print d
173\end{haskell}
174&
175\begin{haskell}
176main = do
177        prtfoo foo
178        prtfoo bar
179        prtfoo baz
1803
1813.5
1827
1837.5
184\end{haskell}
185\end{tabular}
186\end{cquote}
187For safety, most languages require all associated types to be listed or a default case with no field accesses.
188
189A less frequent case is multiple constructors with the same type.
190\begin{haskell}
191data Bar = X Int | Y Int | Z Int;
192foo = X 3;
193bar = Y 3;
194baz = Z 5;
195\end{haskell}
196Here, the constructor name gives different meanings to the values in the common \lstinline[language=Haskell]{Int} type, \eg the value @3@ has different interpretations depending on the constructor name in the pattern matching.
197
198Note, the term \newterm{variant} is often associated with ADTs.
199However, there are multiple languages with a @variant@ type that is not an ADT \see{Algol68~\cite{Algol68} or \CC \lstinline{variant}}.
200Here, the type (and possibly the position for equivalent types) is used to discriminate the specific \emph{variant} within the variant instance.
201For example, \VRef[Figure]{f:C++variant} shows the \CC equivalent of the two Haskell ADT types using variant types.
202In these languages, the variant cannot be used to simulate an enumeration.
203Hence, in this work the term variant is not a synonym for ADT.
204
205\begin{figure}
206\begin{c++}
207struct S { char s[32]; };
208variant< int, double, S > vd;
209variant< int, int, int > vs;
210
211// discrimination based on type
212vd = 3;
213if ( holds_alternative<int>(vd) ) cout << "int " << get<int>(vd ) << endl;
214vd = 3.5;
215if ( holds_alternative<double>(vd) ) cout << "double " << get<double>(vd) << endl;
216vd = (S){ "abc" };
217if ( holds_alternative<S>(vd) ) cout << "S.s " << get<S>(vd).s << endl;
218
219// discrimination based on type and position within type
220vs = (variant<int,int,int>){ in_place_index<0>, 12 };
221if ( vs.index() == 0 ) cout << "posn 0 " << get<0>(vs) << endl;
222vs = (variant<int,int,int>){ in_place_index<1>, 4 };
223if ( vs.index() == 1 ) cout << "posn 1 " << get<1>(vs) << endl;
224vs = (variant<int,int,int>){ in_place_index<2>, 5 };
225if ( vs.index() == 2 ) cout << "posn 2 " << get<2>(vs) << endl;
226\end{c++}
227\caption{\CC \lstinline[language=C++]{variant} Discrimination Using RTTI/Position}
228\label{f:C++variant}
229\end{figure}
230
231% https://downloads.haskell.org/ghc/latest/docs/libraries/base-4.19.1.0-179c/GHC-Enum.html
232% https://hackage.haskell.org/package/base-4.19.1.0/docs/GHC-Enum.html
233
234% The association between ADT and enumeration occurs if all the constructors have a unit (empty) type, \eg @struct unit {}@.
235% Note, the unit type is not the same as \lstinline{void}.
236In terms of functional programming linguistics, enumerations often refers to a @unit type@ ADT, where @unit type@ is a type
237that carry no information. The unit type is different than @void@ in C, because @void@ is type has no value, i.e., it is a empty set.
238In constract, unit type has exactly one value, often called a @nil@ value.
239Because of this distinction, it is not possbile to have a variable to have type @void@ or to be assigned with a value @void@.
240In practice, @void@ in C is more like an annotation that nothing is expected in this place. A function takes @void@ as parameter
241is essentially a function that expects no parameter. A function that return @void@ cannot be used as a parameter of a function that expects no
242parameter. Therefore, the following code is illegal in C:
243\begin{cfa}
244void foo( void );
245foo( @foo()@ );         $\C{// void argument does not match with void parameter}$
246\end{cfa}
247
248This is a more notably issue when to use @variant@ to simulate @ADT@: @void@ cannot be used as an empty variant. To solve this problem,
249\CC introduced @monstate@, a type that can be instantiated as a value, but holds no information. There is no standard representation of
250@unit@ in C, and it is often approximated by a user-defined type that has no field, for example:
251\begin{cfa}
252struct Empty {} e;              $\C{// empty type}$
253Empty bar( Empty );
254bar(@bar(e)@);
255\end{cfa}
256@Empty@ is a close approximation to @unit@ if and only if @Empty@ is the only representation of @unit@ in the program. If a program has
257a second type, say @Nothing@, that also tries to resemble @unit@, then @unit@ concepts falls apart.
258
259In the Haskell ADT:
260\begin{haskell}
261data Week = Mon | Tue | Wed | Thu | Fri | Sat | Sun deriving(Enum, Eq, Show)
262\end{haskell}
263the default type for each constructor is the unit type, and deriving from @Enum@ enforces no other associated types. The @Eq@ allows equality comparison, and @Show@ is for printing.
264The nullary constructors for the unit types are numbered left-to-right from $0$ to @maxBound@$- 1$, and provide enumerating operations @succ@, @pred@, @enumFrom@, @enumFromTo@.
265\VRef[Figure]{f:HaskellEnumeration} shows enumeration comparison and iterating (enumerating).
266
267\begin{figure}
268\begin{cquote}
269\setlength{\tabcolsep}{40pt}
270\begin{tabular}{@{}ll@{}}
271\begin{haskell}
272day = Tue
273main = do
274        if day == Tue then
275                print day
276        else
277                putStr "not Tue"
278        print (enumFrom Mon)            $\C[2.25in]{-- week}$
279        print (enumFromTo Mon Fri)      $\C{-- weekday}$
280        print (enumFromTo Sat Sun)      $\C{-- weekend}\CRT$
281\end{haskell}
282&
283\begin{haskell}
284Tue
285[Mon,Tue,Wed,Thu,Fri,Sat,Sun]
286[Mon,Tue,Wed,Thu,Fri]
287[Sat,Sun]
288
289
290
291
292
293\end{haskell}
294\end{tabular}
295\end{cquote}
296\caption{Haskell Enumeration}
297\label{f:HaskellEnumeration}
298\end{figure}
299
300The key observation is the dichotomy between an ADT and enumeration: the ADT uses the associated type resulting in a union-like data structure, and the enumeration does not use the associated type, and hence, is not a union.
301In contrast, an enumeration may be constructed using the ADT mechanism, but it is so restricted it is not an ADT.
302Furthermore, a general ADT cannot be an enumeration because the constructors generate different values making enumerating meaningless.
303While functional programming languages regularly repurpose the ADT type into an enumeration type, this process seems contrived and confusing.
304Hence, there is only a weak equivalence between an enumeration and ADT, justifying a separate enumeration type in a programming language.
305
306
307\section{Contributions}
308
309The goal of this work is to to extend the simple and unsafe enumeration type in the C programming-language into a complex and safe enumeration type in the \CFA programming-language, while maintaining backwards compatibility with C.
310On the surface, enumerations seem like a simple type.
311However, when extended with advanced features, enumerations become complex for both the type system and the runtime implementation.
312
313The contributions of this work are:
314\begin{enumerate}
315\item
316safety: Define a safe enumeration conversion scheme, both for C and \CFA, and replace ad-hoc C idioms with safer software-engineering approaches.
317\item
318overloading: Provide a pattern to overload functions, literals, and variables for polymorphic enumerations using the \CFA type system.
319\item
320scoping: Add a namespace for enumerations and qualified access into the namespace to deal with the naming problem.
321\item
322generalization: Support all language types for enumerators with associated values providing enumeration constants for any type.
323\item
324reuse: Implement subset and containment inheritance for enumerations.
325\item
326control flow: Extend control-flow structures making it safer and easier to enumerate over an enumeration.
327\item
328I/O: Provide input and output of enumerations based on enumerator names.
329\end{enumerate}
330
331
332\begin{comment}
333Date: Wed, 1 May 2024 13:41:58 -0400
334Subject: Re: Enumeration
335To: "Peter A. Buhr" <pabuhr@uwaterloo.ca>
336From: Gregor Richards <gregor.richards@uwaterloo.ca>
337
338I think I have only one comment and one philosophical quibble to make:
339
340Comment: I really can't agree with putting MB in the same category as the
341others. MB is both a quantity and a unit, and the suggestion that MB *is* one
342million evokes the rather disgusting comparison 1MB = 1000km.  Unit types are
343not in the scope of this work.
344
345Philosophical quibble: Pi *is* 3.14159...etc. Monday is not 0; associating
346Monday with 0 is just a consequence of the language. The way this is written
347suggests that the intentional part is subordinate to the implementation detail,
348which seems backwards to me. Calling the number "primary" and the name
349"secondary" feels like you're looking out from inside of the compiler, instead
350of looking at the language from the outside. And, calling secondary values
351without visible primary values "opaque"-which yes, I realize is my own term
352;)-suggests that you insist that the primary value is a part of the design, or
353at least mental model, of the program. Although as a practical matter there is
354some system value associated with the constructor/tag of an ADT, that value is
355not part of the mental model, and so calling it "primary" and calling the name
356"secondary" and "opaque" seems either (a) very odd or (b) very C-biased. Or
357both.
358
359With valediction,
360  - Gregor Richards
361
362
363Date: Thu, 30 May 2024 23:15:23 -0400
364Subject: Re: Meaning?
365To: "Peter A. Buhr" <pabuhr@uwaterloo.ca>
366CC: <ajbeach@uwaterloo.ca>, <j82liang@uwaterloo.ca>
367From: Gregor Richards <gregor.richards@uwaterloo.ca>
368
369I have to disagree with this being agreeing to disagree, since we agree
370here. My core point was that it doesn't matter whether you enumerate over the
371names or the values. This is a distinction without a difference in any case
372that matters. If any of the various ways of looking at it are actually
373different from each other, then that's because the enumeration has failed to be
374an enumeration in some other way, not because of the actual process of
375enumeration. Your flag enum is a 1-to-1 map of names and values, so whether you
376walk through names or walk through values is not an actual distinction. It
377could be distinct in the *order* that it walks through, but that doesn't
378actually matter, it's just a choice that has to be made. Walking through entire
379range of machine values, including ones that aren't part of the enumeration,
380would be bizarre in any case.
381
382Writing these out has crystallized some thoughts, albeit perhaps not in a way
383that's any help to y'all. An enumeration is a set of names; ideally an ordered
384set of names. The state of enumerations in programming languages muddies things
385because they often expose the machine value underlying those names, resulting
386in a possibly ordered set of names and a definitely ordered set of values. And,
387muddying things further, because those underlying values are exposed, enums are
388used in ways that *depend* on the underlying values being exposed, making that
389a part of the definition. But, an enumeration is conceptually just *one* set,
390not both. So much of the difficulty is that you're trying to find a way to make
391a concept that should be a single set agree with an implementation that's two
392sets. If those sets have a 1-to-1 mapping, then who cares, they're just
393aliases. It's the possibility of the map being surjective (having multiple
394names for the same underlying values) that breaks everything. Personally, I
395think that an enum with aliases isn't an enumeration anyway, so who cares about
396the rest; if you're not wearing the gourd as a shoe, then it's not an
397enumeration.
398
399With valediction,
400  - Gregor Richards
401\end{comment}
Note: See TracBrowser for help on using the repository browser.