Context Navigation

source: doc/theses/jiada_liang_MMath/intro.tex @ dcfcf368

Last change on this file since dcfcf368 was dcfcf368, checked in by Peter A. Buhr <pabuhr@…>, 5 weeks ago
final proofread of thesis
Property mode set to `100644`
File size: 22.0 KB

Line
1	\chapter{Introduction}
2
3	All basic types in a programming language have a set of constants (symbols), and these constants represent computable values, \eg integer types have constants @-1@, @17@, @0xff@ representing whole numbers, floating-point types have constants @5.3@, @2.3E-5@, @0xff.ffp0@ representing real numbers, character types have constants @'a'@, @"abc\n"@, \mbox{\lstinline{u8"}\texttt{\guillemotleft{na\"{i}ve}\guillemotright}\lstinline{"}} representing (human readable) text, \etc.
4	Constants can be overloaded among types, \eg @0@ is a null pointer for all pointer types, and the value zero for integer and floating-point types.
5	(In \CFA, the constants @0@ and @1@ can be overloaded for any type.)
6	Higher-level types compose constants from the basic constants.
7	\begin{cfa}
8	struct S { int i, j, k; } s;
9	s = (S){ 1, 2, 3 }; $\C[2in]{// structure constant}$
10	int x[5] = { 1, 2, 3, 4, 5 }; $\C{// array constant}\CRT$
11	\end{cfa}
12	A constant's symbolic name is dictated by language syntax related to types, \eg @5.@ (double), @5.0f@ (float), @5l@ (long double).
13	In general, the representation of a constant's value is \newterm{opaque}, so the internal representation can be chosen arbitrarily, \eg two's complement, IEEE floating-point.
14	In theory, there is an infinite set of constant names per type representing an infinite set of values.
15
16	It is common in mathematics, engineering, and computer science to alias new constants to existing constants so they have the same value, \eg $\pi$, $\tau$ (2$\pi$), $\phi$ (golden ratio), K(k), M, G, T for powers of 2\footnote{Overloaded with SI powers of 10.} often prefixing bits (b) or bytes (B), \eg Gb, MB, and in general situations, \eg specific times (noon, New Years), cities (Big Apple), flowers (Lily), \etc.
17	An alias can bind to another alias, which transitively binds it to the specified constant.
18	Multiple aliases can represent the same value, \eg eighth note and quaver, giving synonyms.
19
20	Many programming languages capture this important software-engineering capability through a mechanism called \newterm{constant} or \newterm{literal} naming, where a new constant is aliased to an existing constant.
21	Its purpose is for readability: replacing constant values in a program with symbolic names that are more meaningful to programmers in the context of the application.
22	Thereafter, associating a name to a different value automatically distributes this rebinding, preventing errors.
23	Because an aliased name is a constant, it cannot appear in a mutable context, \eg \mbox{$\pi$ \lstinline{= 42}} is meaningless, and a constant has no address, \ie it is an \newterm{rvalue}\footnote{
24	The term rvalue defines an expression that can only appear on the right-hand side of an assignment expression.}.
25	In theory, there is an infinite set of possible aliasing; in practice, the number of aliasing per program is finite and small.
26
27	Aliased constants can form an (ordered) set, \eg days of a week, months of a year, floors of a building (basement, ground, 1st), colours in a rainbow, \etc.
28	In this case, the binding between a constant name and value can be implicit, where the values are chosen to support any set operations.
29	Many programming languages capture the aliasing and ordering through a mechanism called an \newterm{enumeration}.
30	\begin{quote}
31	enumerate (verb, transitive).
32	To count, ascertain the number of;
33	more usually, to mention (a number of things or persons) separately, as if for the purpose of counting;
34	to specify as in a list or catalogue.~\cite{OEDenumerate}
35	\end{quote}
36	Within an enumeration set, the enumeration names (aliases) must be unique, and instances of an enumerated type are \emph{often} restricted to hold only these names.
37
38	It is possible to enumerate among set names without having an ordering among the set values.
39	For example, the week, the weekdays, the weekend, and every second day of the week.
40	\begin{cfa}[morekeywords={in}]
41	for ( cursor in Mon, Tue, Wed, Thu, Fri, Sat, Sun ) ... $\C[3.75in]{// week}$
42	for ( cursor in Mon, Tue, Wed, Thu, Fri ) ... $\C{// weekday}$
43	for ( cursor in Sat, Sun ) ... $\C{// weekend}$
44	for ( cursor in Mon, Wed, Fri, Sun ) ... $\C{// every second day of week}\CRT$
45	\end{cfa}
46	A set can have a partial or total ordering, making it possible to compare set elements, \eg Monday is before Tuesday and Tuesday is after.
47	Ordering allows iterating among the enumeration set using relational operators and advancement, \eg:
48	\begin{cfa}
49	for ( cursor = Monday; cursor @<=@ Friday; cursor = @succ@( cursor ) ) ... // weekdays
50	\end{cfa}
51	Here the values for the set names are logically \emph{generated} rather than listing a subset of names.
52
53	Hence, the fundamental aspects of an enumeration are:
54	\begin{enumerate}
55	\item
56	\begin{sloppypar}
57	It provides a finite set of new constants, which are implicitly or explicitly assigned values that must be appropriate for any set operations, \eg increasing order.
58	This aspect differentiates an enumeration from general types with an infinite set of constants.
59	\end{sloppypar}
60	\item
61	The alias names are constants, which follow transitively from their binding to other constants.
62	\item
63	Defines a type for generating instances (variables).
64	\item
65	For safety, an enumeration instance should be restricted to hold only its constant names.
66	\item
67	There is a mechanism for \emph{enumerating} over the enumeration names, where the ordering can be implicit from the type, explicitly listed, or generated arithmetically.
68	\end{enumerate}
69
70
71	\section{Terminology}
72	\label{s:Terminology}
73
74	The term \newterm{enumeration} defines a type with a set of new constants, and the term \newterm{enumerator} represents an arbitrary alias name \see{\VRef{s:CEnumeration} for the name derivations}.
75	An enumerated type can have the following properties: \newterm{label} (name), \newterm{order} (position), and \newterm{value} (payload).
76	\begin{cquote}
77	\sf\setlength{\tabcolsep}{3pt}
78	\begin{tabular}{rcccccccr}
79	\it\color{red}enumeration & \multicolumn{8}{c}{\it\color{red}enumerators} \\
80	$\downarrow$\hspace*{15pt} & \multicolumn{8}{c}{$\downarrow$} \\
81	@enum@ Week \{ & Mon, & Tue, & Wed, & Thu, & Fri, & Sat, & Sun {\color{red}= 42} & \}; \\
82	\it\color{red}label & Mon & Tue & Wed & Thu & Fri & Sat & Sun & \\
83	\it\color{red}order & 0 & 1 & 2 & 3 & 4 & 5 & 6 & \\
84	\it\color{red}value & 0 & 1 & 2 & 3 & 4 & 5 & {\color{red}42} &
85	\end{tabular}
86	\end{cquote}
87	Here, the enumeration @Week@ defines the enumerator constants @Mon@, @Tue@, @Wed@, @Thu@, @Fri@, @Sat@, and @Sun@.
88	The implicit ordering implies the successor of @Tue@ is @Mon@ and the predecessor of @Tue@ is @Wed@, independent of any associated enumerator values.
89	The value is the implicitly/explicitly assigned constant to support any enumeration operations;
90	the value may be hidden (opaque) or visible.
91
92	Specifying complex ordering is possible:
93	\begin{cfa}
94	enum E1 { $\color{red}[\(_1\)$ {A, B}, $\color{blue}[\(_2\)$ C $\color{red}]\(_1\)$, {D, E} $\color{blue}]\(_2\)$ }; $\C{// overlapping square brackets}$
95	enum E2 { {A, {B, C} }, { {D, E}, F }; $\C{// nesting}$
96	\end{cfa}
97	For @E1@, there is the partial ordering among @A@, @B@ and @C@, and @C@, @D@ and @E@, but not among @A@, @B@ and @D@, @E@.
98	For @E2@, there is the total ordering @A@ $<$ @{B, C}@ $<$ @{D, E}@ $<$ @F@.
99	Only flat total-ordering among enumerators is considered in this work.
100
101
102	\section{Motivation}
103
104	Many programming languages provide an enumeration-like mechanism, which may or may not cover the previous five fundamental enumeration aspects.
105	Hence, the term \emph{enumeration} can be confusing and misunderstood.
106	Furthermore, some languages conjoin the enumeration with other type features, making it difficult to tease apart which feature is being used.
107	This section discusses some language features that are sometimes called an enumeration but do not provide all enumeration aspects.
108
109
110	\subsection{Aliasing}
111	\label{s:Aliasing}
112
113	Some languages provide simple aliasing (renaming).
114	\begin{cfa}
115	const Size = 20, Pi = 3.14159, Name = "Jane";
116	\end{cfa}
117	The alias name is logically replaced in the program text by its matching constant.
118	It is possible to compare aliases, if the constants allow it, \eg @Size < Pi@, whereas @Pi < Name@ might be disallowed depending on the language.
119
120	Aliasing is \emph{not} macro substitution, \eg @#define Size 20@, where a name is replaced by its value \emph{before} compilation, so the name is invisible to the programming language.
121	With aliasing, each new name is part of the language, and hence, participates fully, such as name overloading in the type system.
122	Aliasing is not an immutable variable.
123	\begin{cfa}
124	extern @const@ int Size = 20;
125	extern void foo( @const@ int @&@ size );
126	foo( Size ); // take the address of (reference) Size
127	\end{cfa}
128	Taking the address of an immutable variable makes it an \newterm{lvalue}, which implies it has storage.
129	With separate compilation, it is necessary to choose one translation unit to perform the initialization.
130	If aliasing requires storage, its address and initialization are opaque (compiler only), similar to \CC rvalue reference @&&@.
131
132	Aliasing does provide readability and automatic resubstitution.
133	It also provides simple enumeration properties, but with effort.
134	\begin{cfa}
135	const Mon = 1, Tue = 2, Wed = 3, Thu = 4, Fri = 5, Sat = 6, Sun = 7;
136	\end{cfa}
137	Any reordering of the enumerators requires manual renumbering.
138	\begin{cfa}
139	const @Sun = 1@, Mon = 2, Tue = 3, Wed = 4, Thu = 5, Fri = 6, Sat = 7;
140	\end{cfa}
141	For these reasons, aliasing is sometimes called an enumeration.
142	However, there is no type to create a type-checked instance or iterator cursor, so there is no ability to enumerate.
143	Hence, there are multiple enumeration aspects not provided by aliasing, justifying a separate enumeration type in a programming language.
144
145
146	\subsection{Algebraic Data Type}
147	\label{s:AlgebraicDataType}
148
149	An algebraic data type (ADT)\footnote{ADT is overloaded with abstract data type.} is another language feature often linked with enumeration, where an ADT conjoins an arbitrary type, possibly a \lstinline[language=C++]{class} or @union@, and a named constructor.
150	For example, in Haskell:
151	\begin{haskell}
152	data S = S { i::Int, d::Double } $\C{// structure}$
153	data @Foo@ = A Int \| B Double \| C S $\C{// ADT, composed of three types}$
154	foo = A 3; $\C{// type Foo is inferred}$
155	bar = B 3.5
156	baz = C S{ i = 7, d = 7.5 }
157	\end{haskell}
158	the ADT has three variants (constructors), @A@, @B@, @C@, with associated types @Int@, @Double@, and @S@.
159	The constructors create an initialized value of the specific type that is bound to the immutable variables @foo@, @bar@, and @baz@.
160	Hence, the ADT @Foo@ is like a union containing values of the associated types, and a constructor name is used to initialize and access the value using dynamic pattern-matching.
161	\begin{cquote}
162	\setlength{\tabcolsep}{20pt}
163	\begin{tabular}{@{}ll@{}}
164	\begin{haskell}
165	prtfoo val = -- function
166	-- pattern match on constructor
167	case val of
168	@A@ a -> print a
169	@B@ b -> print b
170	@C@ (S i d) -> do
171	print i
172	print d
173	\end{haskell}
174	&
175	\begin{haskell}
176	main = do
177	prtfoo foo
178	prtfoo bar
179	prtfoo baz
180	3
181	3.5
182	7
183	7.5
184	\end{haskell}
185	\end{tabular}
186	\end{cquote}
187	For safety, most languages require all associated types to be listed or a default case with no field accesses.
188
189	A less frequent case is multiple constructors with the same type.
190	\begin{haskell}
191	data Bar = X Int \| Y Int \| Z Int;
192	foo = X 3;
193	bar = Y 3;
194	baz = Z 5;
195	\end{haskell}
196	Here, the constructor name gives different meanings to the values in the common \lstinline[language=Haskell]{Int} type, \eg the value @3@ has different interpretations depending on the constructor name in the pattern matching.
197
198	Note, the term \newterm{variant} is often associated with ADTs.
199	However, there are multiple languages with a @variant@ type that is not an ADT \see{Algol68~\cite{Algol68} or \CC \lstinline{variant}}.
200	Here, the type (and possibly the position for equivalent types) is used to discriminate the specific \emph{variant} within the variant instance.
201	For example, \VRef[Figure]{f:C++variant} shows the \CC equivalent of the two Haskell ADT types using variant types.
202	In these languages, the variant cannot be used to simulate an enumeration.
203	Hence, in this work the term variant is not a synonym for ADT.
204
205	\begin{figure}
206	\begin{c++}
207	struct S { char s[32]; };
208	variant< int, double, S > vd;
209	variant< int, int, int > vs;
210
211	// discrimination based on type
212	vd = 3;
213	if ( holds_alternative<int>(vd) ) cout << "int " << get<int>(vd ) << endl;
214	vd = 3.5;
215	if ( holds_alternative<double>(vd) ) cout << "double " << get<double>(vd) << endl;
216	vd = (S){ "abc" };
217	if ( holds_alternative<S>(vd) ) cout << "S.s " << get<S>(vd).s << endl;
218
219	// discrimination based on type and position within type
220	vs = (variant<int,int,int>){ in_place_index<0>, 12 };
221	if ( vs.index() == 0 ) cout << "posn 0 " << get<0>(vs) << endl;
222	vs = (variant<int,int,int>){ in_place_index<1>, 4 };
223	if ( vs.index() == 1 ) cout << "posn 1 " << get<1>(vs) << endl;
224	vs = (variant<int,int,int>){ in_place_index<2>, 5 };
225	if ( vs.index() == 2 ) cout << "posn 2 " << get<2>(vs) << endl;
226	\end{c++}
227	\caption{\CC \lstinline[language=C++]{variant} Discrimination Using RTTI/Position}
228	\label{f:C++variant}
229	\end{figure}
230
231	% https://downloads.haskell.org/ghc/latest/docs/libraries/base-4.19.1.0-179c/GHC-Enum.html
232	% https://hackage.haskell.org/package/base-4.19.1.0/docs/GHC-Enum.html
233
234	% The association between ADT and enumeration occurs if all the constructors have a unit (empty) type, \eg @struct unit {}@.
235	% Note, the unit type is not the same as \lstinline{void}.
236	In terms of functional programming linguistics, enumerations often refer to a @unit type@ ADT, which is a set with the @nil@ value carrying no information.
237	The unit type is different from type @void@ in C, because @void@ has no value, which is an empty set.
238	Hence, @void@ is a C annotation that nothing is expected in this place.
239	For example, a function that takes a @void@ parameter and returns a @void@ is a function that expects no parameters and returns nothing.
240	\begin{cfa}
241	void foo( void );
242	foo(); $\C{// no arguments and no result}$
243	\end{cfa}
244	Because of this distinction, it is impossible to have a variable of type @void@, to assign a @void@ value, or have a function taking and returning multiple @void@s.
245	\begin{cfa}
246	void v; $\C{// disallowed}$
247	v = void;
248	[ void, void ] bar( void, void );
249	\end{cfa}
250	Programming languages often use an empty parameter list to imply no value and no return type for empty return.
251	\begin{cfa}
252	[] bar(); $\C{// \CFA empty/empty prototype}$
253	\end{cfa}
254	However, C is saddled with an empty parameter list meaning a list of unknown type parameters, \ie @var_arg@, which is changed to @void@ in \CC/\CFA.
255	As a result, a function that returns @void@ cannot be used as a parameter of a function that expects no parameter.
256	\begin{cfa}
257	void foo( void );
258	foo( @foo()@ ); $\C{// void argument does not match with void parameter}$
259	\end{cfa}
260
261	This issue arose when simulating an ADT using a \CC @variant@: @void@ cannot be used as an empty variant.
262	To solve this problem, \CC introduced @std::monstate@~\cite{C++monstate}, a type that can be instantiated as a value but holds no information.
263	A similar approximation in C is to define a @struct@ type with no fields.
264	\begin{cfa}
265	struct Unit {} e; $\C{// empty type}$
266	Unit bar( Unit );
267	bar( @bar( e )@ );
268	\end{cfa}
269	Because @std::monostate@ and @Unit@ are user-defined types versus part of the type system, they are only an approximation to @unit@ because other @unit@ types can be defined.
270
271	In the Haskell ADT:
272	\begin{haskell}
273	data Week = Mon \| Tue \| Wed \| Thu \| Fri \| Sat \| Sun deriving(Enum, Eq, Show)
274	\end{haskell}
275	the default type for each constructor is the unit type, and deriving from @Enum@ enforces no other associated types. The @Eq@ allows equality comparison, and @Show@ is for printing.
276	The nullary constructors for the unit types are numbered left-to-right from $0$ to @maxBound@$- 1$, and provide enumerating operations @succ@, @pred@, @enumFrom@, @enumFromTo@.
277	\VRef[Figure]{f:HaskellEnumeration} shows enumeration comparison and iterating (enumerating).
278
279	\begin{figure}
280	\begin{cquote}
281	\setlength{\tabcolsep}{40pt}
282	\begin{tabular}{@{}ll@{}}
283	\begin{haskell}
284	day = Tue
285	main = do
286	if day == Tue then
287	print day
288	else
289	putStr "not Tue"
290	print (enumFrom Mon) $\C[2.25in]{-- week}$
291	print (enumFromTo Mon Fri) $\C{-- weekday}$
292	print (enumFromTo Sat Sun) $\C{-- weekend}\CRT$
293	\end{haskell}
294	&
295	\begin{haskell}
296	Tue
297	[Mon,Tue,Wed,Thu,Fri,Sat,Sun]
298	[Mon,Tue,Wed,Thu,Fri]
299	[Sat,Sun]
300
301
302
303
304
305	\end{haskell}
306	\end{tabular}
307	\end{cquote}
308	\caption{Haskell Enumeration}
309	\label{f:HaskellEnumeration}
310	\end{figure}
311
312	The key observation is the dichotomy between an ADT and enumeration: the ADT uses the associated type resulting in a union-like data structure, and the enumeration does not use the associated type, and hence, is not a union.
313	In contrast, an enumeration may be constructed using the ADT mechanism, but it is so restricted it is not an ADT.
314	Furthermore, a general ADT cannot be an enumeration because the constructors generate different values making enumerating meaningless.
315	While functional programming languages regularly repurpose the ADT type into an enumeration type, this process seems contrived and confusing.
316	Hence, there is only a weak equivalence between an enumeration and ADT, justifying a separate enumeration type in a programming language.
317
318
319	\section{Contributions}
320
321	The goal of this work is to to extend the simple and unsafe enumeration type in the C programming-language into a complex and safe enumeration type in the \CFA programming-language, while maintaining backwards compatibility with C.
322	On the surface, enumerations seem like a simple type.
323	However, when extended with advanced features, enumerations become complex for both the type system and the runtime implementation.
324
325	The contributions of this work are:
326	\begin{enumerate}
327	\item
328	safety: Define a safe enumeration conversion scheme, both for C and \CFA, and replace ad-hoc C idioms with safer software-engineering approaches.
329	\item
330	overloading: Provide a pattern to overload functions, literals, and variables for polymorphic enumerations using the \CFA type system.
331	\item
332	scoping: Add a namespace for enumerations and qualified access into the namespace to deal with the naming problem.
333	\item
334	generalization: Support all language types for enumerators with associated values providing enumeration constants for any type.
335	\item
336	reuse: Implement subset and containment inheritance for enumerations.
337	\item
338	control flow: Extend control-flow structures making it safer and easier to enumerate over an enumeration.
339	\item
340	I/O: Provide input and output of enumerations based on enumerator names.
341	\end{enumerate}
342
343
344	\begin{comment}
345	Date: Wed, 1 May 2024 13:41:58 -0400
346	Subject: Re: Enumeration
347	To: "Peter A. Buhr" <pabuhr@uwaterloo.ca>
348	From: Gregor Richards <gregor.richards@uwaterloo.ca>
349
350	I think I have only one comment and one philosophical quibble to make:
351
352	Comment: I really can't agree with putting MB in the same category as the
353	others. MB is both a quantity and a unit, and the suggestion that MB is one
354	million evokes the rather disgusting comparison 1MB = 1000km. Unit types are
355	not in the scope of this work.
356
357	Philosophical quibble: Pi is 3.14159...etc. Monday is not 0; associating
358	Monday with 0 is just a consequence of the language. The way this is written
359	suggests that the intentional part is subordinate to the implementation detail,
360	which seems backwards to me. Calling the number "primary" and the name
361	"secondary" feels like you're looking out from inside of the compiler, instead
362	of looking at the language from the outside. And, calling secondary values
363	without visible primary values "opaque"-which yes, I realize is my own term
364	;)-suggests that you insist that the primary value is a part of the design, or
365	at least mental model, of the program. Although as a practical matter there is
366	some system value associated with the constructor/tag of an ADT, that value is
367	not part of the mental model, and so calling it "primary" and calling the name
368	"secondary" and "opaque" seems either (a) very odd or (b) very C-biased. Or
369	both.
370
371	With valediction,
372	- Gregor Richards
373
374
375	Date: Thu, 30 May 2024 23:15:23 -0400
376	Subject: Re: Meaning?
377	To: "Peter A. Buhr" <pabuhr@uwaterloo.ca>
378	CC: <ajbeach@uwaterloo.ca>, <j82liang@uwaterloo.ca>
379	From: Gregor Richards <gregor.richards@uwaterloo.ca>
380
381	I have to disagree with this being agreeing to disagree, since we agree
382	here. My core point was that it doesn't matter whether you enumerate over the
383	names or the values. This is a distinction without a difference in any case
384	that matters. If any of the various ways of looking at it are actually
385	different from each other, then that's because the enumeration has failed to be
386	an enumeration in some other way, not because of the actual process of
387	enumeration. Your flag enum is a 1-to-1 map of names and values, so whether you
388	walk through names or walk through values is not an actual distinction. It
389	could be distinct in the order that it walks through, but that doesn't
390	actually matter, it's just a choice that has to be made. Walking through entire
391	range of machine values, including ones that aren't part of the enumeration,
392	would be bizarre in any case.
393
394	Writing these out has crystallized some thoughts, albeit perhaps not in a way
395	that's any help to y'all. An enumeration is a set of names; ideally an ordered
396	set of names. The state of enumerations in programming languages muddies things
397	because they often expose the machine value underlying those names, resulting
398	in a possibly ordered set of names and a definitely ordered set of values. And,
399	muddying things further, because those underlying values are exposed, enums are
400	used in ways that depend on the underlying values being exposed, making that
401	a part of the definition. But, an enumeration is conceptually just one set,
402	not both. So much of the difficulty is that you're trying to find a way to make
403	a concept that should be a single set agree with an implementation that's two
404	sets. If those sets have a 1-to-1 mapping, then who cares, they're just
405	aliases. It's the possibility of the map being surjective (having multiple
406	names for the same underlying values) that breaks everything. Personally, I
407	think that an enum with aliases isn't an enumeration anyway, so who cares about
408	the rest; if you're not wearing the gourd as a shoe, then it's not an
409	enumeration.
410
411	With valediction,
412	- Gregor Richards
413	\end{comment}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: