Context Navigation

source: doc/theses/mike_brooks_MMath/background.tex @ 0775468

Last change on this file since 0775468 was 0775468, checked in by Peter A. Buhr <pabuhr@…>, 6 weeks ago
proofreading background section Linked Lists
Property mode set to `100644`
File size: 45.1 KB

Line
1	\chapter{Background}
2
3	Since this work builds on C, it is necessary to explain the C mechanisms and their shortcomings for array, linked list, and string.
4
5
6	\section{Array}
7
8	At the start, the C programming language made a significant design mistake.
9	\begin{quote}
10	In C, there is a strong relationship between pointers and arrays, strong enough that pointers and arrays really should be treated simultaneously.
11	Any operation which can be achieved by array subscripting can also be done with pointers.~\cite[p.~93]{C:old}
12	\end{quote}
13	Accessing any storage requires pointer arithmetic, even if it is just base-displacement addressing in an instruction.
14	The conjoining of pointers and arrays could also be applied to structures, where a pointer references a structure field like an array element.
15	Finally, while subscripting involves pointer arithmetic (as does field references @x.y.z@), it is very complex for multi-dimensional arrays and requires array descriptors to know stride lengths along dimensions.
16	Many C errors result from performing pointer arithmetic instead of using subscripting.
17	Some C textbooks erroneously teach pointer arithmetic suggesting it is faster than subscripting.
18
19	C semantics want a programmer to \emph{believe} an array variable is a ``pointer to its first element.''
20	This desire becomes apparent by a detailed inspection of an array declaration.
21	\lstinput{34-34}{bkgd-carray-arrty.c}
22	The inspection begins by using @sizeof@ to provide definite program semantics for the intuition of an expression's type.
23	\lstinput{35-36}{bkgd-carray-arrty.c}
24	Now consider the sizes of expressions derived from @ar@, modified by adding ``pointer to'' and ``first element'' (and including unnecessary parentheses to avoid confusion about precedence).
25	\lstinput{37-40}{bkgd-carray-arrty.c}
26	Given the size of @float@ is 4, the size of @ar@ with 10 floats being 40 bytes is common reasoning for C programmers.
27	Equally, C programmers know the size of a \emph{pointer} to the first array element is 8 (or 4 depending on the addressing architecture).
28	% Now, set aside for a moment the claim that this first assertion is giving information about a type.
29	Clearly, an array and a pointer to its first element are different things.
30
31	In fact, the idea that there is such a thing as a pointer to an array may be surprising and it is not the same thing as a pointer to the first element.
32	\lstinput{42-45}{bkgd-carray-arrty.c}
33	The first assignment gets
34	\begin{cfa}
35	warning: assignment to `float ()[10]' from incompatible pointer type `float '
36	\end{cfa}
37	and the second assignment gets the opposite.
38
39	The inspection now refutes any suggestion that @sizeof@ is informing about allocation rather than type information.
40	Note, @sizeof@ has two forms, one operating on an expression and the other on a type.
41	Using the type form yields the same results as the prior expression form.
42	\lstinput{46-49}{bkgd-carray-arrty.c}
43	The results are also the same when there is \emph{no allocation} using a pointer-to-array type.
44	\lstinput{51-57}{bkgd-carray-arrty.c}
45	Hence, in all cases, @sizeof@ is informing about type information.
46
47	So, thinking of an array as a pointer to its first element is too simplistic an analogue and it is not backed up by the type system.
48	This misguided analogue works for a single-dimension array but there is no advantage other than possibly teaching beginning programmers about basic runtime array-access.
49
50	Continuing, a short form for declaring array variables exists using length information provided implicitly by an initializer.
51	\lstinput{59-62}{bkgd-carray-arrty.c}
52	The compiler counts the number of initializer elements and uses this value as the first dimension.
53	Unfortunately, the implicit element counting does not extend to dimensions beyond the first.
54	\lstinput{64-67}{bkgd-carray-arrty.c}
55
56	My contribution is recognizing:
57	\begin{itemize}
58	\item There is value in using a type that knows its size.
59	\item The type pointer to (first) element does not.
60	\item C \emph{has} a type that knows the whole picture: array, e.g. @T[10]@.
61	\item This type has all the usual derived forms, which also know the whole picture.
62	A usefully noteworthy example is pointer to array, e.g. @T (*)[10]@.\footnote{
63	The parenthesis are necessary because subscript has higher priority than pointer in C declarations.
64	(Subscript also has higher priority than dereference in C expressions.)}
65	\end{itemize}
66
67
68	\section{Reading Declarations}
69
70	A significant area of confusion for reading C declarations results from embedding a declared variable in a declaration, mimicking the way the variable is used in executable statements.
71	\begin{cquote}
72	\begin{tabular}{@{}ll@{}}
73	\multicolumn{1}{@{}c}{\textbf{Array}} & \multicolumn{1}{c@{}}{\textbf{Function Pointer}} \\
74	\begin{cfa}
75	int @(*@ar@)[@5@]@; // definition
76	... @(*@ar@)[@3@]@ += 1; // usage
77	\end{cfa}
78	&
79	\begin{cfa}
80	int @(*@f@())[@5@]@ { ... }; // definition
81	... @(*@f@())[@3@]@ += 1; // usage
82	\end{cfa}
83	\end{tabular}
84	\end{cquote}
85	Essentially, the type is wrapped around the name in successive layers (like an \Index{onion}).
86	While attempting to make the two contexts consistent is a laudable goal, it has not worked out in practice, even though Dennis Richie believed otherwise:
87	\begin{quote}
88	In spite of its difficulties, I believe that the C's approach to declarations remains plausible, and am comfortable with it; it is a useful unifying principle.~\cite[p.~12]{Ritchie93}
89	\end{quote}
90	After all, reading a C array type is easy: just read it from the inside out, and know when to look left and when to look right!
91
92	\CFA provides its own type, variable and routine declarations, using a simpler syntax.
93	The new declarations place qualifiers to the left of the base type, while C declarations place qualifiers to the right of the base type.
94	The qualifiers have the same meaning in \CFA as in C.
95	Then, a \CFA declaration is read left to right, where a function return type is enclosed in brackets @[@\,@]@.
96	\begin{cquote}
97	\begin{tabular}{@{}l@{\hspace{3em}}ll@{}}
98	\multicolumn{1}{c@{\hspace{3em}}}{\textbf{C}} & \multicolumn{1}{c}{\textbf{\CFA}} & \multicolumn{1}{c}{\textbf{read left to right}} \\
99	\begin{cfa}
100	int @*@ x1 @[5]@;
101	int @(*@x2@)[5]@;
102	int @(*@f( int p )@)[5]@;
103	\end{cfa}
104	&
105	\begin{cfa}
106	@[5] *@ int x1;
107	@* [5]@ int x2;
108	@[ * [5] int ]@ f( int p );
109	\end{cfa}
110	&
111	\begin{cfa}
112	// array of 5 pointers to int
113	// pointer to array of 5 int
114	// function returning pointer to array of 5 ints
115	\end{cfa}
116	\\
117	& &
118	\LstCommentStyle{//\ \ \ and taking an int argument}
119	\end{tabular}
120	\end{cquote}
121	As declaration complexity increases, it becomes corresponding difficult to read and understand the C declaration form.
122	Note, writing declarations left to right is common in other programming languages, where the function return-type is often placed after the parameter declarations.
123
124	\VRef[Table]{bkgd:ar:usr:avp} introduces the many layers of the C and \CFA array story, where the \CFA story is discussion in \VRef{XXX}.
125	The \CFA-thesis column shows the new array declaration form, which is my contributed improvements for safety and ergonomics.
126	The table shows there are multiple yet equivalent forms for the array types under discussion, and subsequent discussion shows interactions with orthogonal (but easily confused) language features.
127	Each row of the table shows alternate syntactic forms.
128	The simplest occurrences of types distinguished in the preceding discussion are marked with $\triangleright$.
129	Removing the declared variable @x@, gives the type used for variable, structure field, cast or error messages \PAB{(though note Section TODO points out that some types cannot be casted to)}.
130	Unfortunately, parameter declarations \PAB{(section TODO)} have more syntactic forms and rules.
131
132	\begin{table}
133	\centering
134	\caption{Syntactic Reference for Array vs Pointer. Includes interaction with \lstinline{const}ness.}
135	\label{bkgd:ar:usr:avp}
136	\begin{tabular}{ll\|l\|l\|l}
137	& Description & \multicolumn{1}{c\|}{C} & \multicolumn{1}{c\|}{\CFA} & \multicolumn{1}{c}{\CFA-thesis} \\
138	\hline
139	$\triangleright$ & value & @T x;@ & @T x;@ & \\
140	\hline
141	& immutable value & @const T x;@ & @const T x;@ & \\
142	& & @T const x;@ & @T const x;@ & \\
143	\hline \hline
144	$\triangleright$ & pointer to value & @T * x;@ & @* T x;@ & \\
145	\hline
146	& immutable ptr. to val. & @T * const x;@ & @const * T x;@ & \\
147	\hline
148	& ptr. to immutable val. & @const T * x;@ & @* const T x;@ & \\
149	& & @T const * x;@ & @* T const x;@ & \\
150	\hline \hline
151	$\triangleright$ & array of value & @T x[10];@ & @[10] T x@ & @array(T, 10) x@ \\
152	\hline
153	& ar.\ of immutable val. & @const T x[10];@ & @[10] const T x@ & @const array(T, 10) x@ \\
154	& & @T const x[10];@ & @[10] T const x@ & @array(T, 10) const x@ \\
155	\hline
156	& ar.\ of ptr.\ to value & @T * x[10];@ & @[10] * T x@ & @array(T *, 10) x@ \\
157	& & & & @array(* T, 10) x@ \\
158	\hline
159	& ar.\ of imm. ptr.\ to val. & @T * const x[10];@ & @[10] const * T x@ & @array(* const T, 10) x@ \\
160	& & & & @array(const * T, 10) x@ \\
161	\hline
162	& ar.\ of ptr.\ to imm. val. & @const T * x[10];@ & @[10] * const T x@ & @array(const T *, 10) x@ \\
163	& & @T const * x[10];@ & @[10] * T const x@ & @array(* const T, 10) x@ \\
164	\hline \hline
165	$\triangleright$ & ptr.\ to ar.\ of value & @T (x)[10];@ & @ [10] T x@ & @* array(T, 10) x@ \\
166	\hline
167	& imm. ptr.\ to ar.\ of val. & @T (* const x)[10];@ & @const * [10] T x@ & @const * array(T, 10) x@ \\
168	\hline
169	& ptr.\ to ar.\ of imm. val. & @const T (x)[10];@ & @ [10] const T x@ & @* const array(T, 10) x@ \\
170	& & @T const (x)[10];@ & @ [10] T const x@ & @* array(T, 10) const x@ \\
171	\hline
172	& ptr.\ to ar.\ of ptr.\ to val. & @T (x)[10];@ & @* [10] * T x@ & @* array(T *, 10) x@ \\
173	& & & & @* array(* T, 10) x@ \\
174	\hline
175	\end{tabular}
176	\end{table}
177
178	TODO: Address these parked unfortunate syntaxes
179	\begin{itemize}
180	\item static
181	\item star as dimension
182	\item under pointer decay: @int p1[const 3]@ being @int const *p1@
183	\end{itemize}
184
185
186	\subsection{Arrays decay and pointers diffract}
187
188	The last section established the difference between these four types:
189	\lstinput{3-6}{bkgd-carray-decay.c}
190	But the expression used for obtaining the pointer to the first element is pedantic.
191	The root of all C programmer experience with arrays is the shortcut
192	\lstinput{8-8}{bkgd-carray-decay.c}
193	which reproduces @pa0@, in type and value:
194	\lstinput{9-9}{bkgd-carray-decay.c}
195	The validity of this initialization is unsettling, in the context of the facts established in the last section.
196	Notably, it initializes name @pa0x@ from expression @ar@, when they are not of the same type:
197	\lstinput{10-10}{bkgd-carray-decay.c}
198
199	So, C provides an implicit conversion from @float[10]@ to @float *@.
200	\begin{quote}
201	Except when it is the operand of the @sizeof@ operator, or the unary @&@ operator, or is a string literal used to
202	initialize an array an expression that has type ``array of \emph{type}'' is converted to an expression with type
203	``pointer to \emph{type}'' that points to the initial element of the array object~\cite[\S~6.3.2.1.3]{C11}
204	\end{quote}
205	This phenomenon is the famous ``pointer decay,'' which is a decay of an array-typed expression into a pointer-typed one.
206	It is worthy to note that the list of exception cases does not feature the occurrence of @ar@ in @ar[i]@.
207	Thus, subscripting happens on pointers not arrays.
208
209	Subscripting proceeds first with pointer decay, if needed. Next, \cite[\S~6.5.2.1.2]{C11} explains that @ar[i]@ is treated as if it were @(*((a)+(i)))@.
210	\cite[\S~6.5.6.8]{C11} explains that the addition, of a pointer with an integer type, is defined only when the pointer refers to an element that is in an array, with a meaning of ``@i@ elements away from,'' which is valid if @ar@ is big enough and @i@ is small enough.
211	Finally, \cite[\S~6.5.3.2.4]{C11} explains that the @*@ operator's result is the referenced element.
212	Taken together, these rules illustrate that @ar[i]@ and @i[a]@ mean the same thing!
213
214	Subscripting a pointer when the target is standard-inappropriate is still practically well-defined.
215	While the standard affords a C compiler freedom about the meaning of an out-of-bound access,
216	or of subscripting a pointer that does not refer to an array element at all,
217	the fact that C is famously both generally high-performance, and specifically not bound-checked,
218	leads to an expectation that the runtime handling is uniform across legal and illegal accesses.
219	Moreover, consider the common pattern of subscripting on a @malloc@ result:
220	\begin{cfa}
221	float * fs = malloc( 10 * sizeof(float) );
222	fs[5] = 3.14;
223	\end{cfa}
224	The @malloc@ behaviour is specified as returning a pointer to ``space for an object whose size is'' as requested (\cite[\S~7.22.3.4.2]{C11}).
225	But \emph{nothing} more is said about this pointer value, specifically that its referent might \emph{be} an array allowing subscripting.
226
227	Under this assumption, a pointer being subscripted (or added to, then dereferenced) by any value (positive, zero, or negative), gives a view of the program's entire address space, centred around the @p@ address, divided into adjacent @sizeof(p)@ chunks, each potentially (re)interpreted as @typeof(p)@.
228	I call this phenomenon ``array diffraction,'' which is a diffraction of a single-element pointer into the assumption that its target is in the middle of an array whose size is unlimited in both directions.
229	No pointer is exempt from array diffraction.
230	No array shows its elements without pointer decay.
231
232	A further pointer--array confusion, closely related to decay, occurs in parameter declarations.
233	\cite[\S~6.7.6.3.7]{C11} explains that when an array type is written for a parameter,
234	the parameter's type becomes a type that can be summarized as the array-decayed type.
235	The respective handling of the following two parameter spellings shows that the array-spelled one is really, like the other, a pointer.
236	\lstinput{12-16}{bkgd-carray-decay.c}
237	As the @sizeof(x)@ meaning changed, compared with when run on a similarly-spelled local variable declaration,
238	@gcc@ also gives this code the warning for the first assertion:
239	\begin{cfa}
240	warning: 'sizeof' on array function parameter 'x' will return size of 'float *'
241	\end{cfa}
242	The caller of such a function is left with the reality that a pointer parameter is a pointer, no matter how it is spelled:
243	\lstinput{18-21}{bkgd-carray-decay.c}
244	This fragment gives no warnings.
245
246	The shortened parameter syntax @T x[]@ is a further way to spell ``pointer.''
247	Note the opposite meaning of this spelling now, compared with its use in local variable declarations.
248	This point of confusion is illustrated in:
249	\lstinput{23-30}{bkgd-carray-decay.c}
250	The basic two meanings, with a syntactic difference helping to distinguish,
251	are illustrated in the declarations of @ca@ vs.\ @cp@,
252	whose subsequent @edit@ calls behave differently.
253	The syntax-caused confusion is in the comparison of the first and last lines,
254	both of which use a literal to initialize an object declared with spelling @T x[]@.
255	But these initialized declarations get opposite meanings,
256	depending on whether the object is a local variable or a parameter.
257
258	In summary, when a function is written with an array-typed parameter,
259	\begin{itemize}
260	\item an appearance of passing an array by value is always an incorrect understanding
261	\item a dimension value, if any is present, is ignored
262	\item pointer decay is forced at the call site and the callee sees the parameter having the decayed type
263	\end{itemize}
264
265	Pointer decay does not affect pointer-to-array types, because these are already pointers, not arrays.
266	As a result, a function with a pointer-to-array parameter sees the parameter exactly as the caller does:
267	\lstinput{32-42}{bkgd-carray-decay.c}
268	\VRef[Table]{bkgd:ar:usr:decay-parm} gives the reference for the decay phenomenon seen in parameter declarations.
269
270	\begin{table}
271	\caption{Syntactic Reference for Decay during Parameter-Passing.
272	Includes interaction with \lstinline{const}ness, where ``immutable'' refers to a restriction on the callee's ability.}
273	\label{bkgd:ar:usr:decay-parm}
274	\centering
275	\begin{tabular}{llllll}
276	& Description & Type & Parameter Declaration & \CFA \\
277	\hline
278	& & & @T * x,@ & @* T x,@ \\
279	$\triangleright$ & pointer to value & @T *@ & @T x[10],@ & @[10] T x,@ \\
280	& & & @T x[],@ & @[] T x,@ \\
281	\hline
282	& & & @T * const x,@ & @const * T x@ \\
283	& immutable ptr.\ to val. & @T * const@ & @T x[const 10],@ & @[const 10] T x,@ \\
284	& & & @T x[const],@ & @[const] T x,@\\
285	\hline
286	& & & @const T * x,@ & @ * const T x,@ \\
287	& & & @T const * x,@ & @ * T const x,@ \\
288	& ptr.\ to immutable val. & @const T *@ & @const T x[10],@ & @[10] const T x,@ \\
289	& & @T const *@ & @T const x[10],@ & @[10] T const x,@ \\
290	& & & @const T x[],@ & @[] const T x,@ \\
291	& & & @T const x[],@ & @[] T const x,@ \\
292	\hline \hline
293	& & & @T (x)[10],@ & @ [10] T x,@ \\
294	$\triangleright$ & ptr.\ to ar.\ of val. & @T(*)[10]@ & @T x[3][10],@ & @[3][10] T x,@ \\
295	& & & @T x[][10],@ & @[][10] T x,@ \\
296	\hline
297	& & & @T x,@ & @ T x,@ \\
298	& ptr.\ to ptr.\ to val. & @T *@ & @T x[10],@ & @[10] * T x,@ \\
299	& & & @T * x[],@ & @[] * T x,@ \\
300	\hline
301	& ptr.\ to ptr.\ to imm.\ val. & @const char *@ & @const char argv[],@ & @[] * const char argv,@ \\
302	& & & \emph{others elided} & \emph{others elided} \\
303	\hline
304	\end{tabular}
305	\end{table}
306
307
308	\subsection{Lengths may vary, checking does not}
309
310	When the desired number of elements is unknown at compile time, a variable-length array is a solution:
311	\begin{cfa}
312	int main( int argc, const char * argv[] ) {
313	assert( argc == 2 );
314	size_t n = atol( argv[1] );
315	assert( 0 < n );
316	float ar[n];
317	float b[10];
318	// ... discussion continues here
319	}
320	\end{cfa}
321	This arrangement allocates @n@ elements on the @main@ stack frame for @ar@, called a \newterm{variable length array} (VLA), as well as 10 elements in the same stack frame for @b@.
322	The variable-sized allocation of @ar@ is provided by the @alloca@ routine, which bumps the stack pointer.
323	Note, the C standard supports VLAs~\cite[\S~6.7.6.2.4]{C11} as a conditional feature, but the \CC standard does not;
324	both @gcc@ and @g++@ support VLAs.
325	As well, there is misinformation about VLAs, \eg VLAs cause stack failures or are inefficient.
326	VLAs exist as far back as Algol W~\cite[\S~5.2]{AlgolW} and are a sound and efficient data type.
327
328	For high-performance applications, the stack size can be fixed and small (coroutines or user-level threads).
329	Here, VLAs can overflow the stack, so a heap allocation is used.
330	\begin{cfa}
331	float * ax1 = malloc( sizeof( float[n] ) );
332	float * ax2 = malloc( n * sizeof( float ) );
333	float * bx1 = malloc( sizeof( float[1000000] ) );
334	float * bx2 = malloc( 1000000 * sizeof( float ) );
335	\end{cfa}
336
337	Parameter dependency
338
339	Checking is best-effort / unsound
340
341	Limited special handling to get the dimension value checked (static)
342
343
344	\subsection{Dynamically sized, multidimensional arrays}
345
346	In C and \CC, ``multidimensional array'' means ``array of arrays.'' Other meanings are discussed in TODO.
347
348	Just as an array's element type can be @float@, so can it be @float[10]@.
349
350	While any of @float@, @float[10]@ and @float()[10]@ are easy to tell apart from @float@, telling them apart from each other may need occasional reference back to TODO intro section.
351	The sentence derived by wrapping each type in @-[3]@ follows.
352
353	While any of @float[3]@, @float[3][10]@ and @float()[3][10]@ are easy to tell apart from @float[3]@,
354	telling them apart from each other is what it takes to know what ``array of arrays'' really means.
355
356	Pointer decay affects the outermost array only
357
358	TODO: unfortunate syntactic reference with these cases:
359
360	\begin{itemize}
361	\item ar. of ar. of val (be sure about ordering of dimensions when the declaration is dropped)
362	\item ptr. to ar. of ar. of val
363	\end{itemize}
364
365
366	\subsection{Arrays are (but) almost values}
367
368	Has size; can point to
369
370	Can't cast to
371
372	Can't pass as value
373
374	Can initialize
375
376	Can wrap in aggregate
377
378	Can't assign
379
380
381	\subsection{Returning an array is (but) almost possible}
382
383
384	\subsection{The pointer-to-array type has been noticed before}
385
386	\subsection{Multi-Dimensional}
387
388	As in the last section, we inspect the declaration ...
389	\lstinput{16-18}{bkgd-carray-mdim.c}
390	The significant axis of deriving expressions from @ar@ is now ``itself,'' ``first element'' or ``first grand-element (meaning, first element of first element).''
391	\lstinput{20-44}{bkgd-carray-mdim.c}
392
393
394	\section{Linked List}
395
396	Linked-lists are blocks of storage connected using one or more pointers.
397	The storage block is logically divided into data (user payload) and links (list pointers), where the links are the only component used by the list structure.
398	Since the data is opaque, list structures are often polymorphic over the data, which is often homogeneous.
399
400	Linking is used to build data structures, which are a group of nodes, containing data and links, organized in a particular format, with specific operations peculiar to that format, \eg queue, tree, hash table, \etc.
401	Because a node's existence is independent of the data structure that organizes it, all nodes are manipulated by address not value;
402	hence, all data structure routines take and return pointers to nodes and not the nodes themselves.
403
404
405	\begin{comment}
406	\subsection{Linked-List Packages}
407
408	C only supports type-eraser polymorphism, with no help from the type system.
409	This approach is used in the @queue@ library providing macros that define and operate on four types of data structures: singly-linked lists, singly-linked tail queues, lists, and tail queues.
410	These linked structures are \newterm{intrusive list}, where the link fields are defined (intrude) with data fields.
411	\begin{cfa}
412	struct DS {
413	// link fields, intrustive
414	// data fields
415	}
416	\end{cfa}
417
418	\uCpp~\cite{uC++} is a concurrent extension of \CC, and provides a basic set of intrusive lists, where the link fields are defined with the data fields using inheritance.
419	\begin{cfa}
420	struct DS : public uColable {
421	// implicit link fields
422	// data fields
423	}
424	\end{cfa}
425
426	Intrusive nodes eliminate the need to dynamically allocate/deallocate the link fields when a node is added/removed to/from a data-structure.
427	Reducing dynamic allocation is important in concurrent programming because the heap is a shared resource with the potential for high contention.
428	The two formats are one link field, which form a \Index{collection}, and two link fields, which form a \Index{sequence}.
429	\begin{center}
430	%\input{DSLNodes}
431	\end{center}
432	@uStack@ and @uQueue@ are collections and @uSequence@ is a sequence.
433	To get the appropriate link fields associated with a user node, it must be a public descendant of @uColable@\index{uColable@@uColable@} or @uSeqable@\index{uSeqable@@uSeqable@}, respectively, e.g.:
434	%[
435	class stacknode : public uColable { ... }
436	class queuenode : public uColable { ... }
437	class seqnode : public uSeqable { ... }
438	%]
439	A node inheriting from @uSeqable@ can appear in a sequence/collection but a node inherting from @uColable@ can only appear in a collection.
440	Along with providing the appropriate link fields, the types @uColable@ and @uSeqable@ also provide one member routine:
441	%[
442	bool listed() const;
443	%]
444	which returns @true@ if the node is an element of any collection or sequence and @false@ otherwise.
445
446	Finally, no header files are necessary to access the uC DSL.
447
448	Some uC DSL restrictions are:
449	\begin{itemize}
450	\item
451	None of the member routines are virtual in any of the data structures for efficiency reasons.
452	Therefore, pointers to data structures must be used with care or incorrect member routines may be invoked.
453	\end{itemize}
454	\end{comment}
455
456
457	\subsection{Design Issues}
458	\label{toc:lst:issue}
459
460	This section introduces the design space for linked lists that target \emph{system programmers}.
461	Within this restricted target, all design-issue discussions assume the following invariants.
462	Alternatives to the assumptions are discussed under Future Work (Section~\ref{toc:lst:futwork}).
463	\begin{itemize}
464	\item A doubly-linked list is being designed.
465	Generally, the discussed issues apply similarly for singly-linked lists.
466	Circular \vs ordered linking is discussed under List identity (Section~\ref{toc:lst:issue:ident}).
467	\item Link fields are system-managed.
468	The user works with the system-provided API to query and modify list membership.
469	The system has freedom over how to represent these links.
470	\item The user data must provide storage for the list link-fields.
471	Hence, a list node is \emph{statically} defined as data and links \vs a node that is \emph{dynamically} constructed from data and links \see{\VRef{toc:lst:issue:attach}}.
472	\end{itemize}
473
474
475	\subsection{Preexisting Linked-List Libraries}
476
477	Two preexisting linked-list libraries are used throughout, to show examples of the concepts being defined,
478	and further libraries are introduced as needed.
479	\begin{enumerate}
480	\item Linux Queue library\cite{lst:linuxq} (LQ) of @<sys/queue.h>@.
481	\item \CC Standard Template Library's (STL)\footnote{The term STL is contentious as some people prefer the term standard library.} @std::list@\cite{lst:stl}
482	\end{enumerate}
483	A general comparison of libraries' abilities is given under Related Work (Section~\ref{toc:lst:relwork}).
484
485	For the discussion, assume the fictional type @req@ (request) is the user's payload in examples.
486	As well, the list library is helping the user manage (organize) requests, \eg a request can be work on the level of handling a network arrival event or scheduling a thread.
487
488
489	\subsection{Link attachment: Intrusive vs.\ Wrapped}
490	\label{toc:lst:issue:attach}
491
492	Link attachment deals with the question:
493	Where are the libraries' inter-element link fields stored, in relation to the user's payload data fields?
494	Figure~\ref{fig:lst-issues-attach} shows three basic styles.
495	The \newterm{intrusive} style places the link fields inside the payload structure.
496	The two \newterm{wrapped} styles place the payload inside a generic library-provided structure that then defines the link fields.
497	Library LQ is intrusive; STL is wrapped.
498	The wrapped style further distinguishes between wrapping a reference and wrapping a value, \eg @list<req *>@ or @list<req>@.
499	(For this discussion, @list<req &>@ is similar to @list<req *>@.)
500	This difference is one of user style, not framework capability.
501
502	\begin{comment}
503	\begin{figure}
504	\begin{tabularx}{\textwidth}{Y\|Y\|Y}
505	\lstinput[language=C]{20-39}{lst-issues-intrusive.run.c}
506	&\lstinputlisting[language=C++]{20-39}{lst-issues-wrapped-byref.run.cpp}
507	&\lstinputlisting[language=C++]{20-39}{lst-issues-wrapped-emplaced.run.cpp}
508	\\ & &
509	\\
510	\includegraphics[page=1]{lst-issues-attach.pdf}
511	&
512	\includegraphics[page=2]{lst-issues-attach.pdf}
513	&
514	\includegraphics[page=3]{lst-issues-attach.pdf}
515	\\ & &
516	\\
517	(a) & (b) & (c)
518	\end{tabularx}
519	\caption{
520	Three styles of link attachment: (a)~intrusive, (b)~wrapped reference, and (c)~wrapped value.
521	The diagrams show the memory layouts that result after the code runs, eliding the head object \lstinline{reqs};
522	head objects are discussed in Section~\ref{toc:lst:issue:ident}.
523	In (a), the field \lstinline{req.x} names a list direction;
524	these are discussed in Section~\ref{toc:lst:issue:simultaneity}.
525	In (b) and (c), the type \lstinline{node} represents a system-internal type,
526	which is \lstinline{std::_List_node} in the GNU implementation.
527	(TODO: cite? found in /usr/include/c++/7/bits/stl\_list.h )
528	}
529	\label{fig:lst-issues-attach}
530	\end{figure}
531	\end{comment}
532
533	\begin{figure}
534	\centering
535	\newsavebox{\myboxA} % used with subfigure
536	\newsavebox{\myboxB}
537	\newsavebox{\myboxC}
538
539	\begin{lrbox}{\myboxA}
540	\begin{tabular}{@{}l@{}}
541	\lstinput[language=C]{20-35}{lst-issues-intrusive.run.c} \\
542	\includegraphics[page=1]{lst-issues-attach.pdf}
543	\end{tabular}
544	\end{lrbox}
545
546	\begin{lrbox}{\myboxB}
547	\begin{tabular}{@{}l@{}}
548	\lstinput[language=C++]{20-35}{lst-issues-wrapped-byref.run.cpp} \\
549	\includegraphics[page=2]{lst-issues-attach.pdf}
550	\end{tabular}
551	\end{lrbox}
552
553	\begin{lrbox}{\myboxC}
554	\begin{tabular}{@{}l@{}}
555	\lstinput[language=C++]{20-35}{lst-issues-wrapped-emplaced.run.cpp} \\
556	\includegraphics[page=3]{lst-issues-attach.pdf}
557	\end{tabular}
558	\end{lrbox}
559
560	\subfloat[Intrusive]{\label{f:Intrusive}\usebox\myboxA}
561	\hspace{6pt}
562	\vrule
563	\hspace{6pt}
564	\subfloat[Wrapped reference]{\label{f:WrappedRef}\usebox\myboxB}
565	\hspace{6pt}
566	\vrule
567	\hspace{6pt}
568	\subfloat[Wrapped value]{\label{f:WrappedValue}\usebox\myboxC}
569
570	\caption{
571	Three styles of link attachment:
572	% \protect\subref{f:Intrusive}~intrusive, \protect\subref{f:WrappedRef}~wrapped reference, and \protect\subref*{f:WrappedValue}~wrapped value.
573	The diagrams show the memory layouts that result after the code runs, eliding the head object \lstinline{reqs};
574	head objects are discussed in Section~\ref{toc:lst:issue:ident}.
575	In \protect\subref*{f:Intrusive}, the field \lstinline{req.d} names a list direction;
576	these are discussed in Section~\ref{toc:lst:issue:simultaneity}.
577	In \protect\subref{f:WrappedRef} and \protect\subref{f:WrappedValue}, the type \lstinline{node} represents a
578	library-internal type, which is \lstinline{std::_List_node} in the GNU implementation
579	\see{\lstinline{/usr/include/c++/X/bits/stl_list.h}, where \lstinline{X} is the \lstinline{g++} version number}.
580	}
581	\label{fig:lst-issues-attach}
582	\end{figure}
583
584	Each diagrammed example is using the fewest dynamic allocations for its respective style:
585	in \subref{f:Intrusive}, here are no dynamic allocations, in \subref{f:WrappedRef} only the linked fields are dynamically allocated, and in \subref*{f:WrappedValue} the copied data and linked fields are dynamically allocated.
586	The advantage of intrusive attachment is the control in memory layout and storage placement.
587	Both wrapped attachment styles have independent storage layout and imply library-induced heap allocations, with lifetime that matches the item's membership in the list.
588	In all three cases, a @req@ object can enter and leave a list many times.
589	However, in \subref*{f:Intrusive} a @req@ can only be on one list at a time, unless there are separate link-fields for each simultaneous list.
590	In \subref*{f:WrappedRef}, a @req@ can appear multiple times on the same or different lists simultaneously, but since @req@ is shared via the pointer, care must be taken if updating data also occurs simultaneously, \eg concurrency.
591	In \subref*{f:WrappedValue}, the @req@ is copied, which increases storage usage, but allows independent simultaneous changes;
592	however, knowing which of the @req@ object is the ``true'' object becomes complex.
593	\see*{\VRef{toc:lst:issue:simultaneity} for further discussion.}
594
595	The implementation of @LIST_ENTRY@ uses a trick to find the links and the node containing the links.
596	The macro @LIST_INSERT_HEAD(&reqs, &r2, d);@ takes the list header, a pointer to the node, and the offset of the link fields in the node.
597	One of the fields generated by @LIST_ENTRY@ is a pointer to the node, which is set to the node address, \eg @r2@.
598	Hence, the offset to the link fields provides an access to the entire node, \ie the node points at itself.
599	For list traversal, @LIST_FOREACH(cur, &reqs_pri, by_pri)@, there is the node cursor, the list, and the offset of the link fields within the node.
600	The traversal actually moves from link fields to link fields within a node and sets the node cursor from the pointer within the link fields back to the node.
601
602	A further aspect of layout control is allowing the user to explicitly specify link fields controlling attributes and placement within the @req@ object.
603	LQ allows this ability through the @LIST_ENTRY@ macro\footnote{It is possible to have multiple named linked fields allowing a node to appear on multiple lists simultaneously.};
604	supplying the link fields by inheritance makes them implicit and relies on compiler placement, such as the start or end of @req@.
605	An example of an explicit attribute is cache alignment of the link fields in conjunction with other @req@ fields, improving locality and/or avoiding false sharing.
606	Wrapped reference has no control over the link fields, but the seperate data allows some control;
607	wrapped value has no control over data or links.
608
609	Another subtle advantage of intrusive arrangement is that a reference to a user-level item (@req@) is sufficient to navigate or manage the item's membership.
610	In LQ, \subref*{f:Intrusive}, a @req@ pointer is the right argument type for operations @LIST_NEXT@ or @LIST_REMOVE@;
611	there is no distinguishing a @req@ from ``a @req@ in a list.''
612	The same is not true of STL, \subref{f:WrappedRef} or \subref{f:WrappedValue}.
613	There, the analogous operations work on a parameter of type @list<T>::iterator@;
614	they are @iterator::operator++()@, @iterator::operator*()@, and @list::erase(iterator)@.
615	There is no mapping from @req &@ to @list<req>::iterator@, except for linear search.
616
617	The advantage of wrapped attachment is the abstraction of a data item from its list membership(s).
618	In the wrapped style, the @req@ type can come from a library that serves many independent uses,
619	which generally have no need for listing.
620	Then, a novel use can put @req@ in list, without requiring any upstream change in the @req@ library.
621	In intrusive attachment, the ability to be listed must be planned during the definition of @req@.
622
623	Finally, for wrapper reference a single node can appear at multiple places in the same list or different list, which might be useful in certain read-only cases.
624	For intrusive and wrapper value, a node must be duplicated to appear at multiple locations, presenting additional cost.
625	This scenario becomes difficult to imagine when the nodes are written because three link styles have issues.
626
627	\begin{figure}
628	\lstinput[language=C++]{100-117}{lst-issues-attach-reduction.hpp}
629	\lstinput[language=C++]{150-150}{lst-issues-attach-reduction.hpp}
630	\caption{
631	Reduction of wrapped attachment to intrusive attachment.
632	Illustrated by pseudocode implementation of an STL-compatible API fragment
633	using LQ as the underlying implementation.
634	The gap that makes it pseudocode is that
635	the LQ C macros do not expand to valid C++ when instantiated with template parameters---there is no \lstinline{struct El}.
636	When using a custom-patched version of LQ to work around this issue,
637	the programs of Figure~\ref{f:WrappedRef} and \protect\subref*{f:WrappedValue} work with this shim in place of real STL.
638	Their executions lead to the same memory layouts.
639	}
640	\label{fig:lst-issues-attach-reduction}
641	\end{figure}
642
643	Wrapped attachment has a straightforward reduction to intrusive attachment, illustrated in Figure~\ref{fig:lst-issues-attach-reduction}.
644	This shim layer performs the implicit dynamic allocations that pure intrusion avoids.
645	But there is no reduction going the other way.
646	No shimming can cancel the allocations to which wrapped membership commits.
647
648	So intrusion is a lower-level listing primitive.
649	And so, the system design choice is not between forcing users to use intrusion or wrapping.
650	The choice is whether or not to provide access to an allocation-free layer of functionality.
651	A wrapped-primitive library like STL forces users to incur the costs of wrapping, whether or not they access its benefits.
652	An intrusive-primitive library like LQ lets users choose when to make this tradeoff.
653
654
655	\subsection{Simultaneity: Single vs.\ Multi-Static vs.\ Dynamic}
656	\label{toc:lst:issue:simultaneity}
657
658	\begin{figure}
659	\parbox[t]{3.5in} {
660	\lstinput[language=C++]{20-60}{lst-issues-multi-static.run.c}
661	}\parbox[t]{20in} {
662	~\\
663	\includegraphics[page=1]{lst-issues-direct.pdf} \\
664	~\\
665	\hspace*{1.5in}\includegraphics[page=2]{lst-issues-direct.pdf}
666	}
667	\caption{
668	Example of simultaneity using LQ lists.
669	The zoomed-out diagram (right/top) shows the complete multi-linked data structure.
670	This structure can navigate all requests in priority order, and navigate among requests with a common request value.
671	The zoomed-in diagram (right/bottom) shows how the link fields connect the nodes on different lists.
672	}
673	\label{fig:lst-issues-multi-static}
674	\end{figure}
675
676	\newterm{Simultaneity} deals with the question:
677	In how many different lists can a node be stored, at the same time?
678	Figure~\ref{fig:lst-issues-multi-static} shows an example that can traverse all requests in priority order (field @pri@) or navigate among requests with the same request value (field @rqr@).
679	Each of ``by priority'' and ``by common request value'' is a separate list.
680	For example, there is a single priority-list linked in order [1, 2, 2, 3, 3, 4], where nodes may have the same priority, and there are three common request-value lists combining requests with the same values: [42, 42], [17, 17, 17], and [99], giving four head nodes one for each list.
681	The example shows a list can encompass all the nodes (by-priority) or only a subset of the nodes (three request-value lists).
682
683	As stated, the limitation of intrusive attachment is knowing apriori how many groups of links are needed for the maximum number of simultaneous lists.
684	Thus, the intrusive LQ example supports multiple, but statically many, link lists.
685	Note, it is possible to reuse links for different purposes, \eg if a list in linked one at one time and another way at another time, and these times do not overlap, the two different linkings can use the same link fields.
686	This feature is used in the \CFA runtime where a thread node may be on a blocked or running list, both never on both simultaneously.
687
688	Now consider the STL in the wrapped-reference arrangement of Figure~\ref{f:WrappedRef}.
689	Here it is possible to construct the same simultaneity by creating multiple STL lists, each pointing at the appropriate nodes.
690	Each group of intrusive links become the links for each separate STL list.
691	The upside is the unlimited number of a lists a node can be associated with simultaneously, any number of STL lists can be created dynamically.
692	The downside is the dynamic allocation of the link nodes and manging multiple lists.
693	Note, it might be possible to wrap the multiple lists in another type to hide this implementation issue.
694
695	Now consider the STL in the wrapped-value arrangement of Figure~\ref{f:WrappedValue}.
696	Again, it is possible to construct the same simultaneity by creating multiple STL lists, each copying the appropriate nodes, where the intrusive links become the links for each separate STL list.
697	The upside is the same as for wrapped-reference arrangement with an unlimited number of a list bindings.
698	The downside is the dynamic allocation and significant storage usage due to copying.
699	As well, it is unclear how node updates work in this scenario, without some notation of ultimately merging node information.
700
701	% https://www.geeksforgeeks.org/introduction-to-multi-linked-list/ -- example of building a bespoke multi-linked list out of STL primitives (providing indication that STL doesn't offer one); offers dynamic directionality by embedding `vector<struct node*> pointers;`
702
703	% When allowing multiple static directions, frameworks differ in their ergonomics for
704	% the typical case: when the user needs only one direction, vs.\ the atypical case, when the user needs several.
705	% LQ's ergonomics are well-suited to the uncommon case of multiple list directions.
706	% Its intrusion declaration and insertion operation both use a mandatory explicit parameter naming the direction.
707	% This decision works well in Figure~\ref{fig:lst-issues-multi-static}, where the names @by_pri@ and @by_rqr@ work well,
708	% but it clutters Figure~\ref{f:Intrusive}, where a contrived name must be invented and used.
709	% The example uses @x@; @reqs@ would be a more readily ignored choice. \PAB{wording?}
710
711	\uCpp offers an intrusive list that makes the opposite ergonomic choice. TODO: elaborate on inheritance for first direction and acrobatics for subsequent directions.
712
713	STL may seem to have similar ergonomics to LQ, but in fact, the current ergonomic distinction is not applicable there,
714	where one static direction is enough to achieve multiple dynamic directions.
715	Note that all options in Figure~\ref{fig:lst-issues-attach} have a \emph{variable} named @refs@
716	just as both Figure~\ref{fig:lst-issues-multi-static} and Figure~(TODO~new) have \emph{variables} with names including @pri@ vs @rqr@.
717	But only the intrusive model has this naming showing up within the definition of a structure.
718	This lack of named parts of a structure lets Figure~\ref{fig:lst-issues-attach} \subref{f:WrappedRef} and \subref{f:WrappedValue}, just like \uCpp,
719	insert into a list without mentioning a part's name, while only version \subref*{f:Intrusive} has to mention @x@ at this step.
720	LQ demands this same extraneous part-naming when removing, iterating, and even asking for a neighbour.
721	At issue in this distinction is whether an API that offers multiple static directions (and so requires these to be named differently)
722	allows the sole direction (when several are not wanted) to be \emph{implicit}.
723	\uCpp allows it, LQ does not, and STL does not have this question as applicable.
724
725
726	\subsection{User integration: Preprocessed vs.\ Type-System Mediated}
727
728	% example of poor error message due to LQ's preprocessed integration
729	% programs/lst-issues-multi-static.run.c:46:1: error: expected identifier or '(' before 'do'
730	% 46 \| LIST_INSERT_HEAD(&reqs_rtr_42, &r42b, by_rqr);
731	% \| ^~~~~~~~~~~~~~~~
732	%
733	% ... not a wonderful example; it was a missing semicolon on the preceding line; but at least real
734
735
736	\subsection{List identity: Headed vs.\ Ad-hoc}
737	\label{toc:lst:issue:ident}
738
739	All examples so far have used distinct user-facing types:
740	an item found in a list (type @req@, of variables like @r1@), and
741	a list (type @reql@ or @list<req>@, of variables like @reqs@ or @reqs_rqr_42@).
742	\see{Figure~\ref{fig:lst-issues-attach} and Figure~\ref{fig:lst-issues-multi-static}}
743	The latter type is a head, and these examples are of headed lists.
744
745	A bespoke ``pointer to next @req@'' implementation often omits the latter type.
746	Such a list model is ad-hoc.
747
748	In headed thinking, there are length-zero lists (heads with no elements), and an element can be listed or not listed.
749	In ad-hoc thinking, there are no length-zero lists and every element belongs to a list of length at least one.
750	\PAB{Create a figure for this.}
751
752	By omitting the head, elements can enter into an adjacency relationship,
753	without requiring that someone allocate room for the head of the possibly-resulting list,
754	or being able to find a correct existing head.
755
756	A head defines one or more element roles, among elements that share a transitive adjacency.
757	``First'' and ``last'' are element roles.
758	One moment's ``first'' need not be the next moment's.
759
760	There is a cost to maintaining these roles.
761	A runtime component of this cost is evident in LQ's offering the choice of type generators @LIST@ vs.~@TAILQ@.
762	Its @LIST@ maintains a ``first,'' but not a ``last;'' its @TAILQ@ maintains both roles.
763	(Both types are doubly linked and an analogous choice is available for singly linked.)
764
765	TODO: finish making this point
766
767	See WIP in lst-issues-adhoc-.ignore..
768
769	The code-complexity component of the cost ...
770
771	Ability to offer heads is good. Point: Does maintaining a head mean that the user has to provide more state when manipulating the list? Requiring the user to do so is bad, because the user may have lots of "list" typed variables in scope, and detecting that the user passed the wrong one requires testing all the listing edge cases.
772
773	\subsection{End treatment: Uniform }
774
775
776	\section{String}
777
778	A string is a logical sequence of symbols, where the form of the symbols can vary significantly: 7/8-bit characters (ASCII/Latin-1), or 2/4/8-byte (UNICODE) characters/symbols or variable length (UTF-8/16/32) characters.
779	A string can be read left-to-right, right-to-left, top-to-bottom, and have stacked elements (Arabic).
780
781	An integer character constant is a sequence of one or more multibyte characters enclosed in single-quotes, as in @'x'@.
782	A wide character constant is the same, except prefixed by the letter @L@, @u@, or @U@.
783	Except for escape sequences, the elements of the sequence are any members of the source character set;
784	they are mapped in an implementation-defined manner to members of the execution character set.
785
786	A C character-string literal is a sequence of zero or more multibyte characters enclosed in double-quotes, as in @"xyz"@.
787	A UTF-8 string literal is the same, except prefixed by @u8@.
788	A wide string literal is the same, except prefixed by the letter @L@, @u@, or @U@.
789
790	For UTF-8 string literals, the array elements have type @char@, and are initialized with the characters of the multibyte character sequence, as encoded in UTF-8.
791	For wide string literals prefixed by the letter @L@, the array elements have type @wchar_t@ and are initialized with the sequence of wide characters corresponding to the multibyte character sequence, as defined by the @mbstowcs@ function with an implementation-defined current locale.
792	For wide string literals prefixed by the letter @u@ or @U@, the array elements have type @char16_t@ or @char32_t@, respectively, and are initialized with the sequence of wide characters corresponding to the multibyte character sequence, as defined by successive calls to the @mbrtoc16@, or @mbrtoc32@ function as appropriate for its type, with an implementation-defined current locale.
793	The value of a string literal containing a multibyte character or escape sequence not represented in the execution character set is implementation-defined.
794
795
796	Another bad C design decision is to have null-terminated strings rather than maintaining a separate string length.
797	\begin{quote}
798	Technically, a string is an array whose elements are single characters.
799	The compiler automatically places the null character @\0@ at the end of each such string, so programs can conveniently find the end.
800	This representation means that there is no real limit to how long a string can be, but programs have to scan one completely to determine its length.
801	\end{quote}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: