Context Navigation

source: doc/theses/mike_brooks_MMath/background.tex @ c721105

Last change on this file since c721105 was c721105, checked in by Peter A. Buhr <pabuhr@…>, 6 weeks ago
proofreading changes
Property mode set to `100644`
File size: 45.5 KB

Line
1	\chapter{Background}
2
3	Since this work builds on C, it is necessary to explain the C mechanisms and their shortcomings for array, linked list, and string.
4
5
6	\section{Array}
7
8	At the start, the C programming language made a significant design mistake.
9	\begin{quote}
10	In C, there is a strong relationship between pointers and arrays, strong enough that pointers and arrays really should be treated simultaneously.
11	Any operation which can be achieved by array subscripting can also be done with pointers.~\cite[p.~93]{C:old}
12	\end{quote}
13	Accessing any storage requires pointer arithmetic, even if it is just base-displacement addressing in an instruction.
14	The conjoining of pointers and arrays could also be applied to structures, where a pointer references a structure field like an array element.
15	Finally, while subscripting involves pointer arithmetic (as does field references @x.y.z@), the computation is very complex for multi-dimensional arrays and requires array descriptors to know stride lengths along dimensions.
16	Many C errors result from performing pointer arithmetic instead of using subscripting;
17	some C textbooks teach pointer arithmetic erroneously suggesting it is faster than subscripting.
18	A sound and efficient C program does not require explicit pointer arithmetic.
19
20	C semantics want a programmer to \emph{believe} an array variable is a ``pointer to its first element.''
21	This desire becomes apparent by a detailed inspection of an array declaration.
22	\lstinput{34-34}{bkgd-carray-arrty.c}
23	The inspection begins by using @sizeof@ to provide definite program semantics for the intuition of an expression's type.
24	\lstinput{35-36}{bkgd-carray-arrty.c}
25	Now consider the sizes of expressions derived from @ar@, modified by adding ``pointer to'' and ``first element'' (and including unnecessary parentheses to avoid confusion about precedence).
26	\lstinput{37-40}{bkgd-carray-arrty.c}
27	Given the size of @float@ is 4, the size of @ar@ with 10 floats being 40 bytes is common reasoning for C programmers.
28	Equally, C programmers know the size of a \emph{pointer} to the first array element is 8 (or 4 depending on the addressing architecture).
29	% Now, set aside for a moment the claim that this first assertion is giving information about a type.
30	Clearly, an array and a pointer to its first element are different.
31
32	In fact, the idea that there is such a thing as a pointer to an array may be surprising and it is not the same thing as a pointer to the first element.
33	\lstinput{42-45}{bkgd-carray-arrty.c}
34	The first assignment gets
35	\begin{cfa}
36	warning: assignment to `float ()[10]' from incompatible pointer type `float '
37	\end{cfa}
38	and the second assignment gets the opposite.
39
40	The inspection now refutes any suggestion that @sizeof@ is informing about allocation rather than type information.
41	Note, @sizeof@ has two forms, one operating on an expression and the other on a type.
42	Using the type form yields the same results as the prior expression form.
43	\lstinput{46-49}{bkgd-carray-arrty.c}
44	The results are also the same when there is \emph{no allocation} using a pointer-to-array type.
45	\lstinput{51-57}{bkgd-carray-arrty.c}
46	Hence, in all cases, @sizeof@ is informing about type information.
47
48	So, thinking of an array as a pointer to its first element is too simplistic an analogue and it is not backed up by the type system.
49	This misguided analogue works for a single-dimension array but there is no advantage other than possibly teaching beginning programmers about basic runtime array-access.
50
51	Continuing, a short form for declaring array variables exists using length information provided implicitly by an initializer.
52	\lstinput{59-62}{bkgd-carray-arrty.c}
53	The compiler counts the number of initializer elements and uses this value as the first dimension.
54	Unfortunately, the implicit element counting does not extend to dimensions beyond the first.
55	\lstinput{64-67}{bkgd-carray-arrty.c}
56
57	My contribution is recognizing:
58	\begin{itemize}
59	\item There is value in using a type that knows its size.
60	\item The type pointer to (first) element does not.
61	\item C \emph{has} a type that knows the whole picture: array, e.g. @T[10]@.
62	\item This type has all the usual derived forms, which also know the whole picture.
63	A usefully noteworthy example is pointer to array, e.g. @T (*)[10]@.\footnote{
64	The parenthesis are necessary because subscript has higher priority than pointer in C declarations.
65	(Subscript also has higher priority than dereference in C expressions.)}
66	\end{itemize}
67
68
69	\section{Reading Declarations}
70
71	A significant area of confusion for reading C declarations results from embedding a declared variable in a declaration, mimicking the way the variable is used in executable statements.
72	\begin{cquote}
73	\begin{tabular}{@{}ll@{}}
74	\multicolumn{1}{@{}c}{\textbf{Array}} & \multicolumn{1}{c@{}}{\textbf{Function Pointer}} \\
75	\begin{cfa}
76	int @(*@ar@)[@5@]@; // definition
77	... @(*@ar@)[@3@]@ += 1; // usage
78	\end{cfa}
79	&
80	\begin{cfa}
81	int @(*@f@())[@5@]@ { ... }; // definition
82	... @(*@f@())[@3@]@ += 1; // usage
83	\end{cfa}
84	\end{tabular}
85	\end{cquote}
86	Essentially, the type is wrapped around the name in successive layers (like an \Index{onion}).
87	While attempting to make the two contexts consistent is a laudable goal, it has not worked out in practice, even though Dennis Richie believed otherwise:
88	\begin{quote}
89	In spite of its difficulties, I believe that the C's approach to declarations remains plausible, and am comfortable with it; it is a useful unifying principle.~\cite[p.~12]{Ritchie93}
90	\end{quote}
91	After all, reading a C array type is easy: just read it from the inside out, and know when to look left and when to look right!
92
93	\CFA provides its own type, variable and routine declarations, using a simpler syntax.
94	The new declarations place qualifiers to the left of the base type, while C declarations place qualifiers to the right of the base type.
95	The qualifiers have the same syntax and semantics in \CFA as in C.
96	Then, a \CFA declaration is read left to right, where a function return type is enclosed in brackets @[@\,@]@.
97	\begin{cquote}
98	\begin{tabular}{@{}l@{\hspace{3em}}ll@{}}
99	\multicolumn{1}{c@{\hspace{3em}}}{\textbf{C}} & \multicolumn{1}{c}{\textbf{\CFA}} & \multicolumn{1}{c}{\textbf{read left to right}} \\
100	\begin{cfa}
101	int @*@ x1 @[5]@;
102	int @(*@x2@)[5]@;
103	int @(*@f( int p )@)[5]@;
104	\end{cfa}
105	&
106	\begin{cfa}
107	@[5] *@ int x1;
108	@* [5]@ int x2;
109	@[ * [5] int ]@ f( int p );
110	\end{cfa}
111	&
112	\begin{cfa}
113	// array of 5 pointers to int
114	// pointer to array of 5 int
115	// function returning pointer to array of 5 ints
116	\end{cfa}
117	\\
118	& &
119	\LstCommentStyle{//\ \ \ and taking an int argument}
120	\end{tabular}
121	\end{cquote}
122	As declaration size increases, it becomes corresponding difficult to read and understand the C declaration form, whereas reading and understanding a \CFA declaration has linear complexity as the declaration size increases.
123	Note, writing declarations left to right is common in other programming languages, where the function return-type is often placed after the parameter declarations.
124
125	\VRef[Table]{bkgd:ar:usr:avp} introduces the many layers of the C and \CFA array story, where the \CFA story is discussion in \VRef[Chapter]{c:Array}.
126	The \CFA-thesis column shows the new array declaration form, which is my contributed improvements for safety and ergonomics.
127	The table shows there are multiple yet equivalent forms for the array types under discussion, and subsequent discussion shows interactions with orthogonal (but easily confused) language features.
128	Each row of the table shows alternate syntactic forms.
129	The simplest occurrences of types distinguished in the preceding discussion are marked with $\triangleright$.
130	Removing the declared variable @x@, gives the type used for variable, structure field, cast or error messages \PAB{(though note Section TODO points out that some types cannot be casted to)}.
131	Unfortunately, parameter declarations \PAB{(section TODO)} have more syntactic forms and rules.
132
133	\begin{table}
134	\centering
135	\caption{Syntactic Reference for Array vs Pointer. Includes interaction with \lstinline{const}ness.}
136	\label{bkgd:ar:usr:avp}
137	\begin{tabular}{ll\|l\|l\|l}
138	& Description & \multicolumn{1}{c\|}{C} & \multicolumn{1}{c\|}{\CFA} & \multicolumn{1}{c}{\CFA-thesis} \\
139	\hline
140	$\triangleright$ & value & @T x;@ & @T x;@ & \\
141	\hline
142	& immutable value & @const T x;@ & @const T x;@ & \\
143	& & @T const x;@ & @T const x;@ & \\
144	\hline \hline
145	$\triangleright$ & pointer to value & @T * x;@ & @* T x;@ & \\
146	\hline
147	& immutable ptr. to val. & @T * const x;@ & @const * T x;@ & \\
148	\hline
149	& ptr. to immutable val. & @const T * x;@ & @* const T x;@ & \\
150	& & @T const * x;@ & @* T const x;@ & \\
151	\hline \hline
152	$\triangleright$ & array of value & @T x[10];@ & @[10] T x@ & @array(T, 10) x@ \\
153	\hline
154	& ar.\ of immutable val. & @const T x[10];@ & @[10] const T x@ & @const array(T, 10) x@ \\
155	& & @T const x[10];@ & @[10] T const x@ & @array(T, 10) const x@ \\
156	\hline
157	& ar.\ of ptr.\ to value & @T * x[10];@ & @[10] * T x@ & @array(T *, 10) x@ \\
158	& & & & @array(* T, 10) x@ \\
159	\hline
160	& ar.\ of imm. ptr.\ to val. & @T * const x[10];@ & @[10] const * T x@ & @array(* const T, 10) x@ \\
161	& & & & @array(const * T, 10) x@ \\
162	\hline
163	& ar.\ of ptr.\ to imm. val. & @const T * x[10];@ & @[10] * const T x@ & @array(const T *, 10) x@ \\
164	& & @T const * x[10];@ & @[10] * T const x@ & @array(* const T, 10) x@ \\
165	\hline \hline
166	$\triangleright$ & ptr.\ to ar.\ of value & @T (x)[10];@ & @ [10] T x@ & @* array(T, 10) x@ \\
167	\hline
168	& imm. ptr.\ to ar.\ of val. & @T (* const x)[10];@ & @const * [10] T x@ & @const * array(T, 10) x@ \\
169	\hline
170	& ptr.\ to ar.\ of imm. val. & @const T (x)[10];@ & @ [10] const T x@ & @* const array(T, 10) x@ \\
171	& & @T const (x)[10];@ & @ [10] T const x@ & @* array(T, 10) const x@ \\
172	\hline
173	& ptr.\ to ar.\ of ptr.\ to val. & @T (x)[10];@ & @* [10] * T x@ & @* array(T *, 10) x@ \\
174	& & & & @* array(* T, 10) x@ \\
175	\hline
176	\end{tabular}
177	\end{table}
178
179	TODO: Address these parked unfortunate syntaxes
180	\begin{itemize}
181	\item static
182	\item star as dimension
183	\item under pointer decay: @int p1[const 3]@ being @int const *p1@
184	\end{itemize}
185
186
187	\subsection{Arrays decay and pointers diffract}
188
189	The last section established the difference between these four types:
190	\lstinput{3-6}{bkgd-carray-decay.c}
191	But the expression used for obtaining the pointer to the first element is pedantic.
192	The root of all C programmer experience with arrays is the shortcut
193	\lstinput{8-8}{bkgd-carray-decay.c}
194	which reproduces @pa0@, in type and value:
195	\lstinput{9-9}{bkgd-carray-decay.c}
196	The validity of this initialization is unsettling, in the context of the facts established in the last section.
197	Notably, it initializes name @pa0x@ from expression @ar@, when they are not of the same type:
198	\lstinput{10-10}{bkgd-carray-decay.c}
199
200	So, C provides an implicit conversion from @float[10]@ to @float *@.
201	\begin{quote}
202	Except when it is the operand of the @sizeof@ operator, or the unary @&@ operator, or is a string literal used to
203	initialize an array an expression that has type ``array of \emph{type}'' is converted to an expression with type
204	``pointer to \emph{type}'' that points to the initial element of the array object~\cite[\S~6.3.2.1.3]{C11}
205	\end{quote}
206	This phenomenon is the famous \newterm{pointer decay}, which is a decay of an array-typed expression into a pointer-typed one.
207	It is worthy to note that the list of exception cases does not feature the occurrence of @ar@ in @ar[i]@.
208	Thus, subscripting happens on pointers not arrays.
209
210	Subscripting proceeds first with pointer decay, if needed. Next, \cite[\S~6.5.2.1.2]{C11} explains that @ar[i]@ is treated as if it were @(*((a)+(i)))@.
211	\cite[\S~6.5.6.8]{C11} explains that the addition, of a pointer with an integer type, is defined only when the pointer refers to an element that is in an array, with a meaning of ``@i@ elements away from,'' which is valid if @ar@ is big enough and @i@ is small enough.
212	Finally, \cite[\S~6.5.3.2.4]{C11} explains that the @*@ operator's result is the referenced element.
213	Taken together, these rules illustrate that @ar[i]@ and @i[a]@ mean the same thing!
214
215	Subscripting a pointer when the target is standard inappropriate is still practically well-defined.
216	While the standard affords a C compiler freedom about the meaning of an out-of-bound access, or of subscripting a pointer that does not refer to an array element at all,
217	the fact that C is famously both generally high-performance, and specifically not bound-checked, leads to an expectation that the runtime handling is uniform across legal and illegal accesses.
218	Moreover, consider the common pattern of subscripting on a @malloc@ result:
219	\begin{cfa}
220	float * fs = malloc( 10 * sizeof(float) );
221	fs[5] = 3.14;
222	\end{cfa}
223	The @malloc@ behaviour is specified as returning a pointer to ``space for an object whose size is'' as requested (\cite[\S~7.22.3.4.2]{C11}).
224	But \emph{nothing} more is said about this pointer value, specifically that its referent might \emph{be} an array allowing subscripting.
225
226	Under this assumption, a pointer being subscripted (or added to, then dereferenced) by any value (positive, zero, or negative), gives a view of the program's entire address space, centred around the @p@ address, divided into adjacent @sizeof(p)@ chunks, each potentially (re)interpreted as @typeof(p)@.
227	I call this phenomenon \emph{array diffraction}, which is a diffraction of a single-element pointer into the assumption that its target is in the middle of an array whose size is unlimited in both directions.
228	No pointer is exempt from array diffraction.
229	No array shows its elements without pointer decay.
230
231	A further pointer--array confusion, closely related to decay, occurs in parameter declarations.
232	\cite[\S~6.7.6.3.7]{C11} explains that when an array type is written for a parameter,
233	the parameter's type becomes a type that can be summarized as the array-decayed type.
234	The respective handling of the following two parameter spellings shows that the array-spelled one is really, like the other, a pointer.
235	\lstinput{12-16}{bkgd-carray-decay.c}
236	As the @sizeof(x)@ meaning changed, compared with when run on a similarly-spelled local variable declaration,
237	@gcc@ also gives this code the warning for the first assertion:
238	\begin{cfa}
239	warning: 'sizeof' on array function parameter 'x' will return size of 'float *'
240	\end{cfa}
241	The caller of such a function is left with the reality that a pointer parameter is a pointer, no matter how it is spelled:
242	\lstinput{18-21}{bkgd-carray-decay.c}
243	This fragment gives the warning for the first argument, in the second call.
244	\begin{cfa}
245	warning: 'f' accessing 40 bytes in a region of size 4
246	\end{cfa}
247
248	The shortened parameter syntax @T x[]@ is a further way to spell ``pointer.''
249	Note the opposite meaning of this spelling now, compared with its use in local variable declarations.
250	This point of confusion is illustrated in:
251	\lstinput{23-30}{bkgd-carray-decay.c}
252	Note, \CC gives a warning for the initialization of @cp@.
253	\begin{cfa}
254	warning: ISO C++ forbids converting a string constant to 'char*'
255	\end{cfa}
256	and C gives a warning at the call of @edit@, if @const@ is added to the declaration of @cp@.
257	\begin{cfa}
258	warning: passing argument 1 of 'edit' discards 'const' qualifier from pointer target type
259	\end{cfa}
260	The basic two meanings, with a syntactic difference helping to distinguish,
261	are illustrated in the declarations of @ca@ \vs @cp@,
262	whose subsequent @edit@ calls behave differently.
263	The syntax-caused confusion is in the comparison of the first and last lines,
264	both of which use a literal to initialize an object declared with spelling @T x[]@.
265	But these initialized declarations get opposite meanings,
266	depending on whether the object is a local variable or a parameter.
267
268	In summary, when a function is written with an array-typed parameter,
269	\begin{itemize}
270	\item an appearance of passing an array by value is always an incorrect understanding
271	\item a dimension value, if any is present, is ignored
272	\item pointer decay is forced at the call site and the callee sees the parameter having the decayed type
273	\end{itemize}
274
275	Pointer decay does not affect pointer-to-array types, because these are already pointers, not arrays.
276	As a result, a function with a pointer-to-array parameter sees the parameter exactly as the caller does:
277	\lstinput{32-42}{bkgd-carray-decay.c}
278	\VRef[Table]{bkgd:ar:usr:decay-parm} gives the reference for the decay phenomenon seen in parameter declarations.
279
280	\begin{table}
281	\caption{Syntactic Reference for Decay during Parameter-Passing.
282	Includes interaction with \lstinline{const}ness, where ``immutable'' refers to a restriction on the callee's ability.}
283	\label{bkgd:ar:usr:decay-parm}
284	\centering
285	\begin{tabular}{llllll}
286	& Description & Type & Parameter Declaration & \CFA \\
287	\hline
288	& & & @T * x,@ & @* T x,@ \\
289	$\triangleright$ & pointer to value & @T *@ & @T x[10],@ & @[10] T x,@ \\
290	& & & @T x[],@ & @[] T x,@ \\
291	\hline
292	& & & @T * const x,@ & @const * T x@ \\
293	& immutable ptr.\ to val. & @T * const@ & @T x[const 10],@ & @[const 10] T x,@ \\
294	& & & @T x[const],@ & @[const] T x,@\\
295	\hline
296	& & & @const T * x,@ & @ * const T x,@ \\
297	& & & @T const * x,@ & @ * T const x,@ \\
298	& ptr.\ to immutable val. & @const T *@ & @const T x[10],@ & @[10] const T x,@ \\
299	& & @T const *@ & @T const x[10],@ & @[10] T const x,@ \\
300	& & & @const T x[],@ & @[] const T x,@ \\
301	& & & @T const x[],@ & @[] T const x,@ \\
302	\hline \hline
303	& & & @T (x)[10],@ & @ [10] T x,@ \\
304	$\triangleright$ & ptr.\ to ar.\ of val. & @T(*)[10]@ & @T x[3][10],@ & @[3][10] T x,@ \\
305	& & & @T x[][10],@ & @[][10] T x,@ \\
306	\hline
307	& & & @T x,@ & @ T x,@ \\
308	& ptr.\ to ptr.\ to val. & @T *@ & @T x[10],@ & @[10] * T x,@ \\
309	& & & @T * x[],@ & @[] * T x,@ \\
310	\hline
311	& ptr.\ to ptr.\ to imm.\ val. & @const char *@ & @const char argv[],@ & @[] * const char argv,@ \\
312	& & & \emph{others elided} & \emph{others elided} \\
313	\hline
314	\end{tabular}
315	\end{table}
316
317
318	\subsection{Multi-Dimensional}
319
320	As in the last section, multi-dimensional array declarations are examined.
321	\lstinput{16-18}{bkgd-carray-mdim.c}
322	The significant axis of deriving expressions from @ar@ is now ``itself,'' ``first element'' or ``first grand-element (meaning, first element of first element).''
323	\lstinput{20-44}{bkgd-carray-mdim.c}
324
325
326	\subsection{Lengths may vary, checking does not}
327
328	When the desired number of elements is unknown at compile time, a variable-length array is a solution:
329	\begin{cfa}
330	int main( int argc, const char * argv[] ) {
331	assert( argc == 2 );
332	size_t n = atol( argv[1] );
333	assert( 0 < n );
334	float ar[n];
335	float b[10];
336	// ... discussion continues here
337	}
338	\end{cfa}
339	This arrangement allocates @n@ elements on the @main@ stack frame for @ar@, called a \newterm{variable length array} (VLA), as well as 10 elements in the same stack frame for @b@.
340	The variable-sized allocation of @ar@ is provided by the @alloca@ routine, which bumps the stack pointer.
341	Note, the C standard supports VLAs~\cite[\S~6.7.6.2.4]{C11} as a conditional feature, but the \CC standard does not;
342	both @gcc@ and @g++@ support VLAs.
343	As well, there is misinformation about VLAs, \eg the stack size is limited (small), or VLAs cause stack failures or are inefficient.
344	VLAs exist as far back as Algol W~\cite[\S~5.2]{AlgolW} and are a sound and efficient data type.
345
346	For high-performance applications, the stack size can be fixed and small (coroutines or user-level threads).
347	Here, VLAs can overflow the stack without appropriately sizing the stack, so a heap allocation is used.
348	\begin{cfa}
349	float * ax1 = malloc( sizeof( float[n] ) );
350	float * ax2 = malloc( n * sizeof( float ) ); $\C{// arrays}$
351	float * bx1 = malloc( sizeof( float[1000000] ) );
352	float * bx2 = malloc( 1000000 * sizeof( float ) );
353	\end{cfa}
354
355	Parameter dependency
356
357	Checking is best-effort / unsound
358
359	Limited special handling to get the dimension value checked (static)
360
361
362	\subsection{Dynamically sized, multidimensional arrays}
363
364	In C and \CC, ``multidimensional array'' means ``array of arrays.'' Other meanings are discussed in TODO.
365
366	Just as an array's element type can be @float@, so can it be @float[10]@.
367
368	While any of @float@, @float[10]@ and @float()[10]@ are easy to tell apart from @float@, telling them apart from each other may need occasional reference back to TODO intro section.
369	The sentence derived by wrapping each type in @-[3]@ follows.
370
371	While any of @float[3]@, @float[3][10]@ and @float()[3][10]@ are easy to tell apart from @float[3]@,
372	telling them apart from each other is what it takes to know what ``array of arrays'' really means.
373
374	Pointer decay affects the outermost array only
375
376	TODO: unfortunate syntactic reference with these cases:
377
378	\begin{itemize}
379	\item ar. of ar. of val (be sure about ordering of dimensions when the declaration is dropped)
380	\item ptr. to ar. of ar. of val
381	\end{itemize}
382
383
384	\subsection{Arrays are (but) almost values}
385
386	Has size; can point to
387
388	Can't cast to
389
390	Can't pass as value
391
392	Can initialize
393
394	Can wrap in aggregate
395
396	Can't assign
397
398
399	\subsection{Returning an array is (but) almost possible}
400
401
402	\subsection{The pointer-to-array type has been noticed before}
403
404
405	\section{Linked List}
406
407	Linked-lists are blocks of storage connected using one or more pointers.
408	The storage block is logically divided into data (user payload) and links (list pointers), where the links are the only component used by the list structure.
409	Since the data is opaque, list structures are often polymorphic over the data, which is often homogeneous.
410
411	Storage linking is used to build data structures, which are a group of nodes, containing data and links, organized in a particular format, with specific operations peculiar to that format, \eg queue, tree, hash table, \etc.
412	Because a node's existence is independent of the data structure that organizes it, all nodes are manipulated by address not value;
413	hence, all data structure routines take and return pointers to nodes and not the nodes themselves.
414
415
416	\subsection{Design Issues}
417	\label{toc:lst:issue}
418
419	This section introduces the design space for linked lists that target \emph{system programmers}.
420	Within this restricted target, all design-issue discussions assume the following invariants.
421	Alternatives to the assumptions are discussed under Future Work (Section~\ref{toc:lst:futwork}).
422	\begin{itemize}
423	\item A doubly-linked list is being designed.
424	Generally, the discussed issues apply similarly for singly-linked lists.
425	Circular \vs ordered linking is discussed under List identity (Section~\ref{toc:lst:issue:ident}).
426	\item Link fields are system-managed.
427	The user works with the system-provided API to query and modify list membership.
428	The system has freedom over how to represent these links.
429	\item The user data must provide storage for the list link-fields.
430	Hence, a list node is \emph{statically} defined as data and links \vs a node that is \emph{dynamically} constructed from data and links \see{\VRef{toc:lst:issue:attach}}.
431	\end{itemize}
432
433
434	\subsection{Preexisting Linked-List Libraries}
435
436	Two preexisting linked-list libraries are used throughout, to show examples of the concepts being defined,
437	and further libraries are introduced as needed.
438	\begin{enumerate}
439	\item Linux Queue library\cite{lst:linuxq} (LQ) of @<sys/queue.h>@.
440	\item \CC Standard Template Library's (STL)\footnote{The term STL is contentious as some people prefer the term standard library.} @std::list@\cite{lst:stl}
441	\end{enumerate}
442	%A general comparison of libraries' abilities is given under Related Work (Section~\ref{toc:lst:relwork}).
443	For the discussion, assume the fictional type @req@ (request) is the user's payload in examples.
444	As well, the list library is helping the user manage (organize) requests, \eg a request can be work on the level of handling a network arrival event or scheduling a thread.
445
446
447	\subsection{Link attachment: Intrusive vs.\ Wrapped}
448	\label{toc:lst:issue:attach}
449
450	Link attachment deals with the question:
451	Where are the libraries' inter-element link fields stored, in relation to the user's payload data fields?
452	\VRef[Figure]{fig:lst-issues-attach} shows three basic styles.
453	\VRef[Figure]{f:Intrusive} shows the \newterm{intrusive} style, placing the link fields inside the payload structure.
454	\VRef[Figures]{f:WrappedRef} and \subref*{f:WrappedValue} show the two \newterm{wrapped} styles, which place the payload inside a generic library-provided structure that then defines the link fields.
455	The wrapped style distinguishes between wrapping a reference and wrapping a value, \eg @list<req *>@ or @list<req>@.
456	(For this discussion, @list<req &>@ is similar to @list<req *>@.)
457	This difference is one of user style, not framework capability.
458	Library LQ is intrusive; STL is wrapped with reference and value.
459
460	\begin{comment}
461	\begin{figure}
462	\begin{tabularx}{\textwidth}{Y\|Y\|Y}
463	\lstinput[language=C]{20-39}{lst-issues-intrusive.run.c}
464	&\lstinputlisting[language=C++]{20-39}{lst-issues-wrapped-byref.run.cpp}
465	&\lstinputlisting[language=C++]{20-39}{lst-issues-wrapped-emplaced.run.cpp}
466	\\ & &
467	\\
468	\includegraphics[page=1]{lst-issues-attach.pdf}
469	&
470	\includegraphics[page=2]{lst-issues-attach.pdf}
471	&
472	\includegraphics[page=3]{lst-issues-attach.pdf}
473	\\ & &
474	\\
475	(a) & (b) & (c)
476	\end{tabularx}
477	\caption{
478	Three styles of link attachment: (a)~intrusive, (b)~wrapped reference, and (c)~wrapped value.
479	The diagrams show the memory layouts that result after the code runs, eliding the head object \lstinline{reqs};
480	head objects are discussed in Section~\ref{toc:lst:issue:ident}.
481	In (a), the field \lstinline{req.x} names a list direction;
482	these are discussed in Section~\ref{toc:lst:issue:simultaneity}.
483	In (b) and (c), the type \lstinline{node} represents a system-internal type,
484	which is \lstinline{std::_List_node} in the GNU implementation.
485	(TODO: cite? found in /usr/include/c++/7/bits/stl\_list.h )
486	}
487	\label{fig:lst-issues-attach}
488	\end{figure}
489	\end{comment}
490
491	\begin{figure}
492	\centering
493	\newsavebox{\myboxA} % used with subfigure
494	\newsavebox{\myboxB}
495	\newsavebox{\myboxC}
496
497	\begin{lrbox}{\myboxA}
498	\begin{tabular}{@{}l@{}}
499	\lstinput[language=C]{20-35}{lst-issues-intrusive.run.c} \\
500	\includegraphics[page=1]{lst-issues-attach.pdf}
501	\end{tabular}
502	\end{lrbox}
503
504	\begin{lrbox}{\myboxB}
505	\begin{tabular}{@{}l@{}}
506	\lstinput[language=C++]{20-35}{lst-issues-wrapped-byref.run.cpp} \\
507	\includegraphics[page=2]{lst-issues-attach.pdf}
508	\end{tabular}
509	\end{lrbox}
510
511	\begin{lrbox}{\myboxC}
512	\begin{tabular}{@{}l@{}}
513	\lstinput[language=C++]{20-35}{lst-issues-wrapped-emplaced.run.cpp} \\
514	\includegraphics[page=3]{lst-issues-attach.pdf}
515	\end{tabular}
516	\end{lrbox}
517
518	\subfloat[Intrusive]{\label{f:Intrusive}\usebox\myboxA}
519	\hspace{6pt}
520	\vrule
521	\hspace{6pt}
522	\subfloat[Wrapped reference]{\label{f:WrappedRef}\usebox\myboxB}
523	\hspace{6pt}
524	\vrule
525	\hspace{6pt}
526	\subfloat[Wrapped value]{\label{f:WrappedValue}\usebox\myboxC}
527
528	\caption{
529	Three styles of link attachment:
530	% \protect\subref{f:Intrusive}~intrusive, \protect\subref{f:WrappedRef}~wrapped reference, and \protect\subref*{f:WrappedValue}~wrapped value.
531	The diagrams show the memory layouts that result after the code runs, eliding the head object \lstinline{reqs};
532	head objects are discussed in Section~\ref{toc:lst:issue:ident}.
533	In \protect\subref*{f:Intrusive}, the field \lstinline{req.d} names a list direction;
534	these are discussed in Section~\ref{toc:lst:issue:simultaneity}.
535	In \protect\subref{f:WrappedRef} and \protect\subref{f:WrappedValue}, the type \lstinline{node} represents a
536	library-internal type, which is \lstinline{std::_List_node} in the GNU implementation
537	\see{\lstinline{/usr/include/c++/X/bits/stl_list.h}, where \lstinline{X} is the \lstinline{g++} version number}.
538	}
539	\label{fig:lst-issues-attach}
540	\end{figure}
541
542	Each diagrammed example is using the fewest dynamic allocations for its respective style:
543	in intrusive, here is no dynamic allocation, in wrapped reference only the linked fields are dynamically allocated, and in wrapped value the copied data and linked fields are dynamically allocated.
544	The advantage of intrusive is the control in memory layout and storage placement.
545	Both wrapped styles have independent storage layout and imply library-induced heap allocations, with lifetime that matches the item's membership in the list.
546	In all three cases, a @req@ object can enter and leave a list many times.
547	However, in intrusive a @req@ can only be on one list at a time, unless there are separate link-fields for each simultaneous list.
548	In wrapped reference, a @req@ can appear multiple times on the same or different lists simultaneously, but since @req@ is shared via the pointer, care must be taken if updating data also occurs simultaneously, \eg concurrency.
549	In wrapped value, the @req@ is copied, which increases storage usage, but allows independent simultaneous changes;
550	however, knowing which of the @req@ object is the ``true'' object becomes complex.
551	\see*{\VRef{toc:lst:issue:simultaneity} for further discussion.}
552
553	The implementation of @LIST_ENTRY@ uses a trick to find the links and the node containing the links.
554	The macro @LIST_INSERT_HEAD(&reqs, &r2, d);@ takes the list header, a pointer to the node, and the offset of the link fields in the node.
555	One of the fields generated by @LIST_ENTRY@ is a pointer to the node, which is set to the node address, \eg @r2@.
556	Hence, the offset to the link fields provides an access to the entire node, \ie the node points at itself.
557	For list traversal, @LIST_FOREACH(cur, &reqs_pri, by_pri)@, there is the node cursor, the list, and the offset of the link fields within the node.
558	The traversal actually moves from link fields to link fields within a node and sets the node cursor from the pointer within the link fields back to the node.
559
560	A further aspect of layout control is allowing the user to explicitly specify link fields controlling attributes and placement within the @req@ object.
561	LQ allows this ability through the @LIST_ENTRY@ macro;\footnote{It is possible to have multiple named linked fields allowing a node to appear on multiple lists simultaneously.}
562	supplying the link fields by inheritance makes them implicit and relies on compiler placement, such as the start or end of @req@.
563	An example of an explicit attribute is cache alignment of the link fields in conjunction with other @req@ fields, improving locality and/or avoiding false sharing.
564	Wrapped reference has no control over the link fields, but the separate data allows some control;
565	wrapped value has no control over data or links.
566
567	Another subtle advantage of intrusive arrangement is that a reference to a user-level item (@req@) is sufficient to navigate or manage the item's membership.
568	In LQ, the intrusive @req@ pointer is the right argument type for operations @LIST_NEXT@ or @LIST_REMOVE@;
569	there is no distinguishing a @req@ from ``a @req@ in a list.''
570	The same is not true of STL, wrapped reference or value.
571	There, the analogous operations, @iterator::operator++()@, @iterator::operator*()@, and @list::erase(iterator)@, work on a parameter of type @list<T>::iterator@;
572	There is no mapping from @req &@ to @list<req>::iterator@. %, for linear search.
573
574	The advantage of wrapped is the abstraction of a data item from its list membership(s).
575	In the wrapped style, the @req@ type can come from a library that serves many independent uses,
576	which generally have no need for listing.
577	Then, a novel use can put a @req@ in a list, without requiring any upstream change in the @req@ library.
578	In intrusive, the ability to be listed must be planned during the definition of @req@.
579
580	\begin{figure}
581	\lstinput[language=C++]{100-117}{lst-issues-attach-reduction.hpp}
582	\lstinput[language=C++]{150-150}{lst-issues-attach-reduction.hpp}
583	\caption{
584	Simulation of wrapped using intrusive.
585	Illustrated by pseudocode implementation of an STL-compatible API fragment using LQ as the underlying implementation.
586	The gap that makes it pseudocode is that
587	the LQ C macros do not expand to valid C++ when instantiated with template parameters---there is no \lstinline{struct El}.
588	When using a custom-patched version of LQ to work around this issue,
589	the programs of Figure~\ref{f:WrappedRef} and wrapped value work with this shim in place of real STL.
590	Their executions lead to the same memory layouts.
591	}
592	\label{fig:lst-issues-attach-reduction}
593	\end{figure}
594
595	It is possible to simulate wrapped using intrusive, illustrated in Figure~\ref{fig:lst-issues-attach-reduction}.
596	This shim layer performs the implicit dynamic allocations that pure intrusion avoids.
597	But there is no reduction going the other way.
598	No shimming can cancel the allocations to which wrapped membership commits.
599
600	Because intrusion is a lower-level listing primitive, the system design choice is not between forcing users to use intrusion or wrapping.
601	The choice is whether or not to provide access to an allocation-free layer of functionality.
602	An intrusive-primitive library like LQ lets users choose when to make this tradeoff.
603	A wrapped-primitive library like STL forces users to incur the costs of wrapping, whether or not they access its benefits.
604
605
606	\subsection{Simultaneity: Single vs.\ Multi-Static vs.\ Dynamic}
607	\label{toc:lst:issue:simultaneity}
608
609	\begin{figure}
610	\parbox[t]{3.5in} {
611	\lstinput[language=C++]{20-60}{lst-issues-multi-static.run.c}
612	}\parbox[t]{20in} {
613	~\\
614	\includegraphics[page=1]{lst-issues-direct.pdf} \\
615	~\\
616	\hspace*{1.5in}\includegraphics[page=2]{lst-issues-direct.pdf}
617	}
618	\caption{
619	Example of simultaneity using LQ lists.
620	The zoomed-out diagram (right/top) shows the complete multi-linked data structure.
621	This structure can navigate all requests in priority order ({\color{blue}blue}), and navigate among requests with a common request value ({\color{orange}orange}).
622	The zoomed-in diagram (right/bottom) shows how the link fields connect the nodes on different lists.
623	}
624	\label{fig:lst-issues-multi-static}
625	\end{figure}
626
627	\newterm{Simultaneity} deals with the question:
628	In how many different lists can a node be stored, at the same time?
629	Figure~\ref{fig:lst-issues-multi-static} shows an example that can traverse all requests in priority order (field @pri@) or navigate among requests with the same request value (field @rqr@).
630	Each of ``by priority'' and ``by common request value'' is a separate list.
631	For example, there is a single priority-list linked in order [1, 2, 2, 3, 3, 4], where nodes may have the same priority, and there are three common request-value lists combining requests with the same values: [42, 42], [17, 17, 17], and [99], giving four head nodes one for each list.
632	The example shows a list can encompass all the nodes (by-priority) or only a subset of the nodes (three request-value lists).
633
634	As stated, the limitation of intrusive is knowing apriori how many groups of links are needed for the maximum number of simultaneous lists.
635	Thus, the intrusive LQ example supports multiple, but statically many, link lists.
636	Note, it is possible to reuse links for different purposes, \eg if a list in linked one way at one time and another way at another time, and these times do not overlap, the two different linkings can use the same link fields.
637	This feature is used in the \CFA runtime where a thread node may be on a blocked or running list, both never on both simultaneously.
638
639	Now consider the STL in the wrapped-reference arrangement of Figure~\ref{f:WrappedRef}.
640	Here it is possible to construct the same simultaneity by creating multiple STL lists, each pointing at the appropriate nodes.
641	Each group of intrusive links become the links for each separate STL list.
642	The upside is the unlimited number of a lists a node can be associated with simultaneously, any number of STL lists can be created dynamically.
643	The downside is the dynamic allocation of the link nodes and managing multiple lists.
644	Note, it might be possible to wrap the multiple lists in another type to hide this implementation issue.
645
646	Now consider the STL in the wrapped-value arrangement of Figure~\ref{f:WrappedValue}.
647	Again, it is possible to construct the same simultaneity by creating multiple STL lists, each copying the appropriate nodes, where the intrusive links become the links for each separate STL list.
648	The upside is the same as for wrapped-reference arrangement with an unlimited number of a list bindings.
649	The downside is the dynamic allocation and significant storage usage due to node copying.
650	As well, it is unclear how node updates work in this scenario, without some notation of ultimately merging node information.
651
652	% https://www.geeksforgeeks.org/introduction-to-multi-linked-list/ -- example of building a bespoke multi-linked list out of STL primitives (providing indication that STL doesn't offer one); offers dynamic directionality by embedding `vector<struct node*> pointers;`
653
654	% When allowing multiple static directions, frameworks differ in their ergonomics for
655	% the typical case: when the user needs only one direction, vs.\ the atypical case, when the user needs several.
656	% LQ's ergonomics are well-suited to the uncommon case of multiple list directions.
657	% Its intrusion declaration and insertion operation both use a mandatory explicit parameter naming the direction.
658	% This decision works well in Figure~\ref{fig:lst-issues-multi-static}, where the names @by_pri@ and @by_rqr@ work well,
659	% but it clutters Figure~\ref{f:Intrusive}, where a contrived name must be invented and used.
660	% The example uses @x@; @reqs@ would be a more readily ignored choice. \PAB{wording?}
661
662	\uCpp is a concurrent extension of \CC, which provides a basic set of intrusive lists~\cite[appx.~F]{uC++}, where the link fields are defined with the data fields using inheritance.
663	\begin{cquote}
664	\setlength{\tabcolsep}{15pt}
665	\begin{tabular}{@{}ll@{}}
666	\multicolumn{1}{c}{singly-linked list} & \multicolumn{1}{c}{doubly-linked list} \\
667	\begin{c++}
668	struct Node : public uColable {
669	int i; // data
670	Node( int i ) : i{ i } {}
671	};
672	\end{c++}
673	&
674	\begin{c++}
675	struct Node : public uSeqable {
676	int i; // data
677	Node( int i ) : i{ i } {}
678	};
679	\end{c++}
680	\end{tabular}
681	\end{cquote}
682	A node can be placed in the following data structures depending on its link fields: @uStack@ and @uQueue@ (singly linked), and @uSequence@ (doubly linked).
683	A node inheriting from @uSeqable@ can appear in a singly or doubly linked structure.
684	Structure operations implicitly know the link-field location through the inheritance.
685	\begin{c++}
686	uStack<Node> stack;
687	Node node;
688	stack.push( node ); // link fields at beginning of node
689	\end{c++}
690
691	Simultaneity cannot be done with multiple inheritance, because there is no mechanism to either know the order of inheritance fields or name each inheritance.
692	Instead, a special type is require that contains the link fields and points at the node.
693	\begin{cquote}
694	\setlength{\tabcolsep}{10pt}
695	\begin{tabular}{@{}ll@{}}
696	\begin{c++}
697	struct NodeDL : public uSeqable {
698	@Node & node;@ // node pointer
699	NodeDL( Node & node ) : node( node ) {}
700	Node & get() const { return node; }
701	};
702	\end{c++}
703	&
704	\begin{c++}
705	struct Node : public uColable {
706	int i; // data
707	@NodeDL nodeseq;@ // embedded intrusive links
708	Node( int i ) : i{ i }, @nodeseq{ this }@ {}
709	};
710	\end{c++}
711	\end{tabular}
712	\end{cquote}
713	This node can now be inserted into a doubly-linked list through the embedded intrusive links.
714	\begin{c++}
715	uSequence<NodeDL> sequence;
716	sequence.add_front( node.nodeseq ); $\C{// link fields in embedded type}$
717	NodeDL nodedl = sequence.remove( node.nodeseq );
718	int i = nodedl.get().i; $\C{// indirection to node}$
719	\end{c++}
720	Hence, the \uCpp approach optimizes one set of intrusive links through the \CC inheritance mechanism, and falls back onto the LQ approach of explicit declarations for additional intrusive links.
721	However, \uCpp cannot apply the LQ trick for finding the links and node.
722
723	The major ergonomic difference among the approaches is naming and name usage.
724	The intrusive model requires naming each set of intrusive links, \eg @by_pri@ and @by_rqr@ in \VRef[Figure]{fig:lst-issues-multi-static}.
725	\uCpp cheats by using inheritance for the first intrusive links, after which explicit naming of intrusive links is required.
726	Furthermore, these link names must be used in all list operations, including iterating, whereas wrapped reference and value hide the list names in the implicit dynamically-allocated node-structure.
727	At issue is whether an API for simultaneity can support one list (when several are not wanted) to be \emph{implicit}.
728	\uCpp allows it, LQ does not, and the STL does not have this question.
729
730
731	\subsection{User integration: Preprocessed vs.\ Type-System Mediated}
732
733	% example of poor error message due to LQ's preprocessed integration
734	% programs/lst-issues-multi-static.run.c:46:1: error: expected identifier or '(' before 'do'
735	% 46 \| LIST_INSERT_HEAD(&reqs_rtr_42, &r42b, by_rqr);
736	% \| ^~~~~~~~~~~~~~~~
737	%
738	% ... not a wonderful example; it was a missing semicolon on the preceding line; but at least real
739
740
741	\subsection{List identity: Headed vs.\ Ad-hoc}
742	\label{toc:lst:issue:ident}
743
744	All examples so far have used distinct user-facing types:
745	an item found in a list (type @req@, of variables like @r1@), and
746	a list (type @reql@ or @list<req>@, of variables like @reqs@ or @reqs_rqr_42@).
747	\see{Figure~\ref{fig:lst-issues-attach} and Figure~\ref{fig:lst-issues-multi-static}}
748	The latter type is a head, and these examples are of headed lists.
749
750	A bespoke ``pointer to next @req@'' implementation often omits the latter type.
751	Such a list model is ad-hoc.
752
753	In headed thinking, there are length-zero lists (heads with no elements), and an element can be listed or not listed.
754	In ad-hoc thinking, there are no length-zero lists and every element belongs to a list of length at least one.
755	\PAB{Create a figure for this.}
756
757	By omitting the head, elements can enter into an adjacency relationship,
758	without requiring that someone allocate room for the head of the possibly-resulting list,
759	or being able to find a correct existing head.
760
761	A head defines one or more element roles, among elements that share a transitive adjacency.
762	``First'' and ``last'' are element roles.
763	One moment's ``first'' need not be the next moment's.
764
765	There is a cost to maintaining these roles.
766	A runtime component of this cost is evident in LQ's offering the choice of type generators @LIST@ vs.~@TAILQ@.
767	Its @LIST@ maintains a ``first,'' but not a ``last;'' its @TAILQ@ maintains both roles.
768	(Both types are doubly linked and an analogous choice is available for singly linked.)
769
770	TODO: finish making this point
771
772	See WIP in lst-issues-adhoc-.ignore..
773
774	The code-complexity component of the cost ...
775
776	Ability to offer heads is good. Point: Does maintaining a head mean that the user has to provide more state when manipulating the list? Requiring the user to do so is bad, because the user may have lots of "list" typed variables in scope, and detecting that the user passed the wrong one requires testing all the listing edge cases.
777
778	\subsection{End treatment: Uniform }
779
780
781	\section{String}
782
783	A string is a sequence of symbols, where the form of a symbol can vary significantly: 7/8-bit characters (ASCII/Latin-1), or 2/4/8-byte (UNICODE) characters/symbols or variable length (UTF-8/16/32) characters.
784	A string can be read left-to-right, right-to-left, top-to-bottom, and have stacked elements (Arabic).
785
786	A C character constant is an ASCII/Latin-1 character enclosed in single-quotes, \eg @'x'@, @'@\textsterling@'@.
787	A wide C character constant is the same, except prefixed by the letter @L@, @u@, or @U@, \eg @u'\u25A0'@ (black square), where the @\u@ identifies a universal character name.
788	A character can be formed from an escape sequence, which expresses a non-typable character @'\f'@ form feed, a delimiter character @'\''@ embedded single quote, or a raw character @'\xa3'@ \textsterling.
789
790	A C character string is zero or more regular, wide, or escape characters enclosed in double-quotes @"xyz\n"@.
791	The kind of characters in the string is denoted by a prefix: UTF-8 characters are prefixed by @u8@, wide characters are prefixed by @L@, @u@, or @U@.
792
793	For UTF-8 string literals, the array elements have type @char@ and are initialized with the characters of the multibyte character sequences, \eg @u8"\xe1\x90\x87"@ (Canadian syllabics Y-Cree OO).
794	For wide string literals prefixed by the letter @L@, the array elements have type @wchar_t@ and are initialized with the wide characters corresponding to the multibyte character sequence, \eg @L"abc@$\mu$@"@ and are read/printed using @wsanf@/@wprintf@.
795	The value of a wide-character is implementation-defined, usually a UTF-16 character.
796	For wide string literals prefixed by the letter @u@ or @U@, the array elements have type @char16_t@ or @char32_t@, respectively, and are initialized with wide characters corresponding to the multibyte character sequence, \eg @u"abc@$\mu$@"@, @U"abc@$\mu$@"@.
797	The value of a @"u"@ character is an UTF-16 character;
798	the value of a @"U"@ character is an UTF-32 character.
799	The value of a string literal containing a multibyte character or escape sequence not represented in the execution character set is implementation-defined.
800
801	C strings are null-terminated rather than maintaining a separate string length.
802	\begin{quote}
803	Technically, a string is an array whose elements are single characters.
804	The compiler automatically places the null character @\0@ at the end of each such string, so programs can conveniently find the end.
805	This representation means that there is no real limit to how long a string can be, but programs have to scan one completely to determine its length.
806	\end{quote}
807	Unfortunately, this design decision is both unsafe and inefficient.
808	It is common error in C to forget the space in a character array for the terminator or overwrite the terminator, resulting in array overruns in string operations.
809	The need to repeatedly scan an entire string to determine its length can result in significant cost, as it is impossible to cache the length in many cases.
810
811	C strings are fixed size because arrays are used for the implementation.
812	However, string manipulation commonly results in dynamically-sized temporary and final string values.
813	As a result, storage management for C strings is a nightmare, quickly resulting in array overruns and incorrect results.
814
815	Collectively, these design decisions make working with strings in C, awkward, time consuming, and unsafe.
816	While there are companion string routines that take the maximum lengths of strings to prevent array overruns, that means the semantics of the operation can fail because strings are truncated.
817	Suffice it to say, C is not a go-to language for string applications, which is why \CC introduced the @string@ type.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: