Context Navigation

source: doc/theses/mike_brooks_MMath/list.tex@ 81e1984b

Visit:

Last change on this file since 81e1984b was 39ffa5e, checked in by Peter A. Buhr <pabuhr@…>, 7 weeks ago
fix macro name for printing uC++
Property mode set to `100644`
File size: 59.3 KB

Line
1	\chapter{Linked List}
2
3	This chapter presents my work on designing and building a linked-list library for \CFA.
4	Due to time limitations and the needs expressed by the \CFA runtime developers, I focussed on providing a doubly-linked list, and its bidirectionally iterators for traversal.
5	Simpler data-structures, like stack and queue, can be built from the doubly-linked mechanism with only a slight storage/performance cost because of the unused link field.
6	Reducing to data-structures with a single link follows directly from the more complex doubly-links and its iterators.
7
8
9	\section{Plan-9 Inheritance}
10	\label{s:Plan9Inheritance}
11
12	This chapter uses a form of inheritance from the Plan-9 C dialect~\cite[\S~3.3]{Thompson90new}, which is supported by @gcc@ and @clang@ using @-fplan9-extensions@.
13	\CFA has its own variation of the Plan-9 mechanism, where the nested type is denoted using @inline@.
14	\begin{cfa}
15	union U { int x; double y; char z; };
16	struct W { int i; double j; char k; };
17	struct S {
18	@inline@ struct W; $\C{// extended Plan-9 inheritance}$
19	unsigned int tag;
20	@inline@ U; $\C{// extended Plan-9 inheritance}$
21	} s;
22	\end{cfa}
23	Inline inheritance is containment, where the inlined field is unnamed and the type's internal fields are hoisted into the containing structure.
24	Hence, the field names must be unique, unlike \CC nested types, but any inlined type-names are at a nested scope level, unlike aggregate nesting in C.
25	Note, the position of the containment is normally unimportant, unless there is some form of memory or @union@ overlay.
26	Finally, the inheritance declaration of @U@ is not prefixed with @union@.
27	Like \CC, \CFA allows optional prefixing of type names with their kind, \eg @struct@, @union@, and @enum@, unless there is ambiguity with variable names in the same scope.
28
29	\VRef[Figure]{f:Plan9Polymorphism} shows the key polymorphic feature of Plan-9 inheritance: implicit conversion of values and pointers for nested types.
30	In the example, there are implicit conversions from @S@ to @U@ and @S@ to @W@, extracting the appropriate value or pointer for the substructure.
31	\VRef[Figure]{f:DiamondInheritancePattern} shows complex multiple inheritance patterns are possible, like the \newterm{diamond pattern}~\cite[\S~6.1]{Stroustrup89}\cite[\S~4]{Cargill91}.
32	Currently, \CFA type-system does not support @virtual@ inheritance.
33
34	\begin{figure}
35	\begin{cfa}
36	void f( U, U * ); $\C{// value, pointer}$
37	void g( W, W * ); $\C{// value, pointer}$
38	U u, * up; W w, * wp; S s, * sp;
39	u = s; up = sp; $\C{// value, pointer}$
40	w = s; wp = sp; $\C{// value, pointer}$
41	f( s, &s ); $\C{// value, pointer}$
42	g( s, &s ); $\C{// value, pointer}$
43	\end{cfa}
44	\caption{Plan-9 Polymorphism}
45	\label{f:Plan9Polymorphism}
46	\end{figure}
47
48	\begin{figure}
49	\setlength{\tabcolsep}{10pt}
50	\begin{tabular}{ll@{}}
51	\multicolumn{1}{c}{\CC} & \multicolumn{1}{c}{\CFA} \\
52	\begin{c++}
53	struct B { int b; };
54	struct L : @public B@ { int l, w; };
55	struct R : @public B@ { int r, w; };
56	struct T : @public L, R@ { int t; };
57
58	T t = { { { 5 }, 3, 4 }, { { 8 }, 6, 7 }, 3 };
59	t.t = 42;
60	t.l = 42;
61	t.r = 42;
62	((L &)t).b = 42; // disambiguate
63	\end{c++}
64	&
65	\begin{cfa}
66	struct B { int b; };
67	struct L { @inline B;@ int l, w; };
68	struct R { @inline B;@ int r, w; };
69	struct T { @inline L; inline R;@ int t; };
70
71	T t = { { { 5 }, 3, 4 }, { { 8 }, 6, 7 }, 3 };
72	t.t = 42;
73	t.l = 42;
74	t.r = 42;
75	((L &)t).b = 42; // disambiguate, proposed solution t.L.b = 42;
76	\end{cfa}
77	\end{tabular}
78	\caption{Diamond Non-Virtual Inheritance Pattern}
79	\label{f:DiamondInheritancePattern}
80	\end{figure}
81
82
83	\section{Features}
84
85	The following features directed this project, where the goal is high-performance list operations required by \CFA runtime components, like the threading library.
86
87
88	\subsection{Core Design Issues}
89
90	The doubly-linked list attaches links intrusively, supports multiple link directions, integrates with user code via the type system, treats its ends uniformly, and identifies a list using an explicit head.
91	This design covers system and data management issues stated in \VRef{toc:lst:issue}.
92
93	\VRef[Figure]{fig:lst-features-intro} continues the running @req@ example from \VRef[Figure]{fig:lst-issues-attach} using the \CFA list @dlist@.
94	The \CFA link attachment is intrusive so the resulting memory layout is per user node, as for the LQ version of \VRef[Figure]{f:Intrusive}.
95	The \CFA framework provides generic type @dlink( T )@ (not to be confused with @dlist@) for the two link fields (front and back).
96	A user inserts the links into the @req@ structure via \CFA inline-inheritance \see{\VRef{s:Plan9Inheritance}}.
97	Lists leverage the automatic conversion of a pointer to anonymous inline field for assignments and function calls.
98	Therefore, a reference to a @req@ is implicitly convertible to @dlink@ in both contexts.
99	The links in @dlist@ point at another embedded @dlist@ node, know the offsets of all links (data is abstract but accessible), and any field-offset arithmetic or link-value changes are safe and abstract.
100
101	\begin{figure}
102	\lstinput{20-30}{lst-features-intro.run.cfa}
103	\caption[Multiple link directions in \CFA list library]{
104	Demonstration of the running \lstinline{req} example, done using the \CFA list library.
105	This example is equivalent to the three approaches in \VRef[Figure]{fig:lst-issues-attach}.
106	}
107	\label{fig:lst-features-intro}
108	\end{figure}
109
110	\VRef[Figure]{f:dlistOutline} shows the outline for types @dlink@ and @dlist@.
111	Note, the first @forall@ clause is distributed across all the declaration in its containing block, eliminating repetition on each declaration.
112	The second nested @forall@ on @dlist@ is also distributed and adds an optional second type parameter, @tLinks@, denoting the linking axis \see{\VRef{s:Axis}}, \ie the kind of list this node can appear on.
113
114	\begin{figure}
115	\begin{cfa}
116	forall( tE & ) { $\C{// distributed}$
117	struct @dlink@ { ... }; $\C{// abstract type}$
118	static inline void ?{}( dlink( tE ) & this ); $\C{// constructor}$
119
120	forall( tLinks & = dlink( tE ) ) { $\C{// distributed, default type for axis}$
121	struct @dlist@ { ... }; $\C{// abstract type}$
122	static inline void ?{}( dlist( tE, tLinks ) & this ); $\C{// constructor}$
123	}
124	}
125	\end{cfa}
126	\caption{\lstinline{dlisk} / \lstinline{dlist} Outline}
127	\label{f:dlistOutline}
128	\end{figure}
129
130	\VRef[Figure]{fig:lst-features-multidir} shows how the \CFA library supports multi-inline links, so a node can be on one or more lists simultaneously.
131	The declaration of @req@ has two inline-inheriting @dlink@ occurrences.
132	The first of these gives a type named @req.by_pri@, @req@ inherits from it, and it inherits from @dlink@.
133	The second line @req.by_rqr@ is similar to @req.by_pri@.
134	Thus, there is a diamond, non-virtual, inheritance from @req@ to @dlink@, with @by_pri@ and @by_rqr@ being the mid-level types \see{\VRef[Figure]{f:DiamondInheritancePattern}}.
135
136	The declarations of the list-head objects, @reqs_pri@, @reqs_rqr_42@, @reqs_rqr_17@, and @reqs_rqr_99@, bind which link nodes in @req@ are used by this list.
137	Hence, the type of the variable @reqs_pri@, @dlist(req, req.by_pri)@, means operations called on @reqs_pri@ implicitly select (disambiguate) the correct @dlink@s, \eg the calls @insert_first(reqs_pri, ...)@ imply, ``here, we are working by priority.''
138	As in \VRef[Figure]{fig:lst-issues-multi-static}, three lists are constructed, a priority list containing all nodes, a list with only nodes containing the value 42, and a list with only nodes containing the value 17.
139
140	\begin{figure}
141	\centering
142	\begin{tabular}{@{}ll@{}}
143	\begin{tabular}{@{}l@{}}
144	\lstinput{20-31}{lst-features-multidir.run.cfa} \\
145	\lstinput{43-71}{lst-features-multidir.run.cfa}
146	\end{tabular}
147	&
148	\lstinput[language=C++]{20-60}{lst-issues-multi-static.run.c}
149	\end{tabular}
150
151	\caption{
152	Demonstration of multiple static link directions done in the \CFA list library.
153	The right example is from \VRef[Figure]{fig:lst-issues-multi-static}.
154	The left \CFA example does the same job.
155	}
156	\label{fig:lst-features-multidir}
157	\end{figure}
158
159	The list library also supports the common case of single directionality more naturally than LQ.
160	Returning to \VRef[Figure]{fig:lst-features-intro}, the single-direction list has no contrived name for the link direction as it uses the default type in the definition of @dlist@;
161	in contrast, the LQ list in \VRef[Figure]{f:Intrusive} adds the unnecessary field name @d@.
162	In \CFA, a single direction list sets up a single inheritance with @dlink@, and the default list axis is to itself.
163
164	When operating on a list with several directions and operations that do not take the list head, the list axis can be ambiguous.
165	For example, a call like @insert_after( r1, r2 )@ does not have enough information to know which axes to select implicitly.
166	Is @r2@ supposed to be the next-priority request after @r1@, or is @r2@ supposed to join the same-requester list of @r1@?
167	As such, the \CFA type-system gives an ambiguity error for this call.
168	There are multiple ways to resolve the ambiguity.
169	The simplest is an explicit cast on each call to select the specific axis, \eg @insert_after( (by_pri)r1, r2 )@.
170	However, multiple explicit casts are tedious and error-prone.
171	To mitigate this issue, the list library provides a hook for applying the \CFA language's scoping and priority rules.
172	\begin{cfa}
173	with ( DLINK_VIA( req, req.pri ) ) insert_after( r1, r2 );
174	\end{cfa}
175	Here, the @with@ statement opens the scope of the object type for the expression;
176	hence, the @DLINK_VIA@ result causes one of the list directions to become a more attractive candidate to \CFA's overload resolution.
177	This boost can be applied across multiple statements in a block or an entire function body.
178	\begin{cquote}
179	\setlength{\tabcolsep}{15pt}
180	\begin{tabular}{@{}ll@{}}
181	\begin{cfa}
182	@with( DLINK_VIA( req, req.pri ) ) {@
183	... insert_after( r1, r2 ); ...
184	@}@
185	\end{cfa}
186	&
187	\begin{cfa}
188	void f() @with( DLINK_VIA( req, req.pri ) ) {@
189	... insert_after( r1, r2 ); ...
190	@}@
191	\end{cfa}
192	\end{tabular}
193	\end{cquote}
194	Within the @with@, the code acts as if there is only one list direction.
195
196	Unlike the \CC template container-types, the \CFA library works completely within the type system;
197	both @dlink@ and @dlist@ are ordinary types, not language macros.
198	There is no textual expansion other than header-included static-inline function for performance.
199	Hence, errors in user code are reported only with mention of the library's declarations.
200	Finally, the library is separately compiled from the usage code, modulo inlining.
201
202	The \CFA library works in headed and headless modes.
203	\PAB{TODO: elaborate.}
204
205
206	\section{List API}
207
208	\VRef[Figure]{f:ListAPI} shows the API for the doubly-link list operations, where each is explained.
209	\begin{itemize}[leftmargin=*]
210	\item
211	@isListed@ returns true if the node is an element of a list and false otherwise.
212	\item
213	@isEmpty@ returns true if the list has no nodes and false otherwise.
214	\item
215	@first@ returns a reference to the first node of the list without removing it or @0p@ if the list is empty.
216	\item
217	@last@ returns a reference to the last node of the list without removing it or @0p@ if the list is empty.
218	\item
219	@insert_before@ adds a node before a specified node \see{\lstinline{insert_last} for insertion at the end}\footnote{
220	Some list packages allow \lstinline{0p} (\lstinline{nullptr}) for the before/after node implying insert/remove at the start/end of the list, respectively.
221	However, this inserts an \lstinline{if} statement in the fastpath of a potentially commonly used list operation.}
222	\item
223	@insert_after@ adds a node after a specified node \see{\lstinline{insert_first} for insertion at the start}\footnotemark[\value{footnote}].
224	\item
225	@remove@ removes a specified node from the list (any location) and returns a reference to the node.
226	\item
227	@iter@ create an iterator for the list.
228	\item
229	@recede@ returns true if the iterator cursor is advanced to the previous (predecessor, towards first) node before the prior cursor node and false otherwise.
230	\item
231	@advance@ returns true if the iterator cursor is advanced to the next (successor, towards last) node after the prior cursor node and false otherwise.
232	\item
233	@isFirst@ returns true if the node is the first node in the list and false otherwise.
234	\item
235	@isLast@ returns true if the node is the last node in the list and false otherwise.
236	\item
237	@pred@ returns a reference to the previous (predecessor, towards first) node before a specified node or @0p@ if a specified node is the first node in the list.
238	\item
239	@next@ returns a reference to the next (successor, towards last) node after a specified node or @0p@ if a specified node is the last node in the list.
240	\item
241	@insert_first@ adds a node to the start of the list so it becomes the first node and returns a reference to the node for cascading.
242	\item
243	@insert_last@ adds a node to the end of the list so it becomes the last node and returns returns a reference to the node for cascading.
244	\item
245	@remove_first@ removes the first node and returns a reference to it or @0p@ if the list is empty.
246	\item
247	@remove_last@ removes the last node and returns a reference to it or @0p@ if the list is empty.
248	\item
249	@transfer@ transfers all nodes from the @from@ list to the end of the @to@ list; the @from@ list is empty after the transfer.
250	\item
251	@split@ transfers the @from@ list up to node to the end of the @to@ list; the @from@ list becomes the list after the node.
252	The node must be in the @from@ list.
253	\end{itemize}
254
255	\begin{figure}
256	\begin{cfa}
257	E & isListed( E & node );
258	E & isEmpty( dlist( E ) & list );
259	E & first( dlist( E ) & list );
260	E & last( dlist( E ) & list );
261	E & insert_before( E & before, E & node );
262	E & insert_after( E & after, E & node );
263	E & remove( E & node );
264	E & iter( dlist( E ) & list );
265	bool advance( E && refx );
266	bool recede( E && refx );
267	bool isFirst( E & node );
268	bool isLast( E & node );
269	E & prev( E & node );
270	E & next( E & node );
271	E & insert_first( dlist( E ) & list, E & node );
272	E & insert_last( dlist( E ) & list, E & node );
273	E & remove_first( dlist( E ) & list );
274	E & remove_last( dlist( E ) & list );
275	void transfer( dlist( E ) & to, dlist( E ) & from ) {
276	void split( dlist( E ) & to, dlist( E ) & from, E & node ) {
277	\end{cfa}
278	\caption{\CFA List API}
279	\label{f:ListAPI}
280	\end{figure}
281
282
283	\subsection{Iteration}
284
285	It is possible to iterate through a list manually or using a set of standard macros.
286	\VRef[Figure]{f:IteratorTemple} shows the iterator template, managing a list of nodes, used throughout the following iterator examples.
287	Each example assumes its loop body prints the value in the current node.
288
289	\begin{figure}
290	\begin{cfa}
291	#include <fstream.hfa>
292	#include <list.hfa>
293
294	struct node {
295	int v;
296	inline dlink(node);
297	};
298	int main() {
299	dlist(node) list;
300	node n1 = { 1 }, n2 = { 2 }, n3 = { 3 }, n4 = { 4 };
301	insert( list, n1 ); insert( list, n2 ); insert( list, n3 ); insert( list, n4 );
302	sout \| nlOff;
303	for ( ... ) @sout \| it.v \| ","; sout \| nl;@ // iterator examples in text
304	remove( n1 ); remove( n2 ); remove( n3 ); remove( n4 );
305	}
306	\end{cfa}
307	\caption{Iterator Temple}
308	\label{f:IteratorTemple}
309	\end{figure}
310
311	The manual method is low level but allows complete control of the iteration.
312	The list cursor (index) can be either a pointer or a reference to a node in the list.
313	The choice depends on how the programmer wants to access the fields: @it->f@ or @it.f@.
314	The following examples use a reference because the loop body manipulates the node values rather than the list pointers.
315	The end of iteration is denoted by the loop cursor returning @0p@.
316
317	\noindent
318	Iterating forward and reverse through the entire list.
319	\begin{cquote}
320	\setlength{\tabcolsep}{15pt}
321	\begin{tabular}{@{}l\|l@{}}
322	\begin{cfa}
323	for ( node & it = @first@( list ); &it /* != 0p */ ; &it = &@next@( it ) ...
324	for ( node & it = @last@( list ); ⁢ &it = &@prev@( it ) ) ...
325	\end{cfa}
326	&
327	\begin{cfa}
328	1, 2, 3, 4,
329	4, 3, 2, 1,
330	\end{cfa}
331	\end{tabular}
332	\end{cquote}
333	Iterating forward and reverse from a starting node through the remaining list.
334	\begin{cquote}
335	\setlength{\tabcolsep}{15pt}
336	\begin{tabular}{@{}l\|l@{}}
337	\begin{cfa}
338	for ( node & it = @n2@; ⁢ &it = &@next@( it ) ) ...
339	for ( node & it = @n3@; ⁢ &it = &@prev@( it ) ) ...
340	\end{cfa}
341	&
342	\begin{cfa}
343	2, 3, 4,
344	3, 2, 1,
345	\end{cfa}
346	\end{tabular}
347	\end{cquote}
348	Iterating forward and reverse from a starting node to an ending node through the contained list.
349	\begin{cquote}
350	\setlength{\tabcolsep}{15pt}
351	\begin{tabular}{@{}l\|l@{}}
352	\begin{cfa}
353	for ( node & it = @n2@; &it @!= &n4@; &it = &@next@( it ) ) ...
354	for ( node & it = @n4@; &it @!= &n2@; &it = &@prev@( it ) ) ...
355	\end{cfa}
356	&
357	\begin{cfa}
358	2, 3,
359	4, 3,
360	\end{cfa}
361	\end{tabular}
362	\end{cquote}
363	Iterating forward and reverse through the entire list using the shorthand start at the list head and pick a direction.
364	In this case, @advance@ and @recede@ return a boolean, like \CC @while ( cin >> i )@.
365	\begin{cquote}
366	\setlength{\tabcolsep}{15pt}
367	\begin{tabular}{@{}l\|l@{}}
368	\begin{cfa}
369	for ( node & it = @iter@( list ); @advance@( it ); ) ...
370	for ( node & it = @iter@( list ); @recede@( it ); ) ...
371	\end{cfa}
372	&
373	\begin{cfa}
374	1, 2, 3, 4,
375	4, 3, 2, 1,
376	\end{cfa}
377	\end{tabular}
378	\end{cquote}
379	Finally, there are convenience macros that look like @foreach@ in other programming languages.
380	Iterating forward and reverse through the entire list.
381	\begin{cquote}
382	\setlength{\tabcolsep}{15pt}
383	\begin{tabular}{@{}l\|l@{}}
384	\begin{cfa}
385	FOREACH( list, it ) ...
386	FOREACH_REV( list, it ) ...
387	\end{cfa}
388	&
389	\begin{cfa}
390	1, 2, 3, 4,
391	4, 3, 2, 1,
392	\end{cfa}
393	\end{tabular}
394	\end{cquote}
395	Iterating forward and reverse through the entire list or until a predicate is triggered.
396	\begin{cquote}
397	\setlength{\tabcolsep}{15pt}
398	\begin{tabular}{@{}l\|l@{}}
399	\begin{cfa}
400	FOREACH_COND( list, it, @it.v == 3@ ) ...
401	FOREACH_REV_COND( list, it, @it.v == 1@ ) ...
402	\end{cfa}
403	&
404	\begin{cfa}
405	1, 2,
406	4, 3, 2,
407	\end{cfa}
408	\end{tabular}
409	\end{cquote}
410	Macros are not ideal, so future work is to provide a language-level @foreach@ statement, like \CC.
411	Finally, a predicate can be added to any of the manual iteration loops.
412	\begin{cquote}
413	\setlength{\tabcolsep}{15pt}
414	\begin{tabular}{@{}l\|l@{}}
415	\begin{cfa}
416	for ( node & it = first( list ); &it @&& !(it.v == 3)@; &it = &next( it ) ) ...
417	for ( node & it = last( list ); &it @&& !(it.v == 1)@; &it = &prev( it ) ) ...
418	for ( node & it = iter( list ); advance( it ) @&& !(it.v == 3)@; ) ...
419	for ( node & it = iter( list ); recede( it ) @&& !(it.v == 1)@; ) ...
420	\end{cfa}
421	&
422	\begin{cfa}
423	1, 2,
424	4, 3, 2,
425	1, 2,
426	4, 3, 2,
427	\end{cfa}
428	\end{tabular}
429	\end{cquote}
430
431	\begin{comment}
432	Many languages offer an iterator interface for collections, and a corresponding for-each loop syntax for consuming the items through implicit interface calls.
433	\CFA does not yet have a general-purpose form of such a feature, though it has a form that addresses some use cases.
434	This section shows why the incumbent \CFA pattern does not work for linked lists and gives the alternative now offered by the linked-list library.
435	Chapter 5 [TODO: deal with optimism here] presents a design that satisfies both uses and accommodates even more complex collections.
436
437	The current \CFA extensible loop syntax is:
438	\begin{cfa}
439	for( elem; end )
440	for( elem; begin ~ end )
441	for( elem; begin ~ end ~ step )
442	\end{cfa}
443	Many derived forms of @begin ~ end@ exist, but are used for defining numeric ranges, so they are excluded from the linked-list discussion.
444	These three forms are rely on the iterative trait:
445	\begin{cfa}
446	forall( T ) trait Iterate {
447	void ?{}( T & t, zero_t );
448	int ?<?( T t1, T t2 );
449	int ?<=?( T t1, T t2 );
450	int ?>?( T t1, T t2 );
451	int ?>=?( T t1, T t2 );
452	T ?+=?( T & t1, T t2 );
453	T ?+=?( T & t, one_t );
454	T ?-=?( T & t1, T t2 );
455	T ?-=?( T & t, one_t );
456	}
457	\end{cfa}
458	where @zero_t@ and @one_t@ are constructors for the constants 0 and 1.
459	The simple loops above are abbreviates for:
460	\begin{cfa}
461	for( typeof(end) elem = @0@; elem @<@ end; elem @+=@ @1@ )
462	for( typeof(begin) elem = begin; elem @<@ end; elem @+=@ @1@ )
463	for( typeof(begin) elem = @0@; elem @<@ end; elem @+=@ @step@ )
464	\end{cfa}
465	which use a subset of the trait operations.
466	The shortened loop works well for iterating a number of times or through an array.
467	\begin{cfa}
468	for ( 20 ) // 20 iterations
469	for ( i: 1 ~= 21 ~ 2 ) // odd numbers
470	for ( i; n ) total += a[i]; // subscripts
471	\end{cfa}
472	which is similar to other languages, like JavaScript.
473	\begin{cfa}
474	for ( i in a ) total += a[i];
475	\end{cfa}
476	Albeit with different mechanisms for expressing the array's length.
477	It might be possible to take the \CC iterator:
478	\begin{c++}
479	for ( list<int>::iterator it=mylist.begin(); it != mylist.end(); ++it )
480	\end{c++}
481	and convert it to the \CFA form
482	\begin{cfa}
483	for ( it; begin() ~= end() )
484	\end{cfa}
485	by having a list operator @<=@ that just looks for equality, and @+=@ that moves to the next node, \etc.
486
487	However, the list usage is contrived, because a list does use its data values for relational comparison, only links for equality comparison.
488	Hence, the focus of a list iterator's stopping condition is fundamentally different.
489	So, iteration of a linked list via the existing loop syntax is to ask whether this syntax can also do double-duty for iterating values.
490	That is, to be an analog of JavaScript's @for..of@ syntax:
491	\begin{cfa}
492	for ( e of a ) total += e;
493	\end{cfa}
494
495	The \CFA team will likely implement an extension of this functionality that moves the @~@ syntax from being part of the loop, to being a first-class operator (with associated multi-pace operators for the elided derived forms).
496	With this change, both @begin ~ end@ and @end@ (in context of the latter ``two-place for'' expression) parse as \emph{ranges}, and the loop syntax becomes, simply:
497	\begin{cfa}
498	for( elem; rangeExpr )
499	\end{cfa}
500	The expansion and underlying API are under discussion.
501	TODO: explain pivot from ``is at done?'' to ``has more?''
502	Advantages of this change include being able to pass ranges to functions, for example, projecting a numerically regular subsequence of array entries, and being able to use the loop syntax to cover more collection types, such as looping over the keys of a hashtable.
503
504	When iterating an empty list, the question, ``Is there a further element?'' needs to be posed once, receiving the answer, ``no.''
505	When iterating an $n$-item list, the same question gets $n$ ``yes'' answers (one for each element), plus one ``no'' answer, once there are no more elements; the question is posed $n+1$ times.
506
507	When iterating an empty list, the question, ``What is the value of the current element?'' is never posed, nor is the command, ``Move to the next element,'' issued. When iterating an $n$-item list, each happens $n$ times.
508
509	So, asking about the existence of an element happens once more than retrieving an element's value and advancing the position.
510
511	Many iteration APIs deal with this fact by splitting these steps across different functions, and relying on the user's knowledge of iterator state to know when to call each. In Java, the function @hasNext@ should be called $n+1$ times and @next@ should be called $n$ times (doing the double duty of advancing the iteration and returning a value). In \CC, the jobs are split among the three actions, @it != end@, @it++@ and @*it@, the latter two being called one time more than the first.
512
513	TODO: deal with simultaneous axes: @DLINK_VIA@ just works
514
515	TODO: deal with spontaneous simultaneity, like a single-axis req, put into an array: which ``axis'' is @&req++@ navigating: array-adjacency vs link dereference. It should sick according to how you got it in the first place: navigating dlist(req, req.pri) vs navigating array(req, 42). (prob. future work)
516	\end{comment}
517
518
519	\section{Implementation}
520
521	\VRef[Figure]{fig:lst-impl-links} continues the running @req@ example, now showing the \CFA list library's internal representation.
522	The @dlink@ structure contains exactly two pointers: @next@ and @prev@, which are opaque.
523	Even though the user-facing list model is linear, the CFA library implements all listing as circular.
524	This choice helps achieve uniform end treatment and \PAB{TODO finish summarizing benefit}.
525	A link pointer targets a neighbouring @dlink@ structure, rather than a neighbouring @req@.
526	(Recall, the running example has the user putting a @dlink@ within a @req@.)
527
528	\begin{figure}
529	\centering
530	\includegraphics[width=\textwidth]{lst-impl-links.pdf}
531	\caption{
532	\CFA list library representations for the cases under discussion.
533	}
534	\label{fig:lst-impl-links}
535	\end{figure}
536
537	Circular link-pointers (dashed lines) are tagged internally in the pointer to indicate linear endpoints.
538	Links among neighbour nodes are not tagged.
539	Iteration reports ``has more elements'' when accessing untagged links, and ``no more elements'' when accessing tagged links.
540	Hence, the tags are set on the links that a user cannot navigate.
541
542	In a headed list, the list head (@dlist(req)@) acts as an extra element in the implementation-level circularly-linked list.
543	The content of a @dlist@ header is a (private) @dlink@, with the @next@ pointer to the first element, and the @prev@ pointer to the last element.
544	Since the head wraps a @dlink@, as does @req@, and since a link-pointer targets a @dlink@, the resulting cycle is among @dlink@ structures, situated inside of header/node.
545	An untagged pointer points within a @req@, while a tagged pointer points within a list head.
546	In a headless list, the circular backing list is only among @dlink@s within @req@s.
547
548	No distinction is made between an unlisted node (top left and middle) under a headed model and a singleton list under a headless model.
549	Both are represented as an item referring to itself, with both tags set.
550
551
552	\section{Assessment}
553	\label{toc:lst:assess}
554
555	This section examines the performance of the discussed list implementations.
556	The goal is to show the \CFA lists are competitive with other designs, but the different list designs may not have equivalent functionality, so it is impossible to select a winner encompassing both functionality and execution performance.
557
558
559	\subsection{Add-Remove Performance}
560
561	The fundamental job of a linked-list library is to manage the links that connect nodes.
562	Any link management is an action that causes pair(s) of elements to become, or cease to be, adjacent in the list.
563	Thus, adding and removing an element are the sole primitive actions.
564
565	Repeated adding and removing is necessary to measure timing because these operations can be as simple as a dozen instructions.
566	These instruction sequences may have cases that proceed (in a modern, deep pipeline) without a stall.
567
568	This experiment takes the position that:
569	\begin{itemize}
570	\item The total time to add and remove is relevant \vs individual times for add and remove.
571	Adds without removes quickly fill memory;
572	removes without adds is meaningless.
573	\item A relevant breakdown ``by operation'' is:
574	\begin{description}
575	\item[movement]
576	Is the add/remove applied to a stack, queue, or something else?
577	\item[polarity]
578	In which direction does the action apply?
579	For a queue, do items flow from first to last or last to first?
580	For a stack, is the first or last end used for adding and removing?
581	\item[accessor]
582	Is an add/remove location given by a list head's ``first''/``last'', or by a reference to an individual element?
583	\end{description}
584	\item Speed differences caused by the host machine's memory hierarchy need to be identified and explained,
585	but do not represent advantages of one linked-list implementation over another.
586	\end{itemize}
587
588	The experiment used to measure insert/remove cost measures the mean duration of a sequence of additions and removals.
589	Confidence bounds, on this mean, are discussed.
590	The distribution of speeds experienced by an individual add-remove pair (tail latency) is not discussed.
591	Space efficiency is shown only indirectly, by way of caches' impact on speed.
592	The experiment is sensitive enough to show:
593	\begin{itemize}
594	\item intrusive lists performing (majorly) differently than wrapped lists,
595	\item a space of (minor) performance differences among the intrusive lists.
596	\end{itemize}
597
598
599	\subsubsection{Experiment setup}
600
601	The experiment driver defines a (intrusive) node type:
602	\begin{cfa}
603	struct Node {
604	int i, j, k; // fields
605	// possible intrusive links
606	};
607	\end{cfa}
608	and considers the speed of building and tearing down a list of $n$ instances of it.
609	% A number of experimental rounds per clock check is precalculated to be appropriate to the value of $n$.
610	\begin{cfa}
611	// simplified harness: CFA implementation,
612	// stack movement, insert-first polarity, head-mediated access
613	size_t totalOpsDone = 0;
614	dlist( node_t ) lst;
615	node_t nodes[ n ]; $\C{// preallocated list storage}$
616	startTimer();
617	while ( CONTINUE ) { $\C{// \(\approx\) 20 second duration}$
618	for ( i; n ) insert_first( lst, nodes[i] ); $\C{// build up}$
619	for ( i; n ) remove_first( lst ); $\C{// tear down}$
620	totalOpsDone += n;
621	}
622	stopTimer();
623	reportedDuration = getTimerDuration() / totalOpsDone; // throughput per insert/remove operation
624	\end{cfa}
625	To reduce administrative overhead, the $n$ nodes for each experiment list are preallocated in an array (on the stack), which removes dynamic allocations for this storage.
626	These nodes contain an intrusive pointer for intrusive lists, or a pointer to it is stored in a dynamically-allocated internal-node for a wrapper list.
627	Copying the node for wrapped lists skews the results with administration costs.
628	The list insertion/removal operations are repeated for a typical 20+ second duration.
629	After each round, a counter is incremented by $n$ (for throughput).
630	Time is measured outside the loop because a large $n$ can overrun the time duration before the @CONTINUE@ flag is tested.
631	The loop duration is divided by the counter and this throughput is reported.
632	In a scatter-plot, each dot is one throughput, which means insert + remove + harness overhead.
633	The harness overhead is constant when comparing linked-list implementations and keep as small as possible.
634	% The remainder of the setup section discusses the choices that affected the harness overhead.
635
636	To test list operations, the experiment performs the inserts/removes in different patterns, \eg insert and remove from front, insert from front and remove from back, random insert and remove, \etc.
637	Unfortunately, the @std::list@ does \emph{not} support direct insert/remove from a node without an iterator, \ie no @erase( node )@, even though the list is doubly-linked.
638	To eliminate this additional cost in the harness, a trick is used for random insertions without replacement.
639	The @i@ fields in each node are initialized from @0..n-1@, these @i@ values are then shuffled in the nodes, and the @i@ value is used to represent an indirection to that node for insertion, so the nodes are inserted in random order, and hence, removed in th same random order.
640	$\label{p:Shuffle}$
641	\begin{c++}
642	for ( i; n ) @nodes[i].i = i@; $\C{// indirection}$
643	shuffle( nodes, n ); $\C{// random shuffle indirects within nodes}$
644
645	while ( CONTINUE ) {
646	for ( i; n ) insert_first( lst, nodes[ @nodes[i].i@ ] ); $\C{// build up}$
647	for ( i; n ) pass( &remove_first( lst ) ); $\C{// tear down}$
648	totalOpsDone += n;
649	}
650	\end{c++}
651	This approach works across intrusive and wrapped lists.
652
653	% \emph{Interleaving} allows for movements other than pure stack and queue.
654	% Note that the earlier example of using the iterators' array is still a pure stack: the item selected for @erase(...)@ is always the first.
655	% Including a less predictable movement is important because real applications that justify doubly linked lists use them.
656	% Freedom to remove from arbitrary places (and to insert under more relaxed assumptions) is the characteristic function of a doubly linked list.
657	% A queue with drop-out is an example of such a movement.
658	% A list implementation can show unrepresentative speed under a simple movement, for example, by enjoying unchallenged ``Is first element?'' branch predictions.
659
660	% Interleaving brings ``at middle of list'' cases into a stream of add or remove invocations, which would otherwise be exclusively ``at end''.
661	% A chosen split, like half middle and half end, populates a boolean array, which is then shuffled.
662	% These booleans then direct the action to end-\vs-middle.
663	%
664	% \begin{cfa}
665	% // harness (bookkeeping and shuffling elided): CFA implementation,
666	% // stack movement, insert-first polarity, interleaved element-based remove access
667	% dlist( item_t ) lst;
668	% item_t items[ n ];
669	% @bool interl[ n ];@ // elided: populate with weighted, shuffled [0,1]
670	% while ( CONTINUE ) {
671	% item_t * iters[ n ];
672	% for ( i; n ) {
673	% insert_first( items[i] );
674	% iters[i] = & items[i];
675	% }
676	% @item_t ** crsr[ 2 ]@ = { // two cursors into iters
677	% & iters[ @0@ ], // at stack-insert-first's removal end
678	% & iters[ @n / interl_frac@ ] // in middle
679	% };
680	% for ( i; n ) {
681	% item *** crsr_use = & crsr[ interl[ i ] ]@;
682	% remove( *** crsr_use ); // removing from either middle or end
683	% *crsr_use += 1; // that item is done
684	% }
685	% assert( crsr[0] == & iters[ @n / interl_frac@ ] ); // through second's start
686	% assert( crsr[1] == & iters[ @n@ ] ); // did the rest
687	% }
688	% \end{cfa}
689	%
690	% By using the pair of cursors, the harness avoids branches, which could incur prediction stall times themselves, or prime a branch in the SUT.
691	% This harness avoids telling the hardware what the SUT is about to do.
692
693	The comparator linked-list implementations are:
694	\begin{description}
695	\item[std::list] The @list@ type of g++.
696	\item[lq-list] The @list@ type of LQ from glibc of gcc.
697	\item[lq-tailq] The @tailq@ type of the same.
698	\item[upp-upp] \uCpp provided @uSequence@
699	\item[cfa-cfa] \CFA's @dlist@
700	\end{description}
701
702
703	\subsection{Experimental Environment}
704	\label{s:ExperimentalEnvironment}
705
706	The performance experiments are run on:
707	\begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
708	%\item[PC]
709	%with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6. The machine has 16 GB of RAM and 8 MB of last-level cache.
710	%\item[ARM]
711	%Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model
712	\item[AMD]
713	Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 1024KB L2, 16MB L3, where each L3 cache covers 16 processors.
714	%\item[Intel]
715	%Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model
716	\end{description}
717	The experiments are single threaded and pinned to single core to prevent any OS movement, which might cause cache or NUMA effects perturbing the experiment.
718
719	The compiler is gcc/g++-14.2.0 running on the Linux v6.8.0-52-generic OS.
720	Switching between the default memory allocators @glibc@ and @llheap@ is done with @LD_PRELOAD@.
721	To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@
722	\begin{cfa}
723	// prevent eliding, cheaper than volatile
724	static inline void * pass( void * v ) { __asm__ __volatile__( "" : "+r"(v) ); return v; }
725	...
726	pass( &remove_first( lst ) ); // wrap call to prevent elision, insert cannot be elided now
727	\end{cfa}
728	The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists.
729
730
731	\subsubsection{Result: Coarse comparison of styles}
732
733	This comparison establishes how an intrusive list performs compared with a wrapped-reference list.
734	\VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test.
735	Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum.
736	In the graphs, all four intrusive lists are plotted with the same symbol;
737	theses symbols clumped on top of each, showing the performance difference among intrusive lists is small in comparison to the wrapped list \see{\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists}.
738
739	The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns.
740	For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/remove times.
741	As the list size grows, the administrative overhead for intrusive lists quickly disappears.
742
743	\begin{figure}
744	\centering
745	\subfloat[Linear List Nodes]{\label{f:Linear}
746	\includegraphics{plot-list-zoomout-noshuf.pdf}
747	} % subfigure
748	\\
749	\subfloat[Shuffled List Nodes]{\label{f:Shuffled}
750	\includegraphics{plot-list-zoomout-shuf.pdf}
751	} % subfigure
752	\caption{Insert/remove duration \vs list length.
753	Lengths go as large possible without error.
754	One example operation is shown: stack movement, insert-first polarity and head-mediated access.}
755	\label{fig:plot-list-zoomout}
756	\end{figure}
757
758	The large difference in performance between intrusive and wrapped-reference lists results from the dynamic allocation for the wrapped nodes.
759	In effect, this experiment is largely measuring the cost of malloc/free rather than the insert/remove, and is affected by the layout of memory by the allocator.
760	Fundamentally, insert/remove for a doubly linked-list has a basic cost, which is seen by the similar results for the intrusive lists.
761	Hence, the costs for a wrapped-reference list are: allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
762	the dynamic allocation dominates these costs.
763	For example, the experiment was run with both glibc and llheap memory allocators, where performance from llheap reduced the cost from 20 to 16 ns, which is still far from the 2--4 ns for intrusive lists.
764	Unfortunately, there is no way to tease apart the linking and allocation costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage the storage.
765	Hence, dynamic allocation cost is fundamental for wrapped lists.
766
767	In detail, \VRef[Figure]{f:Linear} shows linear insertion of all the nodes and then linear removal, both in the same direction.
768	For intrusive lists, the nodes are adjacent because they are preallocated in an array.
769	For wrapped lists, the wrapped nodes are still adjacent because the memory allocator happens to use bump allocation for the small fixed-sized wrapped nodes.
770	As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal.
771	With address look-ahead, the hardware does an excellent job of managing the multi-level cache.
772	Hence, performance is largely constant for both kinds of lists, until cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists.
773
774	In detail, \VRef[Figure]{f:Shuffled} shows shuffle insertion of all the nodes and then linear removal, both in the same direction.
775	As for linear, there are issues with the wrapped list and memory allocation.
776	For intrusive lists, it is possible to link the nodes randomly, so consecutive nodes in memory seldom point at adjacent nodes.
777	For wrapped lists, the placement of wrapped nodes is dictated by the memory allocator, and results in memory layout virtually the same as for linear, \ie the wrapped nodes are laid out consecutively in memory, but each wrapped node points at a randomly located external node.
778	As seen on \VPageref{p:Shuffle}, the random access reads data from external node, which are located in random order.
779	So while the wrapped nodes are accessed linearly, external nodes are touched randomly for both kinds of lists resulting in similar cache events.
780	The slowdown of shuffled occurs as the experiment's length grows from last-level cache into main memory.
781	% Insert and remove operations act on both sides of a link.
782	%Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
783	Each time a next item is processed, the memory access is a hop to a randomly far address.
784	The target is unavailable in the cache and a slowdown results.
785	Note, the external-data no-touch assumption is often unrealistic: decisions like,``Should I remove this item?'' need to look at the item.
786	Therefore, under realistic assumptions, both intrusive and wrapped-reference lists suffer similar caching issues for very large lists.
787	% In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size.
788
789	The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution.
790	Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated.
791	Even when space is a consideration, intrusive links may not use any more storage if a node is mostly linked.
792	Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures.
793
794	% Note, linear access may not be realistic unless dynamic size changes may occur;
795	% if the nodes are known to be adjacent, use an array.
796
797	% In a wrapped-reference list, list nodes are allocated separately from the items put into the list.
798	% Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes.
799
800	% STL's performance is not affected by element order in memory.
801	%The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations.
802	% This much is also unaffected by element order.
803	% Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off.
804	% In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL.
805
806	% The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster.
807	% Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level.
808	% Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear.
809	% Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear.
810
811	% The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists.
812	% The tests in this chapter are only inserting and removing.
813	% They are not operating on any user payload data that is being listed.
814	% The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order.
815	% These differences are inherent to the two list models.
816
817	% A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening.
818	% As a result, the interlinked nodes of the STL list are generally referencing their immediate neighbours.
819	% This pattern occurs regardless of user-item shuffling because this test's ``use'' of the user-items' array is limited to storing element addresses.
820	% This experiment, driving an STL list, is simply not touching the memory that holds the user data.
821	% Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast.
822
823	% But the comparison of unshuffled intrusive with wrapped-reference gives the performance of these two styles, with their the common impediment of overfilling the cache removed.
824	% Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes.
825	% This difference is appreciable below list length 0.5 M, and enormous below 10 K.
826
827
828	\section{Result: Comparing intrusive implementations}
829	\label{s:ComparingIntrusiveImplementations}
830
831	The preceding result shows that all the intrusive implementations examined have noteworthy performance compared to wrapped lists.
832	This analysis picks the list region 10-150 and zooms-in to differentiate among the different intrusive implementations.
833	The data is selected from the start of \VRef[Figure]{f:Linear}, but the start of \VRef[Figure]{f:Shuffled} would be largely the same.
834
835	\begin{figure}
836	\centering
837	\subfloat[Absolute Time]{\label{f:AbsoluteTime}
838	\includegraphics{plot-list-zoomin-abs.pdf}
839	} % subfigure
840	\\
841	\subfloat[Relative Time]{\label{f:RelativeTime}
842	\includegraphics{plot-list-zoomin-rel.pdf}
843	} % subfigure
844	\caption{Operation duration \vs list length at small-medium lengths. One example operation is shown: stack movement, insert-first polarity and head-mediated access. (a) has absolute times. (b) has times relative to those of LQ-\lstinline{tailq}.}
845	\label{fig:plot-list-zoomin}
846	\end{figure}
847
848	\VRef[Figure]{fig:plot-list-zoomin} shows the zone from 10--150 blown up, both in absolute and relative time.
849	% The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occurring through a head-provided insert/remove operation.
850	The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials.
851	For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other.
852
853	While preparing experiment results, I first tested on my old office PC, AMD FX-8370E Eight-Core, before switching to the large new server for final testing.
854	For this experiment, the results flipped in my favour when running on the server.
855	New CPU architectures are now amazingly good at branch prediction and micro-parallelism in the pipelines.
856	Specifically, on the PC, my \CFA and companion \uCpp lists are slower than lq-tail and lq-list by 10\% to 20\%.
857	On the server, \CFA and \uCpp lists are can be fast by up to 100\%.
858	Overall, LQ-tailq does the best at short lengths but loses out above a dozen elements.
859
860	\VRef[Figure]{f:RelativeTime} shows the percentage difference by treating @tailq@ as the control benchmark, and showing the other list results relative to it.
861	This change does not affect the who-wins statements, it just removes the ``sweet spot'' bend that the earlier discussion dismissed as incidental.
862	Runs faster than @tailq@'s are below the zero and slower runs are above;
863	@tailq@'s mean is always zero by definition, but its error bars, representing a single scenario's re-run stability, are still meaningful.
864	With this bend straightened out, aggregating across lengths is feasible.
865
866	\begin{figure}
867	\centering
868	\subfloat[Supersets]{\label{f:Superset}
869	\includegraphics{plot-list-cmp-exout.pdf}
870	} % subfigure
871	\\
872	\subfloat[1st Level Slice]{\label{f:1stLevelSlice}
873	\includegraphics{plot-list-cmp-survey.pdf}
874	} % subfigure
875	\caption{Operation duration ranges across operational scenarios. (a) has the supersets of the running example operation. (b) has the first-level slices of the full space of operations.}
876	\label{fig:plot-list-cmp-overall}
877	\end{figure}
878
879	\VRef[Figure]{fig:plot-list-cmp-overall} introduces alternative views of the data.
880	Part (a)'s first column summarizes all the data of \VRef{fig:plot-list-zoomin}-(b).
881	Its x-axis label, ``stack/insfirst/allhead,'' names the concrete scenario that has been discussed until now.
882	Moving across the columns, the next three each stretch to include more scenarios on each of the operation dimensions, one at a time.
883	The second column considers the scenarios $\{\mathrm{stack}\} \times \{\mathrm{insfirst}\} \times \{\mathrm{allhead}, \mathrm{inselem}, \mathrm{remelem}\}$,
884	while the third stretches polarity and the fourth stretches accessor.
885	Then next three columns each stretch two scenario dimensions and the last column stretches all three.
886	The \CFA bar in the last column is summarizing 840 test-program runs: 14 list lengths, 2 movements, 2 polarities, 3 accessors and 5 repetitions.
887
888	In the earlier plots of one scenario broken down by length, each data point, with its error bars, represents just 5 repetitions.
889	With a couple exceptions, this reproducibility error was small.
890	Now, for a \CFA bar, summarizing 70 (first column) to 840 (last column) runs, a bar's height is dominated by the different behaviours of the scenarios and list length that it summarizes.
891	Accordingly, the first column's bars are short and last one's are tall.
892	A box represents the inner 68\% of the durations, while its lines extend to cover 95\%.
893	The symbol on the bar is the mean duration.
894
895	The chosen benchmark of LQ-@tailq@ is not shown in this format because it would be trivial here.
896	With iter-scenario differences dominating the bar size, and @tailq@'s mean performance defined to be zero in all scenarios, a @tailq@ bar on this plot would only show @tailq@'s re-run stability, which is of no comparison value.
897
898	The LQ-@list@ implementation does not support all scenarios, only stack movement with insert-first polarity.
899	So, its 1, 3, 4 and 7\textsuperscript{th} bars all summarize the same set of points (those with accessor constrained to all-head), as do its 2, 5, 6 and 8\textsuperscript{th} (those with accessor unconstrained).
900
901	Rather than exploring from one scenario out, \VRef{fig:plot-list-cmp-overall}-(b) gives a more systematic breakdown of the entire experimental space.
902	Other than the last grand-total column, each breakdown column shows one value from one operation dimension.
903
904	LQ-@list@'s partial scenario coverage gives missing bars where it does not support the operation.
905	And, again, it gives repetition where all data points occur in several columns' intersection, such as stack// and /insfirst/.
906
907	In the grand total, and in all halves by movement or polarity, \CFA and uC++ are equivalent, while LQ-@list@ beats them slightly.
908	Splitting on accessor, \CFA has a poor result on element removal, LQ-@list@ has a great result on the other accessors, and uC++ is unaffected.
909	The unseen @tailq@ dominates across every category and beats \CFA and uC++ by 15--20\%.
910
911	% \begin{figure}
912	% \centering
913	% \begin{tabular}{c}
914	% \includegraphics{plot-list-cmp-intrl-shift.pdf} \\
915	% (a) \\
916	% \includegraphics{plot-list-cmp-intrl-outcome.pdf} \\
917	% (b) \\
918	% \end{tabular}
919	% \caption{Caption TODO}
920	% \label{fig:plot-list-cmp-intrl}
921	% \end{figure}
922
923
924	\section{Result: CFA cost attribution}
925
926	This comparison loosely itemizes the reasons that the \CFA implementation runs 15--20\% slower than LQ.
927	Each reason provides for safer programming.
928	For each reason, a version of the \CFA list was measured that forgoes its safety and regains some performance.
929	These potential sacrifices are:
930	\newcommand{\mandhead}{\emph{mand-head}}
931	\newcommand{\nolisted}{\emph{no-listed}}
932	\newcommand{\noiter}{\emph{no-iter}}
933	\begin{description}
934	\item[mand(atory)-head] Removing support for headless lists.
935	A specific explanation of why headless support causes a slowdown is not offered.
936	But it is reasonable for a cost to result from making one piece of code handle multiple cases; the subset of the \CFA list API that applies to headless lists shares its implementation with headed lists.
937	In the \mandhead case, disabling the feature in \CFA means using an older version of the implementation, from before headless support was added.
938	In the pre-headless library, trying to form a headless list (instructing, ``Insert loose element B after loose element A,'') is a checked runtime error.
939	LQ does not support headless lists\footnote{
940	Though its documentation does not mention the headless use case, this fact is due to one of its insert-before or insert-after routines being unusable in every list model.
941	For \lstinline{tailq}, the API requires a head.
942	For \lstinline{list}, this usage causes an ``uncaught'' runtime crash.}.
943	\item[no-listed] Removing support for the @is_listed@ API query.
944	Along with it goes error checking such as ``When inserting an element, it must not already be listed, \ie be referred to from somewhere else.''
945	These abilities have a cost because, in order to support them, a listed element that is being removed must be written to, to record its change in state.
946	In \CFA's representation, this cost is two pointer writes.
947	To disable the feature, these writes, and the error checking that consumes their result, are put behind an @#ifdef@.
948	The result is that a removed element sees itself as still having neighbours (though these quasi-neighbours see it differently).
949	This state is how LQ leaves a removed element; LQ does not offer an is-listed query.
950	\item[no-iter(ation)] Removing support for well-terminating iteration.
951	The \CFA list uses bit-manipulation tagging on link poiters (rather than \eg null links) to express, ``No more elements this way.''
952	This tagging has the cost of submitting a retrieved value to the ALU, and awaiting this operation's completion, before dereferencing a link pointer.
953	In some cases, the is-terminating bit is transferred from one link to another, or has a similar influence on a resulting link value; this logic adds register pressure and more data dependency.
954	To disable the feature, the @#ifdef@-controlled tag manipulation logic compiles in answers like, ``No, that link is not a terminator,'' ``The dereferenceable pointer is the value you read from memory,'' and ``The terminator-marked value you need to write is the pointer you started with.''
955	Without this termination marking, repeated requests for a next valid item will always provide a positive response; when it should be negative, the indicated next element is garbage data at an address unlikely to trigger a memory error.
956	LQ has a well-terminating iteration for listed elements.
957	In the \noiter case, the slowdown is not inherent; it represents a \CFA optimization opportunity.
958	\end{description}
959	\MLB{Ensure benefits are discussed earlier and cross-reference} % an LQ programmer must know not to ask, ``Who's next?'' about an unlisted element; an LQ programmer cannot write assertions about an item being listed; LQ requiring a head parameter is an opportunity for the user to provide inconsistent data
960
961	\begin{figure}
962	\centering
963	\begin{tabular}{c}
964	\includegraphics{plot-list-cfa-attrib.pdf} \\
965	(a) \\
966	\includegraphics{plot-list-cfa-attrib-remelem.pdf} \\
967	(b) \\
968	\end{tabular}
969	\caption{Operation duration ranges for functionality-reduced \CFA list implementations. (a) has the top level slices. (b) has the next level of slicing within the slower element-based removal operation.}
970	\label{fig:plot-list-cfa-attrib}
971	\end{figure}
972
973	\VRef[Figure]{fig:plot-list-cfa-attrib} shows the \CFA list performance with these features, and their combinations, turned on and off. When a series name is one of the three sacrifices above, the series is showing this sacrifice in isolation. These further series names give combinations:
974	\newcommand{\attribFull}{\emph{full}}
975	\newcommand{\attribParity}{\emph{parity}}
976	\newcommand{\attribStrip}{\emph{strip}}
977	\begin{description}
978	\item[full] No sacrifices. Same as measurements presented earlier.
979	\item[parity] \mandhead + \nolisted. Feature parity with LQ.
980	\item[strip] \mandhead + \nolisted + \noiter. All options set to ``faster.''
981	\end{description}
982	All list implementations are \CFA, possibly stripped.
983	The plot uses the same LQ-relative basis as earlier.
984	So getting to zero means matching LQ's @tailq@.
985
986	\VRef[Figure]{fig:plot-list-cfa-attrib}-(a) summarizes the time attribution across the main operating scenarios.
987	The \attribFull series is repeated from \VRef[Figure]{fig:plot-list-cmp-overall}, part (b), while the series showing feature sacrifices are new.
988	Going all the way to \attribStrip at least nearly matches LQ in all operating scenarios, beats LQ often, and slightly beats LQ overall.
989	Except within the accessor splits, both sacrifices contribute improvements individually, \noiter helps more than \attribParity, and the total \attribStrip benefit depends on both contributions.
990	When the accessor is not element removal, the \attribParity shift appears to be counterproductive, leaving \noiter to deliver most of the benefit.
991	For element removals, \attribParity is the heavy hitter, with \noiter contributing modestly.
992
993	The counterproductive shift outside of element removals is likely due to optimization done in the \attribFull version after implementing headless support, \ie not present in the \mandhead version.
994	This work streamlined both head-based operations (head-based removal being half the work of the element-insertion test).
995	This improvement could be ported to a \mandhead-style implementation, which would bring down the \attribParity time in these cases.
996
997	More significantly, missing this optimization affects every \attribParity result because they all use head-based inserts or removes for at least half their operations.
998	It is likely a reason that \attribParity is not delivering as well overall as \noiter.
999	It even represents plausible further improvements in \attribStrip.
1000
1001	\VRef[Figure]{fig:plot-list-cfa-attrib}-(b) addresses element removal being the overall \CFA slow spot and element removal having a peculiar shape in the (a) analysis.
1002	Here, the \attribParity sacrifice bundle is broken out into its two constituents.
1003	The result is the same regardless of the operation.
1004	All three individual sacrifices contribute noteworthy improvements (\nolisted slightly less).
1005	The fullest improvement requires all of them.
1006
1007	The \noiter feature sacrifice is unpalatable.
1008	But because it is not an inherent slowdown, there may be room to pursue a \noiter-level speed improvement without the \noiter feature sacrifice.
1009	The performance crux for \noiter is the pointer-bit tagging scheme.
1010	Alternative designs that may offer speedup with acceptable consequences include keeping the tag information in a separate field, and (for 64-bit architectures) keeping it in the high-order byte \ie using byte- rather than bit-oriented instructions to access it.
1011	The \noiter speed improvement would bring \CFA to +5\% of LQ overall, and from high twenties to high teens, in the worst case of element removal.
1012
1013	Ultimately, this analysis provides options for a future effort that needs to get the most speed out of the \CFA list.
1014
1015
1016
1017
1018	\section{Future Work}
1019	\label{toc:lst:futwork}
1020
1021	The \CFA list examples elide the \lstinline{P9_EMBEDDED} annotations that (TODO: xref P9E future work) proposes to obviate.
1022	Thus, these examples illustrate a to-be state, free of what is to be historic clutter.
1023	The elided portions are immaterial to the discussion and the examples work with the annotations provided.
1024	The \CFA test suite (TODO:cite?) includes equivalent demonstrations, with the annotations included.
1025	\begin{cfa}
1026	struct mary {
1027	float anotherdatum;
1028	inline dlink(mary);
1029	};
1030	struct fred {
1031	float adatum;
1032	inline struct mine { inline dlink(fred); };
1033	inline struct yours { inline dlink(fred); };
1034	};
1035	\end{cfa}
1036	like in the thesis examples. You have to say
1037	\begin{cfa}
1038	struct mary {
1039	float anotherdatum;
1040	inline dlink(mary);
1041	};
1042	P9_EMBEDDED(mary, dlink(mary))
1043	struct fred {
1044	float adatum;
1045	inline struct mine { inline dlink(fred); };
1046	inline struct yours { inline dlink(fred); };
1047	};
1048	P9_EMBEDDED(fred, fred.mine)
1049	P9_EMBEDDED(fred, fred.yours)
1050	P9_EMBEDDED(fred.mine, dlink(fred))
1051	P9_EMBEDDED(fred.yours, dlink(fred))
1052	\end{cfa}
1053	like in tests/list/dlist-insert-remove.
1054	Future work should autogen those @P9_EMBEDDED@ declarations whenever it sees a plan-9 declaration.
1055	The exact scheme chosen should harmonize with general user-defined conversions.
1056
1057	Today's P9 scheme is: mary gets a function `inner returning this as dlink(mary).
1058	Fred gets four of them in a diamond.
1059	They're defined so that `inner is transitive; i.e. fred has two further ambiguous overloads mapping fred to dlink(fred).
1060	The scheme allows the dlist functions to give the assertion, "we work on any T that gives a `inner to dlink(T)."
1061
1062
1063	TODO: deal with: A doubly linked list is being designed.
1064
1065	TODO: deal with: Link fields are system-managed.
1066	Links in GDB.
1067

Note: See TracBrowser for help on using the repository browser.

Download in other formats: