Changes in / [81e1984b:58a4cde]


Ignore:
Files:
16 edited

Legend:

Unmodified
Added
Removed
  • doc/proposals/modules-alvin/proposal.md

    r81e1984b r58a4cde  
    11<!--
    2 I fleshed out some problems with strong module theory after working through some examples and writing some code, and got a better idea of how to actually implement my vision. The cynic in me says that I've simply solved many hard problems by slapping "compile-time reflection" on it, but my formalism shows that works. I'm quite happy with the result - its focus on making it possible to export individual top-level declarations provides a really nice logical interface, and many of its concepts are used in other programming languages, so the strategy seems at least reasonable.
    3 There are still some aspects that need to be added: initialization order, handling forward declarations of symbols defined outside of modules, handling the full C spec instead of restricting top-level statements, and addressing the additional features that Cforall provides on top of C. In my modules proposal v5 I'll rearrange some of the sections and rewrite some parts in order to make it easier to follow (some sections are outdated, as my ideas changed slightly as I wrote this proposal). This will allow me to better address some of these missing aspects.
     2I fleshed out some problems with strong module theory after working through some examples and writing some code, and got a better idea of how to actually implement my vision. The cynic in me says that I've simply solved many hard problems by slapping "compile-time reflection" on it (which is by no means a simple thing to engineer), but it does solve a number of glaring issues with my previous attempts. Additionally, it takes many concepts from other programming languages, so the strategy seems at least reasonable.
     3Now, the biggest hole in my proposal is that I do not address the issue of initialization order (I'm kind of putting this one off, because I think I can handle this separately). Additionally, I still need to present "walkthrough examples" and present a formalism.
    44
    55Motivation
     
    1717    Forall polymorphism
    1818Walkthrough
    19     Symbols of a module
    20     How importing modules use symbols
    2119Formalism
    22     Layout of the codebase
    23     Individual module information
    24     Resolving names of exported declarations
    25     Providing additional information for importers
    26     Resolving names for compilation
    2720-->
    2821
     
    7770How do we actually implement this "oracle"? After analyzing the alternatives, option 2 feels the most "natural", despite its inherent complexity.
    78711. We can perform whole-program analysis to analyze all type-related information, similar to how Rust does it. This allows for maximum expressiveness since this gives us full visibility into the entire program, and can rely on the analyzer to automatically resolve circularly dependent modules (eg. module `A` imports module `B`, which in turn imports module `A`, but in a way that is still resolvable). However, this breaks the principle of separate compilation by accessing the entire program, and raises questions such as "what gets reanalyzed if `A` changes?"
    79 2. We can also extract type information into a separate generated module interface. This aligns closer to the principle of separate compilation, though it still requires special analysis to resolve circularly dependent modules (eg. module `A` imports module `B`, which in turn imports module `A`, but in a way that is still resolvable. In order to avoid a circular import error, this requires only importing the symbols from `B` that `A` needs). A criticism is that this does not really resolve the transitive dependency issue; in a certain sense, it's offloading the problem to a compile-time reflection mechanism. This level of compile-time reflection is also non-trivial, poentially requiring significant re-engineering and validation of an existing C compiler in order to implement. Despite these concerns, this strategy seems to align best with general intuition when analyzing an existing codebase.
     722. We can also extract type information into a separate generated module interface. This aligns closer to the principle of separate compilation, though it still requires special analysis to resolve circularly dependent modules (eg. module `A` imports module `B`, which in turn imports module `A`, but in a way that is still resolvable. In order to avoid a circular import error, this requires only importing the symbols from `B` that `A` needs). A criticism is that this does not really resolve the transitive dependency issue; in a certain sense, it's offloading the problem to a compile-time reflection mechanism. This level of compile-time reflection is also non-trivial, poentially requiring significant re-engineering and validation of an existing C compiler in order to implement. Despite these concerns, this strategy seems to align best with general intuitition when analyzing an existing codebase.
    8073
    8174### C macros are not exportable
     
    216209### Forall polymorphism
    217210
    218 The forall keyword is an addition to Cforall to support polymorphism, with polymorphic functions using vtables and a single implementation. If a module exports a forall statement, the module owns the polymorphic function implementations, while the polymorphic function declarations are exported (if these were declared inline, the definition could be exported, similar to C++20 modules). Polymorphic types are instantiated from the caller's side, so their definitions are exported. This may present problems, but currently I am not familiar enough with Cforall to judge.
     211The forall keyword is an addition to Cforall to support polymorphism, with polymorphic functions using vtables and a single implementation. If a module exports a forall statement, the module owns the polymorphic function implementations, while the polymorphic function declarations are exported (if these were declared inline, the definition could be exported, similar to C++20 modules). Polymorphic types are instantiated from the caller's side, so their definitions are exported. This does not present the same issues as those in the "handling unboxed types" section, as non-polymorphic function declarations cannot inadvertently capture these types (thus, they cannot "leak" outside of a module).
    219212
    220213An interesting case is to consider if Cforall could be updated to perform specialization (multiple implementations for a single function) in addition to the single implementation strategy. This would be similar to Rust's `impl` vs `dyn` traits. To support this, the module system would need to be updated, as we would want the multiple implementations to exist within the module that owns the forall statement.
     
    222215## Walkthrough
    223216
    224 A module holds a collection of symbols to be exported. What we're concerned with is how the importing module can use these symbols. In order to answer this, we will analyze:
    225 * What are the kinds of symbols and what does the importing module see?
    226 * How does the importing module disambiguate between symbols?
    227 
    228 Our guiding principle behind our modules is that, as much as is possible, **every symbol definition should have full control over who can see it.** This means that you cannot declare a function or struct that is defined in another module unless that module exports it and you import said module (of course, it may be accessible using `extern "C"`). We will see certain cases where types and inline functions break this abstraction somewhat, but we try to limit this "information leakage" as much as we can. A benefit of this is that our inductive proof usually only needs to reason using the previous step, as every symbol is logically isolated from each other.
    229 
    230 ### Symbols of a module
    231 
    232 Modules in our modules are either types, functions or variables. While types can be defined in the same line as a variable (eg. `struct {int i;} name = {5};`), we will make the simplifying assumption right now that this does not happen (in the example, both `name` and the anonymous struct would be exported together).
    233 
    234 #### Types
    235 
    236 There are 3 kinds of types: builtins (eg. `int`), user-defined (ie. `struct`) and aliases (ie. `typedef`). We defer discussion on pointers, references and arrays until afterwards.
    237 
    238 Built-ins are provided by the compiler to every module, so we don't need to consider them. We can simplify aliases as a `struct` with one field, which has an implicit field access.
    239 
    240 So `struct` is really the only type we need to concern ourselves with. Based on our guiding principle, if we export a `struct` that has a `struct` field which is not itself, then it should not export any information related to that field's type. The importer will be able to access all of its fields, but the `struct` field is essentially an opaque type (if a module also wants access to that, it needs to import that `struct` too).
    241 
    242 Unfortunately, this isn't technically possible. First, we have unboxed types (boxed types are implemented as a pointer to the actual object). This means we need to know the size and alignment of every field, so we know how large the struct is and the offsets of each field. Second, when a user initializes a variable with a field, they need to be able to write down the type name (eg. `struct Exported e; struct Field f = e.f1;`). So we need to know the size, alignment and name of a field (in the previous example, move/copy constructors may also need to be considered, but in C we simply copy bytes). A similar thing happens when returning types from functions (eg. `struct Unexported foo();`), which we will discuss later.
    243 
    244 How do we obtain this size/alignment/name information? We use compile-time reflection to crawl import modules as needed. Note that this needs to handle circular imports, as it should only fail if we encounter a type name ambiguity (eg. we import module `A` and `B` which both export `struct S`, and we need a `struct S`) or the type has infinite size (eg. `struct A {struct B b;}; struct B {struct A a; int i;};`). This means the number of modules that need to be analyzed in order to determine the size/alignment of a `struct` is only bounded by the size of the codebase, but in practice this should not happen often.
    245 
    246 Pointers and references are boxed values, so size/alignment information of the underlying type is not gathered. However, we still require modules to import the symbol in order to handle name disambiguation. Arrays may be parameterized by an integer whose value must be immediately available. This is key, otherwise we can end up with cases such as `struct S {struct T t[sizeof(struct U)];};`, which may require fixpoint analysis. In our case, we can simply leverage compile-time reflection in the same manner as we do with field structs.
    247 
    248 #### Functions
    249 
    250 Functions have the form `returnType name(parameterType, ...numberOfParameters) {...body}`. In order for importers to use this function, we need to provide the size, alignment and name of any types in its declaration. We use compile-time reflection to resolve the details (see Types section for details).
    251 
    252 If we are exporting an inline function, from the compiler's perspective the function body is also exported. This may also use compile-time reflection in order to resolve the symbols within the function body (which is done within the context of the containing module, not the importing module).
    253 
    254 Cforall introduces a keyword called `forall`, which can be used to create polymorphic functions. At the moment, these functions do not specialize - there is only one instance of this function, and callers need to pass a vtable alongside any polymorphic type parameters, similar to how the `dyn` trait in Rust works. This means that trait information must be provided to the importing module, which unfortunately breaks our guiding principle somewhat. Within a `forall` block we can also define polymorphic types, which can be used as the return value of a polymorphic function. In that case we need size/align/name information, and the first two are currently handled by calling a "layout function" at runtime (this could possibly be moved to happening at compile-time, but isn't at the moment). As such, we need provide the layout function to the importing module, which unfortunately breaks our guiding principle too. So polymorphic functions can be made to work with our system, but they break our guiding principles to some extent.
    255 
    256 #### Variables
    257 
    258 Variables have the form `variableType name = expression;`. We make the simplifying assumption that only one name is defined at a time (ie. no `int a, b, c;`). An importing module needs to provide the size, alignment and name of `variableType` in its declaration to any importing modules. We use compile-time reflection to resolve the details (see Types section for details).
    259 
    260 If the variable is actually a polymorphic type (see Function section for details), I'm actually not sure how this works. The layout function can only be called at runtime, but you need the information in order to allocate the appropriate amount of space in the `.data` section. More analysis is needed here.
    261 
    262 ### How importing modules use symbols
    263 
    264 The previous section notes compile-time reflection as the solution to many of our challenges with modules. Here we will explain how symbol names are resolved to the correct modules, allowing us to define how our compile-time reflection crawls through modules in order to determine the information it requires.
    265 
    266 #### Gathering symbols
    267 
    268 First, we scan the top-level declarations of a module, gathering a list of imported modules and top-level symbol names defined within the module.
    269 
    270 ```
    271 // module data/graph/node
    272 module;
    273 import edge;
    274 import ../minheap;
    275 import stdlib;
    276 const int maxEdges = 10;
    277 export struct Node {
    278     int id;
    279     int value;
    280     struct Edge edges[maxEdges];
    281     struct MinHeap someField;
    282 };
    283 // Shadows imported MinHeap
    284 struct MinHeap {int capacity; int size; int *data;};
    285 struct MinHeap initHeap(int capacity) {
    286     // Assumes malloc doesn't return error code
    287     return {capacity, 0, (int*)malloc(sizeof(int) * capacity)};
    288 }
    289 void destroyHeap(struct MinHeap* h) {
    290     free(h->data);
    291     h.capacity = 0;
    292     h.size = 0;
    293 }
    294 ```
    295 
    296 We first look for the import relative to the current directory - if that fails we look at libraries. So our set of imports here are: `data/graph/edge`, `data/minheap` and `stdlib` (library module).
    297 
    298 The top-level symbol names are: `maxEdges`, `Node`, `MinHeap`, `initHeap` and `destroyHeap`.
    299 
    300 #### Crawling other modules
    301 
    302 Continuing with our example, we do the same top-level analysis with the imported modules.
    303 
    304 If `data/minheap` has `export import minheap/functions;` defined, then it is as if `data/graph/node` has `import ../minheap/functions;` defined. This can keep going - if `data/minheap/functions` has `export import`, it is also added to `data/graph/node`. After all `export import` have been resolved, we can look at the top-level declarations of every imported module in order to determine what symbols are visible to `data/graph/node`.
    305 
    306 Now, we can bind each symbol within `data/graph/node` to the correct module. The details of how multiple symbols are disambiguated are described in the Overload resolution section.
    307 
    308 In addition to name disambiguation, some symbols need additional information in order to be useable by importers. For example, size/alignment information of types, function bodies for inline functions, and trait information for polymorphic functions. This information is obtained by performing the above steps, but on the importing module instead of `data/graph/node`.
    309 
    310 This recursive task raises the problem of circular imports: What if we recurse back to `data/graph/node` (or any module that creates a cycle)? As long as we are analyzing different symbols inside the circularly imported module, this should not pose a problem. This leaves us with handling the problem where we circle back to the same symbol. For size/alignment analysis, coming back to the same type means that said type contains itself, which for our purposes is not resolvable/infinite size. If an inline function calls other inline functions that mutually recurse with itself, we can either produce a forward declaration of the function within the underlying C implementation (Cforall compiles down to C), or make mutual recursion be a regular function call. The first option is ideal, but if this is not supported in the C spec, we can fall back to the second option. For trait information, we could emit a warning, but if a trait includes itself, that can be safely ignored (a trait is a collection of conditions, it contains itself by definition). Since we can handle all of our circular problems, our system is well-defined here.
    311 
    312 #### Overload resolution
    313 
    314 A key aspect about our modules is that imported symbols are used in the same way as they are defined - no namespacing needed (eg. `struct Edge` in the provided example). While this raises the potential for name collisions, we argue that with the right compiler tools and the ability to disambiguate between different modules (eg. `A::a` means "symbol `a` within module `A`"), this can be a reasonable tradeoff (unfortunately, much of the tooling here is out of scope for this proposal).
    315 
    316 The reason behind this decision comes from 2 places: First, if we add namespacing to imports, then when we have an imported function that returns a type defined in another module, we would be forced to write the module name (the alternatives don't fare much better in terms of brevity and readability). Second, a core feature of Cforall is its advanced overloading system (eg. allowing multiple variables with the same name but different types to coexist). Adding namespaces would likely limit our ability to research and take advantage of this feature.
    317 
    318 This means we have both symbols defined within our module and imported symbols participating in overload resolution. This requires extending our existing overload resolution algorithm to handle multiple types with the same name (currently we handle multiple functions and variables with the same name, but not types). This also requires refactoring the overload resolution algorithm to hold the full name of a symbol (symbol name + module name). Finally, if overload resolution favours two options equally, it should favour symbols defined within the given module (eg. `struct MinHeap` in the provided example). Note that this will only happen if the alternative is strictly better, otherwise it is ambiguous.
    319 
    320217## Formalism
    321218
    322 Taking the Walkthrough section a step further, we will provide a rough sketch of a formal argument that our module system is well-defined.
    323 
    324 We'll first describe the layout of our module system by providing a rough description of the file structure and grammar we are working with. Then we describe what we can grab from each module file separately. By combining this information with those from other modules, we can determine which module's symbol is being referred to in each exported declaration. Note that for some symbols to be properly used by importers, additional information like size/alignment information will be necessary, which we will show is possible to obtain. Finally, with this information, we can resolve every symbol within a function body or expression, thereby completing the compilation process.
    325 
    326 ### Layout of the codebase
    327 
    328 In order to provide a formal argument that our module system is sound, we need to first describe the format of the codebase we are trying to analyze.
    329 
    330 The file system is set up so that imports are relative to the current directory, or found in a library. For example, a module file `data/graph.cfa` trying to access `data/graph/node.cfa` would write `import graph/node;`. If such a file doesn't exist, we will look for `graph/node` within the provided libraries (starting from the root directory of each library).
    331 
    332 Module files have `module;` as their first statement. Import statements are written as `import filePath;`, where the file path can be escaped with double-quotes if necessary. Only top-level statements can be exported, which is done by placing `export` at the beginning (eg. `export const int i = 3;`). The available top-level statements are:
    333 * Types: `struct` and `typedef` statements. Only one `struct` can be defined per top-level statement (eg. no `typedef struct {int i;} MyStruct;`).
    334 * Functions: Format is `returnType name(parameterType ...numberOfParameters) {...functionBody}`.
    335 * Variables: Format is `variableType name = initializerExpression;`. Initializer may be omitted. Only one variable can be declared per top-level statement (eg. no `int a, b;`).
    336 
    337 The syntax of function bodies and initializer expressions can mostly remain unrestricted, as compilation is outside of the module system's purview (we provide the symbols and sufficient information on how to use them to importers). Some aspects we need to note are:
    338 * No user-defined copy/move constructors: In order to handle `func(s.f1)` (where `void func(struct Field);`), we need to be able to create a copy of `struct Field` without knowing its fields (in the example, the importer knows `func` and the type of `s`, but does not need to know the details of `struct Field`). Having user-defined copy/move constructors can be done, but we opt to avoid it here.
    339 * `::` operator: Sometimes we need to disambiguate which module a symbol belongs to. `struct InBothModules foo();` can either use `struct InBothModules;` from `import a;` or `import b;`. Trying to resolve this by analyzing the function body of `foo()` may be possible, but would likely be very challenging to implement. To resolve this, we can write `struct a::inBothModules foo();` instead.
    340 
    341 ### Individual module information
    342 
    343 For every module file, we can collect the import list and the top-level symbols, without needing to consult any other files. The top-level symbols include types, functions and variables.
    344 
    345 Top-level statements, at least the way we have it set up here, avoid any context-sensitive sections (eg. `MyType * a;` can be multiplication or a variable). This means we have no issues parsing all symbols at the top-level (function bodies and initializer expressions are context-sensitive, but we don't need to deal with them here). This means that for each module file, we can grab an import list, a type list, a function list and a (global) variable list, as well as determine which are exported.
    346 
    347 ### Resolving names of exported declarations
    348 
    349 Now we wish to resolve every symbol in any module's top-level declaration (this does not include the function bodies or initializer expressions). In order to do this, we first collect a module's full import list, then all of its imported symbols, before finally resolving our symbols.
    350 
    351 In order to determine a module's full import list, we first look at the individual module information of each imported module. We initially look for the imported module relative to the current directory of the module that the import statement is in - if it does not exist, it will look starting at the root directory of any provided libraries (if we can't find the module, we error out). If any imported module has `export import moduleName;` (or some other module name), then we look at those too. If we end up importing any modules we have already imported (or the module we are getting the full import list of), we can ignore those. Since there are a finite number of modules in a codebase (plus any provided libraries) and we don't analyze a module twice, this search will terminate with a full import list.
    352 
    353 To collect all the imported symbols, we take every module in the full import list and we grab all of the top-level statements that have `export` as their first keyword. This gives us a imported type list, a imported function list and a full variable list.
    354 
    355 Now that we have a list of imported symbols as well as the module's own, we can start resolving symbols in the top-level declarations. Since our grammar is not context-sensitive outside of function bodies and initializer expressions, there is no ambiguity determine whether a type or variable is needed. To resolve names, we first check to see if it uses the `::` operator - if so, then we look using the same strategy when looking for imported modules (if the module isn't in the full import list, we error out). Then we look in the module's own top-level symbols to see if we have a type/variable that matches the symbol's name (this means a module's symbols shadow its imported symbols). If we don't find a match, then we look in the list of imported symbols. If the list of imported symbols has more than one match, we error out due to ambiguity (it may be possible to use overload resolution here, but for now we'll prompt the user to use a `::` operator).
    356 
    357 Something to note about this setup is that forward declarations are neither necessary nor allowed, because all visible symbols are known before attempting to resolve names.
    358 
    359 ### Providing additional information for importers
    360 
    361 Unfortunately, in order to use some of these top-level symbols, simply resolving the names within top-level declarations (ie. not including function bodies or initializer expressions) is not enough for an importer to be able to use it. For example, if a function returns a `struct`, we need to at least know the size/ alignment of the `struct` in order to use it.
    362 
    363 Type names are found within the definitions of other types, the return and parameters of a function, and the type of a variable. In order to fully specify a type, we must at least know the size/alignment of its fields. To use a function, we also need to know size/alignment of return and parameter types. To use a variable, the size/alignment of its type is not technically required, but to keep its semantics consistent with that of field access, we require it (eg. `struct Unimported x = s.f1; struct Unimported x1 = importedVariable;`).
    364 
    365 For functions and variables, we first resolve the type (for types, we move to the next step). Then we look at the fields of that type and resolve those types (pointers don't need to be resolved because pointers have a defined size/alignment). We keep recursing until we resolve everything or find a cycle (if we have a cycle, we error out), and this terminates because there are a finite number of symbols. Additionally, types may be arrays, which are parameterized by an integer (if we resolve to something that is not an integer, we error out). The C spec requires that this integer's value must be constant and immediately calculable, which means `char padding[64-sizeof(int)]` is fine but not `char padding[64-sizeof(struct U)];` (this functionality can be supported and may be useful for cache-aware programming, but we will not support this for now). This means we can fully resolve everything within a well-defined type, and therefore collect its size/alignment.
    366 
    367 While the importing module doesn't change its behaviour whether a function is inline or not, the compiler must see the function body, which means it must be included in the file sent to the compiler. In order to resolve the inline function body, then from the perspective of the module that contains the inline function, we must collect the full import list, then all of the imported symbols. If any of those imported symbols are inline functions, then we need to keep recursing. If we find a mututally recursive inline function, we would emit a forward declaration (eg. `inline float a();`). Note that the module that owns the inline function must use `extern` (eg. `extern inline float a() {return 5;}`) in order to ensure a definition is emitted. Since there are a finite number of modules, this will terminate. Note that if an imported module contains an inline function, we do not provide the function definitions of its non-inline functions - this means the compiler cannot inadvertently expand the definition of a function it is not meant to have access to.
    368 
    369 Cforall introduces the `forall` keyword, which allows creating polymorphic functions and types. I am not completely familiar with the internal implementation of these (hence it isn't actually a part of this formalism), but I know they use a single implementation, with callers passing in a vtable for each concrete instance of a polymorphic type (similar to Rust's `dyn` trait). This means that we need to provide the traits of each polymorphic type in a polymorphic function in order for an importer to use it. This analysis follows a similar logic to the size/alignment of types, except that we're working to define a trait rather than a type. As such, if a trait mutually includes another trait, we ignore the circular reference. I am unsure if resolving traits requires resolving more than the symbol names of each type - if necessary we use the same technique as size/alignment analysis.
    370 
    371 ### Resolving names for compilation
    372 
    373 Now that we have every top-level symbol visible to a module ready to be used, we can proceed to resolve every symbol in each function body or initializer expression. This functionality is actually the domain of the overload resolution system rather than the module system, but the module system adds additional features which need to be handled by the overload resolution system.
    374 
    375 First, name resolution needs to be augmented to also store the containing module name. We also need to support the `::` operator, used when specifying a specific module name. This doesn't change how the overload resolution algorithm would work, but we need the containing module name in order to generate the full symbol name in the generated C code.
    376 
    377 Additionally, we have the fact that the imports' symbols are now available to participate in overload resolution. In this formalism, our module's symbols have priority over the imports' if they are deemed equal by the overload resolver, and we error out if we have to choose between two imported types with the same name. This allows us to handle the 2 special cases that our module system introduces to the overload resolver.
  • doc/theses/mike_brooks_MMath/background.tex

    r81e1984b r58a4cde  
    895895                head objects are discussed in \VRef{toc:lst:issue:ident}.
    896896                In (a), the field \lstinline{req.x} names a list direction;
    897                 these are discussed in \VRef{s:Axis}.
     897                these are discussed in \VRef{toc:lst:issue:simultaneity}.
    898898                In (b) and (c), the type \lstinline{node} represents a system-internal type,
    899899                which is \lstinline{std::_List_node} in the GNU implementation.
     
    947947                head objects are discussed in \VRef{toc:lst:issue:ident}.
    948948                In \protect\subref*{f:Intrusive}, the field \lstinline{req.d} names a list direction;
    949                 these are discussed in \VRef{s:Axis}.
     949                these are discussed in \VRef{toc:lst:issue:simultaneity}.
    950950                In \protect\subref*{f:WrappedRef} and \protect\subref*{f:WrappedValue}, the type \lstinline{node} represents a
    951951                library-internal type, which is \lstinline{std::_List_node} in the GNU implementation
     
    964964In wrapped value, the @req@ is copied, which increases storage usage, but allows independent simultaneous changes;
    965965however, knowing which of the @req@ object is the \emph{true} object becomes complex.
    966 \see*{\VRef{s:Axis} for further discussion.}
     966\see*{\VRef{toc:lst:issue:simultaneity} for further discussion.}
    967967
    968968The implementation of @LIST_ENTRY@ uses a trick to find the links and the node containing the links.
     
    10211021
    10221022
    1023 \subsection{Axis: Single \vs Multi-Static \vs Dynamic}
    1024 \label{s:Axis}
     1023\subsection{Simultaneity: Single \vs Multi-Static \vs Dynamic}
     1024\label{toc:lst:issue:simultaneity}
    10251025
    10261026\begin{figure}
     
    10421042\end{figure}
    10431043
    1044 \newterm{Axis} deals with the question:
     1044\newterm{Simultaneity} deals with the question:
    10451045In how many different lists can a node be stored, at the same time?
    10461046\VRef[Figure]{fig:lst-issues-multi-static} shows an example that can traverse all requests in priority order (field @pri@) or navigate among requests with the same request value (field @rqr@).
     
    11071107\end{c++}
    11081108
    1109 Axis cannot be done with multiple inheritance, because there is no mechanism to either know the order of inheritance fields or name each inheritance.
     1109Simultaneity cannot be done with multiple inheritance, because there is no mechanism to either know the order of inheritance fields or name each inheritance.
    11101110Instead, a special type is require that contains the link fields and points at the node.
    11111111\begin{cquote}
     
    11801180\begin{figure}
    11811181        \centering
    1182         \includegraphics[width=\textwidth]{lst-issues-ident.pdf}
     1182        \includegraphics{lst-issues-ident.pdf}
    11831183        \caption{
    11841184                Comparison of headed and ad-hoc list identities, for various list lengths.
     
    12201220\begin{figure}
    12211221        \centering
    1222         \includegraphics[width=0.55\textwidth]{lst-issues-end.pdf}
     1222        \includegraphics{lst-issues-end.pdf}
    12231223        \caption{
    12241224                LQ sub-object-level representation of links and ends.
     
    12621262
    12631263For UTF-8 string literals, the array elements have type @char@ and are initialized with the characters of the multi-byte character sequences, \eg @u8"\xe1\x90\x87"@ (Canadian syllabics Y-Cree OO).
    1264 For wide string literals prefixed by the letter @L@, the array elements have type @wchar_t@ and are initialized with the wide characters corresponding to the multi-byte character sequence, \eg @L"abc@$\mu$@"@ and are read/printed using @wscanf@/@wprintf@.
     1264For wide string literals prefixed by the letter @L@, the array elements have type @wchar_t@ and are initialized with the wide characters corresponding to the multi-byte character sequence, \eg @L"abc@$\mu$@"@ and are read/printed using @wsanf@/@wprintf@.
    12651265The value of a wide-character is implementation-defined, usually a UTF-16 character.
    12661266For wide string literals prefixed by the letter @u@ or @U@, the array elements have type @char16_t@ or @char32_t@, respectively, and are initialized with wide characters corresponding to the multi-byte character sequence, \eg @u"abc@$\mu$@"@, @U"abc@$\mu$@"@.
  • doc/theses/mike_brooks_MMath/list.tex

    r81e1984b r58a4cde  
    77
    88
    9 \section{Plan-9 Inheritance}
    10 \label{s:Plan9Inheritance}
    11 
    12 This chapter uses a form of inheritance from the Plan-9 C dialect~\cite[\S~3.3]{Thompson90new}, which is supported by @gcc@ and @clang@ using @-fplan9-extensions@.
    13 \CFA has its own variation of the Plan-9 mechanism, where the nested type is denoted using @inline@.
    14 \begin{cfa}
    15 union U {  int x;  double y;  char z; };
    16 struct W { int i;  double j;  char k; };
    17 struct S {
    18         @inline@ struct W;  $\C{// extended Plan-9 inheritance}$
    19         unsigned int tag;
    20         @inline@ U;                     $\C{// extended Plan-9 inheritance}$
    21 } s;
    22 \end{cfa}
    23 Inline inheritance is containment, where the inlined field is unnamed and the type's internal fields are hoisted into the containing structure.
    24 Hence, the field names must be unique, unlike \CC nested types, but any inlined type-names are at a nested scope level, unlike aggregate nesting in C.
    25 Note, the position of the containment is normally unimportant, unless there is some form of memory or @union@ overlay.
    26 Finally, the inheritance declaration of @U@ is not prefixed with @union@.
    27 Like \CC, \CFA allows optional prefixing of type names with their kind, \eg @struct@, @union@, and @enum@, unless there is ambiguity with variable names in the same scope.
    28 
    29 \VRef[Figure]{f:Plan9Polymorphism} shows the key polymorphic feature of Plan-9 inheritance: implicit conversion of values and pointers for nested types.
    30 In the example, there are implicit conversions from @S@ to @U@ and @S@ to @W@, extracting the appropriate value or pointer for the substructure.
    31 \VRef[Figure]{f:DiamondInheritancePattern} shows complex multiple inheritance patterns are possible, like the \newterm{diamond pattern}~\cite[\S~6.1]{Stroustrup89}\cite[\S~4]{Cargill91}.
    32 Currently, \CFA type-system does not support @virtual@ inheritance.
    33 
    34 \begin{figure}
    35 \begin{cfa}
    36 void f( U, U * ); $\C{// value, pointer}$
    37 void g( W, W * ); $\C{// value, pointer}$
    38 U u, * up;   W w, * wp;   S s, * sp;
    39 u = s;   up = sp; $\C{// value, pointer}$
    40 w = s;   wp = sp; $\C{// value, pointer}$
    41 f( s, &s ); $\C{// value, pointer}$
    42 g( s, &s ); $\C{// value, pointer}$
    43 \end{cfa}
    44 \caption{Plan-9 Polymorphism}
    45 \label{f:Plan9Polymorphism}
    46 \end{figure}
    47 
    48 \begin{figure}
    49 \setlength{\tabcolsep}{10pt}
    50 \begin{tabular}{ll@{}}
    51 \multicolumn{1}{c}{\CC} & \multicolumn{1}{c}{\CFA}      \\
    52 \begin{c++}
    53 struct B { int b; };
    54 struct L : @public B@ { int l, w; };
    55 struct R : @public B@ { int r, w; };
    56 struct T : @public L, R@ { int t; };
    57 
    58 T t = { { { 5 }, 3, 4 }, { { 8 }, 6, 7 }, 3 };
    59 t.t = 42;
    60 t.l = 42;
    61 t.r = 42;
    62 ((L &)t).b = 42;  // disambiguate
    63 \end{c++}
    64 &
    65 \begin{cfa}
    66 struct B { int b; };
    67 struct L { @inline B;@ int l, w; };
    68 struct R { @inline B;@ int r, w; };
    69 struct T { @inline L; inline R;@ int t; };
    70 
    71 T t = { { { 5 }, 3, 4 }, { { 8 }, 6, 7 }, 3 };
    72 t.t = 42;
    73 t.l = 42;
    74 t.r = 42;
    75 ((L &)t).b = 42;  // disambiguate, proposed solution t.L.b = 42;
    76 \end{cfa}
    77 \end{tabular}
    78 \caption{Diamond Non-Virtual Inheritance Pattern}
    79 \label{f:DiamondInheritancePattern}
    80 \end{figure}
    81 
    82 
    839\section{Features}
    8410
     
    9117This design covers system and data management issues stated in \VRef{toc:lst:issue}.
    9218
    93 \VRef[Figure]{fig:lst-features-intro} continues the running @req@ example from \VRef[Figure]{fig:lst-issues-attach} using the \CFA list @dlist@.
     19\VRef[Figure]{fig:lst-features-intro} continues the running @req@ example from \VRef[Figure]{fig:lst-issues-attach} using the \CFA list.
    9420The \CFA link attachment is intrusive so the resulting memory layout is per user node, as for the LQ version of \VRef[Figure]{f:Intrusive}.
    95 The \CFA framework provides generic type @dlink( T )@ (not to be confused with @dlist@) for the two link fields (front and back).
    96 A user inserts the links into the @req@ structure via \CFA inline-inheritance \see{\VRef{s:Plan9Inheritance}}.
    97 Lists leverage the automatic conversion of a pointer to anonymous inline field for assignments and function calls.
    98 Therefore, a reference to a @req@ is implicitly convertible to @dlink@ in both contexts.
    99 The links in @dlist@ point at another embedded @dlist@ node, know the offsets of all links (data is abstract but accessible), and any field-offset arithmetic or link-value changes are safe and abstract.
     21The \CFA framework provides generic type @dlink( T, T )@ for the two link fields (front and back).
     22A user inserts the links into the @req@ structure via \CFA inline-inheritance from the Plan-9 C dialect~\cite[\S~3.3]{Thompson90new}.
     23Inline inheritance is containment, where the inlined field is unnamed but the type's internal fields are hoisted into the containing structure.
     24Hence, the field names must be unique, unlike \CC nested types, but the type names are at a nested scope level, unlike aggregate nesting in C.
     25Note, the position of the containment is normally unimportant, unless there is some form of memory or @union@ overlay.
     26The key feature of inlined inheritance is that a pointer to the containing structure is automatically converted to a pointer to any anonymous inline field for assignments and function calls, providing containment inheritance with implicit subtyping.
     27Therefore, a reference to a @req@ is implicitly convertible to @dlink@ in assignments and function calls.
     28% These links have a nontrivial, user-specified location within the @req@ structure;
     29% this convention encapsulates the implied pointer arithmetic safely.
     30The links in @dlist@ point at (links) in the containing node, know the offsets of all links (data is abstract), and any field-offset arithmetic or link-value changes are safe and abstract.
    10031
    10132\begin{figure}
     
    10839\end{figure}
    10940
    110 \VRef[Figure]{f:dlistOutline} shows the outline for types @dlink@ and @dlist@.
    111 Note, the first @forall@ clause is distributed across all the declaration in its containing block, eliminating repetition on each declaration.
    112 The second nested @forall@ on @dlist@ is also distributed and adds an optional second type parameter, @tLinks@, denoting the linking axis \see{\VRef{s:Axis}}, \ie the kind of list this node can appear on.
    113 
    114 \begin{figure}
    115 \begin{cfa}
    116 forall( tE & ) {                                        $\C{// distributed}$
    117         struct @dlink@ { ...  };                $\C{// abstract type}$
    118         static inline void ?{}( dlink( tE ) & this );  $\C{// constructor}$
    119 
    120         forall( tLinks & = dlink( tE ) ) { $\C{// distributed, default type for axis}$
    121                 struct @dlist@ { ... };         $\C{// abstract type}$
    122                 static inline void ?{}( dlist( tE, tLinks ) & this ); $\C{// constructor}$
    123         }
    124 }
    125 \end{cfa}
    126 \caption{\lstinline{dlisk} / \lstinline{dlist} Outline}
    127 \label{f:dlistOutline}
    128 \end{figure}
    129 
    13041\VRef[Figure]{fig:lst-features-multidir} shows how the \CFA library supports multi-inline links, so a node can be on one or more lists simultaneously.
    13142The declaration of @req@ has two inline-inheriting @dlink@ occurrences.
    13243The first of these gives a type named @req.by_pri@, @req@ inherits from it, and it inherits from @dlink@.
    13344The second line @req.by_rqr@ is similar to @req.by_pri@.
    134 Thus, there is a diamond, non-virtual, inheritance from @req@ to @dlink@, with @by_pri@ and @by_rqr@ being the mid-level types \see{\VRef[Figure]{f:DiamondInheritancePattern}}.
    135 
    136 The declarations of the list-head objects, @reqs_pri@, @reqs_rqr_42@, @reqs_rqr_17@, and @reqs_rqr_99@, bind which link nodes in @req@ are used by this list.
    137 Hence, the type of the variable @reqs_pri@, @dlist(req, req.by_pri)@, means operations called on @reqs_pri@ implicitly select (disambiguate) the correct @dlink@s, \eg the calls @insert_first(reqs_pri, ...)@ imply, ``here, we are working by priority.''
     45Thus, there is a diamond, non-virtual, inheritance from @req@ to @dlink@, with @by_pri@ and @by_rqr@ being the mid-level types.
     46
     47Disambiguation occurs in the declarations of the list-head objects: @reqs_pri_global@, @reqs_rqr_42@, @reqs_rqr_17@, and @reqs_rqr_99@.
     48The type of the variable @reqs_pri_global@ is @dlist(req, req.by_pri)@, meaning operations called on @reqs_pri_global@ are implicitly disambiguated.
     49In the example, the calls @insert_first(reqs_pri_global, ...)@ imply, ``here, we are working by priority.''
    13850As in \VRef[Figure]{fig:lst-issues-multi-static}, three lists are constructed, a priority list containing all nodes, a list with only nodes containing the value 42, and a list with only nodes containing the value 17.
    13951
     
    15769\end{figure}
    15870
    159 The list library also supports the common case of single directionality more naturally than LQ. 
    160 Returning to \VRef[Figure]{fig:lst-features-intro}, the single-direction list has no contrived name for the link direction as it uses the default type in the definition of @dlist@;
    161 in contrast, the LQ list in \VRef[Figure]{f:Intrusive} adds the unnecessary field name @d@.
    162 In \CFA, a single direction list sets up a single inheritance with @dlink@, and the default list axis is to itself.
    163 
    164 When operating on a list with several directions and operations that do not take the list head, the list axis can be ambiguous.
    165 For example, a call like @insert_after( r1, r2 )@ does not have enough information to know which axes to select implicitly.
     71The \CFA library also supports the common case, of single directionality, more naturally than LQ. 
     72\VRef[Figure]{fig:lst-features-intro} shows a single-direction list done with no contrived name for the link direction,
     73where \VRef[Figure]{f:Intrusive} adds the unnecessary field name, @d@.
     74In \CFA, a user doing a single direction (\VRef[Figure]{fig:lst-features-intro}) sets up a simple inheritance with @dlink@, and declares a list head to have the simpler type @dlist( T )@.
     75In contrast, (\VRef[Figure]{fig:lst-features-multidir}) sets up a diamond inheritance with @dlink@, and declares a list head to have the more-informed type @dlist( T, DIR )@.
     76
     77The directionality issue also has an advanced corner-case that needs treatment.
     78When working with multiple directions, calls like @insert_first@ benefit from implicit direction disambiguation;
     79however, other calls like @insert_after@ still require explicit disambiguation, \eg the call
     80\begin{cfa}
     81insert_after(r1, r2);
     82\end{cfa}
     83does not have enough information to clarify which of a request's simultaneous list directions is intended.
    16684Is @r2@ supposed to be the next-priority request after @r1@, or is @r2@ supposed to join the same-requester list of @r1@?
    167 As such, the \CFA type-system gives an ambiguity error for this call.
    168 There are multiple ways to resolve the ambiguity.
    169 The simplest is an explicit cast on each call to select the specific axis, \eg @insert_after( (by_pri)r1, r2 )@.
    170 However, multiple explicit casts are tedious and error-prone.
    171 To mitigate this issue, the list library provides a hook for applying the \CFA language's scoping and priority rules.
    172 \begin{cfa}
    173 with ( DLINK_VIA(  req, req.pri ) ) insert_after( r1, r2 );
     85As such, the \CFA compiler gives an ambiguity error for this call.
     86To resolve the ambiguity, the list library provides a hook for applying the \CFA language's scoping and priority rules.
     87It applies as:
     88\begin{cfa}
     89with ( DLINK_VIA(req, req.pri) ) insert_after(r1, r2);
    17490\end{cfa}
    17591Here, the @with@ statement opens the scope of the object type for the expression;
    17692hence, the @DLINK_VIA@ result causes one of the list directions to become a more attractive candidate to \CFA's overload resolution.
    177 This boost can be applied across multiple statements in a block or an entire function body.
     93This boost applies within the scope of the following statement, but could also be a custom block or an entire function body.
    17894\begin{cquote}
    17995\setlength{\tabcolsep}{15pt}
    18096\begin{tabular}{@{}ll@{}}
    18197\begin{cfa}
    182 @with( DLINK_VIA( req, req.pri ) ) {@
    183         ...  insert_after( r1, r2 );  ...
    184 @}@
     98void f() @with( DLINK_VIA(req, req.pri) )@ {
     99        ...
     100
     101        insert_after(r1, r2);
     102
     103        ...
     104}
    185105\end{cfa}
    186106&
    187107\begin{cfa}
    188 void f() @with( DLINK_VIA( req, req.pri ) ) {@
    189         ... insert_after( r1, r2 ); ...
    190 @}@
     108void f() {
     109        ...
     110        @with( DLINK_VIA(req, req.pri) )@ {
     111                ...  insert_after(r1, r2);  ...
     112        }
     113        ...
     114}
    191115\end{cfa}
    192116\end{tabular}
    193117\end{cquote}
    194 Within the @with@, the code acts as if there is only one list direction.
    195 
    196 Unlike the \CC template container-types, the \CFA library works completely within the type system;
    197 both @dlink@ and @dlist@ are ordinary types, not language macros.
     118By using a larger scope, a user can put code within that acts as if there is only one list direction.
     119This boost is needed only when operating on a list with several directions, using operations that do not take the list head.
     120Otherwise, the sole applicable list direction \emph{just works}.
     121
     122Unlike \CC templates container-types, the \CFA library works completely within the type system;
     123both @dlink@ and @dlist@ are ordinary types.
    198124There is no textual expansion other than header-included static-inline function for performance.
    199 Hence, errors in user code are reported only with mention of the library's declarations.
    200 Finally, the library is separately compiled from the usage code, modulo inlining.
    201 
    202 The \CFA library works in headed and headless modes.
    203 \PAB{TODO: elaborate.}
     125Errors in user code are reported only with mention of the library's declarations.
     126Finally, the library is separately compiled from the usage code.
     127
     128The \CFA library works in headed and headless modes.  TODO: elaborate.
    204129
    205130
     
    217142@last@ returns a reference to the last node of the list without removing it or @0p@ if the list is empty.
    218143\item
    219 @insert_before@ adds a node before a specified node \see{\lstinline{insert_last} for insertion at the end}\footnote{
    220 Some list packages allow \lstinline{0p} (\lstinline{nullptr}) for the before/after node implying insert/remove at the start/end of the list, respectively.
    221 However, this inserts an \lstinline{if} statement in the fastpath of a potentially commonly used list operation.}
    222 \item
    223 @insert_after@ adds a node after a specified node \see{\lstinline{insert_first} for insertion at the start}\footnotemark[\value{footnote}].
    224 \item
    225 @remove@ removes a specified node from the list (any location) and returns a reference to the node.
     144@insert_before@ adds a node before the specified node and returns the added node for cascading.
     145\item
     146@insert_after@ adds a node after the specified node and returns the added node for cascading.
     147\item
     148@remove@ removes the specified node from the list (any location) and returns a reference to the node.
    226149\item
    227150@iter@ create an iterator for the list.
     
    235158@isLast@ returns true if the node is the last node in the list and false otherwise.
    236159\item
    237 @pred@ returns a reference to the previous (predecessor, towards first) node before a specified node or @0p@ if a specified node is the first node in the list.
    238 \item
    239 @next@ returns a reference to the next (successor, towards last) node after a specified node or @0p@ if a specified node is the last node in the list.
     160@pred@ returns a reference to the previous (predecessor, towards first) node before the specified node or @0p@ if the specified node is the first node in the list.
     161\item
     162@next@ returns a reference to the next (successor, towards last) node after the specified node or @0p@ if the specified node is the last node in the list.
    240163\item
    241164@insert_first@ adds a node to the start of the list so it becomes the first node and returns a reference to the node for cascading.
     
    313236The choice depends on how the programmer wants to access the fields: @it->f@ or @it.f@.
    314237The following examples use a reference because the loop body manipulates the node values rather than the list pointers.
    315 The end of iteration is denoted by the loop cursor returning @0p@.
     238The end of iteration is denoted by the loop cursor having the null pointer, denoted @0p@ in \CFA.
    316239
    317240\noindent
     
    520443
    521444\VRef[Figure]{fig:lst-impl-links} continues the running @req@ example, now showing the \CFA list library's internal representation.
    522 The @dlink@ structure contains exactly two pointers: @next@ and @prev@, which are opaque.
     445The @dlink@ structure contains exactly two pointers: @next@ and @prev@, which are opaque to a user.
    523446Even though the user-facing list model is linear, the CFA library implements all listing as circular.
    524 This choice helps achieve uniform end treatment and \PAB{TODO finish summarizing benefit}.
     447This choice helps achieve uniform end treatment and TODO finish summarizing benefit.
    525448A link pointer targets a neighbouring @dlink@ structure, rather than a neighbouring @req@.
    526449(Recall, the running example has the user putting a @dlink@ within a @req@.)
     
    528451\begin{figure}
    529452        \centering
    530         \includegraphics[width=\textwidth]{lst-impl-links.pdf}
     453        \includegraphics{lst-impl-links.pdf}
    531454        \caption{
    532455                \CFA list library representations for the cases under discussion.
     
    535458\end{figure}
    536459
    537 Circular link-pointers (dashed lines) are tagged internally in the pointer to indicate linear endpoints.
     460System-added link pointers (dashed lines) are internally tagged to indicate linear endpoints that are the circular pointers.
    538461Links among neighbour nodes are not tagged.
    539 Iteration reports ``has more elements'' when accessing untagged links, and ``no more elements'' when accessing tagged links.
    540 Hence, the tags are set on the links that a user cannot navigate.
     462Iteration reports ``has more elements'' when crossing natural links, and ``no more elements'' upon reaching a tagged link.
    541463
    542464In a headed list, the list head (@dlist(req)@) acts as an extra element in the implementation-level circularly-linked list.
    543 The content of a @dlist@ header is a (private) @dlink@, with the @next@ pointer to the first element, and the @prev@ pointer to the last element.
    544 Since the head wraps a @dlink@, as does @req@, and since a link-pointer targets a @dlink@, the resulting cycle is among @dlink@ structures, situated inside of header/node.
     465The content of a @dlist@ is a (private) @dlink@, with the @next@ pointer to the first element, and the @prev@ pointer to the last element.
     466Since the head wraps a @dlink@, as does a @req@ does too, and since a link-pointer targets a @dlink@, the resulting cycle is among @dlink@ structures, situated inside of header/node.
    545467An untagged pointer points within a @req@, while a tagged pointer points within a list head.
    546468In a headless list, the circular backing list is only among @dlink@s within @req@s.
    547 
    548 No distinction is made between an unlisted node (top left and middle) under a headed model and a singleton list under a headless model.
     469The tags are set on the links that a user cannot navigate.
     470
     471No distinction is made between an unlisted item under a headed model and a singleton list under a headless model.
    549472Both are represented as an item referring to itself, with both tags set.
    550473
     
    553476\label{toc:lst:assess}
    554477
    555 This section examines the performance of the discussed list implementations.
    556 The goal is to show the \CFA lists are competitive with other designs, but the different list designs may not have equivalent functionality, so it is impossible to select a winner encompassing both functionality and execution performance.
    557 
    558 
    559478\subsection{Add-Remove Performance}
    560479
    561 The fundamental job of a linked-list library is to manage the links that connect nodes.
    562 Any link management is an action that causes pair(s) of elements to become, or cease to be, adjacent in the list.
     480The fundamental job of a linked-list library is to manage the links that connect users' items.
     481Any link management is an action that causes pair(s) of elements to become, or cease to be, adjacent.
    563482Thus, adding and removing an element are the sole primitive actions.
    564483
     
    566485These instruction sequences may have cases that proceed (in a modern, deep pipeline) without a stall.
    567486
    568 This experiment takes the position that:
     487This experiment takes the position that
    569488\begin{itemize}
    570     \item The total time to add and remove is relevant \vs individual times for add and remove.
    571           Adds without removes quickly fill memory;
    572                   removes without adds is meaningless.
    573     \item A relevant breakdown ``by operation'' is:
     489    \item The total time to add and remove is relevant; an attribution of time spent adding vs.\ removing is not.
     490          Any use case for which addition speed matters necessarily has removes paired with adds.
     491                  For otherwise, the alleged usage would exhaust the amount of work expressable as a main-memory full of nodes within a few seconds.
     492    \item A relevant breakdown ``by operation'' is, rather, one that considers the structural context of these requests.
    574493                \begin{description}
    575494                \item[movement]
    576           Is the add/remove applied to a stack, queue, or something else?
     495          Is the add/remove order that of a stack, a queue, or something else?
    577496                \item[polarity]
    578                   In which direction does the action apply?
    579                   For a queue, do items flow from first to last or last to first?
    580                   For a stack, is the first or last end used for adding and removing?
     497                  In which direction does the movement's action apply?  For a queue, do items flow from first to last or last to first?  For a stack, is the first-end or the last-end used for adding and removing?
    581498                \item[accessor]
    582499          Is an add/remove location given by a list head's ``first''/``last'', or by a reference to an individual element?
     
    586503\end{itemize}
    587504
    588 The experiment used to measure insert/remove cost measures the mean duration of a sequence of additions and removals.
     505This experiment measures the mean duration of a list addition and removal.
    589506Confidence bounds, on this mean, are discussed.
    590507The distribution of speeds experienced by an individual add-remove pair (tail latency) is not discussed.
     508
    591509Space efficiency is shown only indirectly, by way of caches' impact on speed.
    592 The experiment is sensitive enough to show:
     510
     511%~MONITOR
     512% If able to show cases with CFA doing better, reword.
     513The goal is to show the \CFA library performing comparably to other intrusive libraries,
     514in an experimental context sensitive enough to show also:
    593515\begin{itemize}
    594     \item intrusive lists performing (majorly) differently than wrapped lists,
    595     \item a space of (minor) performance differences among the intrusive lists.
     516    \item intrusive lists performing (majorly) differently than wrapped lists
     517    \item a space of (minor) performance differences typical of existing intrusive lists
    596518\end{itemize}
    597519
     
    599521\subsubsection{Experiment setup}
    600522
    601 The experiment driver defines a (intrusive) node type:
    602 \begin{cfa}
    603 struct Node {
    604         int i, j, k;  // fields
    605         // possible intrusive links
    606 };
    607 \end{cfa}
    608 and considers the speed of building and tearing down a list of $n$ instances of it.
    609 % A number of experimental rounds per clock check is precalculated to be appropriate to the value of $n$.
     523The experiment defines a user's datatype and considers
     524the speed of building, and tearing down, a list of $n$ instances of the user's type.
     525
     526The timings are taken with a fixed-duration method based on checks @clock()@.
     527In a typical 5-sec run, the outer looping checks the clock about 200 times.
     528A number of experimental rounds per clock check is precalculated to be appropriate to the value of $n$.
     529
    610530\begin{cfa}
    611531// simplified harness: CFA implementation,
    612532// stack movement, insert-first polarity, head-mediated access
    613533size_t totalOpsDone = 0;
    614 dlist( node_t ) lst;
    615 node_t nodes[ n ];                                              $\C{// preallocated list storage}$
     534dlist( item_t ) lst;
     535item_t items[ n ];
    616536startTimer();
    617 while ( CONTINUE ) {                                    $\C{// \(\approx\) 20 second duration}$
    618         for ( i; n ) insert_first( lst, nodes[i] );  $\C{// build up}$
    619         for ( i; n ) remove_first( lst );       $\C{// tear down}$
     537while ( SHOULD_CONTINUE ) {
     538        for ( i; n ) insert_first( lst, items[i] );
     539        for ( i; n ) remove_first( lst );
    620540        totalOpsDone += n;
    621541}
    622542stopTimer();
    623 reportedDuration = getTimerDuration() / totalOpsDone; // throughput per insert/remove operation
    624 \end{cfa}
    625 To reduce administrative overhead, the $n$ nodes for each experiment list are preallocated in an array (on the stack), which removes dynamic allocations for this storage.
    626 These nodes contain an intrusive pointer for intrusive lists, or a pointer to it is stored in a dynamically-allocated internal-node for a wrapper list.
    627 Copying the node for wrapped lists skews the results with administration costs.
    628 The list insertion/removal operations are repeated for a typical 20+ second duration.
    629 After each round, a counter is incremented by $n$ (for throughput).
    630 Time is measured outside the loop because a large $n$ can overrun the time duration before the @CONTINUE@ flag is tested.
    631 The loop duration is divided by the counter and this throughput is reported.
    632 In a scatter-plot, each dot is one throughput, which means insert + remove + harness overhead.
    633 The harness overhead is constant when comparing linked-list implementations and keep as small as possible.
    634 % The remainder of the setup section discusses the choices that affected the harness overhead.
    635 
    636 To test list operations, the experiment performs the inserts/removes in different patterns, \eg insert and remove from front, insert from front and remove from back, random insert and remove, \etc.
    637 Unfortunately, the @std::list@ does \emph{not} support direct insert/remove from a node without an iterator, \ie no @erase( node )@, even though the list is doubly-linked.
    638 To eliminate this additional cost in the harness, a trick is used for random insertions without replacement.
    639 The @i@ fields in each node are initialized from @0..n-1@, these @i@ values are then shuffled in the nodes, and the @i@ value is used to represent an indirection to that node for insertion, so the nodes are inserted in random order, and hence, removed in th same random order.
    640 $\label{p:Shuffle}$
     543reportedDuration = getTimerDuration() / totalOpsDone;
     544\end{cfa}
     545
     546One experimental round is, first, a tight loop of inserting $n$ elements into a list, followed by another, to remove these $n$ elements.
     547A counter is incremented by $n$ each round.
     548When the whole experiment is done, the total elapsed time, divided by final value of the operation counter,
     549is reported as the observed mean operation duration.
     550In a scatterplot presentation, each dot would be one such reported mean duration.
     551So, ``operation'' really means insert + remove + harness overhead.
     552
     553The harness overheads are held constant when comparing linked-list implementations.
     554The remainder of the setup section discusses the choices that affected the harness overhead.
     555
     556An \emph{iterators' array} provides support for element-level operations on non-intrusive lists.
     557As elaborated in Section \ref{toc:lst:issue:attach},
     558wrapped-attachment lists use a distinct type (at a distinct memory location) to represent ``an item that's in the list.''
     559Operations like insert-after and remove-here consume iterators.
     560In the STL implementation, an iterator is a pointer to a \lstinline{std::_List_node}.
     561For the STL case, the driver obtains an iterator value
     562at the time of adding to the list, and stores the iterator in an array, for consumption by subsequent element-oriented operations.
     563For intrusive-list cases, the driver stores the user object's address in the iterators' array.
     564
    641565\begin{c++}
    642         for ( i; n ) @nodes[i].i = i@;          $\C{// indirection}$
    643         shuffle( nodes, n );                            $\C{// random shuffle indirects within nodes}$
    644 
    645         while ( CONTINUE ) {
    646                 for ( i; n ) insert_first( lst, nodes[ @nodes[i].i@ ] ); $\C{// build up}$
    647                 for ( i; n ) pass( &remove_first( lst ) );      $\C{// tear down}$
    648                 totalOpsDone += n;
     566// further simplified harness (bookkeeping elided): STL implementation,
     567// stack movement, insert-first polarity, element-based remove access
     568list< item_t * > lst;
     569item_t items[ n ];
     570while ( SHOULD_CONTINUE ) {
     571        @list< item_t * >::iterator iters[ n ];@
     572        for ( int i = 0; i < n; i += 1 ) {
     573                lst.push_front( & items[i] );
     574                @iters[i]@ = lst.begin();
    649575        }
     576        for ( int i = 0; i < n; i += 1 ) {
     577                lst.erase( @iters[i]@ );
     578        }
     579}
    650580\end{c++}
    651 This approach works across intrusive and wrapped lists.
    652 
    653 % \emph{Interleaving} allows for movements other than pure stack and queue.
    654 % Note that the earlier example of using the iterators' array is still a pure stack: the item selected for @erase(...)@ is always the first.
    655 % Including a less predictable movement is important because real applications that justify doubly linked lists use them.
    656 % Freedom to remove from arbitrary places (and to insert under more relaxed assumptions) is the characteristic function of a doubly linked list.
    657 % A queue with drop-out is an example of such a movement.
    658 % A list implementation can show unrepresentative speed under a simple movement, for example, by enjoying unchallenged ``Is first element?'' branch predictions.
    659 
    660 % Interleaving brings ``at middle of list'' cases into a stream of add or remove invocations, which would otherwise be exclusively ``at end''.
    661 % A chosen split, like half middle and half end, populates a boolean array, which is then shuffled.
    662 % These booleans then direct the action to end-\vs-middle.
    663 %
    664 % \begin{cfa}
    665 % // harness (bookkeeping and shuffling elided): CFA implementation,
    666 % // stack movement, insert-first polarity, interleaved element-based remove access
    667 % dlist( item_t ) lst;
    668 % item_t items[ n ];
    669 % @bool interl[ n ];@  // elided: populate with weighted, shuffled [0,1]
    670 % while ( CONTINUE ) {
    671 %       item_t * iters[ n ];
    672 %       for ( i; n ) {
    673 %               insert_first( items[i] );
    674 %               iters[i] = & items[i];
    675 %       }
    676 %       @item_t ** crsr[ 2 ]@ = { // two cursors into iters
    677 %               & iters[ @0@ ], // at stack-insert-first's removal end
    678 %               & iters[ @n / interl_frac@ ]  // in middle
    679 %       };
    680 %       for ( i; n ) {
    681 %               item *** crsr_use = & crsr[ interl[ i ] ]@;
    682 %               remove( *** crsr_use );  // removing from either middle or end
    683 %               *crsr_use += 1;  // that item is done
    684 %       }
    685 %       assert( crsr[0] == & iters[ @n / interl_frac@ ] ); // through second's start
    686 %       assert( crsr[1] == & iters[ @n@ ] );  // did the rest
    687 % }
    688 % \end{cfa}
    689 %
    690 % By using the pair of cursors, the harness avoids branches, which could incur prediction stall times themselves, or prime a branch in the SUT.
    691 % This harness avoids telling the hardware what the SUT is about to do.
     581
     582%~MONITOR
     583% If running insert-random scenarios, revise the assessment
     584
     585A \emph{shuffling array} helps control the memory layout of user items.
     586The control required is when choosing a next item to insert.
     587The user items are allocated in a contiguous array.
     588Without shuffling, the driver's insert phase visits these items in order, producing a list whose adjavency links hop uniform strides.
     589With shuffling active, the driver's insert phase visits only the shuffling array in order,
     590which applies pseudo-random indirection to the selection of a next-to-insert element from the user-item array.
     591The result is a list whose links travel randomly far.
     592
     593\begin{cfa}
     594// harness (bookkeeping and iterators elided): CFA implementation,
     595// stack movement, insert-first polarity, head-mediated access
     596dlist( item_t ) lst;
     597item_t items[ n ];
     598size_t insert_ord[ n ];  // elided: populate with shuffled [0,n)
     599while ( SHOULD_CONTINUE ) {
     600        for ( i; n ) insert_first( lst, items[ @insert_ord[@ i @]@ ] );
     601        for ( i; n ) remove_first( lst );
     602}
     603\end{cfa}
     604
     605\emph{Interleaving} allows for movements other than pure stack and queue.
     606Note that the earlier example of using the iterators' array is still a pure stack: the item selected for @erase(...)@ is always the first.
     607Including a less predictable movement is important because real applications that justify doubly linked lists use them.
     608Freedom to remove from arbitrary places (and to insert under more relaxed assumptions) is the characteristic function of a doubly linked list.
     609A queue with drop-out is an example of such a movement.
     610A list implementation can show unrepresentative speed under a simple movement, for example, by enjoying unchallenged ``Is first element?'' branch predictions.
     611
     612Interleaving brings ``at middle of list'' cases into a stream of add or remove invocations, which would otherwise be exclusively ``at end''.
     613A chosen split, like half middle and half end, populates a boolean array, which is then shuffled.
     614These booleans then direct the action to end-\vs-middle.
     615
     616\begin{cfa}
     617// harness (bookkeeping and shuffling elided): CFA implementation,
     618// stack movement, insert-first polarity, interleaved element-based remove access
     619dlist( item_t ) lst;
     620item_t items[ n ];
     621@bool interl[ n ];@  // elided: populate with weighted, shuffled [0,1]
     622while ( SHOULD_CONTINUE ) {
     623        item_t * iters[ n ];
     624        for ( i; n ) {
     625                insert_first( items[i] );
     626                iters[i] = & items[i];
     627        }
     628        @item_t ** crsr[ 2 ]@ = { // two cursors into iters
     629                & iters[ @0@ ], // at stack-insert-first's removal end
     630                & iters[ @n / interl_frac@ ]  // in middle
     631        };
     632        for ( i; n ) {
     633                item *** crsr_use = & crsr[ interl[ i ] ]@;
     634                remove( *** crsr_use );  // removing from either middle or end
     635                *crsr_use += 1;  // that item is done
     636        }
     637        assert( crsr[0] == & iters[ @n / interl_frac@ ] ); // through second's start
     638        assert( crsr[1] == & iters[ @n@ ] );  // did the rest
     639}
     640\end{cfa}
     641
     642By using the pair of cursors, the harness avoids branches, which could incur prediction stall times themselves, or prime a branch in the SUT.
     643This harness avoids telling the hardware what the SUT is about to do.
     644
     645These experiments are single threaded.  They run on a PC with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6.  The machine has 16 GB of RAM and 8 MB of last-level cache.
    692646
    693647The comparator linked-list implementations are:
    694648\begin{description}
    695 \item[std::list]  The @list@ type of g++.
    696 \item[lq-list]  The @list@ type of LQ from glibc of gcc.
     649\item[lq-list]  The @list@ type of LQ from glibc of GCC-11.
    697650\item[lq-tailq] The @tailq@ type of the same.
    698 \item[upp-upp]  \uCpp provided @uSequence@
     651\item[upp-upp]  uC++ provided @uSequence@
    699652\item[cfa-cfa]  \CFA's @dlist@
    700653\end{description}
    701654
    702655
    703 \subsection{Experimental Environment}
    704 \label{s:ExperimentalEnvironment}
    705 
    706 The performance experiments are run on:
    707 \begin{description}[leftmargin=*,topsep=3pt,itemsep=2pt,parsep=0pt]
    708 %\item[PC]
    709 %with a 64-bit eight-core AMD FX-8370E, with ``taskset'' pinning to core \#6.  The machine has 16 GB of RAM and 8 MB of last-level cache.
    710 %\item[ARM]
    711 %Gigabyte E252-P31 128-core socket 3.0 GHz, WO memory model
    712 \item[AMD]
    713 Supermicro AS--1125HS--TNR EPYC 9754 128--core socket, hyper-threading $\times$ 2 sockets (512 processing units) 2.25 GHz, TSO memory model, with cache structure 32KB L1i/L1d, 1024KB L2, 16MB L3, where each L3 cache covers 16 processors.
    714 %\item[Intel]
    715 %Supermicro SYS-121H-TNR Xeon Gold 6530 32--core, hyper-threading $\times$ 2 sockets (128 processing units) 2.1 GHz, TSO memory model
    716 \end{description}
    717 The experiments are single threaded and pinned to single core to prevent any OS movement, which might cause cache or NUMA effects perturbing the experiment.
    718 
    719 The compiler is gcc/g++-14.2.0 running on the Linux v6.8.0-52-generic OS.
    720 Switching between the default memory allocators @glibc@ and @llheap@ is done with @LD_PRELOAD@.
    721 To prevent eliding certain code patterns, crucial parts of a test are wrapped by the function @pass@
    722 \begin{cfa}
    723 // prevent eliding, cheaper than volatile
    724 static inline void * pass( void * v ) {  __asm__  __volatile__( "" : "+r"(v) );  return v;  }
    725 ...
    726 pass( &remove_first( lst ) );                   // wrap call to prevent elision, insert cannot be elided now
    727 \end{cfa}
    728 The call to @pass@ can prevent a small number of compiler optimizations but this cost is the same for all lists.
    729 
    730 
    731656\subsubsection{Result: Coarse comparison of styles}
    732657
    733 This comparison establishes how an intrusive list performs compared with a wrapped-reference list.
    734 \VRef[Figure]{fig:plot-list-zoomout} presents throughput at various list lengths for a linear and random (shuffled) insert/remove test.
    735 Other kinds of scans were made, but the results are similar in many cases, so it is sufficient to discuss these two scans, representing difference ends of the access spectrum.
    736 In the graphs, all four intrusive lists are plotted with the same symbol;
    737 theses symbols clumped on top of each, showing the performance difference among intrusive lists is small in comparison to the wrapped list \see{\VRef{s:ComparingIntrusiveImplementations} for details among intrusive lists}.
    738 
    739 The list lengths start at 10 due to the short insert/remove times of 2--4 ns, for intrusive lists, \vs STL's wrapped-reference list of 15--20 ns.
    740 For very short lists, like 4, the experiment time of 4 $\times$ 2.5 ns and experiment overhead (loops) of 2--4 ns, results in an artificial administrative bump at the start of the graph having nothing to do with the insert/remove times.
    741 As the list size grows, the administrative overhead for intrusive lists quickly disappears.
     658This comparison establishes how an intrusive list performs, compared with a wrapped-reference list.
     659It also establishes the context within which it is meaningful to compare one intrusive list to another.
     660
     661%These goals notwithstanding, the effect of the host machine's memory hierarchy is more significant here than linked-list implementation.
    742662
    743663\begin{figure}
    744   \centering
    745   \subfloat[Linear List Nodes]{\label{f:Linear}
    746     \includegraphics{plot-list-zoomout-noshuf.pdf}
    747   } % subfigure
    748   \\
    749   \subfloat[Shuffled List Nodes]{\label{f:Shuffled}
    750     \includegraphics{plot-list-zoomout-shuf.pdf}
    751   } % subfigure
    752   \caption{Insert/remove duration \vs list length.
    753   Lengths go as large possible without error.
    754   One example operation is shown: stack movement, insert-first polarity and head-mediated access.}
     664\centering
     665  \begin{tabular}{c}
     666  \includegraphics{plot-list-zoomout-shuf.pdf} \\
     667  (a) \\
     668  \includegraphics{plot-list-zoomout-noshuf.pdf} \\
     669  (b) \\
     670  \end{tabular}
     671  \caption{Operation duration \vs list length at full spectrum of list lengths.  One example operation is shown: stack movement, insert-first polarity and head-mediated access.  Lengths go as large as completes without error.  Version (a) uses shuffled items, while version (b) links items with their physical neighbours.}
    755672  \label{fig:plot-list-zoomout}
    756673\end{figure}
    757674
    758 The large difference in performance between intrusive and wrapped-reference lists results from the dynamic allocation for the wrapped nodes.
    759 In effect, this experiment is largely measuring the cost of malloc/free rather than the insert/remove, and is affected by the layout of memory by the allocator.
    760 Fundamentally, insert/remove for a doubly linked-list has a basic cost, which is seen by the similar results for the intrusive lists.
    761 Hence, the costs for a wrapped-reference list are: allocating/deallocating a wrapped node, copying a external-node pointer into the wrapped node for insertion, and linking the wrapped node to/from the list;
    762 the dynamic allocation dominates these costs.
    763 For example, the experiment was run with both glibc and llheap memory allocators, where performance from llheap reduced the cost from 20 to 16 ns, which is still far from the 2--4 ns for intrusive lists.
    764 Unfortunately, there is no way to tease apart the linking and allocation costs for wrapped lists, as there is no way to preallocate the list nodes without writing a mini-allocator to manage the storage.
    765 Hence, dynamic allocation cost is fundamental for wrapped lists.
    766 
    767 In detail, \VRef[Figure]{f:Linear} shows linear insertion of all the nodes and then linear removal, both in the same direction.
    768 For intrusive lists, the nodes are adjacent because they are preallocated in an array.
    769 For wrapped lists, the wrapped nodes are still adjacent because the memory allocator happens to use bump allocation for the small fixed-sized wrapped nodes.
    770 As a result, these memory layouts result in high spatial and temporal locality for both kinds of lists during the linear array traversal.
    771 With address look-ahead, the hardware does an excellent job of managing the multi-level cache.
    772 Hence, performance is largely constant for both kinds of lists, until cache and NUMA boundaries are crossed for longer lists and the costs increase consistently for both kinds of lists.
    773 
    774 In detail, \VRef[Figure]{f:Shuffled} shows shuffle insertion of all the nodes and then linear removal, both in the same direction.
    775 As for linear, there are issues with the wrapped list and memory allocation.
    776 For intrusive lists, it is possible to link the nodes randomly, so consecutive nodes in memory seldom point at adjacent nodes.
    777 For wrapped lists, the placement of wrapped nodes is dictated by the memory allocator, and results in memory layout virtually the same as for linear, \ie the wrapped nodes are laid out consecutively in memory, but each wrapped node points at a randomly located external node.
    778 As seen on \VPageref{p:Shuffle}, the random access reads data from external node, which are located in random order.
    779 So while the wrapped nodes are accessed linearly, external nodes are touched randomly for both kinds of lists resulting in similar cache events.
    780 The slowdown of shuffled occurs as the experiment's length grows from last-level cache into main memory.
    781 % Insert and remove operations act on both sides of a link.
    782 %Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
     675\VRef[Figure]{fig:plot-list-zoomout} presents the speed measures at various list lengths.
     676STL's wrapped-reference list begins with operations on a length-1 list costing around 30 ns.
     677This time grows modetly as list length increases, apart from more drastic worsening at the largest lengths.
     678STL's performance is not affected by element order in memory.
     679The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations.
     680This much is also unaffected by element order.
     681Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off.
     682In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL.
     683
     684The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster.
     685Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level.
     686Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear.
     687Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear.
     688
     689In a wrapped-reference list, list nodes are allocated separately from the items put into the list.
     690Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes.
     691
     692The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists.
     693The tests in this chapter are only inserting and removing.
     694They are not operating on any user payload data that is being listed.
     695The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order.
     696These differences are inherent to the two list models.
     697
     698The slowdown of shuffled intrusive occurs as the experiment's length grows from last-level cache, into main memory.
     699Insert and remove operations act on both sides of a link.
     700Both a next unlisted item to insert (found in the items' array, seen through the shuffling array), and a next listed item to remove (found by traversing list links), introduce a new user-item location.
    783701Each time a next item is processed, the memory access is a hop to a randomly far address.
    784 The target is unavailable in the cache and a slowdown results.
    785 Note, the external-data no-touch assumption is often unrealistic: decisions like,``Should I remove this item?'' need to look at the item.
    786 Therefore, under realistic assumptions, both intrusive and wrapped-reference lists suffer similar caching issues for very large lists.
    787 % In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a non-intrusive list may be preferred for lists of approximately the cache's size.
    788 
    789 The takeaway from this experiment is that wrapped-list operations are expensive because memory allocation is expense at this fine-grained level of execution.
    790 Hence, when possible, using intrusive links can produce a significant performance gain, even if nodes must be dynamically allocated, because the wrapping allocations are eliminated.
    791 Even when space is a consideration, intrusive links may not use any more storage if a node is mostly linked.
    792 Unfortunately, many programmers are unaware of intrusive lists for dynamically-sized data-structures.
    793 
    794 % Note, linear access may not be realistic unless dynamic size changes may occur;
    795 % if the nodes are known to be adjacent, use an array.
    796 
    797 % In a wrapped-reference list, list nodes are allocated separately from the items put into the list.
    798 % Intrusive beats wrapped at the smaller lengths, and when shuffling is avoided, because intrusive avoids dynamic memory allocation for list nodes.
    799 
    800 % STL's performance is not affected by element order in memory.
    801 %The field of intrusive lists begins with length-1 operations costing around 10 ns and enjoys a ``sweet spot'' in lengths 10--100 of 5--7-ns operations.
    802 % This much is also unaffected by element order.
    803 % Beyond this point, shuffled-element list performance worsens drastically, losing to STL beyond about half a million elements, and never particularly leveling off.
    804 % In the same range, an unshuffled list sees some degradation, but holds onto a 1--2 $\times$ speedup over STL.
    805 
    806 % The apparent intrusive ``sweet spot,'' particularly its better-than-length-1 speed, is not because of list operations truly running faster.
    807 % Rather, the worsening as length decreases reflects the per-operation share of harness overheads incurred at the outer-loop level.
    808 % Disabling the harness's ability to drive interleaving, even though the current scenario is using a ``never work in middle'' interleave, made this rise disappear.
    809 % Subsequent analyses use length-controlled relative performance when comparing intrusive implementations, making this curiosity disappear.
    810 
    811 % The remaining big-swing comparison points say more about a computer's memory hierarchy than about linked lists.
    812 % The tests in this chapter are only inserting and removing.
    813 % They are not operating on any user payload data that is being listed.
    814 % The drastic differences at large list lengths reflect differences in link-field storage density and in correlation of link-field order to element order.
    815 % These differences are inherent to the two list models.
    816 
    817 % A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening.
    818 % As a result, the interlinked nodes of the STL list are generally referencing their immediate neighbours.
    819 % This pattern occurs regardless of user-item shuffling because this test's ``use'' of the user-items' array is limited to storing element addresses. 
    820 % This experiment, driving an STL list, is simply not touching the memory that holds the user data.
    821 % Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast.
    822 
    823 % But the comparison of unshuffled intrusive with wrapped-reference gives the performance of these two styles, with their the common impediment of overfilling the cache removed.
    824 % Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes. 
    825 % This difference is appreciable below list length 0.5 M, and enormous below 10 K.
     702The target is not available in cache and a slowdown results.
     703
     704With the unshuffled intrusive list, each link connects to an adjacent location.  So, this case has high memory locality and stays fast.  But the unshuffled assumption is simply not realistic: if you know items are adjacent, you don't need a linked list.
     705
     706A wrapped-reference list's separate nodes are allocated right beside each other in this experiment, because no other memory allocation action is happening.
     707As a result, the interlinked nodes of the STL list are generally referenceing their immediate neighbours.
     708This pattern occurs regardless of user-item suffling because this test's ``use'' of the user-items' array is limited to storing element addresses. 
     709This experiment, driving an STL list, is simply not touching the memory that holds the user data.
     710Because the interlinked nodes, being the only touched memory, are generally adjacent, this case too has high memory locality and stays fast.
     711But the user-data no-touch assumption is often unrealistic: decisions like,``Should I remove this item?'' need to look at the item.
     712In an odd scenario where this intuition is incorrect, and where furthermore the program's total use of the memory allocator is sufficiently limited to yield approximately adjacent allocations for successive list insertions, a nonintrusive list may be preferred for lists of approximately the chache's size.
     713
     714Therefore, under clearly typical situational assumptions, both intrusive and wrapped-reference lists will suffer similarly from a large list overfilling the memory cache, experiencing degradation like shuffled intrusive shows here.
     715
     716But the comparison of unshuffled intrusive with wrapped-reference gives the peformance of these two styles, with their the common impediment of overfilling the cache removed.
     717Intrusive consistently beats wrapped-reference by about 20 ns, at all sizes. 
     718This difference is appreciable below list length 0.5 M, and enormous below 10 K.
    826719
    827720
    828721\section{Result: Comparing intrusive implementations}
    829 \label{s:ComparingIntrusiveImplementations}
    830 
    831 The preceding result shows that all the intrusive implementations examined have noteworthy performance compared to wrapped lists.
    832 This analysis picks the list region 10-150 and zooms-in to differentiate among the different intrusive implementations.
    833 The data is selected from the start of \VRef[Figure]{f:Linear}, but the start of \VRef[Figure]{f:Shuffled} would be largely the same.
     722
     723The preceding result shows that intrusive implementations have noteworthy performance differences below 150 nodes.
     724This analysis zooms in on this area and identifies the participants.
    834725
    835726\begin{figure}
    836727\centering
    837   \subfloat[Absolute Time]{\label{f:AbsoluteTime}
    838   \includegraphics{plot-list-zoomin-abs.pdf}
    839   } % subfigure
    840   \\
    841   \subfloat[Relative Time]{\label{f:RelativeTime}
    842   \includegraphics{plot-list-zoomin-rel.pdf}
    843   } % subfigure
     728  \begin{tabular}{c}
     729  \includegraphics{plot-list-zoomin-abs.pdf} \\
     730  (a) \\
     731  \includegraphics{plot-list-zoomin-rel.pdf} \\
     732  (b) \\
     733  \end{tabular}
    844734  \caption{Operation duration \vs list length at small-medium lengths.  One example operation is shown: stack movement, insert-first polarity and head-mediated access.  (a) has absolute times.  (b) has times relative to those of LQ-\lstinline{tailq}.}
    845735  \label{fig:plot-list-zoomin}
    846736\end{figure}
    847737
    848 \VRef[Figure]{fig:plot-list-zoomin} shows the zone from 10--150 blown up, both in absolute and relative time.
    849 % The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occurring through a head-provided insert/remove operation.
     738In \VRef{fig:plot-list-zoomin} part (a) shows exactly this zoom-in.
     739The same scenario as the coarse comparison is used: a stack, with insertions and removals happening at the end called ``first,'' ``head'' or ``front,'' and all changes occuring through a head-provided insert/remove operation.
    850740The error bars show fastest and slowest time seen on five trials, and the central point is the mean of the remaining three trials.
    851 For readability, the points are slightly staggered at a given horizontal value, where the points might otherwise appear on top of each other.
    852 
    853 While preparing experiment results, I first tested on my old office PC, AMD FX-8370E Eight-Core, before switching to the large new server for final testing.
    854 For this experiment, the results flipped in my favour when running on the server.
    855 New CPU architectures are now amazingly good at branch prediction and micro-parallelism in the pipelines.
    856 Specifically, on the PC, my \CFA and companion \uCpp lists are slower than lq-tail and lq-list by 10\% to 20\%.
    857 On the server, \CFA and \uCpp lists are can be fast by up to 100\%.
    858 Overall, LQ-tailq does the best at short lengths but loses out above a dozen elements.
    859 
    860 \VRef[Figure]{f:RelativeTime} shows the percentage difference by treating @tailq@ as the control benchmark, and showing the other list results relative to it.
     741For readability, the frameworks are slightly staggered in the horizontal, but all trials near a given size were run at the same size.
     742
     743For this particular operation, uC++ fares the worst, followed by \CFA, then LQ's @tailq@.
     744Its @list@ does the best at smaller lengths but loses its edge above a dozen elements.
     745
     746Moving toward being able to consider several scenarios, \VRef{fig:plot-list-zoomin} part (b) shows the same result, adjusted to treat @tailq@ as a benchmark, and expressing all the results relative to it.
    861747This change does not affect the who-wins statements, it just removes the ``sweet spot'' bend that the earlier discussion dismissed as incidental.
    862 Runs faster than @tailq@'s are below the zero and slower runs are above;
    863 @tailq@'s mean is always zero by definition, but its error bars, representing a single scenario's re-run stability, are still meaningful.
     748Runs faster than @tailq@'s are below the zero and slower runs are above; @tailq@'s mean is always zero by definition, but its error bars, representing a single scenario's re-run stability, are still meaningful.
    864749With this bend straightened out, aggregating across lengths is feasible.
    865750
    866751\begin{figure}
    867752\centering
    868   \subfloat[Supersets]{\label{f:Superset}
    869   \includegraphics{plot-list-cmp-exout.pdf}
    870   } % subfigure
    871   \\
    872   \subfloat[1st Level Slice]{\label{f:1stLevelSlice}
    873   \includegraphics{plot-list-cmp-survey.pdf}
    874   } % subfigure
     753  \begin{tabular}{c}
     754  \includegraphics{plot-list-cmp-exout.pdf} \\
     755  (a) \\
     756  \includegraphics{plot-list-cmp-survey.pdf} \\
     757  (b) \\
     758  \end{tabular}
    875759  \caption{Operation duration ranges across operational scenarios.  (a) has the supersets of the running example operation.  (b) has the first-level slices of the full space of operations.}
    876760  \label{fig:plot-list-cmp-overall}
    877761\end{figure}
    878762
    879 \VRef[Figure]{fig:plot-list-cmp-overall} introduces alternative views of the data.
     763\VRef{fig:plot-list-cmp-overall} introduces the resulting format.
    880764Part (a)'s first column summarizes all the data of \VRef{fig:plot-list-zoomin}-(b).
    881765Its x-axis label, ``stack/insfirst/allhead,'' names the concrete scenario that has been discussed until now.
    882766Moving across the columns, the next three each stretch to include more scenarios on each of the operation dimensions, one at a time.
    883767The second column considers the scenarios $\{\mathrm{stack}\} \times \{\mathrm{insfirst}\} \times \{\mathrm{allhead}, \mathrm{inselem}, \mathrm{remelem}\}$,
    884 while the third stretches polarity and the fourth stretches accessor.
     768while the third stretches polarity and the fourth streches accessor.
    885769Then next three columns each stretch two scenario dimensions and the last column stretches all three.
    886770The \CFA bar in the last column is summarizing 840 test-program runs: 14 list lengths, 2 movements, 2 polarities, 3 accessors and 5 repetitions.
     
    894778
    895779The chosen benchmark of LQ-@tailq@ is not shown in this format because it would be trivial here.
    896 With iter-scenario differences dominating the bar size, and @tailq@'s mean performance defined to be zero in all scenarios, a @tailq@ bar on this plot would only show @tailq@'s re-run stability, which is of no comparison value.
     780With iter-scenario differences dominating the bar size, and @tailq@'s mean performance defined to be zero in all scenarios, a @tailq@ bar on this plot would only show @tailq@'s re-run stabiity, which is of no comparison value.
    897781
    898782The LQ-@list@ implementation does not support all scenarios, only stack movement with insert-first polarity.
     
    921805% \end{figure}
    922806
    923 
    924807\section{Result: CFA cost attribution}
    925808
    926 This comparison loosely itemizes the reasons that the \CFA implementation runs 15--20\% slower than LQ.
    927 Each reason provides for safer programming.
    928 For each reason, a version of the \CFA list was measured that forgoes its safety and regains some performance.
    929 These potential sacrifices are:
     809This comparison loosely itemizes the reasons that the \CFA implementation runs 15--20\% slower than LQ.  Each reason provides for safer programming.  For each reaon, a version of the \CFA list was measured that forgoes its safety and regains some performance.  These potential sacrifices are:
    930810\newcommand{\mandhead}{\emph{mand-head}}
    931811\newcommand{\nolisted}{\emph{no-listed}}
     
    934814\item[mand(atory)-head] Removing support for headless lists.
    935815  A specific explanation of why headless support causes a slowdown is not offered.
    936   But it is reasonable for a cost to result from making one piece of code handle multiple cases; the subset of the \CFA list API that applies to headless lists shares its implementation with headed lists.
     816  But it is reasonable for a cost to result from making one pieceof code handle multiple cases; the subset of the \CFA list API that applies to headless lists shares its implementation with headed lists.
    937817  In the \mandhead case, disabling the feature in \CFA means using an older version of the implementation, from before headless support was added.
    938818  In the pre-headless library, trying to form a headless list (instructing, ``Insert loose element B after loose element A,'') is a checked runtime error.
     
    942822        For \lstinline{list}, this usage causes an ``uncaught'' runtime crash.}.
    943823\item[no-listed] Removing support for the @is_listed@ API query.
    944   Along with it goes error checking such as ``When inserting an element, it must not already be listed, \ie be referred to from somewhere else.''
     824  Along with it goes error checking such as ``When instering an element, it must not already be listed, \ie be referred to from somewhere else.''
    945825  These abilities have a cost because, in order to support them, a listed element that is being removed must be written to, to record its change in state.
    946826  In \CFA's representation, this cost is two pointer writes.
     
    955835  Without this termination marking, repeated requests for a next valid item will always provide a positive response; when it should be negative, the indicated next element is garbage data at an address unlikely to trigger a memory error.
    956836  LQ has a well-terminating iteration for listed elements.
    957   In the \noiter case, the slowdown is not inherent; it represents a \CFA optimization opportunity.
     837  In the \noiter case, the slowdown is not inherent; it represents a \CFA optimization opporunity.
    958838\end{description}
    959839\MLB{Ensure benefits are discussed earlier and cross-reference}  % an LQ programmer must know not to ask, ``Who's next?'' about an unlisted element; an LQ programmer cannot write assertions about an item being listed; LQ requiring a head parameter is an opportunity for the user to provide inconsistent data
     
    991871For element removals, \attribParity is the heavy hitter, with \noiter contributing modestly.
    992872
    993 The counterproductive shift outside of element removals is likely due to optimization done in the \attribFull version after implementing headless support, \ie not present in the \mandhead version.
     873The couterproductive shift outside of element removals is likely due to optimization done in the \attribFull version after implementing headless support, \ie not present in the \mandhead version.
    994874This work streamlined both head-based operations (head-based removal being half the work of the element-insertion test).
    995875This improvement could be ported to a \mandhead-style implementation, which would bring down the \attribParity time in these cases.
     
    1000880
    1001881\VRef[Figure]{fig:plot-list-cfa-attrib}-(b) addresses element removal being the overall \CFA slow spot and element removal having a peculiar shape in the (a) analysis.
    1002 Here, the \attribParity sacrifice bundle is broken out into its two constituents.
     882Here, the \attribParity sacrifice bundle is broken out into its two consituents.
    1003883The result is the same regardless of the operation.
    1004884All three individual sacrifices contribute noteworthy improvements (\nolisted slightly less).
     
    1011891The \noiter speed improvement would bring \CFA to +5\% of LQ overall, and from high twenties to high teens, in the worst case of element removal.
    1012892
    1013 Ultimately, this analysis provides options for a future effort that needs to get the most speed out of the \CFA list.
     893Utimately, this analysis provides options for a future effort that needs to get the most speed out of the \CFA list.
    1014894
    1015895
  • doc/theses/mike_brooks_MMath/plots/list-cfa-attrib-remelem.gp

    r81e1984b r58a4cde  
    3232   "-30%%" 0.769230769, \
    3333)
    34 set ylabel "Duration (tailq-Relative)" offset -1.0,0
     34set ylabel "Duration (tailq-Relative)"
    3535
    3636# Draw axis-like line at speedup=0% (y=1.0)
  • doc/theses/mike_brooks_MMath/plots/list-cfa-attrib.gp

    r81e1984b r58a4cde  
    2525   "-30%%" 0.769230769, \
    2626)
    27 set ylabel "Duration (tailq-Relative)" offset -1.0,0
     27set ylabel "Duration (tailq-Relative)"
    2828
    2929# Draw axis-like line at speedup=0% (y=1.0)
  • doc/theses/mike_brooks_MMath/plots/list-cmp-exout.gp

    r81e1984b r58a4cde  
    2525   "-30%%" 0.769230769, \
    2626)
    27 set ylabel "Duration (tailq-Relative)" offset -1.0,0
     27set ylabel "Duration (tailq-Relative)"
    2828
    2929# Draw axis-like line at speedup=0% (y=1.0)
  • doc/theses/mike_brooks_MMath/plots/list-cmp-survey.gp

    r81e1984b r58a4cde  
    2525   "-30%%" 0.769230769, \
    2626)
    27 set ylabel "Duration (tailq-Relative)" offset -1.0,0
     27set ylabel "Duration (tailq-Relative)"
    2828
    2929# Draw axis-like line at speedup=0% (y=1.0)
  • doc/theses/mike_brooks_MMath/plots/list-zoomin-abs.gp

    r81e1984b r58a4cde  
    1515set xrange [0.75:128];
    1616set xlabel "List length (item count)" offset 2,0
    17 set ylabel "Duration (ns)" offset -1.0,0
     17set ylabel "Duration (ns)"
    1818set errorbars 2.0
    1919
  • doc/theses/mike_brooks_MMath/plots/list-zoomin-rel.gp

    r81e1984b r58a4cde  
    2424set xrange [0.75:128];
    2525set xlabel "List length (item count)" offset 2,0
    26 set ylabel "Duration (tailq-Relative)" offset -1.0,0
     26set ylabel "Duration (tailq-Relative)"
    2727set errorbars 2.0
    2828
  • doc/theses/mike_brooks_MMath/plots/list-zoomout-noshuf.gp

    r81e1984b r58a4cde  
    1010set key top left
    1111set logscale x
    12 set xrange [10:*]
    13 #set logscale y
    14 #set yrange [32:*];
     12set logscale y
     13set yrange [4:200];
    1514set xlabel "List length (item count)" offset 2,0
    1615set ylabel "Duration (ns)"
  • doc/theses/mike_brooks_MMath/plots/list-zoomout-shuf.gp

    r81e1984b r58a4cde  
    1010set key top left
    1111set logscale x
    12 set xrange [10:*]
    13 #set logscale y
    14 #set yrange [32:200]
     12set logscale y
     13set yrange [4:200];
    1514set xlabel "List length (item count)" offset 2,0
    1615set ylabel "Duration (ns)"
  • doc/theses/mike_brooks_MMath/programs/lst-features-intro.run.cfa

    r81e1984b r58a4cde  
    2020struct req {
    2121        int pri, rqr;
    22         @inline dlink(req);@    // containment inheritance, fields hoisted into structure
     22        inline dlink(req);              // containment inheritance, fields hoisted into structure
    2323};
    2424
  • doc/theses/mike_brooks_MMath/programs/lst-features-multidir.run.cfa

    r81e1984b r58a4cde  
    4242
    4343
    44 dlist(req, req.by_pri) reqs_pri;
     44dlist(req, req.by_pri) reqs_pri_global;
    4545dlist(req, req.by_rqr) reqs_rqr_42;
    4646dlist(req, req.by_rqr) reqs_rqr_17;
     
    5555        r99a = {3, 99};
    5656
    57 insert_first(reqs_pri, r17c);
    58 insert_first(reqs_pri, r99a);
    59 insert_first(reqs_pri, r17b);
    60 insert_first(reqs_pri, r42b);
    61 insert_first(reqs_pri, r17a);
    62 insert_first(reqs_pri, r42a);
     57insert_first(reqs_pri_global, r17c);
     58insert_first(reqs_pri_global, r99a);
     59insert_first(reqs_pri_global, r17b);
     60insert_first(reqs_pri_global, r42b);
     61insert_first(reqs_pri_global, r17a);
     62insert_first(reqs_pri_global, r42a);
    6363
    6464insert_first(reqs_rqr_42, r42b);
     
    7979
    8080with(DLINK_VIA(req, req.by_pri)) {
    81         while( req & cur = iter( reqs_pri ); advance( cur ) )
     81        while( req & cur = iter( reqs_pri_global ); advance( cur ) )
    8282                printf("{%d %d} ", cur.pri, cur.rqr);
    8383        printf("| ");
  • doc/theses/mike_brooks_MMath/uw-ethesis.bib

    r81e1984b r58a4cde  
    3131
    3232@misc{cxx:raii-abi,
    33     key         = {Itanium},
    3433    title       = {Itanium C++ ABI},
    3534    howpublished= {\url{https://itanium-cxx-abi.github.io/cxx-abi/abi.html}},
  • doc/theses/mike_brooks_MMath/uw-ethesis.tex

    r81e1984b r58a4cde  
    9090\newcommand{\cmark}{\ding{51}}
    9191\newcommand{\xmark}{\ding{55}}
     92\usepackage[labelformat=simple,aboveskip=0pt,farskip=0pt,font=normalsize]{subfig}
    9293\usepackage{multirow}
    93 \usepackage[labelformat=simple,aboveskip=0pt,farskip=0pt,font=normalsize]{subfig}
    9494\renewcommand\thesubfigure{(\alph{subfigure})}
    9595
     
    115115
    116116\newcommand{\uCpp}{$\mu$\CC}
    117 \newcommand{\PAB}[1]{{\color{magenta}PAB: #1}}
     117\newcommand{\PAB}[1]{{\color{red}PAB: #1}}
    118118\newcommand{\MLB}[1]{{\color{red}MLB: #1}}
    119119
     
    156156\urlstyle{sf}
    157157
    158 \setcounter{secnumdepth}{4}     % number subsubsection
    159 \setcounter{tocdepth}{4} % subsubsection in TOC
    160 
    161158%\usepackage[automake,toc,abbreviations]{glossaries-extra} % Exception to the rule of hyperref being the last add-on package
    162159% If glossaries-extra is not in your LaTeX distribution, get it from CTAN (http://ctan.org/pkg/glossaries-extra),
     
    182179\setlength{\evensidemargin}{0.125in} % Adds 1/8 in. to binding side of all
    183180% even-numbered pages when the "twoside" printing option is selected
    184 %\setlength{\oddsidemargin}{0.125in} % Adds 1/8 in. to the left of all pages when "oneside" printing is selected, and to the left of all odd-numbered pages when "twoside" printing is selected
    185 %\setlength{\textwidth}{6.375in} % assuming US letter paper (8.5 in. x 11 in.) and side margins as above
     181\setlength{\oddsidemargin}{0.125in} % Adds 1/8 in. to the left of all pages when "oneside" printing is selected, and to the left of all odd-numbered pages when "twoside" printing is selected
     182\setlength{\textwidth}{6.375in} % assuming US letter paper (8.5 in. x 11 in.) and side margins as above
    186183\raggedbottom
    187184
  • libcfa/src/collections/list.hfa

    r81e1984b r58a4cde  
    1010// Created On       : Wed Apr 22 18:00:00 2020
    1111// Last Modified By : Peter A. Buhr
    12 // Last Modified On : Thu Aug  7 10:46:08 2025
    13 // Update Count     : 75
     12// Last Modified On : Thu Apr 24 18:12:59 2025
     13// Update Count     : 72
    1414//
    1515
     
    154154
    155155
     156
     157
     158
    156159forall( tE & ) {
    157160        struct dlink{
    158                 tE * next, * prev;
     161                tE * next;
     162                tE * prev;
    159163        };
    160164
     
    165169        }
    166170
    167         forall( tLinks & = dlink( tE ) ) {
    168                 struct dlist {
    169                         inline dlink( tE );
    170                 };
    171 
    172                 forall( | embedded( tE, tLinks, dlink( tE ) ) ) {
    173                         static inline tE * $get_list_origin_addr( dlist( tE, tLinks ) & list ) {
    174                                 dlink( tE ) & link_from_null = (*(tE *)0p)`inner;
    175                                 ptrdiff_t link_offset = (ptrdiff_t)&link_from_null;
    176                                 size_t origin_addr = ((size_t)&list) - link_offset;
    177                                 size_t preResult = ORIGIN_TAG_SET( origin_addr );
    178                                 return (tE *)preResult;
    179                         }
    180 
    181                         static inline void ?{}( dlist( tE, tLinks ) & this ) {
    182                                 tE * listOrigin = $get_list_origin_addr( this );
    183                                 ((dlink( tE ) &)this){ listOrigin, listOrigin };
    184                         }
     171        forall( tLinks & = dlink( tE ) )
     172        struct dlist {
     173                inline dlink( tE );
     174        };
     175
     176        forall( tLinks & | embedded( tE, tLinks, dlink( tE ) ) ) {
     177                static inline tE * $get_list_origin_addr( dlist( tE, tLinks ) & list ) {
     178                        dlink( tE ) & link_from_null = (*(tE *)0p)`inner;
     179                        ptrdiff_t link_offset = (ptrdiff_t)&link_from_null;
     180                        size_t origin_addr = ((size_t)&list) - link_offset;
     181                        size_t preResult = ORIGIN_TAG_SET( origin_addr );
     182                        return (tE *)preResult;
     183                }
     184
     185                static inline void ?{}( dlist( tE, tLinks ) & this ) {
     186                        tE * listOrigin = $get_list_origin_addr( this );
     187                        ((dlink( tE ) &)this){ listOrigin, listOrigin };
    185188                }
    186189        }
Note: See TracChangeset for help on using the changeset viewer.