| 1 | \chapter{Introduction}
 | 
|---|
| 2 | 
 | 
|---|
| 3 | All modern programming languages provide three high-level containers (collections): array, linked-list, and string.
 | 
|---|
| 4 | Often array is part of the programming language, while linked-list is built from (recursive) pointer types, and string from a combination of array and linked-list.
 | 
|---|
| 5 | For all three types, languages supply varying degrees of high-level mechanism for manipulating these objects at the bulk level and at the component level, such as array copy, slicing and iterating using subscripting.
 | 
|---|
| 6 | 
 | 
|---|
| 7 | This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
 | 
|---|
| 8 | A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
 | 
|---|
| 9 | This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
 | 
|---|
| 10 | An equally important goal is balancing good performance with safety.
 | 
|---|
| 11 | 
 | 
|---|
| 12 | 
 | 
|---|
| 13 | \section{Array}
 | 
|---|
| 14 | 
 | 
|---|
| 15 | An array provides a homogeneous container with $O(1)$ access to elements using subscripting.
 | 
|---|
| 16 | Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provides a better programmer experience.
 | 
|---|
| 17 | The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
 | 
|---|
| 18 | For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
 | 
|---|
| 19 | Because array layout has contiguous components, subscripting is a computation (some form of pointer arithmetic).
 | 
|---|
| 20 | 
 | 
|---|
| 21 | C provides an array as a language feature.
 | 
|---|
| 22 | 
 | 
|---|
| 23 | \section{Linked list}
 | 
|---|
| 24 | 
 | 
|---|
| 25 | A linked-list provides a homogeneous container often with $O(log N)$ or $O(N)$ access to elements using successor and predecessor operations that normally involve pointer chasing.
 | 
|---|
| 26 | Subscripting by value is sometimes available, \eg hash table.
 | 
|---|
| 27 | Linked types are normally dynamically sized by adding/removing nodes using link fields internal or external to the elements (nodes).
 | 
|---|
| 28 | If a programming language allows pointer to stack storage, linked-list types can be allocated on the stack;
 | 
|---|
| 29 | otherwise, elements are heap allocated with explicitly/implicitly managed.
 | 
|---|
| 30 | 
 | 
|---|
| 31 | C does not provide a linked-list, either as an obvious commonly-accepted library, nor as an outright language feature.
 | 
|---|
| 32 | C programmers commonly build linked-list behaviours into their bespoke data structures, directly using its language feature of pointers.
 | 
|---|
| 33 | 
 | 
|---|
| 34 | \section{String}
 | 
|---|
| 35 | 
 | 
|---|
| 36 | A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters.
 | 
|---|
| 37 | What differentiates a string from other types in that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@;
 | 
|---|
| 38 | subscripting individual elements is usually available, too.
 | 
|---|
| 39 | Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing.
 | 
|---|
| 40 | The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
 | 
|---|
| 41 | In some cases, string management is separate from heap management, \ie strings roll their own heap.
 | 
|---|
| 42 | 
 | 
|---|
| 43 | The C string type is a user-managed allocation of the language-provided character array,
 | 
|---|
| 44 | plus the convention of marking its (variable) length by placing the 0-valued control character at the end (``null-terminated'').
 | 
|---|
| 45 | The C standard library includes many operations for working on this representation.
 | 
|---|
| 46 | 
 | 
|---|
| 47 | \section{Iterator}
 | 
|---|
| 48 | 
 | 
|---|
| 49 | As a side issue to working with complex structured types is iterating through them.
 | 
|---|
| 50 | Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work, but the general iteration work is only a sketch for others as future work.
 | 
|---|
| 51 | Nevertheless, sufficed work was done to write out the ideas that developed and how they should apply in the main context of this work. 
 | 
|---|
| 52 | 
 | 
|---|
| 53 | 
 | 
|---|
| 54 | \section{Motivation}
 | 
|---|
| 55 | 
 | 
|---|
| 56 | The primary motivation for this work is two fold:
 | 
|---|
| 57 | \begin{enumerate}[leftmargin=*]
 | 
|---|
| 58 | \item
 | 
|---|
| 59 | These three aspects of C are difficult to understand, teach, and get right because they are correspondingly low level.
 | 
|---|
| 60 | Providing higher-level, feature-rich versions of these containers in \CFA is a major component of the primary goal.
 | 
|---|
| 61 | \item
 | 
|---|
| 62 | These three aspects of C cause the greatest safety issues because there are few or no safe guards when a programmer misunderstands or misuses these features~\cite{Elliott18, Blache19, Ruef19, Oorschot23}.
 | 
|---|
| 63 | Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occur in C resulting from errors using these facilities (memory errors), providing the major hacker attack-vectors.
 | 
|---|
| 64 | \end{enumerate}
 | 
|---|
| 65 | Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to move away from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these older languages.
 | 
|---|
| 66 | Hardening these three types goes a long way to make the majority of C programs safer.
 | 
|---|
| 67 | 
 | 
|---|
| 68 | 
 | 
|---|
| 69 | While multiple new languages purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the new runtime uses garage collection.
 | 
|---|
| 70 | Furthermore, these languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
 | 
|---|
| 71 | Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning \CC.
 | 
|---|
| 72 | Hence, rewriting and retraining costs for these languages can be prohibitive for companies with a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
 | 
|---|
| 73 | 
 | 
|---|
| 74 | 
 | 
|---|
| 75 | \subsection{C?}
 | 
|---|
| 76 | 
 | 
|---|
| 77 | Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}.
 | 
|---|
| 78 | However, most programming languages are only partially explained by their standard's manual(s).
 | 
|---|
| 79 | When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}.
 | 
|---|
| 80 | Often other C compilers must mimic @gcc@ because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
 | 
|---|
| 81 | While some key aspects of C need to be explained and understood by quoting from the language reference manual, to illustrate actual program semantics, this thesis uses constructs a program whose behaviour exercises a particular point and then confirms the behaviour by running the program.
 | 
|---|
| 82 | These example programs show
 | 
|---|
| 83 | \begin{itemize}[leftmargin=*]
 | 
|---|
| 84 |         \item if the compiler accepts or rejects certain syntax,
 | 
|---|
| 85 |         \item prints output to buttress a behavioural claim,
 | 
|---|
| 86 |         \item or executes without triggering any embedded assertions testing pre/post-assertions or invariants.
 | 
|---|
| 87 | \end{itemize}
 | 
|---|
| 88 | This work has been tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
 | 
|---|
| 89 | Any discovered anomalies among compilers or versions is discussed.
 | 
|---|
| 90 | In all case, it is never clear whether the \emph{truth} lies in the compiler(s) or the C standard.
 | 
|---|
| 91 | 
 | 
|---|
| 92 | 
 | 
|---|
| 93 | \section{Contributions}
 | 
|---|
| 94 | 
 | 
|---|
| 95 | This work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
 | 
|---|
| 96 | As well, a strong plan for general iteration has been sketched out.
 | 
|---|
| 97 | 
 | 
|---|
| 98 | This thesis mainly describes improvements made to the source code of the \CFA compiler and runtime system,
 | 
|---|
| 99 | available at TODO.
 | 
|---|
| 100 | 
 | 
|---|
| 101 | This work often leverages preexisting    xxxxxxxxxx
 | 
|---|
| 102 | 
 | 
|---|
| 103 | \subsection{Linked list}
 | 
|---|
| 104 | 
 | 
|---|
| 105 | The linked list is a new runtime library, available as @#include <dlist.hfa>@.
 | 
|---|
| 106 | The library offers a novel combination of the preexisting \CFA features of:
 | 
|---|
| 107 | references, inline inheritance, polymorphism and declaration scoping.
 | 
|---|
| 108 | A programmer's experience of the list datatype within the type system is novel.
 | 
|---|
| 109 | The list's runtime representation follows the established design pattern of intrusion.
 | 
|---|
| 110 | This thesis includes a performance evaluation that shows
 | 
|---|
| 111 | the list performing comparably with other C-family intrusive lists,
 | 
|---|
| 112 | and notably better than the \CC standard list.
 | 
|---|
| 113 | 
 | 
|---|
| 114 | \subsection{Array}
 | 
|---|
| 115 | 
 | 
|---|
| 116 | This work's array improvements are
 | 
|---|
| 117 | \begin{enumerate}[leftmargin=*]
 | 
|---|
| 118 | \item
 | 
|---|
| 119 | subtle changes to the typing rules for the C array,
 | 
|---|
| 120 | \item
 | 
|---|
| 121 | a new \CFA language construct for a type-system managed array length, and
 | 
|---|
| 122 | \item
 | 
|---|
| 123 | a new standard-library array type, available as @#include <array.hfa>@.
 | 
|---|
| 124 | \end{enumerate}
 | 
|---|
| 125 | The new array type, enabled by the earlier features, defines an array with
 | 
|---|
| 126 | guaranteed runtime bound checks (often optimizer-removable) and
 | 
|---|
| 127 | implicit (guaranteed accurate) inter-function length communication.
 | 
|---|
| 128 | Leveraging the preexisting \CFA type system to achieve
 | 
|---|
| 129 | this length-related type checking is an original contribution.
 | 
|---|
| 130 | 
 | 
|---|
| 131 | Furthermore, the new array incorporates multidimensional capabilities
 | 
|---|
| 132 | typical of scientific/machine-learning languages,
 | 
|---|
| 133 | made to coexist with the C ``raw-memory-aware'' tradition in a novel way.
 | 
|---|
| 134 | 
 | 
|---|
| 135 | \subsection{String}
 | 
|---|
| 136 | 
 | 
|---|
| 137 | The string is a new runtime library, available as @#include <string.hfa>@.
 | 
|---|
| 138 | The library offers a type whose basic usage is comparable to the \CC @string@ type,
 | 
|---|
| 139 | including analogous coexistence with raw character pointers.
 | 
|---|
| 140 | Its implementation, however, follows different princliples,
 | 
|---|
| 141 | enabling programs to work with strings by value, without incurring excessive copying.
 | 
|---|
| 142 | Mutability is retained.
 | 
|---|
| 143 | Substrings are supported, including the ability for overlapping ranges to share edits transparently.
 | 
|---|
| 144 | Ultimately, this implementation is a case of strings rolling their own heap.
 | 
|---|
| 145 | 
 | 
|---|
| 146 | The string library includes writing and reading strings via the preexsiting \CFA I/O stream library.
 | 
|---|
| 147 | Enabling transparent reading (of unknown length into a managed allocation) included revision of
 | 
|---|
| 148 | the stream libarary's existing handling of character arrays.
 | 
|---|
| 149 | 
 | 
|---|
| 150 | \subsection{Iterator}
 | 
|---|
| 151 | 
 | 
|---|
| 152 | Some prototyping, available in the \CFA standard library,
 | 
|---|
| 153 | addresses the infamous \CC bug source of iterator invalidation.
 | 
|---|
| 154 | A family of iterator styles is presented, with cheap iterators that resist being misused,
 | 
|---|
| 155 | plus full-featured iterators that observe modifications without becoming invalid.
 | 
|---|
| 156 | 
 | 
|---|
| 157 | Further design within this thesis presents a home for Liskov-style iterators in \CFA.
 | 
|---|
| 158 | This design extends a preexisting proposal to adapt the \CFA for-each style loop to be more user-pluggable.
 | 
|---|
| 159 | It also builds upon preexisting \CFA coroutines.
 | 
|---|
| 160 | It simplifies the work a programmer must do to leverage the suspended-state abstraction.
 | 
|---|