[27f1055] | 1 | \chapter{Introduction}
|
---|
| 2 |
|
---|
[16915b1] | 3 | All modern programming languages provide three high-level containers (collections): array, linked-list, and string.
|
---|
| 4 | Often array is part of the programming language, while linked-list is built from (recursive) pointer types, and string from a combination of array and linked-list.
|
---|
[7d415b4] | 5 | For all three types, languages supply varying degrees of high-level mechanism for manipulating these objects at the bulk level and at the component level, such as array copy, slicing and iterating using subscripting.
|
---|
[16915b1] | 6 |
|
---|
| 7 | This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
|
---|
| 8 | A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
|
---|
| 9 | This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
|
---|
[7d415b4] | 10 | An equally important goal is balancing good performance with safety.
|
---|
[bdc8591] | 11 |
|
---|
[19a2890] | 12 |
|
---|
[bdc8591] | 13 | \section{Array}
|
---|
| 14 |
|
---|
[16915b1] | 15 | An array provides a homogeneous container with $O(1)$ access to elements using subscripting.
|
---|
[7d415b4] | 16 | Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provides a better programmer experience.
|
---|
[bdc8591] | 17 | The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
|
---|
| 18 | For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
|
---|
[16915b1] | 19 | Because array layout has contiguous components, subscripting is a computation (some form of pointer arithmetic).
|
---|
[bdc8591] | 20 |
|
---|
[02ea981] | 21 | C provides an array as a language feature.
|
---|
[bdc8591] | 22 |
|
---|
[1e110bf] | 23 | \section{Linked list}
|
---|
[bdc8591] | 24 |
|
---|
[02ea981] | 25 | A linked-list provides a homogeneous container often with $O(log N)$ or $O(N)$ access to elements using successor and predecessor operations that normally involve pointer chasing.
|
---|
[051aec4] | 26 | Subscripting by value is sometimes available, \eg hash table.
|
---|
| 27 | Linked types are normally dynamically sized by adding/removing nodes using link fields internal or external to the elements (nodes).
|
---|
| 28 | If a programming language allows pointer to stack storage, linked-list types can be allocated on the stack;
|
---|
[16915b1] | 29 | otherwise, elements are heap allocated with explicitly/implicitly managed.
|
---|
[bdc8591] | 30 |
|
---|
[02ea981] | 31 | C does not provide a linked-list, either as an obvious commonly-accepted library, nor as an outright language feature.
|
---|
| 32 | C programmers commonly build linked-list behaviours into their bespoke data structures, directly using its language feature of pointers.
|
---|
[bdc8591] | 33 |
|
---|
| 34 | \section{String}
|
---|
| 35 |
|
---|
[7d415b4] | 36 | A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters.
|
---|
| 37 | What differentiates a string from other types in that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@;
|
---|
| 38 | subscripting individual elements is usually available, too.
|
---|
| 39 | Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing.
|
---|
[c721105] | 40 | The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
|
---|
[7d415b4] | 41 | In some cases, string management is separate from heap management, \ie strings roll their own heap.
|
---|
| 42 |
|
---|
[02ea981] | 43 | The C string type is a user-managed allocation of the language-provided character array,
|
---|
| 44 | plus the convention of marking its (variable) length by placing the 0-valued control character at the end (``null-terminated'').
|
---|
| 45 | The C standard library includes many operations for working on this representation.
|
---|
[7d415b4] | 46 |
|
---|
| 47 | \section{Iterator}
|
---|
| 48 |
|
---|
| 49 | As a side issue to working with complex structured types is iterating through them.
|
---|
| 50 | Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work, but the general iteration work is only a sketch for others as future work.
|
---|
| 51 | Nevertheless, sufficed work was done to write out the ideas that developed and how they should apply in the main context of this work.
|
---|
[bdc8591] | 52 |
|
---|
| 53 |
|
---|
[6a8c773] | 54 | \section{Motivation}
|
---|
[bdc8591] | 55 |
|
---|
[16915b1] | 56 | The primary motivation for this work is two fold:
|
---|
| 57 | \begin{enumerate}[leftmargin=*]
|
---|
| 58 | \item
|
---|
[7d415b4] | 59 | These three aspects of C are difficult to understand, teach, and get right because they are correspondingly low level.
|
---|
| 60 | Providing higher-level, feature-rich versions of these containers in \CFA is a major component of the primary goal.
|
---|
[16915b1] | 61 | \item
|
---|
| 62 | These three aspects of C cause the greatest safety issues because there are few or no safe guards when a programmer misunderstands or misuses these features~\cite{Elliott18, Blache19, Ruef19, Oorschot23}.
|
---|
| 63 | Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occur in C resulting from errors using these facilities (memory errors), providing the major hacker attack-vectors.
|
---|
| 64 | \end{enumerate}
|
---|
| 65 | Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to move away from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these older languages.
|
---|
| 66 | Hardening these three types goes a long way to make the majority of C programs safer.
|
---|
| 67 |
|
---|
| 68 |
|
---|
| 69 | While multiple new languages purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the new runtime uses garage collection.
|
---|
| 70 | Furthermore, these languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
|
---|
| 71 | Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning \CC.
|
---|
| 72 | Hence, rewriting and retraining costs for these languages can be prohibitive for companies with a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
|
---|
[bdc8591] | 73 |
|
---|
| 74 |
|
---|
[6a8c773] | 75 | \subsection{C?}
|
---|
[bdc8591] | 76 |
|
---|
[051aec4] | 77 | Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}.
|
---|
[7d415b4] | 78 | However, most programming languages are only partially explained by their standard's manual(s).
|
---|
[6a8c773] | 79 | When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}.
|
---|
[16915b1] | 80 | Often other C compilers must mimic @gcc@ because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
|
---|
[7d415b4] | 81 | While some key aspects of C need to be explained and understood by quoting from the language reference manual, to illustrate actual program semantics, this thesis uses constructs a program whose behaviour exercises a particular point and then confirms the behaviour by running the program.
|
---|
[6a8c773] | 82 | These example programs show
|
---|
[16915b1] | 83 | \begin{itemize}[leftmargin=*]
|
---|
| 84 | \item if the compiler accepts or rejects certain syntax,
|
---|
[7d415b4] | 85 | \item prints output to buttress a behavioural claim,
|
---|
[16915b1] | 86 | \item or executes without triggering any embedded assertions testing pre/post-assertions or invariants.
|
---|
[bdc8591] | 87 | \end{itemize}
|
---|
[7d415b4] | 88 | This work has been tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
|
---|
[6a8c773] | 89 | Any discovered anomalies among compilers or versions is discussed.
|
---|
[7d415b4] | 90 | In all case, it is never clear whether the \emph{truth} lies in the compiler(s) or the C standard.
|
---|
[27f1055] | 91 |
|
---|
| 92 |
|
---|
| 93 | \section{Contributions}
|
---|
[bdc8591] | 94 |
|
---|
[7d415b4] | 95 | This work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
|
---|
| 96 | As well, a strong plan for general iteration has been sketched out.
|
---|
| 97 |
|
---|
[02ea981] | 98 | This thesis mainly describes improvements made to the source code of the \CFA compiler and runtime system,
|
---|
| 99 | available at TODO.
|
---|
| 100 |
|
---|
| 101 | This work often leverages preexisting xxxxxxxxxx
|
---|
| 102 |
|
---|
[1e110bf] | 103 | \subsection{Linked list}
|
---|
[bdc8591] | 104 |
|
---|
[02ea981] | 105 | The linked list is a new runtime library, available as @#include <dlist.hfa>@.
|
---|
| 106 | The library offers a novel combination of the preexisting \CFA features of:
|
---|
| 107 | references, inline inheritance, polymorphism and declaration scoping.
|
---|
| 108 | A programmer's experience of the list datatype within the type system is novel.
|
---|
| 109 | The list's runtime representation follows the established design pattern of intrusion.
|
---|
| 110 | This thesis includes a performance evaluation that shows
|
---|
| 111 | the list performing comparably with other C-family intrusive lists,
|
---|
| 112 | and notably better than the \CC standard list.
|
---|
| 113 |
|
---|
[bdc8591] | 114 | \subsection{Array}
|
---|
| 115 |
|
---|
[02ea981] | 116 | This work's array improvements are
|
---|
| 117 | \begin{enumerate}[leftmargin=*]
|
---|
| 118 | \item
|
---|
| 119 | subtle changes to the typing rules for the C array,
|
---|
| 120 | \item
|
---|
| 121 | a new \CFA language construct for a type-system managed array length, and
|
---|
| 122 | \item
|
---|
| 123 | a new standard-library array type, available as @#include <array.hfa>@.
|
---|
| 124 | \end{enumerate}
|
---|
| 125 | The new array type, enabled by the earlier features, defines an array with
|
---|
| 126 | guaranteed runtime bound checks (often optimizer-removable) and
|
---|
| 127 | implicit (guaranteed accurate) inter-function length communication.
|
---|
| 128 | Leveraging the preexisting \CFA type system to achieve
|
---|
| 129 | this length-related type checking is an original contribution.
|
---|
| 130 |
|
---|
| 131 | Furthermore, the new array incorporates multidimensional capabilities
|
---|
| 132 | typical of scientific/machine-learning languages,
|
---|
| 133 | made to coexist with the C ``raw-memory-aware'' tradition in a novel way.
|
---|
| 134 |
|
---|
[bdc8591] | 135 | \subsection{String}
|
---|
[c721105] | 136 |
|
---|
[02ea981] | 137 | The string is a new runtime library, available as @#include <string.hfa>@.
|
---|
| 138 | The library offers a type whose basic usage is comparable to the \CC @string@ type,
|
---|
| 139 | including analogous coexistence with raw character pointers.
|
---|
| 140 | Its implementation, however, follows different princliples,
|
---|
| 141 | enabling programs to work with strings by value, without incurring excessive copying.
|
---|
| 142 | Mutability is retained.
|
---|
| 143 | Substrings are supported, including the ability for overlapping ranges to share edits transparently.
|
---|
| 144 | Ultimately, this implementation is a case of strings rolling their own heap.
|
---|
| 145 |
|
---|
| 146 | The string library includes writing and reading strings via the preexsiting \CFA I/O stream library.
|
---|
| 147 | Enabling transparent reading (of unknown length into a managed allocation) included revision of
|
---|
| 148 | the stream libarary's existing handling of character arrays.
|
---|
| 149 |
|
---|
[c721105] | 150 | \subsection{Iterator}
|
---|
[02ea981] | 151 |
|
---|
| 152 | Some prototyping, available in the \CFA standard library,
|
---|
| 153 | addresses the infamous \CC bug source of iterator invalidation.
|
---|
| 154 | A family of iterator styles is presented, with cheap iterators that resist being misused,
|
---|
| 155 | plus full-featured iterators that observe modifications without becoming invalid.
|
---|
| 156 |
|
---|
| 157 | Further design within this thesis presents a home for Liskov-style iterators in \CFA.
|
---|
| 158 | This design extends a preexisting proposal to adapt the \CFA for-each style loop to be more user-pluggable.
|
---|
| 159 | It also builds upon preexisting \CFA coroutines.
|
---|
| 160 | It simplifies the work a programmer must do to leverage the suspended-state abstraction.
|
---|