1 | \chapter{Introduction}
|
---|
2 |
|
---|
3 | All modern programming languages provide three high-level containers (collections): array, linked-list, and string.
|
---|
4 | Often array is part of the programming language, while linked-list is built from (recursive) pointer types, and string from a combination of array and linked-list.
|
---|
5 | For all three types, languages supply varying degrees of high-level mechanism for manipulating these objects at the bulk level and at the component level, such as array copy, slicing and iterating using subscripting.
|
---|
6 |
|
---|
7 | This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
|
---|
8 | A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
|
---|
9 | This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
|
---|
10 | An equally important goal is balancing good performance with safety.
|
---|
11 |
|
---|
12 |
|
---|
13 | \section{Array}
|
---|
14 |
|
---|
15 | An array provides a homogeneous container with $O(1)$ access to elements using subscripting.
|
---|
16 | Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provides a better programmer experience.
|
---|
17 | The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
|
---|
18 | For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
|
---|
19 | Because array layout has contiguous components, subscripting is a computation (some form of pointer arithmetic).
|
---|
20 |
|
---|
21 | C provides an array as a language feature.
|
---|
22 |
|
---|
23 | \section{Linked list}
|
---|
24 |
|
---|
25 | A linked-list provides a homogeneous container often with $O(log N)$ or $O(N)$ access to elements using successor and predecessor operations that normally involve pointer chasing.
|
---|
26 | Subscripting by value is sometimes available, \eg hash table.
|
---|
27 | Linked types are normally dynamically sized by adding/removing nodes using link fields internal or external to the elements (nodes).
|
---|
28 | If a programming language allows pointer to stack storage, linked-list types can be allocated on the stack;
|
---|
29 | otherwise, elements are heap allocated with explicitly/implicitly managed.
|
---|
30 |
|
---|
31 | C does not provide a linked-list, either as an obvious commonly-accepted library, nor as an outright language feature.
|
---|
32 | C programmers commonly build linked-list behaviours into their bespoke data structures, directly using its language feature of pointers.
|
---|
33 |
|
---|
34 | \section{String}
|
---|
35 |
|
---|
36 | A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters.
|
---|
37 | What differentiates a string from other types in that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@;
|
---|
38 | subscripting individual elements is usually available, too.
|
---|
39 | Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing.
|
---|
40 | The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
|
---|
41 | In some cases, string management is separate from heap management, \ie strings roll their own heap.
|
---|
42 |
|
---|
43 | The C string type is a user-managed allocation of the language-provided character array,
|
---|
44 | plus the convention of marking its (variable) length by placing the 0-valued control character at the end (``null-terminated'').
|
---|
45 | The C standard library includes many operations for working on this representation.
|
---|
46 |
|
---|
47 | \section{Iterator}
|
---|
48 |
|
---|
49 | As a side issue to working with complex structured types is iterating through them.
|
---|
50 | Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work, but the general iteration work is only a sketch for others as future work.
|
---|
51 | Nevertheless, sufficed work was done to write out the ideas that developed and how they should apply in the main context of this work.
|
---|
52 |
|
---|
53 |
|
---|
54 | \section{Motivation}
|
---|
55 |
|
---|
56 | The primary motivation for this work is two fold:
|
---|
57 | \begin{enumerate}[leftmargin=*]
|
---|
58 | \item
|
---|
59 | These three aspects of C are difficult to understand, teach, and get right because they are correspondingly low level.
|
---|
60 | Providing higher-level, feature-rich versions of these containers in \CFA is a major component of the primary goal.
|
---|
61 | \item
|
---|
62 | These three aspects of C cause the greatest safety issues because there are few or no safe guards when a programmer misunderstands or misuses these features~\cite{Elliott18, Blache19, Ruef19, Oorschot23}.
|
---|
63 | Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occur in C resulting from errors using these facilities (memory errors), providing the major hacker attack-vectors.
|
---|
64 | \end{enumerate}
|
---|
65 | Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to move away from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these older languages.
|
---|
66 | Hardening these three types goes a long way to make the majority of C programs safer.
|
---|
67 |
|
---|
68 |
|
---|
69 | While multiple new languages purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the new runtime uses garage collection.
|
---|
70 | Furthermore, these languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
|
---|
71 | Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning \CC.
|
---|
72 | Hence, rewriting and retraining costs for these languages can be prohibitive for companies with a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
|
---|
73 |
|
---|
74 |
|
---|
75 | \subsection{C?}
|
---|
76 |
|
---|
77 | Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}.
|
---|
78 | However, most programming languages are only partially explained by their standard's manual(s).
|
---|
79 | When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}.
|
---|
80 | Often other C compilers must mimic @gcc@ because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
|
---|
81 | While some key aspects of C need to be explained and understood by quoting from the language reference manual, to illustrate actual program semantics, this thesis uses constructs a program whose behaviour exercises a particular point and then confirms the behaviour by running the program.
|
---|
82 | These example programs show
|
---|
83 | \begin{itemize}[leftmargin=*]
|
---|
84 | \item if the compiler accepts or rejects certain syntax,
|
---|
85 | \item prints output to buttress a behavioural claim,
|
---|
86 | \item or executes without triggering any embedded assertions testing pre/post-assertions or invariants.
|
---|
87 | \end{itemize}
|
---|
88 | This work has been tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
|
---|
89 | Any discovered anomalies among compilers or versions is discussed.
|
---|
90 | In all case, it is never clear whether the \emph{truth} lies in the compiler(s) or the C standard.
|
---|
91 |
|
---|
92 |
|
---|
93 | \section{Contributions}
|
---|
94 |
|
---|
95 | This work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
|
---|
96 | As well, a strong plan for general iteration has been sketched out.
|
---|
97 |
|
---|
98 | This thesis mainly describes improvements made to the source code of the \CFA compiler and runtime system,
|
---|
99 | available at TODO.
|
---|
100 |
|
---|
101 | This work often leverages preexisting xxxxxxxxxx
|
---|
102 |
|
---|
103 | \subsection{Linked list}
|
---|
104 |
|
---|
105 | The linked list is a new runtime library, available as @#include <dlist.hfa>@.
|
---|
106 | The library offers a novel combination of the preexisting \CFA features of:
|
---|
107 | references, inline inheritance, polymorphism and declaration scoping.
|
---|
108 | A programmer's experience of the list datatype within the type system is novel.
|
---|
109 | The list's runtime representation follows the established design pattern of intrusion.
|
---|
110 | This thesis includes a performance evaluation that shows
|
---|
111 | the list performing comparably with other C-family intrusive lists,
|
---|
112 | and notably better than the \CC standard list.
|
---|
113 |
|
---|
114 | \subsection{Array}
|
---|
115 |
|
---|
116 | This work's array improvements are
|
---|
117 | \begin{enumerate}[leftmargin=*]
|
---|
118 | \item
|
---|
119 | subtle changes to the typing rules for the C array,
|
---|
120 | \item
|
---|
121 | a new \CFA language construct for a type-system managed array length, and
|
---|
122 | \item
|
---|
123 | a new standard-library array type, available as @#include <array.hfa>@.
|
---|
124 | \end{enumerate}
|
---|
125 | The new array type, enabled by the earlier features, defines an array with
|
---|
126 | guaranteed runtime bound checks (often optimizer-removable) and
|
---|
127 | implicit (guaranteed accurate) inter-function length communication.
|
---|
128 | Leveraging the preexisting \CFA type system to achieve
|
---|
129 | this length-related type checking is an original contribution.
|
---|
130 |
|
---|
131 | Furthermore, the new array incorporates multidimensional capabilities
|
---|
132 | typical of scientific/machine-learning languages,
|
---|
133 | made to coexist with the C ``raw-memory-aware'' tradition in a novel way.
|
---|
134 |
|
---|
135 | \subsection{String}
|
---|
136 |
|
---|
137 | The string is a new runtime library, available as @#include <string.hfa>@.
|
---|
138 | The library offers a type whose basic usage is comparable to the \CC @string@ type,
|
---|
139 | including analogous coexistence with raw character pointers.
|
---|
140 | Its implementation, however, follows different princliples,
|
---|
141 | enabling programs to work with strings by value, without incurring excessive copying.
|
---|
142 | Mutability is retained.
|
---|
143 | Substrings are supported, including the ability for overlapping ranges to share edits transparently.
|
---|
144 | Ultimately, this implementation is a case of strings rolling their own heap.
|
---|
145 |
|
---|
146 | The string library includes writing and reading strings via the preexsiting \CFA I/O stream library.
|
---|
147 | Enabling transparent reading (of unknown length into a managed allocation) included revision of
|
---|
148 | the stream libarary's existing handling of character arrays.
|
---|
149 |
|
---|
150 | \subsection{Iterator}
|
---|
151 |
|
---|
152 | Some prototyping, available in the \CFA standard library,
|
---|
153 | addresses the infamous \CC bug source of iterator invalidation.
|
---|
154 | A family of iterator styles is presented, with cheap iterators that resist being misused,
|
---|
155 | plus full-featured iterators that observe modifications without becoming invalid.
|
---|
156 |
|
---|
157 | Further design within this thesis presents a home for Liskov-style iterators in \CFA.
|
---|
158 | This design extends a preexisting proposal to adapt the \CFA for-each style loop to be more user-pluggable.
|
---|
159 | It also builds upon preexisting \CFA coroutines.
|
---|
160 | It simplifies the work a programmer must do to leverage the suspended-state abstraction.
|
---|