source: doc/theses/mike_brooks_MMath/intro.tex@ 5faa3a5

Last change on this file since 5faa3a5 was 5faa3a5, checked in by Peter A. Buhr <pabuhr@…>, 20 hours ago

spelling corrections, and final proofreading

  • Property mode set to 100644
File size: 17.6 KB
Line 
1\chapter{Introduction}
2
3All modern programming languages provide at least these three high-level collections (containers): array, linked-list, and string.
4Often, the array is part of the programming language, while linked lists are built from (recursive) pointer types, and strings from arrays and/or linked lists.
5For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component levels, such as copying, slicing, extracting, iterating among, adding and removing elements.
6
7Unfortunately, memory misuse under C-idiomatic programming causes a significant number of errors~\cite{Oorschot23}.
8Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occurring in C result from errors using these facilities (memory errors).
9For operating system and browser vendors, who heavily use systems languages, 60\%--70\% of reported software vulnerabilities involved memory errors~\cite{Kehrer23}.
10For Microsoft, 70\% of vulnerabilities addressed via security updates between 2006--2018 are memory safety issues~\cite[slide 10]{Miller19}.
11In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is buffer errors (in the sense of misused array bounds), among 19 vulnerability categories~\cite{Cifuentes19}.
12
13While these figures address memory errors, not limited to the three primary collections, the authors' explanations of memory errors suggest improving on C's language/library options for working with the three collections would be beneficial.
14Every one of the safety-statistic sources above includes array bounds in its first characterization of unsafe programming-language features.
15Of the four studies that deal exclusively with memory safety (excluding~\cite{Cifuentes19}), all put @malloc@--@free@ upfront as well.
16The specific survey of C features, \cite{Oorschot23}, also places blame on:
17\begin{itemize}
18\item
19the null-terminated text idiom, with its support/mandate in the C standard library and
20\item
21the unprotected nature of C pointers.
22\end{itemize}
23
24An effort at translating C code to the safe subset of the now-famously safe language Rust~\cite{Emre2022} shows pointers as the stand-out difficult aspect to analyze.
25It demonstrates the difficulty of automatically determining pointer-referent lifetimes in that its success claim is to reduce the occurrence rate of unsafe dereference operations to about once per 100 lines of code.
26
27\begin{comment}
28It demonstrates the difficulty of analyzing C pointers for safety.
29It first puts a corpus of C programs through a naive Rust translation, which ``succeeds'' by leaving many uses of Rust's \lstinline{unsafe} marker.
30Then, it analyzes these uses and presents a technique for reducing the uses that hinge exclusively on referenced object lifetime.
31Pointer dereference accounts for 80\% of the cases that the naive translation did not make safe (Table 3.2, p.~33).
32Of these, 10\% are eligible for the work's own safety improvement technique, while this filtering leaves 75\% with no single cause of unsafety determined (Table 3.5, p.~39).
33The presented technique succeeds at making 75\% of the eligible initially unsafe dereferences safe, by inferring the necessary lifetime annotations (Table 4.2, p.~83).
34This result leaves, per 1000 LOC, 3.3 unremovable unsafe dereferences that are understood to evade lifetime analysis, among 220 gross unsafe dereferences (Tables 3.5, 4.2 and 3.1, p.~27).
35% 3.3 = 1398/413,428 * 1000
36% 220 ~= (99,101 [@tab3.5] - 9631 [@tab4.2] ) / 413,428 [@tab 3.1] * 1000
37\end{comment}
38
39A tool for analyzing C code's use of linked data structures (including, \eg skip list and binary tree) found variations of the linked-list shape to be present in most of the programs in its corpus~\cite{White2016}.
40
41So, C's array unsafety is infamous, its string pattern necessitates explicit storage management (also infamous), and user-written linked lists are a recognizable player in the arena of (infamously) unsafe raw pointer uses.
42Therefore, hardening the three primary collections goes a long way to making the majority of C programs safer, eliminating major hacker attack vectors.
43
44My work looks at extending these foundational collection types in the programming language \CFA, which is a new dialect of the C programming language.
45A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
46This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
47An equally important goal is balancing good performance with safety.
48
49The thesis describes improvements I made to the \CFA language design, both syntax and semantics, to support the collection features, and the source code created within the \CFA compiler, libraries, and runtime system to implement these features.
50This work leverages preexisting work within the compiler's type and runtime systems generated by prior graduate students working on the \CFA project.
51
52
53\section{Array}
54
55An array provides a homogeneous collection with $O(1)$ access to elements using subscripting.
56Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provide a better programmer experience.
57The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
58For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
59Because array layout has contiguous components, subscripting is a computation, \ie some form of pointer arithmetic.
60Not included in this requirement are pseudo-arrays with arbitrary keys, \eg hash table subscripted by a string key (see below).
61
62C provides a simple array type as a language feature.
63However, it adopts the controversial position of treating pointer and array as twins, leading to multiple problems.
64Hence, the way that arrays are typically passed around in a program removes the information necessary to do bound checks.
65
66
67\section{Linked List}
68
69A linked-list provides a homogeneous collection often with $O(n)$ access to elements using successor and predecessor operations that involve pointer chasing.
70It is the most foundational case of the linked data structures, which include trees and hash tables.
71Among the linked structures, the list's defining characteristic is maintaining a user-explicit total order (chain shape).
72Element search can be by position (as for array) or by value; it is sometimes presented as a subscript.
73In a more complex structure, the linked shape may be system managed (user implicit) by being a function of a designated ``key'' datum, \eg often for a hash table.
74Though such schemes often give better-than-$O(n)$ lookup, the linked list's user-explicit shape limits what the system can provide, to fast step/iteration for user-``nearby'' data, and slow exhaustive search/seek otherwise.
75Linked types are dynamically sized by adding and removing nodes using link fields internal or external to the list elements (nodes).
76If a programming language allows pointers to stack storage, linked nodes can be allocated on the stack and connected with stack addresses;
77otherwise, elements are heap allocated with internal or external link fields.
78
79
80
81C provides a few simple, library data-structures (@glibc@):
82\begin{itemize}
83\item
84singly-linked lists, singly-linked tail queues, lists, and tail queues (@<sys/queue.h>@) \see{\VRef{s:PreexistingLinked-ListLibraries}}
85\item
86hash search table consisting of a key (string) with associated data (@<search.h>@)
87\end{itemize}
88Because these collection libraries can be restrictive or awkward to use, C programmers often build bespoke linked data-structures, which further increases the potential for memory errors.
89
90
91\section{String}
92
93A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters.
94What differentiates a string from other types is that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@.
95While subscripting individual elements is usually available, working at the character level is considered poor practice, \ie underutilizing the powerful string operations.
96Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analyzing, composing, and decomposing string text.
97The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
98In many cases, string management is separate from heap management, \ie string libraries roll their own heaps.
99
100The C string type is just a character array and requires user storage management to handle varying size.
101C adopts a terminating sentinel character, @'\0'@, to mark the end of a variable-length character array, as opposed to keeping an independent length field.
102Hence, the sentinel character is excluded as an element within a string and determining the string length is an $O(N)$ operation.
103The C standard library includes a number of high-level operations for working with this representation.
104Most of these operations are awkward to use and highly error prone.
105
106
107\begin{comment}
108\section{Iterator}
109
110As a side issue to complex structured-types is iterating through them.
111Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work.
112However, the general iteration work is only a sketch for others as future work.
113Nevertheless, sufficed work was done to write out the ideas that developed and how they should apply in the main context of this work.
114\end{comment}
115
116
117\section{Motivation}
118
119The primary motivation for this work is twofold:
120\begin{enumerate}[leftmargin=*]
121\item
122The three primary collection types in C are difficult to understand, teach, and get right because they are too low level.
123Providing higher-level, feature-rich versions of these collections in \CFA is a major component of the primary goal.
124The result is a simplified programming experience, which increases productivity and maintainability.
125\item
126The new collection types must be as correct and safe to use as those in other modern programming languages, to elide the concerns of industry, government, and military~\cite{ONCD}.
127Prior approaches focus on out-of-bound array accesses using a model-based approach (ASCET) in embedded systems (\eg cars)~\cite{Blache19}, and general and null-terminated string arrays using a \CC template syntax (Checked C)~\cite{Elliott18,Ruef19}.
128Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to \emph{move away} from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these languages.
129Fixing these vulnerabilities negates this need, allowing C and its ecosystem to continue into the future.
130\end{enumerate}
131
132While new languages, \eg Go, Rust, Swift, purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the runtime uses garbage collection~\cite{Brinker26}.
133Even assuming automated conversion of C programs to other safe languages, who will maintain this massive new code-base?
134Furthermore, new languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
135Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, overloading, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning new \CC features and idioms.
136Finally, projects claiming to be modern C replacements (Cyclone~\cite{Grossman06}, Polymorphic C~\cite{Smith98}, GObject~\cite{GObject}) do not retain C backwards compatibility in syntax, programming model, or semantic compatibility;
137these languages are equivalent to switching to a different programming language.
138Hence, rewriting and retraining costs for other languages are prohibitive when companies have a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
139
140
141\section{Which C?}
142
143Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}.
144However, most programming languages are only partially explained by their standard's manual(s).
145When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}, because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
146Some key aspects of C need to be explained and understood by quoting from the language reference manual.
147However, to illustrate actual program semantics, this thesis constructs multiple small programs whose behaviour exercises a particular point and then confirms the behaviour by running the program using multiple @gcc@ compilers.
148These example programs show
149\begin{itemize}[itemsep=0pt]
150 \item whether the compiler accepts or rejects certain syntax,
151 \item printed output to buttress a behavioural claim, or
152 \item passed/failed assertions testing pre/post-conditions or invariants.
153\end{itemize}
154These programs are tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
155Any discovered anomalies among compilers, versions, or architectures are discussed.
156In general, it is never clear whether the \emph{truth} lies in the C standard or the compiler(s), which may be true for other programming languages.
157
158
159\section{Contributions}
160
161Overall, this work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
162% As well, a strong plan for general iteration has been sketched out.
163The following are my detailed contributions, where performance and safety are always the motivating factors.
164
165\subsection{Array}
166
167The improvements to C arrays are:
168\begin{enumerate}[leftmargin=*]
169\item
170Introduce a small number of subtle changes to the typing rules for the C array, while still achieving significant backwards compatibility.
171\item
172Create a new polymorphic mechanism in the \CFA @forall@ clause to specify array dimension values, similar to a \CC non-typed template parameter.
173\item
174Construct a new standard-library array type, available through @#include <array.hfa>@.
175\end{enumerate}
176The new array type accesses prior \CFA type-system features, delivering guaranteed runtime bound checks (often optimizer-removable) and implicit (guaranteed accurate) inter-function length communication.
177Extending the preexisting \CFA type system to achieve this length-related type checking is an original contribution.
178Furthermore, the new array incorporates multidimensional capabilities typical of scientific/machine-learning languages, made to coexist with the C raw-memory-aware tradition, in a novel way.
179The thesis includes a performance-relevant assessment of generated code that shows the new \CFA array simultaneously delivering the performance of a C array and the safety of a \CC vector.
180
181
182\subsection{Linked List}
183
184The linked list is a new runtime library, available through @#include <list.hfa>@.
185The library offers a novel combination of the preexisting \CFA features: references, inline inheritance, polymorphism, and declaration scoping.
186A programmer's experience using the list data types within the type system is novel.
187The list's runtime representation follows the established design pattern of intrusive link fields for performance reasons, especially in concurrent programs.
188The thesis includes a performance evaluation that shows the list performing comparably with other C-family intrusive lists, and notably better than the \CC standard list.
189
190
191\subsection{String}
192
193A new string type and its runtime library are available through @#include <string.hfa>@.
194The library offers a type where basic usage is comparable to the \CC @string@ type, including analogous coexistence with raw-character pointers.
195Its implementation, however, follows different principles, enabling programs to work with strings as simple values, without incurring excessive copying.
196Mutability is retained.
197Substrings are supported, including the option for overlapping ranges to share edits transparently, or act as diverging snapshots.
198The implementation allows a set of strings to use an optimized, shared, custom heap, or to use the regular heap, on strings that are passed among threads.
199
200The string library includes reading and writing strings via the preexisting \CFA I/O stream library.
201Enabling transparent reading of unknown length strings using managed allocation required revision of the stream library's existing handling of character arrays.
202All reasonable stream manipulators are supported, \eg width, justification (left/right), printing characters as their numeric values, and complex input scanning to isolate strings in the input stream.
203
204
205\begin{comment}
206\subsection{Iterator}
207
208Some advanced iterator prototyping now exists, available in the \CFA standard library.
209Specifically, a family of iterator styles is provided, with cheap iterators that are robust to being misused, plus full-featured iterators that observe structural modifications without becoming invalid.
210Hence, the infamous \CC bug of iterator invalidation, where structural changes cause legitimate operations to fail, \eg iterator over a list and deleting each element.
211
212Further design provides a home for Liskov-style iterators (stackless-coroutine) in \CFA.
213This design extends a preexisting proposal to adapt the \CFA (fixed) for-each loop to be more user-pluggable, and builds upon preexisting \CFA coroutines.
214Overall, it simplifies the work a programmer must do to leverage the suspended-state abstraction during iteration.
215\end{comment}
Note: See TracBrowser for help on using the repository browser.