source: doc/theses/mike_brooks_MMath/intro.tex@ 4cf8832

Last change on this file since 4cf8832 was c979afa, checked in by Peter A. Buhr <pabuhr@…>, 7 weeks ago

final proofread of introduction chapter

  • Property mode set to 100644
File size: 16.7 KB
Line 
1\chapter{Introduction}
2
3All modern programming languages provide at least these three high-level containers (collections): array, linked-list, and string.
4Often, the array is part of the programming language, while linked lists are built from (recursive) pointer types, and strings from arrays and/or linked lists.
5For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component levels, such as copying, slicing, extracting, and iterating among elements.
6
7Unfortunately, memory misuse under C-idiomatic programming causes a significant number of errors~\cite{Oorschot23}.
8Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occurring in C result from errors using these facilities (memory errors).
9For operating system and browser vendors, who heavily use systems languages, 60\%--70\% of reported software vulnerabilities involved memory errors~\cite{Kehrer23}.
10For Microsoft, 70\% of vulnerabilities addressed via security updates between 2006--2018 are memory safety issues~\cite[slide 10]{Miller19}.
11In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is buffer errors (in the sense of misused array bounds), among 19 vulnerability categories~\cite{Cifuentes19}.
12
13While these figures address memory errors, not limited to the three primary collections, the authors' explanations of memory errors suggest improving on C's language/library options for working with the three collections would be beneficial.
14Every one of the safety-statistic sources above includes array bounds in its first characterization of unsafe programming-language features.
15Of the four studies that deal exclusively with memory safety (excluding~\cite{Cifuentes19}), all put @malloc@--@free@ upfront as well.
16The specific survey of C features, \cite{Oorschot23}, also places blame on:
17\begin{itemize}
18\item
19the null-terminated text idiom, with its support/mandate in the C standard library and
20\item
21the unprotected nature of C pointers.
22\end{itemize}
23
24An effort at translating C to safe Rust~\cite{Emre2022} has pointers as the stand-out difficult aspect to analyze.
25It demonstrates the difficulty of analyzing C pointers for safety.
26It first puts a corpus of C programs through a naive Rust translation, which ``succeeds'' by leaving many uses of Rust's \lstinline{unsafe} marker.
27Then, it analyzes these uses and presents a technique for reducing the uses that hinge exclusively on referenced object lifetime.
28Pointer dereference accounts for 80\% of the cases that the na\"{i}ve translation did not make safe (Table 3.2, p.~33).
29Of these, 10\% are eligible for the work's own safety improvement technique, while this filtering leaves 75\% with no single cause of unsafety determined (Table 3.5, p.~39).
30The presented technique succeeds at making 75\% of the eligible initially unsafe dereferences safe, by inferring the necessary lifetime annotations (Table 4.2, p.~83).
31This result leaves, per 1000 LOC, 3.3 unremovable unsafe dereferences that are understood to evade lifetime analysis, among 220 gross unsafe dereferences (Tables 3.5, 4.2 and 3.1, p.~27).
32% 3.3 = 1398/413,428 * 1000
33% 220 ~= (99,101 [@tab3.5] - 9631 [@tab4.2] ) / 413,428 [@tab 3.1] * 1000
34
35A tool for analyzing C code's use of linked data structures (including, \eg skip list and binary tree) found variations of the linked-list shape to be present in most of the programs in its corpus~\cite{White2016}.
36So, C's array unsafety is infamous, its string pattern necessitates explicit storage management (also infamous), and user-written linked lists are a recognizable player in the arena of (infamously) unsafe raw pointer uses.
37Therefore, hardening the three primary collections goes a long way to making the majority of C programs safer, eliminating major hacker attack vectors.
38
39This work looks at extending these foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
40A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
41This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
42An equally important goal is balancing good performance with safety.
43
44The thesis describes improvements made to the \CFA language design, both syntax and semantics, to support the container features, and the source code created within the \CFA compiler, libraries, and runtime system to implement these features.
45This work leverages preexisting work within the compiler's type and runtime systems generated by prior graduate students working on the \CFA project.
46
47
48\section{Array}
49
50An array provides a homogeneous container with $O(1)$ access to elements using subscripting.
51Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provide a better programmer experience.
52The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
53For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
54Because array layout has contiguous components, subscripting is a computation, \ie some form of pointer arithmetic.
55Not included in this requirement are pseudo-arrays with arbitrary keys, \eg hash table subscripted by a string key (see below).
56
57C provides a simple array type as a language feature.
58However, it adopts the controversial position of treating pointer and array as twins, leading to multiple problems.
59Hence, the way that arrays are typically passed around in a program removes the information necessary to do bound checks.
60
61
62\section{Linked List}
63
64A linked-list provides a homogeneous container often with $O(log N)$ or $O(N)$ access to elements using successor and predecessor operations that normally involve pointer chasing.
65Subscripting by value (rather than position or location as for array) is sometimes available, \eg hash table.
66Linked types are dynamically sized by adding and removing nodes using link fields internal or external to the list elements (nodes).
67If a programming language allows pointers to stack storage, linked-list nodes can be allocated on the stack and connected with stack addresses;
68otherwise, elements are heap allocated with internal or external link fields.
69
70C provides a few simple, library data-structures (@glibc@):
71\begin{itemize}
72\item
73singly-linked lists, singly-linked tail queues, lists, and tail queues (@<sys/queue.h>@) \see{\VRef{s:PreexistingLinked-ListLibraries}}
74\item
75hash search table consisting of a key (string) with associated data (@<search.h>@)
76\end{itemize}
77Because these container libraries can be restrictive or awkward to use, C programmers often build bespoke linked data-structures, which further increases the potential for memory errors.
78
79
80\section{String}
81
82A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters.
83What differentiates a string from other types in that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@.
84While subscripting individual elements is usually available, working at the character level is considered poor practise, \ie underutilizing the powerful string operations.
85Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing string text.
86The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
87In many cases, string management is separate from heap management, \ie strings roll-their-own heap.
88
89The C string type is just a character array and requires user storage-management to handle varying size.
90C adopts a terminating sentinel character, @'\0'@, to mark the end of a variable-length character-array versus a preceding length field.
91Hence, the sentinel character is excluded as an element within a string and determining the string length is an $O(N)$ operation.
92The C standard library includes a number of high-level operations for working with this representation.
93Most of these operations are awkward to use and highly error prone.
94
95
96\begin{comment}
97\section{Iterator}
98
99As a side issue to complex structured-types is iterating through them.
100Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work.
101However, the general iteration work is only a sketch for others as future work.
102Nevertheless, sufficed work was done to write out the ideas that developed and how they should apply in the main context of this work.
103\end{comment}
104
105
106\section{Motivation}
107
108The primary motivation for this work is two fold:
109\begin{enumerate}[leftmargin=*]
110\item
111The three primary container types in C are difficult to understand, teach, and get right because they are too low level.
112Providing higher-level, feature-rich versions of these containers in \CFA is a major component of the primary goal.
113The result is a simplified programming experience, which increases productivity and maintainability.
114\item
115The new container types must be as correct and safe to use as those in other modern programming languages, to elide the concerns of industry, government, and military~\cite{ONCD}.
116Prior approaches focus on out-of-bound array accesses using a model-based approach (ASCET) in embedded systems (\eg cars)~\cite{Blache19}, and general and null-terminated string arrays using a \CC template syntax (Checked C)~\cite{Elliott18,Ruef19}.
117Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to \emph{move away} from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these languages.
118Fixing these vulnerabilities negates this need, allowing C and its ecosystem to continue into the future.
119\end{enumerate}
120
121While new languages, \eg Go, Rust, Swift, purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the runtime uses garbage collection~\cite{Brinker26}.
122Even assuming automated conversion of C programs to other safe languages, who will maintain this massive new code-base?
123Furthermore, new languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
124Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, overloading, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning new \CC features and idioms.
125Finally, projects claiming to be modern C replacements (Cyclone~\cite{Grossman06}, Polymorphic C~\cite{Smith98}, GObject~\cite{GObject}) do not retain C backwards compatibility in syntax, programing model, or semantic compatibility;
126these languages are equivalent to switching to a different programming language.
127Hence, rewriting and retraining costs for other languages are prohibitive when companies have a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
128
129
130\section{C?}
131
132Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}.
133However, most programming languages are only partially explained by their standard's manual(s).
134When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}, because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
135Some key aspects of C need to be explained and understood by quoting from the language reference manual.
136However, to illustrate actual program semantics, this thesis constructs multiple small programs whose behaviour exercises a particular point and then confirms the behaviour by running the program using multiple @gcc@ compilers.
137These example programs show
138\begin{itemize}[itemsep=0pt]
139 \item if the compiler accepts or rejects certain syntax,
140 \item prints output to buttress a behavioural claim,
141 \item or fails to trigger embedded assertions testing pre/post-assertions or invariants.
142\end{itemize}
143These programs are tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
144Any discovered anomalies among compilers, versions, or architectures are discussed.
145In general, it is never clear whether the \emph{truth} lies in the C standard or the compiler(s), which may be true for other programming languages.
146
147
148\section{Contributions}
149
150Overall, this work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
151% As well, a strong plan for general iteration has been sketched out.
152The following are the detailed contributions, where performance and safety are always the motivating factors.
153
154\subsection{Array}
155
156The improvements to C arrays are:
157\begin{enumerate}[leftmargin=*]
158\item
159Introduce a small number of subtle changes to the typing rules for the C array, while still achieving significant backwards compatibility.
160\item
161Create a new polymorphic mechanism in the \CFA @forall@ clause to specify array dimension values, similar to a fixed-typed parameter in a \CC \lstinline[language=C++]{template}.
162\item
163Construct a new standard-library array-type, available through @#include <array.hfa>@.
164\end{enumerate}
165The new array type, enabled by prior features, defines an array with guaranteed runtime bound checks (often optimizer-removable) and implicit (guaranteed accurate) inter-function length communication.
166Leveraging the preexisting \CFA type-system to achieve this length-related type checking is an original contribution.
167Furthermore, the new array incorporates multidimensional capabilities typical of scientific/machine-learning languages, made to coexist with the C raw-memory-aware tradition in a novel way.
168The thesis includes a performance evaluation that shows \CFA arrays perform comparably with C arrays in many programming cases.
169
170
171\subsection{Linked List}
172
173The linked list is a new runtime library, available through @#include <dlist.hfa>@.
174The library offers a novel combination of the preexisting \CFA features: references, inline inheritance, polymorphism, and declaration scoping.
175A programmer's experience using the list data-types within the type system is novel.
176The list's runtime representation follows the established design pattern of intrusive link-fields for performance reasons, especially in concurrent programs.
177The thesis includes a performance evaluation that shows the list performing comparably with other C-family intrusive lists, and notably better than the \CC standard list.
178
179
180\subsection{String}
181
182A new string type and its runtime library are available through @#include <string.hfa>@.
183The library offers a type where basic usage is comparable to the \CC @string@ type, including analogous coexistence with raw-character pointers.
184Its implementation, however, follows different principles, enabling programs to work with strings by value, without incurring excessive copying.
185Mutability is retained.
186Substrings are supported, including the ability for overlapping ranges to share edits transparently.
187The implementation allows a set of strings to use an optimized, shared, custom heap or to use the regular heap for thread safety.
188
189The string library includes reading and writing strings via the preexisting \CFA I/O stream library.
190Enabling transparent reading of unknown length strings using managed allocation required revision of the stream library's existing handling of character arrays.
191All reasonable stream manipulators are supported, \eg width, justification (left/right), printing characters as their numeric values, and complex input scanning to isolate strings in the input stream.
192
193
194\begin{comment}
195\subsection{Iterator}
196
197Some advanced iterator prototyping now exists, available in the \CFA standard library.
198Specifically, a family of iterator styles is provided, with cheap iterators that are robust to being misused, plus full-featured iterators that observe structural modifications without becoming invalid.
199Hence, the infamous \CC bug of iterator invalidation, where structural changes cause legitimate operations to fail, \eg iterator over a list and deleting each element.
200
201Further design provides a home for Liskov-style iterators (stackless-coroutine) in \CFA.
202This design extends a preexisting proposal to adapt the \CFA (fixed) for-each loop to be more user-pluggable, and builds upon preexisting \CFA coroutines.
203Overall, it simplifies the work a programmer must do to leverage the suspended-state abstraction during iteration.
204\end{comment}
Note: See TracBrowser for help on using the repository browser.