source: doc/theses/mike_brooks_MMath/intro.tex@ 037dc57

Last change on this file since 037dc57 was 74accc6, checked in by Michael Brooks <mlbrooks@…>, 6 days ago

proof-read through background chapter

  • Property mode set to 100644
File size: 17.3 KB
Line 
1\chapter{Introduction}
2
3All modern programming languages provide at least these three high-level containers (collections): array, linked-list, and string.
4Often, the array is part of the programming language, while linked lists are built from (recursive) pointer types, and strings from arrays and/or linked lists.
5For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component levels, such as copying, slicing, extracting, iterating among, adding and removing elements.
6
7Unfortunately, memory misuse under C-idiomatic programming causes a significant number of errors~\cite{Oorschot23}.
8Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occurring in C result from errors using these facilities (memory errors).
9For operating system and browser vendors, who heavily use systems languages, 60\%--70\% of reported software vulnerabilities involved memory errors~\cite{Kehrer23}.
10For Microsoft, 70\% of vulnerabilities addressed via security updates between 2006--2018 are memory safety issues~\cite[slide 10]{Miller19}.
11In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is buffer errors (in the sense of misused array bounds), among 19 vulnerability categories~\cite{Cifuentes19}.
12
13While these figures address memory errors, not limited to the three primary collections, the authors' explanations of memory errors suggest improving on C's language/library options for working with the three collections would be beneficial.
14Every one of the safety-statistic sources above includes array bounds in its first characterization of unsafe programming-language features.
15Of the four studies that deal exclusively with memory safety (excluding~\cite{Cifuentes19}), all put @malloc@--@free@ upfront as well.
16The specific survey of C features, \cite{Oorschot23}, also places blame on:
17\begin{itemize}
18\item
19the null-terminated text idiom, with its support/mandate in the C standard library and
20\item
21the unprotected nature of C pointers.
22\end{itemize}
23
24An effort at translating C to safe Rust~\cite{Emre2022} has pointers as the stand-out difficult aspect to analyze.
25It demonstrates the difficulty of analyzing C pointers for safety.
26It first puts a corpus of C programs through a naive Rust translation, which ``succeeds'' by leaving many uses of Rust's \lstinline{unsafe} marker.
27Then, it analyzes these uses and presents a technique for reducing the uses that hinge exclusively on referenced object lifetime.
28Pointer dereference accounts for 80\% of the cases that the naive translation did not make safe (Table 3.2, p.~33).
29Of these, 10\% are eligible for the work's own safety improvement technique, while this filtering leaves 75\% with no single cause of unsafety determined (Table 3.5, p.~39).
30The presented technique succeeds at making 75\% of the eligible initially unsafe dereferences safe, by inferring the necessary lifetime annotations (Table 4.2, p.~83).
31This result leaves, per 1000 LOC, 3.3 unremovable unsafe dereferences that are understood to evade lifetime analysis, among 220 gross unsafe dereferences (Tables 3.5, 4.2 and 3.1, p.~27).
32% 3.3 = 1398/413,428 * 1000
33% 220 ~= (99,101 [@tab3.5] - 9631 [@tab4.2] ) / 413,428 [@tab 3.1] * 1000
34
35
36A tool for analyzing C code's use of linked data structures (including, \eg skip list and binary tree) found variations of the linked-list shape to be present in most of the programs in its corpus~\cite{White2016}.
37
38So, C's array unsafety is infamous, its string pattern necessitates explicit storage management (also infamous), and user-written linked lists are a recognizable player in the arena of (infamously) unsafe raw pointer uses.
39Therefore, hardening the three primary collections goes a long way to making the majority of C programs safer, eliminating major hacker attack vectors.
40
41This work looks at extending these foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
42A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
43This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
44An equally important goal is balancing good performance with safety.
45
46The thesis describes improvements made to the \CFA language design, both syntax and semantics, to support the container features, and the source code created within the \CFA compiler, libraries, and runtime system to implement these features.
47This work leverages preexisting work within the compiler's type and runtime systems generated by prior graduate students working on the \CFA project.
48
49
50\section{Array}
51
52An array provides a homogeneous container with $O(1)$ access to elements using subscripting.
53Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provide a better programmer experience.
54The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
55For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
56Because array layout has contiguous components, subscripting is a computation, \ie some form of pointer arithmetic.
57Not included in this requirement are pseudo-arrays with arbitrary keys, \eg hash table subscripted by a string key (see below).
58
59C provides a simple array type as a language feature.
60However, it adopts the controversial position of treating pointer and array as twins, leading to multiple problems.
61Hence, the way that arrays are typically passed around in a program removes the information necessary to do bound checks.
62
63
64\section{Linked List}
65
66A linked-list provides a homogeneous container often with $O(n)$ access to elements using successor and predecessor operations that involve pointer chasing.
67It is the most foundational case of the linked data structures, which include trees and hash tables.
68Among the linked structures, the list's defining characteristic is maintaining a user-explicit total order (chain shape).
69Element search can be by position (as for array) or by value; it is sometimes presented as a subscript.
70In a more complex structure, the linked shape may be system managed (user impicit) by being a function of a designated ``key'' datum, \eg often for a hash table.
71Though such schemes often give better-than-$O(n)$ lookup, the linked list's user-explicit shape limits what the system can provide, to fast step/iteration for user-``nearby'' data, and slow exhaustive search/seek otherwise.
72Linked types are dynamically sized by adding and removing nodes using link fields internal or external to the list elements (nodes).
73If a programming language allows pointers to stack storage, linked nodes can be allocated on the stack and connected with stack addresses;
74otherwise, elements are heap allocated with internal or external link fields.
75
76
77
78C provides a few simple, library data-structures (@glibc@):
79\begin{itemize}
80\item
81singly-linked lists, singly-linked tail queues, lists, and tail queues (@<sys/queue.h>@) \see{\VRef{s:PreexistingLinked-ListLibraries}}
82\item
83hash search table consisting of a key (string) with associated data (@<search.h>@)
84\end{itemize}
85Because these container libraries can be restrictive or awkward to use, C programmers often build bespoke linked data-structures, which further increases the potential for memory errors.
86
87
88\section{String}
89
90A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters.
91What differentiates a string from other types is that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@.
92While subscripting individual elements is usually available, working at the character level is considered poor practice, \ie underutilizing the powerful string operations.
93Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analyzing, composing, and decomposing string text.
94The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
95In many cases, string management is separate from heap management, \ie string libraries roll their own heaps.
96
97The C string type is just a character array and requires user storage management to handle varying size.
98C adopts a terminating sentinel character, @'\0'@, to mark the end of a variable-length character array, as opposed to keeping an independent length field.
99Hence, the sentinel character is excluded as an element within a string and determining the string length is an $O(N)$ operation.
100The C standard library includes a number of high-level operations for working with this representation.
101Most of these operations are awkward to use and highly error prone.
102
103
104\begin{comment}
105\section{Iterator}
106
107As a side issue to complex structured-types is iterating through them.
108Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work.
109However, the general iteration work is only a sketch for others as future work.
110Nevertheless, sufficed work was done to write out the ideas that developed and how they should apply in the main context of this work.
111\end{comment}
112
113
114\section{Motivation}
115
116The primary motivation for this work is twofold:
117\begin{enumerate}[leftmargin=*]
118\item
119The three primary container types in C are difficult to understand, teach, and get right because they are too low level.
120Providing higher-level, feature-rich versions of these containers in \CFA is a major component of the primary goal.
121The result is a simplified programming experience, which increases productivity and maintainability.
122\item
123The new container types must be as correct and safe to use as those in other modern programming languages, to elide the concerns of industry, government, and military~\cite{ONCD}.
124Prior approaches focus on out-of-bound array accesses using a model-based approach (ASCET) in embedded systems (\eg cars)~\cite{Blache19}, and general and null-terminated string arrays using a \CC template syntax (Checked C)~\cite{Elliott18,Ruef19}.
125Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to \emph{move away} from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these languages.
126Fixing these vulnerabilities negates this need, allowing C and its ecosystem to continue into the future.
127\end{enumerate}
128
129While new languages, \eg Go, Rust, Swift, purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the runtime uses garbage collection~\cite{Brinker26}.
130Even assuming automated conversion of C programs to other safe languages, who will maintain this massive new code-base?
131Furthermore, new languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
132Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, overloading, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning new \CC features and idioms.
133Finally, projects claiming to be modern C replacements (Cyclone~\cite{Grossman06}, Polymorphic C~\cite{Smith98}, GObject~\cite{GObject}) do not retain C backwards compatibility in syntax, programming model, or semantic compatibility;
134these languages are equivalent to switching to a different programming language.
135Hence, rewriting and retraining costs for other languages are prohibitive when companies have a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
136
137
138\section{Which C?}
139
140Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}.
141However, most programming languages are only partially explained by their standard's manual(s).
142When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}, because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
143Some key aspects of C need to be explained and understood by quoting from the language reference manual.
144However, to illustrate actual program semantics, this thesis constructs multiple small programs whose behaviour exercises a particular point and then confirms the behaviour by running the program using multiple @gcc@ compilers.
145These example programs show
146\begin{itemize}[itemsep=0pt]
147 \item whether the compiler accepts or rejects certain syntax,
148 \item printed output to buttress a behavioural claim, or
149 \item passed/failed assertions testing pre/post-conditions or invariants.
150\end{itemize}
151These programs are tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
152Any discovered anomalies among compilers, versions, or architectures are discussed.
153In general, it is never clear whether the \emph{truth} lies in the C standard or the compiler(s), which may be true for other programming languages.
154
155
156\section{Contributions}
157
158Overall, this work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
159% As well, a strong plan for general iteration has been sketched out.
160The following are the detailed contributions, where performance and safety are always the motivating factors.
161
162\subsection{Array}
163
164The improvements to C arrays are:
165\begin{enumerate}[leftmargin=*]
166\item
167Introduce a small number of subtle changes to the typing rules for the C array, while still achieving significant backwards compatibility.
168\item
169Create a new polymorphic mechanism in the \CFA @forall@ clause to specify array dimension values, similar to a \CC non-typed template parameter.
170\item
171Construct a new standard-library array type, available through @#include <array.hfa>@.
172\end{enumerate}
173The new array type accesses prior \CFA type-system features, delivering guaranteed runtime bound checks (often optimizer-removable) and implicit (guaranteed accurate) inter-function length communication.
174Extending the preexisting \CFA type system to achieve this length-related type checking is an original contribution.
175Furthermore, the new array incorporates multidimensional capabilities typical of scientific/machine-learning languages, made to coexist with the C raw-memory-aware tradition, in a novel way.
176The thesis includes a performance-relevant assessment of generated code that shows the new \CFA array simultaneously delivering the performance of a C array and the safety of a \CC vector.
177
178
179\subsection{Linked List}
180
181The linked list is a new runtime library, available through @#include <list.hfa>@.
182The library offers a novel combination of the preexisting \CFA features: references, inline inheritance, polymorphism, and declaration scoping.
183A programmer's experience using the list data types within the type system is novel.
184The list's runtime representation follows the established design pattern of intrusive link fields for performance reasons, especially in concurrent programs.
185The thesis includes a performance evaluation that shows the list performing comparably with other C-family intrusive lists, and notably better than the \CC standard list.
186
187
188\subsection{String}
189
190A new string type and its runtime library are available through @#include <string.hfa>@.
191The library offers a type where basic usage is comparable to the \CC @string@ type, including analogous coexistence with raw-character pointers.
192Its implementation, however, follows different principles, enabling programs to work with strings as simple values, without incurring excessive copying.
193Mutability is retained.
194Substrings are supported, including the option for overlapping ranges to share edits transparently, or act as diverging snapshots.
195The implementation allows a set of strings to use an optimized, shared, custom heap, or to use the regular heap, on strings that are passed among threads.
196
197The string library includes reading and writing strings via the preexisting \CFA I/O stream library.
198Enabling transparent reading of unknown length strings using managed allocation required revision of the stream library's existing handling of character arrays.
199All reasonable stream manipulators are supported, \eg width, justification (left/right), printing characters as their numeric values, and complex input scanning to isolate strings in the input stream.
200
201
202\begin{comment}
203\subsection{Iterator}
204
205Some advanced iterator prototyping now exists, available in the \CFA standard library.
206Specifically, a family of iterator styles is provided, with cheap iterators that are robust to being misused, plus full-featured iterators that observe structural modifications without becoming invalid.
207Hence, the infamous \CC bug of iterator invalidation, where structural changes cause legitimate operations to fail, \eg iterator over a list and deleting each element.
208
209Further design provides a home for Liskov-style iterators (stackless-coroutine) in \CFA.
210This design extends a preexisting proposal to adapt the \CFA (fixed) for-each loop to be more user-pluggable, and builds upon preexisting \CFA coroutines.
211Overall, it simplifies the work a programmer must do to leverage the suspended-state abstraction during iteration.
212\end{comment}
Note: See TracBrowser for help on using the repository browser.