Context Navigation

intro.tex@ 5faa3a5

Visit:

Last change on this file since 5faa3a5 was 5faa3a5, checked in by Peter A. Buhr <pabuhr@…>, 20 hours ago
spelling corrections, and final proofreading
Property mode set to `100644`
File size: 17.6 KB

Line
1	\chapter{Introduction}
2
3	All modern programming languages provide at least these three high-level collections (containers): array, linked-list, and string.
4	Often, the array is part of the programming language, while linked lists are built from (recursive) pointer types, and strings from arrays and/or linked lists.
5	For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component levels, such as copying, slicing, extracting, iterating among, adding and removing elements.
6
7	Unfortunately, memory misuse under C-idiomatic programming causes a significant number of errors~\cite{Oorschot23}.
8	Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occurring in C result from errors using these facilities (memory errors).
9	For operating system and browser vendors, who heavily use systems languages, 60\%--70\% of reported software vulnerabilities involved memory errors~\cite{Kehrer23}.
10	For Microsoft, 70\% of vulnerabilities addressed via security updates between 2006--2018 are memory safety issues~\cite[slide 10]{Miller19}.
11	In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is buffer errors (in the sense of misused array bounds), among 19 vulnerability categories~\cite{Cifuentes19}.
12
13	While these figures address memory errors, not limited to the three primary collections, the authors' explanations of memory errors suggest improving on C's language/library options for working with the three collections would be beneficial.
14	Every one of the safety-statistic sources above includes array bounds in its first characterization of unsafe programming-language features.
15	Of the four studies that deal exclusively with memory safety (excluding~\cite{Cifuentes19}), all put @malloc@--@free@ upfront as well.
16	The specific survey of C features, \cite{Oorschot23}, also places blame on:
17	\begin{itemize}
18	\item
19	the null-terminated text idiom, with its support/mandate in the C standard library and
20	\item
21	the unprotected nature of C pointers.
22	\end{itemize}
23
24	An effort at translating C code to the safe subset of the now-famously safe language Rust~\cite{Emre2022} shows pointers as the stand-out difficult aspect to analyze.
25	It demonstrates the difficulty of automatically determining pointer-referent lifetimes in that its success claim is to reduce the occurrence rate of unsafe dereference operations to about once per 100 lines of code.
26
27	\begin{comment}
28	It demonstrates the difficulty of analyzing C pointers for safety.
29	It first puts a corpus of C programs through a naive Rust translation, which ``succeeds'' by leaving many uses of Rust's \lstinline{unsafe} marker.
30	Then, it analyzes these uses and presents a technique for reducing the uses that hinge exclusively on referenced object lifetime.
31	Pointer dereference accounts for 80\% of the cases that the naive translation did not make safe (Table 3.2, p.~33).
32	Of these, 10\% are eligible for the work's own safety improvement technique, while this filtering leaves 75\% with no single cause of unsafety determined (Table 3.5, p.~39).
33	The presented technique succeeds at making 75\% of the eligible initially unsafe dereferences safe, by inferring the necessary lifetime annotations (Table 4.2, p.~83).
34	This result leaves, per 1000 LOC, 3.3 unremovable unsafe dereferences that are understood to evade lifetime analysis, among 220 gross unsafe dereferences (Tables 3.5, 4.2 and 3.1, p.~27).
35	% 3.3 = 1398/413,428 * 1000
36	% 220 ~= (99,101 [@tab3.5] - 9631 [@tab4.2] ) / 413,428 [@tab 3.1] * 1000
37	\end{comment}
38
39	A tool for analyzing C code's use of linked data structures (including, \eg skip list and binary tree) found variations of the linked-list shape to be present in most of the programs in its corpus~\cite{White2016}.
40
41	So, C's array unsafety is infamous, its string pattern necessitates explicit storage management (also infamous), and user-written linked lists are a recognizable player in the arena of (infamously) unsafe raw pointer uses.
42	Therefore, hardening the three primary collections goes a long way to making the majority of C programs safer, eliminating major hacker attack vectors.
43
44	My work looks at extending these foundational collection types in the programming language \CFA, which is a new dialect of the C programming language.
45	A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
46	This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
47	An equally important goal is balancing good performance with safety.
48
49	The thesis describes improvements I made to the \CFA language design, both syntax and semantics, to support the collection features, and the source code created within the \CFA compiler, libraries, and runtime system to implement these features.
50	This work leverages preexisting work within the compiler's type and runtime systems generated by prior graduate students working on the \CFA project.
51
52
53	\section{Array}
54
55	An array provides a homogeneous collection with $O(1)$ access to elements using subscripting.
56	Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provide a better programmer experience.
57	The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
58	For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
59	Because array layout has contiguous components, subscripting is a computation, \ie some form of pointer arithmetic.
60	Not included in this requirement are pseudo-arrays with arbitrary keys, \eg hash table subscripted by a string key (see below).
61
62	C provides a simple array type as a language feature.
63	However, it adopts the controversial position of treating pointer and array as twins, leading to multiple problems.
64	Hence, the way that arrays are typically passed around in a program removes the information necessary to do bound checks.
65
66
67	\section{Linked List}
68
69	A linked-list provides a homogeneous collection often with $O(n)$ access to elements using successor and predecessor operations that involve pointer chasing.
70	It is the most foundational case of the linked data structures, which include trees and hash tables.
71	Among the linked structures, the list's defining characteristic is maintaining a user-explicit total order (chain shape).
72	Element search can be by position (as for array) or by value; it is sometimes presented as a subscript.
73	In a more complex structure, the linked shape may be system managed (user implicit) by being a function of a designated ``key'' datum, \eg often for a hash table.
74	Though such schemes often give better-than-$O(n)$ lookup, the linked list's user-explicit shape limits what the system can provide, to fast step/iteration for user-``nearby'' data, and slow exhaustive search/seek otherwise.
75	Linked types are dynamically sized by adding and removing nodes using link fields internal or external to the list elements (nodes).
76	If a programming language allows pointers to stack storage, linked nodes can be allocated on the stack and connected with stack addresses;
77	otherwise, elements are heap allocated with internal or external link fields.
78
79
80
81	C provides a few simple, library data-structures (@glibc@):
82	\begin{itemize}
83	\item
84	singly-linked lists, singly-linked tail queues, lists, and tail queues (@<sys/queue.h>@) \see{\VRef{s:PreexistingLinked-ListLibraries}}
85	\item
86	hash search table consisting of a key (string) with associated data (@<search.h>@)
87	\end{itemize}
88	Because these collection libraries can be restrictive or awkward to use, C programmers often build bespoke linked data-structures, which further increases the potential for memory errors.
89
90
91	\section{String}
92
93	A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters.
94	What differentiates a string from other types is that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@.
95	While subscripting individual elements is usually available, working at the character level is considered poor practice, \ie underutilizing the powerful string operations.
96	Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analyzing, composing, and decomposing string text.
97	The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
98	In many cases, string management is separate from heap management, \ie string libraries roll their own heaps.
99
100	The C string type is just a character array and requires user storage management to handle varying size.
101	C adopts a terminating sentinel character, @'\0'@, to mark the end of a variable-length character array, as opposed to keeping an independent length field.
102	Hence, the sentinel character is excluded as an element within a string and determining the string length is an $O(N)$ operation.
103	The C standard library includes a number of high-level operations for working with this representation.
104	Most of these operations are awkward to use and highly error prone.
105
106
107	\begin{comment}
108	\section{Iterator}
109
110	As a side issue to complex structured-types is iterating through them.
111	Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work.
112	However, the general iteration work is only a sketch for others as future work.
113	Nevertheless, sufficed work was done to write out the ideas that developed and how they should apply in the main context of this work.
114	\end{comment}
115
116
117	\section{Motivation}
118
119	The primary motivation for this work is twofold:
120	\begin{enumerate}[leftmargin=*]
121	\item
122	The three primary collection types in C are difficult to understand, teach, and get right because they are too low level.
123	Providing higher-level, feature-rich versions of these collections in \CFA is a major component of the primary goal.
124	The result is a simplified programming experience, which increases productivity and maintainability.
125	\item
126	The new collection types must be as correct and safe to use as those in other modern programming languages, to elide the concerns of industry, government, and military~\cite{ONCD}.
127	Prior approaches focus on out-of-bound array accesses using a model-based approach (ASCET) in embedded systems (\eg cars)~\cite{Blache19}, and general and null-terminated string arrays using a \CC template syntax (Checked C)~\cite{Elliott18,Ruef19}.
128	Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to \emph{move away} from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these languages.
129	Fixing these vulnerabilities negates this need, allowing C and its ecosystem to continue into the future.
130	\end{enumerate}
131
132	While new languages, \eg Go, Rust, Swift, purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the runtime uses garbage collection~\cite{Brinker26}.
133	Even assuming automated conversion of C programs to other safe languages, who will maintain this massive new code-base?
134	Furthermore, new languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
135	Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, overloading, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning new \CC features and idioms.
136	Finally, projects claiming to be modern C replacements (Cyclone~\cite{Grossman06}, Polymorphic C~\cite{Smith98}, GObject~\cite{GObject}) do not retain C backwards compatibility in syntax, programming model, or semantic compatibility;
137	these languages are equivalent to switching to a different programming language.
138	Hence, rewriting and retraining costs for other languages are prohibitive when companies have a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
139
140
141	\section{Which C?}
142
143	Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}.
144	However, most programming languages are only partially explained by their standard's manual(s).
145	When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}, because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
146	Some key aspects of C need to be explained and understood by quoting from the language reference manual.
147	However, to illustrate actual program semantics, this thesis constructs multiple small programs whose behaviour exercises a particular point and then confirms the behaviour by running the program using multiple @gcc@ compilers.
148	These example programs show
149	\begin{itemize}[itemsep=0pt]
150	\item whether the compiler accepts or rejects certain syntax,
151	\item printed output to buttress a behavioural claim, or
152	\item passed/failed assertions testing pre/post-conditions or invariants.
153	\end{itemize}
154	These programs are tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
155	Any discovered anomalies among compilers, versions, or architectures are discussed.
156	In general, it is never clear whether the \emph{truth} lies in the C standard or the compiler(s), which may be true for other programming languages.
157
158
159	\section{Contributions}
160
161	Overall, this work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
162	% As well, a strong plan for general iteration has been sketched out.
163	The following are my detailed contributions, where performance and safety are always the motivating factors.
164
165	\subsection{Array}
166
167	The improvements to C arrays are:
168	\begin{enumerate}[leftmargin=*]
169	\item
170	Introduce a small number of subtle changes to the typing rules for the C array, while still achieving significant backwards compatibility.
171	\item
172	Create a new polymorphic mechanism in the \CFA @forall@ clause to specify array dimension values, similar to a \CC non-typed template parameter.
173	\item
174	Construct a new standard-library array type, available through @#include <array.hfa>@.
175	\end{enumerate}
176	The new array type accesses prior \CFA type-system features, delivering guaranteed runtime bound checks (often optimizer-removable) and implicit (guaranteed accurate) inter-function length communication.
177	Extending the preexisting \CFA type system to achieve this length-related type checking is an original contribution.
178	Furthermore, the new array incorporates multidimensional capabilities typical of scientific/machine-learning languages, made to coexist with the C raw-memory-aware tradition, in a novel way.
179	The thesis includes a performance-relevant assessment of generated code that shows the new \CFA array simultaneously delivering the performance of a C array and the safety of a \CC vector.
180
181
182	\subsection{Linked List}
183
184	The linked list is a new runtime library, available through @#include <list.hfa>@.
185	The library offers a novel combination of the preexisting \CFA features: references, inline inheritance, polymorphism, and declaration scoping.
186	A programmer's experience using the list data types within the type system is novel.
187	The list's runtime representation follows the established design pattern of intrusive link fields for performance reasons, especially in concurrent programs.
188	The thesis includes a performance evaluation that shows the list performing comparably with other C-family intrusive lists, and notably better than the \CC standard list.
189
190
191	\subsection{String}
192
193	A new string type and its runtime library are available through @#include <string.hfa>@.
194	The library offers a type where basic usage is comparable to the \CC @string@ type, including analogous coexistence with raw-character pointers.
195	Its implementation, however, follows different principles, enabling programs to work with strings as simple values, without incurring excessive copying.
196	Mutability is retained.
197	Substrings are supported, including the option for overlapping ranges to share edits transparently, or act as diverging snapshots.
198	The implementation allows a set of strings to use an optimized, shared, custom heap, or to use the regular heap, on strings that are passed among threads.
199
200	The string library includes reading and writing strings via the preexisting \CFA I/O stream library.
201	Enabling transparent reading of unknown length strings using managed allocation required revision of the stream library's existing handling of character arrays.
202	All reasonable stream manipulators are supported, \eg width, justification (left/right), printing characters as their numeric values, and complex input scanning to isolate strings in the input stream.
203
204
205	\begin{comment}
206	\subsection{Iterator}
207
208	Some advanced iterator prototyping now exists, available in the \CFA standard library.
209	Specifically, a family of iterator styles is provided, with cheap iterators that are robust to being misused, plus full-featured iterators that observe structural modifications without becoming invalid.
210	Hence, the infamous \CC bug of iterator invalidation, where structural changes cause legitimate operations to fail, \eg iterator over a list and deleting each element.
211
212	Further design provides a home for Liskov-style iterators (stackless-coroutine) in \CFA.
213	This design extends a preexisting proposal to adapt the \CFA (fixed) for-each loop to be more user-pluggable, and builds upon preexisting \CFA coroutines.
214	Overall, it simplifies the work a programmer must do to leverage the suspended-state abstraction during iteration.
215	\end{comment}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: doc/theses/mike_brooks_MMath/intro.tex@ 5faa3a5

Download in other formats: