source: doc/theses/mike_brooks_MMath/intro.tex@ c2f6b79

Last change on this file since c2f6b79 was 331f59a, checked in by Peter A. Buhr <pabuhr@…>, 4 weeks ago

comment out discussion of advanced iterators, fix citation name

  • Property mode set to 100644
File size: 13.1 KB
Line 
1\chapter{Introduction}
2
3All modern programming languages provide three high-level containers (collections): array, linked-list, and string.
4Often array is part of the programming language, while linked-list is built from (recursive) pointer types, and string from a combination of array and linked-list.
5For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component level, such as copying, slicing, extracting, and iterating among elements.
6
7Unfortunately, these three aspects of C cause a significant number of memory errors~\cite{Oorschot23}.
8For operating system and browser vendors, who heavily use systems languages, 60\%--70\% of reported software vulnerabilities involved memory errors~\cite{Kehrer23}.
9For Microsoft, 70\% of vulnerabilities addressed via security updates between 2006--2018 are memory safety issues~\cite[slide 10]{Miller19}.
10In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is (memory) buffer errors, among 19 vulnerability categories~\cite{Cifuentes19}.
11Therefore, hardening these three C types goes a long way to make the majority of C programs safer.
12
13This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
14A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
15This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
16An equally important goal is balancing good performance with safety.
17
18The thesis describes improvements made to the \CFA language design, both syntax and semantics, to support the container features, and the source code created within the \CFA compiler, libraries, and runtime system to implement these features.
19This work leverages preexisting work within the compiler's type and runtime systems generated by prior graduate students working on the \CFA project.
20
21
22\section{Array}
23
24An array provides a homogeneous container with $O(1)$ access to elements using subscripting.
25Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provide a better programmer experience.
26The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
27For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
28Because array layout has contiguous components, subscripting is a computation (some form of pointer arithmetic).
29
30C provides a simple array type as a language feature.
31However, it adopts the controversial language position of treating pointer and array as duals, leading to multiple problems.
32
33
34\section{Linked List}
35
36A linked-list provides a homogeneous container often with $O(log N)$ or $O(N)$ access to elements using successor and predecessor operations that normally involve pointer chasing.
37Subscripting by value (rather than position or location as for array) is sometimes available, \eg hash table.
38Linked types are normally dynamically sized by adding and removing nodes using link fields internal or external to the elements (nodes).
39If a programming language allows pointers to stack storage, linked-list nodes can be allocated on the stack and connected with stack addresses (pointers);
40otherwise, elements are heap allocated with internal or external link fields.
41
42C provides a few simple, polymorphic, library data-structures (@glibc@):
43\begin{itemize}
44\item
45singly-linked lists, singly-linked tail queues, lists, and tail queues (@<sys/queue.h>@) \see{\VRef{s:PreexistingLinked-ListLibraries}}
46\item
47hash search table consisting of a key (string) with associated data (@<search.h>@)
48\end{itemize}
49Because these libraries are simple, awkward to use, and unsafe, C programmers commonly build bespoke linked data-structures.
50
51
52\section{String}
53
54A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters.
55What differentiates a string from other types in that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@.
56While subscripting individual elements is usually available, working at the individual character level is considered poor practise, \ie underutilizing the powerful string operations.
57Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing string text.
58The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
59In some cases, string management is separate from heap management, \ie strings roll their own heap.
60
61The C string type is just a character array and requires user storage-management to handle varying size.
62The character array uses the convention of marking its (variable) array length by placing the 0-valued control character at the end (null-terminated).
63The C standard library includes a number of high-level operations for working with this representation.
64Most of these operations are awkward and error prone.
65
66
67\begin{comment}
68\section{Iterator}
69
70As a side issue to complex structured-types is iterating through them.
71Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work.
72However, the general iteration work is only a sketch for others as future work.
73Nevertheless, sufficed work was done to write out the ideas that developed and how they should apply in the main context of this work.
74\end{comment}
75
76
77\section{Motivation}
78
79The primary motivation for this work is two fold:
80\begin{enumerate}[leftmargin=*]
81\item
82These three aspects of C are difficult to understand, teach, and get right because they are correspondingly low level.
83Providing higher-level, feature-rich versions of these containers in \CFA is a major component of the primary goal.
84\item
85These three aspects of C cause the greatest safety issues because there are few or no safe guards when a programmer misunderstands or misuses these features~\cite{Elliott18, Blache19, Ruef19, Oorschot23}.
86Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occurring in C result from errors using these facilities (memory errors), providing the major hacker attack-vectors.
87\end{enumerate}
88Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to move away from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these older languages.
89Hardening these three types goes a long way to make the majority of C programs safer.
90
91
92While multiple new languages purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the new runtime uses garage collection.
93Furthermore, new languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
94Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, overloading, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning new \CC.
95Hence, rewriting and retraining costs for these languages can be prohibitive for companies with a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
96
97
98\section{C?}
99
100Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}.
101However, most programming languages are only partially explained by their standard's manual(s).
102When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}.
103Often other C compilers \emph{must} mimic @gcc@ because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
104Some key aspects of C need to be explained and understood by quoting from the language reference manual.
105However, to illustrate actual program semantics, this thesis constructs multiple small programs whose behaviour exercises a particular point and then confirms the behaviour by running the program using multiple @gcc@ compilers.
106These example programs show
107\begin{itemize}
108 \item if the compiler accepts or rejects certain syntax,
109 \item prints output to buttress a behavioural claim,
110 \item or fails to trigger embedded assertions testing pre/post-assertions or invariants.
111\end{itemize}
112These programs are tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
113Any discovered anomalies among compilers, versions, or architectures is discussed.
114In general, it is never clear whether the \emph{truth} lies in the C standard or the compiler(s), which may be true for other programming languages.
115
116
117\section{Contributions}
118
119Overall, this work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
120As well, a strong plan for general iteration has been sketched out.
121The following are the detailed contributions, where performance and safety were always the motivating factors.
122
123\subsection{Array}
124
125This work's array improvements are:
126\begin{enumerate}[leftmargin=*]
127\item
128Introduce a small number of subtle changes to the typing rules for the C array, while still achieving significant backwards compatibility
129\item
130Create a new polymorphic mechanism in the \CFA @forall@ clause to specify array dimension values, similar to a fixed-typed parameter in a \CC \lstinline[language=C++]{template}.
131\item
132Construct a new standard-library array-type, available through @#include <array.hfa>@.
133\end{enumerate}
134The new array type, enabled by prior features, defines an array with guaranteed runtime bound checks (often optimizer-removable) and implicit (guaranteed accurate) inter-function length communication.
135Leveraging the preexisting \CFA type-system to achieve this length-related type checking is an original contribution.
136Furthermore, the new array incorporates multidimensional capabilities typical of scientific/machine-learning languages, made to coexist with the C raw-memory-aware tradition in a novel way.
137The thesis includes a performance evaluation that shows \CFA arrays perform comparably with C arrays in many programming cases.
138
139
140\subsection{Linked List}
141
142The linked list is a new runtime library, available through @#include <dlist.hfa>@.
143The library offers a novel combination of the preexisting \CFA features: references, inline inheritance, polymorphism, and declaration scoping.
144A programmer's experience using the list data-types within the type system is novel.
145The list's runtime representation follows the established design pattern of intrusive link-fields for performance reasons, especially in concurrent programs.
146The thesis includes a performance evaluation that shows the list performing comparably with other C-family intrusive lists, and notably better than the \CC standard list.
147
148
149\subsection{String}
150
151The new string type and its runtime library are available through @#include <string.hfa>@.
152The library offers a type where basic usage is comparable to the \CC @string@ type, including analogous coexistence with raw-character pointers.
153Its implementation, however, follows different principles, enabling programs to work with strings by value, without incurring excessive copying.
154Mutability is retained.
155Substrings are supported, including the ability for overlapping ranges to share edits transparently.
156The implementation allows a set of strings to use an optimized, shared, custom heap or to use the regular heap for thread safety.
157
158The string library includes reading and writing strings via the preexisting \CFA I/O stream library.
159Enabling transparent reading of unknown length strings using managed allocation required revision of the stream library's existing handling of character arrays.
160All reasonable stream manipulators are supported, \eg width, justification (left/right), printing characters as their numeric values, and complex input scanning to isolate strings in the input stream.
161
162
163\begin{comment}
164\subsection{Iterator}
165
166Some advanced iterator prototyping now exists, available in the \CFA standard library.
167Specifically, a family of iterator styles is provided, with cheap iterators that are robust to being misused, plus full-featured iterators that observe structural modifications without becoming invalid.
168Hence, the infamous \CC bug of iterator invalidation, where structural changes cause legitimate operations to fail, \eg iterator over a list and deleting each element.
169
170Further design provides a home for Liskov-style iterators (stackless-coroutine) in \CFA.
171This design extends a preexisting proposal to adapt the \CFA (fixed) for-each loop to be more user-pluggable, and builds upon preexisting \CFA coroutines.
172Overall, it simplifies the work a programmer must do to leverage the suspended-state abstraction during iteration.
173\end{comment}
Note: See TracBrowser for help on using the repository browser.