source: doc/theses/mike_brooks_MMath/intro.tex@ d3942b9

Last change on this file since d3942b9 was 02ea981, checked in by Michael Brooks <mlbrooks@…>, 11 months ago

Thesis introduction contributions, first try.

  • Property mode set to 100644
File size: 10.4 KB
RevLine 
[27f1055]1\chapter{Introduction}
2
[16915b1]3All modern programming languages provide three high-level containers (collections): array, linked-list, and string.
4Often array is part of the programming language, while linked-list is built from (recursive) pointer types, and string from a combination of array and linked-list.
[7d415b4]5For all three types, languages supply varying degrees of high-level mechanism for manipulating these objects at the bulk level and at the component level, such as array copy, slicing and iterating using subscripting.
[16915b1]6
7This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
8A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
9This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
[7d415b4]10An equally important goal is balancing good performance with safety.
[bdc8591]11
[19a2890]12
[bdc8591]13\section{Array}
14
[16915b1]15An array provides a homogeneous container with $O(1)$ access to elements using subscripting.
[7d415b4]16Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provides a better programmer experience.
[bdc8591]17The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
18For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
[16915b1]19Because array layout has contiguous components, subscripting is a computation (some form of pointer arithmetic).
[bdc8591]20
[02ea981]21C provides an array as a language feature.
[bdc8591]22
[1e110bf]23\section{Linked list}
[bdc8591]24
[02ea981]25A linked-list provides a homogeneous container often with $O(log N)$ or $O(N)$ access to elements using successor and predecessor operations that normally involve pointer chasing.
[051aec4]26Subscripting by value is sometimes available, \eg hash table.
27Linked types are normally dynamically sized by adding/removing nodes using link fields internal or external to the elements (nodes).
28If a programming language allows pointer to stack storage, linked-list types can be allocated on the stack;
[16915b1]29otherwise, elements are heap allocated with explicitly/implicitly managed.
[bdc8591]30
[02ea981]31C does not provide a linked-list, either as an obvious commonly-accepted library, nor as an outright language feature.
32C programmers commonly build linked-list behaviours into their bespoke data structures, directly using its language feature of pointers.
[bdc8591]33
34\section{String}
35
[7d415b4]36A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters.
37What differentiates a string from other types in that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@;
38subscripting individual elements is usually available, too.
39Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing.
[c721105]40The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
[7d415b4]41In some cases, string management is separate from heap management, \ie strings roll their own heap.
42
[02ea981]43The C string type is a user-managed allocation of the language-provided character array,
44plus the convention of marking its (variable) length by placing the 0-valued control character at the end (``null-terminated'').
45The C standard library includes many operations for working on this representation.
[7d415b4]46
47\section{Iterator}
48
49As a side issue to working with complex structured types is iterating through them.
50Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work, but the general iteration work is only a sketch for others as future work.
51Nevertheless, sufficed work was done to write out the ideas that developed and how they should apply in the main context of this work.
[bdc8591]52
53
[6a8c773]54\section{Motivation}
[bdc8591]55
[16915b1]56The primary motivation for this work is two fold:
57\begin{enumerate}[leftmargin=*]
58\item
[7d415b4]59These three aspects of C are difficult to understand, teach, and get right because they are correspondingly low level.
60Providing higher-level, feature-rich versions of these containers in \CFA is a major component of the primary goal.
[16915b1]61\item
62These three aspects of C cause the greatest safety issues because there are few or no safe guards when a programmer misunderstands or misuses these features~\cite{Elliott18, Blache19, Ruef19, Oorschot23}.
63Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occur in C resulting from errors using these facilities (memory errors), providing the major hacker attack-vectors.
64\end{enumerate}
65Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to move away from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these older languages.
66Hardening these three types goes a long way to make the majority of C programs safer.
67
68
69While multiple new languages purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the new runtime uses garage collection.
70Furthermore, these languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
71Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning \CC.
72Hence, rewriting and retraining costs for these languages can be prohibitive for companies with a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
[bdc8591]73
74
[6a8c773]75\subsection{C?}
[bdc8591]76
[051aec4]77Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}.
[7d415b4]78However, most programming languages are only partially explained by their standard's manual(s).
[6a8c773]79When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}.
[16915b1]80Often other C compilers must mimic @gcc@ because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
[7d415b4]81While some key aspects of C need to be explained and understood by quoting from the language reference manual, to illustrate actual program semantics, this thesis uses constructs a program whose behaviour exercises a particular point and then confirms the behaviour by running the program.
[6a8c773]82These example programs show
[16915b1]83\begin{itemize}[leftmargin=*]
84 \item if the compiler accepts or rejects certain syntax,
[7d415b4]85 \item prints output to buttress a behavioural claim,
[16915b1]86 \item or executes without triggering any embedded assertions testing pre/post-assertions or invariants.
[bdc8591]87\end{itemize}
[7d415b4]88This work has been tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
[6a8c773]89Any discovered anomalies among compilers or versions is discussed.
[7d415b4]90In all case, it is never clear whether the \emph{truth} lies in the compiler(s) or the C standard.
[27f1055]91
92
93\section{Contributions}
[bdc8591]94
[7d415b4]95This work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
96As well, a strong plan for general iteration has been sketched out.
97
[02ea981]98This thesis mainly describes improvements made to the source code of the \CFA compiler and runtime system,
99available at TODO.
100
101This work often leverages preexisting xxxxxxxxxx
102
[1e110bf]103\subsection{Linked list}
[bdc8591]104
[02ea981]105The linked list is a new runtime library, available as @#include <dlist.hfa>@.
106The library offers a novel combination of the preexisting \CFA features of:
107references, inline inheritance, polymorphism and declaration scoping.
108A programmer's experience of the list datatype within the type system is novel.
109The list's runtime representation follows the established design pattern of intrusion.
110This thesis includes a performance evaluation that shows
111the list performing comparably with other C-family intrusive lists,
112and notably better than the \CC standard list.
113
[bdc8591]114\subsection{Array}
115
[02ea981]116This work's array improvements are
117\begin{enumerate}[leftmargin=*]
118\item
119subtle changes to the typing rules for the C array,
120\item
121a new \CFA language construct for a type-system managed array length, and
122\item
123a new standard-library array type, available as @#include <array.hfa>@.
124\end{enumerate}
125The new array type, enabled by the earlier features, defines an array with
126guaranteed runtime bound checks (often optimizer-removable) and
127implicit (guaranteed accurate) inter-function length communication.
128Leveraging the preexisting \CFA type system to achieve
129this length-related type checking is an original contribution.
130
131Furthermore, the new array incorporates multidimensional capabilities
132typical of scientific/machine-learning languages,
133made to coexist with the C ``raw-memory-aware'' tradition in a novel way.
134
[bdc8591]135\subsection{String}
[c721105]136
[02ea981]137The string is a new runtime library, available as @#include <string.hfa>@.
138The library offers a type whose basic usage is comparable to the \CC @string@ type,
139including analogous coexistence with raw character pointers.
140Its implementation, however, follows different princliples,
141enabling programs to work with strings by value, without incurring excessive copying.
142Mutability is retained.
143Substrings are supported, including the ability for overlapping ranges to share edits transparently.
144Ultimately, this implementation is a case of strings rolling their own heap.
145
146The string library includes writing and reading strings via the preexsiting \CFA I/O stream library.
147Enabling transparent reading (of unknown length into a managed allocation) included revision of
148the stream libarary's existing handling of character arrays.
149
[c721105]150\subsection{Iterator}
[02ea981]151
152Some prototyping, available in the \CFA standard library,
153addresses the infamous \CC bug source of iterator invalidation.
154A family of iterator styles is presented, with cheap iterators that resist being misused,
155plus full-featured iterators that observe modifications without becoming invalid.
156
157Further design within this thesis presents a home for Liskov-style iterators in \CFA.
158This design extends a preexisting proposal to adapt the \CFA for-each style loop to be more user-pluggable.
159It also builds upon preexisting \CFA coroutines.
160It simplifies the work a programmer must do to leverage the suspended-state abstraction.
Note: See TracBrowser for help on using the repository browser.