source: doc/theses/mike_brooks_MMath/intro.tex @ db19e1d

Last change on this file since db19e1d was 16915b1, checked in by Peter A. Buhr <pabuhr@…>, 3 months ago

proofread intro chapter and add citation

  • Property mode set to 100644
File size: 6.0 KB
Line 
1\chapter{Introduction}
2
3All modern programming languages provide three high-level containers (collections): array, linked-list, and string.
4Often array is part of the programming language, while linked-list is built from (recursive) pointer types, and string from a combination of array and linked-list.
5For all three types, languages supply varying degrees of high-level mechanism for manipulating these objects at the bulk level and at the component level, such as array copy, slicing and iterating.
6
7This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
8A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
9This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
10An additional goal is balancing good performance with safety.
11
12
13\section{Array}
14
15An array provides a homogeneous container with $O(1)$ access to elements using subscripting.
16The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
17For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
18Because array layout has contiguous components, subscripting is a computation (some form of pointer arithmetic).
19
20
21\section{Linked list}
22
23A linked-list provides a homogeneous container often with $O(log N)$/$O(N)$ access to elements using successor and predecessor operations that normally involve pointer chasing.
24Subscripting by value is sometimes available, \eg hash table.
25Linked types are normally dynamically sized by adding/removing nodes using link fields internal or external to the elements (nodes).
26If a programming language allows pointer to stack storage, linked-list types can be allocated on the stack;
27otherwise, elements are heap allocated with explicitly/implicitly managed.
28
29
30\section{String}
31
32A string provides a dynamic array of homogeneous elements, where the elements are often human-readable characters.
33What differentiates a string from other types in that its operations work on blocks of elements for scanning and changing, \eg @index@ and @substr@.
34Subscripting individual elements is often available.
35Therefore, the cost of string operations is less important than the power of the operations to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing.
36The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
37Often string management is separate from heap management, \ie strings roll their own heap.
38
39
40\section{Motivation}
41
42The primary motivation for this work is two fold:
43\begin{enumerate}[leftmargin=*]
44\item
45These three aspects of C are extremely difficult to understand, teach, and get right because they are correspondingly extremely low level.
46Providing higher-level versions of these containers in \CFA is a major component of the primary goal.
47\item
48These three aspects of C cause the greatest safety issues because there are few or no safe guards when a programmer misunderstands or misuses these features~\cite{Elliott18, Blache19, Ruef19, Oorschot23}.
49Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occur in C resulting from errors using these facilities (memory errors), providing the major hacker attack-vectors.
50\end{enumerate}
51Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to move away from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these older languages.
52Hardening these three types goes a long way to make the majority of C programs safer.
53
54
55While multiple new languages purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the new runtime uses garage collection.
56Furthermore, these languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
57Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning \CC.
58Hence, rewriting and retraining costs for these languages can be prohibitive for companies with a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
59
60
61\subsection{C?}
62
63Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}.
64However, most programming languages are only partially explained by standard's manuals.
65When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}.
66Often other C compilers must mimic @gcc@ because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
67While some key aspects of C need to be explained by quoting from the language reference manual, to illustrate definite program semantics, my approach in this thesis is to devise a program, whose behaviour exercises a point at issue, and shows its behaviour.
68These example programs show
69\begin{itemize}[leftmargin=*]
70        \item if the compiler accepts or rejects certain syntax,
71        \item prints output to buttress a claim of behaviour,
72        \item or executes without triggering any embedded assertions testing pre/post-assertions or invariants.
73\end{itemize}
74This work has been tested across @gcc@ versions 8--12 and clang version 10 running on ARM, AMD, and Intel architectures.
75Any discovered anomalies among compilers or versions is discussed.
76In all case, it is never clear whether the \emph{truth} lies in the compiler or the C standard.
77
78
79\section{Contributions}
80
81\subsection{Linked list}
82
83\subsection{Array}
84
85\subsection{String}
86
87\subsection{Iterator}
Note: See TracBrowser for help on using the repository browser.