[27f1055] | 1 | \chapter{Introduction} |
---|
| 2 | |
---|
[16915b1] | 3 | All modern programming languages provide three high-level containers (collections): array, linked-list, and string. |
---|
| 4 | Often array is part of the programming language, while linked-list is built from (recursive) pointer types, and string from a combination of array and linked-list. |
---|
| 5 | For all three types, languages supply varying degrees of high-level mechanism for manipulating these objects at the bulk level and at the component level, such as array copy, slicing and iterating. |
---|
| 6 | |
---|
| 7 | This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language. |
---|
| 8 | A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition. |
---|
| 9 | This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base. |
---|
| 10 | An additional goal is balancing good performance with safety. |
---|
[bdc8591] | 11 | |
---|
[19a2890] | 12 | |
---|
[bdc8591] | 13 | \section{Array} |
---|
| 14 | |
---|
[16915b1] | 15 | An array provides a homogeneous container with $O(1)$ access to elements using subscripting. |
---|
[bdc8591] | 16 | The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation. |
---|
| 17 | For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap. |
---|
[16915b1] | 18 | Because array layout has contiguous components, subscripting is a computation (some form of pointer arithmetic). |
---|
[bdc8591] | 19 | |
---|
| 20 | |
---|
[1e110bf] | 21 | \section{Linked list} |
---|
[bdc8591] | 22 | |
---|
[16915b1] | 23 | A linked-list provides a homogeneous container often with $O(log N)$/$O(N)$ access to elements using successor and predecessor operations that normally involve pointer chasing. |
---|
[051aec4] | 24 | Subscripting by value is sometimes available, \eg hash table. |
---|
| 25 | Linked types are normally dynamically sized by adding/removing nodes using link fields internal or external to the elements (nodes). |
---|
| 26 | If a programming language allows pointer to stack storage, linked-list types can be allocated on the stack; |
---|
[16915b1] | 27 | otherwise, elements are heap allocated with explicitly/implicitly managed. |
---|
[bdc8591] | 28 | |
---|
| 29 | |
---|
| 30 | \section{String} |
---|
| 31 | |
---|
[c721105] | 32 | A string provides a dynamic array of homogeneous elements, where the elements are often human-readable characters. |
---|
[16915b1] | 33 | What differentiates a string from other types in that its operations work on blocks of elements for scanning and changing, \eg @index@ and @substr@. |
---|
[c721105] | 34 | Subscripting individual elements is often available. |
---|
[16915b1] | 35 | Therefore, the cost of string operations is less important than the power of the operations to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing. |
---|
[c721105] | 36 | The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages. |
---|
[16915b1] | 37 | Often string management is separate from heap management, \ie strings roll their own heap. |
---|
[bdc8591] | 38 | |
---|
| 39 | |
---|
[6a8c773] | 40 | \section{Motivation} |
---|
[bdc8591] | 41 | |
---|
[16915b1] | 42 | The primary motivation for this work is two fold: |
---|
| 43 | \begin{enumerate}[leftmargin=*] |
---|
| 44 | \item |
---|
| 45 | These three aspects of C are extremely difficult to understand, teach, and get right because they are correspondingly extremely low level. |
---|
| 46 | Providing higher-level versions of these containers in \CFA is a major component of the primary goal. |
---|
| 47 | \item |
---|
| 48 | These three aspects of C cause the greatest safety issues because there are few or no safe guards when a programmer misunderstands or misuses these features~\cite{Elliott18, Blache19, Ruef19, Oorschot23}. |
---|
| 49 | Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occur in C resulting from errors using these facilities (memory errors), providing the major hacker attack-vectors. |
---|
| 50 | \end{enumerate} |
---|
| 51 | Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to move away from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these older languages. |
---|
| 52 | Hardening these three types goes a long way to make the majority of C programs safer. |
---|
| 53 | |
---|
| 54 | |
---|
| 55 | While multiple new languages purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the new runtime uses garage collection. |
---|
| 56 | Furthermore, these languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication. |
---|
| 57 | Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning \CC. |
---|
| 58 | Hence, rewriting and retraining costs for these languages can be prohibitive for companies with a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia). |
---|
[bdc8591] | 59 | |
---|
| 60 | |
---|
[6a8c773] | 61 | \subsection{C?} |
---|
[bdc8591] | 62 | |
---|
[051aec4] | 63 | Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}. |
---|
| 64 | However, most programming languages are only partially explained by standard's manuals. |
---|
[6a8c773] | 65 | When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}. |
---|
[16915b1] | 66 | Often other C compilers must mimic @gcc@ because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features. |
---|
| 67 | While some key aspects of C need to be explained by quoting from the language reference manual, to illustrate definite program semantics, my approach in this thesis is to devise a program, whose behaviour exercises a point at issue, and shows its behaviour. |
---|
[6a8c773] | 68 | These example programs show |
---|
[16915b1] | 69 | \begin{itemize}[leftmargin=*] |
---|
| 70 | \item if the compiler accepts or rejects certain syntax, |
---|
[051aec4] | 71 | \item prints output to buttress a claim of behaviour, |
---|
[16915b1] | 72 | \item or executes without triggering any embedded assertions testing pre/post-assertions or invariants. |
---|
[bdc8591] | 73 | \end{itemize} |
---|
[6a8c773] | 74 | This work has been tested across @gcc@ versions 8--12 and clang version 10 running on ARM, AMD, and Intel architectures. |
---|
| 75 | Any discovered anomalies among compilers or versions is discussed. |
---|
[16915b1] | 76 | In all case, it is never clear whether the \emph{truth} lies in the compiler or the C standard. |
---|
[27f1055] | 77 | |
---|
| 78 | |
---|
| 79 | \section{Contributions} |
---|
[bdc8591] | 80 | |
---|
[1e110bf] | 81 | \subsection{Linked list} |
---|
[bdc8591] | 82 | |
---|
| 83 | \subsection{Array} |
---|
| 84 | |
---|
| 85 | \subsection{String} |
---|
[c721105] | 86 | |
---|
| 87 | \subsection{Iterator} |
---|