source: doc/theses/mike_brooks_MMath/intro.tex@ 7d415b4

Last change on this file since 7d415b4 was 7d415b4, checked in by Peter A. Buhr <pabuhr@…>, 14 months ago

more proofreading of intro chapter

  • Property mode set to 100644
File size: 6.8 KB
RevLine 
[27f1055]1\chapter{Introduction}
2
[16915b1]3All modern programming languages provide three high-level containers (collections): array, linked-list, and string.
4Often array is part of the programming language, while linked-list is built from (recursive) pointer types, and string from a combination of array and linked-list.
[7d415b4]5For all three types, languages supply varying degrees of high-level mechanism for manipulating these objects at the bulk level and at the component level, such as array copy, slicing and iterating using subscripting.
[16915b1]6
7This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
8A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
9This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
[7d415b4]10An equally important goal is balancing good performance with safety.
[bdc8591]11
[19a2890]12
[bdc8591]13\section{Array}
14
[16915b1]15An array provides a homogeneous container with $O(1)$ access to elements using subscripting.
[7d415b4]16Higher-level operations like array slicing (single or multidimensional) may have significantly higher cost, but provides a better programmer experience.
[bdc8591]17The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation.
18For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap.
[16915b1]19Because array layout has contiguous components, subscripting is a computation (some form of pointer arithmetic).
[bdc8591]20
21
[1e110bf]22\section{Linked list}
[bdc8591]23
[16915b1]24A linked-list provides a homogeneous container often with $O(log N)$/$O(N)$ access to elements using successor and predecessor operations that normally involve pointer chasing.
[051aec4]25Subscripting by value is sometimes available, \eg hash table.
26Linked types are normally dynamically sized by adding/removing nodes using link fields internal or external to the elements (nodes).
27If a programming language allows pointer to stack storage, linked-list types can be allocated on the stack;
[16915b1]28otherwise, elements are heap allocated with explicitly/implicitly managed.
[bdc8591]29
30
31\section{String}
32
[7d415b4]33A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters.
34What differentiates a string from other types in that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@;
35subscripting individual elements is usually available, too.
36Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing.
[c721105]37The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages.
[7d415b4]38In some cases, string management is separate from heap management, \ie strings roll their own heap.
39
40
41\section{Iterator}
42
43As a side issue to working with complex structured types is iterating through them.
44Some thought has been given to \emph{general} versus specific iteration capabilities as part of of this work, but the general iteration work is only a sketch for others as future work.
45Nevertheless, sufficed work was done to write out the ideas that developed and how they should apply in the main context of this work.
[bdc8591]46
47
[6a8c773]48\section{Motivation}
[bdc8591]49
[16915b1]50The primary motivation for this work is two fold:
51\begin{enumerate}[leftmargin=*]
52\item
[7d415b4]53These three aspects of C are difficult to understand, teach, and get right because they are correspondingly low level.
54Providing higher-level, feature-rich versions of these containers in \CFA is a major component of the primary goal.
[16915b1]55\item
56These three aspects of C cause the greatest safety issues because there are few or no safe guards when a programmer misunderstands or misuses these features~\cite{Elliott18, Blache19, Ruef19, Oorschot23}.
57Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occur in C resulting from errors using these facilities (memory errors), providing the major hacker attack-vectors.
58\end{enumerate}
59Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to move away from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these older languages.
60Hardening these three types goes a long way to make the majority of C programs safer.
61
62
63While multiple new languages purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the new runtime uses garage collection.
64Furthermore, these languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication.
65Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning \CC.
66Hence, rewriting and retraining costs for these languages can be prohibitive for companies with a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia).
[bdc8591]67
68
[6a8c773]69\subsection{C?}
[bdc8591]70
[051aec4]71Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}.
[7d415b4]72However, most programming languages are only partially explained by their standard's manual(s).
[6a8c773]73When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}.
[16915b1]74Often other C compilers must mimic @gcc@ because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features.
[7d415b4]75While some key aspects of C need to be explained and understood by quoting from the language reference manual, to illustrate actual program semantics, this thesis uses constructs a program whose behaviour exercises a particular point and then confirms the behaviour by running the program.
[6a8c773]76These example programs show
[16915b1]77\begin{itemize}[leftmargin=*]
78 \item if the compiler accepts or rejects certain syntax,
[7d415b4]79 \item prints output to buttress a behavioural claim,
[16915b1]80 \item or executes without triggering any embedded assertions testing pre/post-assertions or invariants.
[bdc8591]81\end{itemize}
[7d415b4]82This work has been tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures.
[6a8c773]83Any discovered anomalies among compilers or versions is discussed.
[7d415b4]84In all case, it is never clear whether the \emph{truth} lies in the compiler(s) or the C standard.
[27f1055]85
86
87\section{Contributions}
[bdc8591]88
[7d415b4]89This work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types.
90As well, a strong plan for general iteration has been sketched out.
91
[1e110bf]92\subsection{Linked list}
[bdc8591]93
94\subsection{Array}
95
96\subsection{String}
[c721105]97
98\subsection{Iterator}
Note: See TracBrowser for help on using the repository browser.