Changeset 79ec8c3 for doc/theses/mike_brooks_MMath
- Timestamp:
- Dec 6, 2025, 5:02:29 PM (39 hours ago)
- Branches:
- master
- Parents:
- 9c8afc7
- File:
-
- 1 edited
-
doc/theses/mike_brooks_MMath/intro.tex (modified) (9 diffs)
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mike_brooks_MMath/intro.tex
r9c8afc7 r79ec8c3 2 2 3 3 All modern programming languages provide three high-level containers (collections): array, linked-list, and string. 4 Often array is part of the programming language, while linked-list is built from (recursive) pointer types, and stringfrom a combination of array and linked-list.4 Often array is part of the programming language, while linked-lists are built from (recursive) pointer types, and strings from a combination of array and linked-list. 5 5 For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component level, such as copying, slicing, extracting, and iterating among elements. 6 6 7 7 Unfortunately, these three aspects of C cause a significant number of memory errors~\cite{Oorschot23}. 8 Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occurring in C result from errors using these facilities (memory errors). 8 9 For operating system and browser vendors, who heavily use systems languages, 60\%--70\% of reported software vulnerabilities involved memory errors~\cite{Kehrer23}. 9 10 For Microsoft, 70\% of vulnerabilities addressed via security updates between 2006--2018 are memory safety issues~\cite[slide 10]{Miller19}. 10 11 In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is (memory) buffer errors, among 19 vulnerability categories~\cite{Cifuentes19}. 11 Therefore, hardening these three C types goes a long way to make the majority of C programs safer .12 Therefore, hardening these three C types goes a long way to make the majority of C programs safer and eliminating major hacker attack-vectors. 12 13 13 14 This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language. … … 26 27 The array size can be static, dynamic but fixed after creation, or dynamic and variable after creation. 27 28 For static and dynamic-fixed, an array can be stack allocated, while dynamic-variable requires the heap. 28 Because array layout has contiguous components, subscripting is a computation (some form of pointer arithmetic).29 Because array layout has contiguous components, subscripting is a computation, \ie some form of pointer arithmetic. 29 30 30 31 C provides a simple array type as a language feature. 31 However, it adopts the controversial languageposition of treating pointer and array as duals, leading to multiple problems.32 However, it adopts the controversial position of treating pointer and array as duals, leading to multiple problems. 32 33 33 34 … … 36 37 A linked-list provides a homogeneous container often with $O(log N)$ or $O(N)$ access to elements using successor and predecessor operations that normally involve pointer chasing. 37 38 Subscripting by value (rather than position or location as for array) is sometimes available, \eg hash table. 38 Linked types are normally dynamically sized by adding and removing nodes using link fields internal or external to theelements (nodes).39 If a programming language allows pointers to stack storage, linked-list nodes can be allocated on the stack and connected with stack addresses (pointers);39 Linked types are dynamically sized by adding and removing nodes using link fields internal or external to the list elements (nodes). 40 If a programming language allows pointers to stack storage, linked-list nodes can be allocated on the stack and connected with stack addresses; 40 41 otherwise, elements are heap allocated with internal or external link fields. 41 42 … … 47 48 hash search table consisting of a key (string) with associated data (@<search.h>@) 48 49 \end{itemize} 49 Because these libraries are simple, awkward to use, and unsafe, C programmers commonly build bespoke linked data-structures.50 Because these container libraries can be restrictive, awkward to use, and unsafe, C programmers often build bespoke linked data-structures, which further increases the potential for memory errors. 50 51 51 52 … … 54 55 A string provides a dynamic array of homogeneous elements, where the elements are (often) some form of human-readable characters. 55 56 What differentiates a string from other types in that many of its operations work on groups of elements for scanning and changing, \eg @index@ and @substr@. 56 While subscripting individual elements is usually available, working at the individualcharacter level is considered poor practise, \ie underutilizing the powerful string operations.57 While subscripting individual elements is usually available, working at the character level is considered poor practise, \ie underutilizing the powerful string operations. 57 58 Therefore, the cost of a string operation is usually less important than the power of the operation to accomplish complex text manipulation, \eg search, analysing, composing, and decomposing string text. 58 59 The dynamic nature of a string means storage is normally heap allocated but often implicitly managed, even in unmanaged languages. 59 In somecases, string management is separate from heap management, \ie strings roll their own heap.60 In many cases, string management is separate from heap management, \ie strings roll their own heap. 60 61 61 62 The C string type is just a character array and requires user storage-management to handle varying size. 62 The character array uses the convention of marking its (variable) array length by placing the 0-valued control character at the end (null-terminated). 63 C adopts a terminating sentinel character, @'\0'@, to mark the end of a variable-length character-array versus a preceding length field. 64 Hence, the sentinel character is excluded from a string and determining the string length is an $O(N)$ operation. 63 65 The C standard library includes a number of high-level operations for working with this representation. 64 Most of these operations are awkward and error prone.66 Most of these operations are awkward to use and error prone. 65 67 66 68 … … 80 82 \begin{enumerate}[leftmargin=*] 81 83 \item 82 The se three aspects of C are difficult to understand, teach, and get right because they are correspondinglylow level.84 The three primary container types in C are difficult to understand, teach, and get right because they are too low level. 83 85 Providing higher-level, feature-rich versions of these containers in \CFA is a major component of the primary goal. 86 The result is a simplify programming experience, which increases productivity and maintainability. 84 87 \item 85 These three aspects of C cause the greatest safety issues because there are few or no safe guards when a programmer misunderstands or misuses these features~\cite{Elliott18, Blache19, Ruef19, Oorschot23}. 86 Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occurring in C result from errors using these facilities (memory errors), providing the major hacker attack-vectors. 88 The new container types must be as correct and safe to use as those in other modern programming languages, which has been shown to a primary concern of industry, government, and military~\cite{ONCD}. 89 Prior approaches focus on out-of-bound array accesses using a model-based approach (ASCET) in embedded systems (\eg cars)~\cite{Blache19}, and general and null-terminated string arrays using a \CC template syntax (Checked C)~\cite{Elliott18,Ruef19}. 90 Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to \emph{move away} from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these languages. 91 Fixing these vulnerabilities negates this need, allowing C and its ecosystem to continue into the future. 87 92 \end{enumerate} 88 Both White House~\cite{WhiteHouse24} and DARPA~\cite{DARPA24} recently released a recommendation to move away from C and \CC, because of cybersecurity threats exploiting vulnerabilities in these older languages.89 Hardening these three types goes a long way to make the majority of C programs safer.90 93 91 92 While multiple new languages purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the new runtime uses garage collection. 94 While new languages, \eg Go, Rust, Swift, purport to be systems languages replacing C, the reality is that rewriting massive C code-bases is impractical and a non-starter if the runtime uses garbage collection. 95 Even assuming automated conversion of C programs to other safe languages, who will maintain this massive new code-base? 93 96 Furthermore, new languages must still interact with the underlying C operating system through fragile, type-unsafe, interlanguage-communication. 94 Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, overloading, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning new \CC. 95 Hence, rewriting and retraining costs for these languages can be prohibitive for companies with a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia). 97 Switching to \CC is equally impractical as its complex and interdependent type-system (\eg objects, overloading, inheritance, templates) means idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning new \CC features and idioms. 98 Finally, projects claiming to be modern C replacements (Cyclone~\cite{Grossman06}, Polymorphic C~\cite{Smith98}, GObject~\cite{GObject}) do not retain C backwards compatibility in syntax, programing model, or semantic compatibility; 99 these languages are equivalent to switching to a different programming language. 100 Hence, rewriting and retraining costs for other languages are prohibitive when companies have a large C software-base (Google, Apple, Microsoft, Amazon, AMD, Nvidia). 96 101 97 102 … … 100 105 Like many established programming languages, C has a standards committee and multiple ANSI/\-ISO language manuals~\cite{C99,C11,C18,C23}. 101 106 However, most programming languages are only partially explained by their standard's manual(s). 102 When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}. 103 Often other C compilers \emph{must} mimic @gcc@ because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features. 107 When it comes to explaining how C works, the definitive source is the @gcc@ compiler, which is mimicked by other C compilers, such as Clang~\cite{clang}, because a large part of the C library (runtime) system (@glibc@ on Linux) contains @gcc@ features. 104 108 Some key aspects of C need to be explained and understood by quoting from the language reference manual. 105 109 However, to illustrate actual program semantics, this thesis constructs multiple small programs whose behaviour exercises a particular point and then confirms the behaviour by running the program using multiple @gcc@ compilers. 106 110 These example programs show 107 \begin{itemize} 111 \begin{itemize}[itemsep=0pt] 108 112 \item if the compiler accepts or rejects certain syntax, 109 113 \item prints output to buttress a behavioural claim, … … 111 115 \end{itemize} 112 116 These programs are tested across @gcc@ versions 8--14 and clang versions 10--14 running on ARM, AMD, and Intel architectures. 113 Any discovered anomalies among compilers, versions, or architectures isdiscussed.117 Any discovered anomalies among compilers, versions, or architectures are discussed. 114 118 In general, it is never clear whether the \emph{truth} lies in the C standard or the compiler(s), which may be true for other programming languages. 115 119 … … 118 122 119 123 Overall, this work has produced significant syntactic and semantic improvements to C's arrays, linked-lists and string types. 120 As well, a strong plan for general iteration has been sketched out.124 % As well, a strong plan for general iteration has been sketched out. 121 125 The following are the detailed contributions, where performance and safety were always the motivating factors. 122 126 123 127 \subsection{Array} 124 128 125 Th is work's array improvements are:129 The improvements to C arrays are: 126 130 \begin{enumerate}[leftmargin=*] 127 131 \item 128 Introduce a small number of subtle changes to the typing rules for the C array, while still achieving significant backwards compatibility 132 Introduce a small number of subtle changes to the typing rules for the C array, while still achieving significant backwards compatibility. 129 133 \item 130 134 Create a new polymorphic mechanism in the \CFA @forall@ clause to specify array dimension values, similar to a fixed-typed parameter in a \CC \lstinline[language=C++]{template}.
Note:
See TracChangeset
for help on using the changeset viewer.