- Timestamp:
- Dec 12, 2025, 12:18:17 PM (4 weeks ago)
- Branches:
- master
- Children:
- 67748f9
- Parents:
- f2b74e3
- Location:
- doc/theses/mike_brooks_MMath
- Files:
-
- 4 edited
-
background.tex (modified) (2 diffs)
-
intro.tex (modified) (4 diffs)
-
uw-ethesis-frontpgs.tex (modified) (1 diff)
-
uw-ethesis.bib (modified) (3 diffs)
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mike_brooks_MMath/background.tex
rf2b74e3 rfe6047c 620 620 In fact, the outermost type constructor (syntactically first dimension) is really the one that determines the parameter flavour. 621 621 622 \PAB{TODO: add examples of mycode/arrr/bugs/c-dependent/x.cfa:v5102,5103, 623 which are shocking how much C ignores.} 622 C ignores length information given in parameter declarations entirely, when determining a function's type. 623 For example, it accepts this pair of declarations 624 \begin{cfa} 625 void f( int, float[][42] ); $\C{// array of len-42 arrays}$ 626 void f( int n, float[][n] ); $\C{// array of VLAs}$ 627 \end{cfa} 628 as a repeat, without an error about conflicting types for @f@. 629 Yet, entirely different stride calculations would occur in a function body whose parameters were declared in each of the two styles. 624 630 625 631 \begin{figure} … … 668 674 An array parameter declaration can specify the outermost dimension with a dimension value, @[10]@ (which is ignored), an empty dimension list, @[ ]@, or a pointer, @*@, as seen in \VRef[Figure]{f:ArParmEquivDecl}. 669 675 The rationale for rejecting the first invalid row follows shortly, while the second invalid row is nonsense, included to complete the pattern; its syntax hints at what the final row actually achieves. 670 Note, in the leftmost style, the typechecker ignores the actual value even ina dynamic expression.676 Note, in the leftmost style, the typechecker ignores the actual value, even for a dynamic expression. 671 677 \begin{cfa} 672 678 int N; -
doc/theses/mike_brooks_MMath/intro.tex
rf2b74e3 rfe6047c 1 1 \chapter{Introduction} 2 2 3 All modern programming languages provide three high-level containers (collections): array, linked-list, and string.4 Often array is part of the programming language, while linked-lists are built from (recursive) pointer types, and strings from a combination of array and linked-list.5 For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component level , such as copying, slicing, extracting, and iterating among elements.3 All modern programming languages provide at least these three high-level containers (collections): array, linked-list, and string. 4 Often, the array is part of the programming language, while linked lists are built from (recursive) pointer types, and strings from arrays and/or linked lists. 5 For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component levels, such as copying, slicing, extracting, and iterating among elements. 6 6 7 Unfortunately, these three aspects of C cause a significant number of memoryerrors~\cite{Oorschot23}.7 Unfortunately, memory misuse under C-idiomatic programming causes a significant number of errors~\cite{Oorschot23}. 8 8 Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occurring in C result from errors using these facilities (memory errors). 9 9 For operating system and browser vendors, who heavily use systems languages, 60\%--70\% of reported software vulnerabilities involved memory errors~\cite{Kehrer23}. 10 10 For Microsoft, 70\% of vulnerabilities addressed via security updates between 2006--2018 are memory safety issues~\cite[slide 10]{Miller19}. 11 In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is (memory) buffer errors, among 19 vulnerability categories~\cite{Cifuentes19}. 12 Therefore, hardening these three C types goes a long way to make the majority of C programs safer and eliminating major hacker attack-vectors. 11 In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is buffer errors (in the sense of misused array bounds), among 19 vulnerability categories~\cite{Cifuentes19}. 13 12 14 This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language. 13 While these figures address memory errors, defined without limit to the three primary collections, these authors' explanations of memory errors, among other C analysis efforts, suggest improving on C's language/library options for working with these three collections would be widely beneficial. 14 Every one of the safety-statistic sources above includes array bounds in its first characterization of unsafe programming-language features. 15 Of the four studies that deal exclusively with memory safety (excluding~\cite{Cifuentes19}), all put @malloc@--@free@ upfront as well. 16 The one that specifically surveys C features, \cite{Oorschot23}, also has similarly prominent blame on both of: the null-terminated text idiom, with its support/mandate in the C standard library; the unprotected nature of C pointers. 17 An effort at translating C to safe Rust~\cite{Emre2022} has pointers as the stand-out difficult aspect to analyze.\footnote{ 18 \cite{Emre2022} demonstrates the difficulty of analyzing C pointers for safety (while tacking some of this difficult problem). 19 It first puts a corpus of C programs through a naive Rust translation, which ``succeeds'' by leaving many uses of Rust's \lstinline{unsafe} marker. 20 Then, it analyzes these uses and presents a technique for reducing those uses that hinge exclusively on referenced object lifetime. 21 Pointer dereference accounts for 80\% of the cases that the naive translation did not make safe (Table 3.2, p33). 22 Of these, 10\% are eligible for the work's own safety improvement technique, while this filtering leaves 75\% with no single cause of unsafety determined (Table 3.5, p39). 23 The presented technique succeeds at making 75\% of the eligible initially unsafe dereferences safe, by inferring the necessary lifetime annotations (Table 4.2, p83). 24 This result leaves, per 1000 LOC, 3.3 unremovable unsafe dereferences that are understood to evade lifetime analysis, among 220 gross unsafe dereferences (Tables 3.5, 4.2 and 3.1, p27). 25 % 3.3 = 1398/413,428 * 1000 26 % 220 ~= (99,101 [@tab3.5] - 9631 [@tab4.2] ) / 413,428 [@tab 3.1] * 1000 27 } 28 A tool for analyzing C code's use of linked data structures (including, \eg skip list and binary tree) found variations of the linked-list shape to be present in most of the programs in its corpus~\cite{White2016}. 29 So, C's array unsafety is infamous, its string pattern necessitates explicit storage management (also infamous), and user-written linked lists are an attrictively recognizable player in the arena of (infamously) unsafe raw pointer uses. 30 Therefore, hardening the three primary collections goes a long way to making the majority of C programs safer, and eliminating major hacker attack vectors. 31 32 This work looks at extending these foundational container types in the programming language \CFA, which is a new dialect of the C programming language. 15 33 A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition. 16 34 This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base. … … 30 48 31 49 C provides a simple array type as a language feature. 32 However, it adopts the controversial position of treating pointer and array as duals, leading to multiple problems. 50 However, it adopts the controversial position of treating pointer and array as twins, leading to multiple problems. 51 The way that arrays are typicaly passed around a program removes the information necessary to do bound checks. 33 52 34 53 … … 41 60 otherwise, elements are heap allocated with internal or external link fields. 42 61 43 C provides a few simple, polymorphic,library data-structures (@glibc@):62 C provides a few simple, library data-structures (@glibc@): 44 63 \begin{itemize} 45 64 \item … … 48 67 hash search table consisting of a key (string) with associated data (@<search.h>@) 49 68 \end{itemize} 50 Because these container libraries can be restrictive , awkward to use, and unsafe, C programmers often build bespoke linked data-structures, which further increases the potential for memory errors.69 Because these container libraries can be restrictive or awkward to use, C programmers often build bespoke linked data-structures, which further increases the potential for memory errors. 51 70 52 71 -
doc/theses/mike_brooks_MMath/uw-ethesis-frontpgs.tex
rf2b74e3 rfe6047c 131 131 \CFA strives to fix mistakes in C, chief among them, safety. 132 132 This thesis presents a significant step forward in \CFA's goal to remove unsafe pointer operations. 133 The thesis presents improvements to the \CFA language design, both syntax and semantics,to support advanced container features.134 These features are implemented across the \CFA compiler , libraries, and runtime system.135 The results maintain another \CFA goal of remaining 99\% backwards compatiblewith C.136 This thesis leverages preexisting work within the compiler's type and runtime systems generated by prior students working on the \CFA project.137 138 All modern programming languages provide three high-level containers (collections): array, linked-list, and string.139 Often array is part of the programming language, while linked-lists are built from (recursive) pointer types, and strings from a combination of array and linked-list.140 For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component level , such as copying, slicing, extracting, and iterating among elements.141 Unfortunately, t hese three aspects of C cause 60\%--70\% of the reported software vulnerabilities involving memory errors, and 70\%--80\% of hacker attack-vectors target these types.133 It describes improvements to the \CFA language design to support advanced container features. 134 These features are implemented across the \CFA compiler and runtime libraries. 135 The results maintain another \CFA goal of offering strong backwards compatibility with C. 136 This work leverages preexisting \CFA contributiongs of prior students working on the \CFA project, particularly through novel applications of the compiler's type system. 137 138 All modern programming languages provide at least these three high-level containers (collections): array, linked-list, and string. 139 Often, the array is part of the programming language, while linked lists are built from (recursive) pointer types, and strings from arrays and/or linked lists. 140 For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component levels, such as copying, slicing, extracting, and iterating among elements. 141 Unfortunately, typical solutions for the the key types in C cause 60\%--70\% of the reported software vulnerabilities involving memory errors; 70\%--80\% of hacker attack vectors target these types. 142 142 Therefore, hardening these three C types goes a long way to make the majority of C programs safer. 143 143 144 144 Specifically, an array utility is provided that tracks length internally, relieving the user of managing explicit length parameters and stopping buffer-overrun errors. 145 145 This feature requires augmenting the \CFA type system, making array length available at compile and runtime. 146 A linked-list utility is provided, which obviates many explicit recursive pointers by catering directly to system-programming uses (intrusive lists) for which a library solution is often dismissed.146 A linked-list utility is provided, which obviates many user-managed recursive pointers by catering directly to system-programming uses (intrusive linking, ad-hoc listing) for which a library solution is often dismissed. 147 147 Finally, a string utility is provided with implicit memory management of text in a specialized heap, relieving error-prone buffer management, including overrun, and providing a copy-on-write speed boost. 148 148 For all three utilities, performance is argued to be on-par with, and occasionally surpassing relevant comparators. -
doc/theses/mike_brooks_MMath/uw-ethesis.bib
rf2b74e3 rfe6047c 94 94 } 95 95 96 % ------------------------------------------------- 97 % Safety 98 96 99 @article{Blache19, 97 100 author = {Gunter Blache}, … … 122 125 } 123 126 127 @phdthesis{Emre2022, 128 author = "Mehmet Emre", 129 title = "Translating C to Safe Rust: Reasoning about Pointer Types and Lifetimes", 130 school = "UC Santa Barbara", 131 year = "2022" 132 } 133 134 @inproceedings{White2016, 135 author = {White, David H. and Rupprecht, Thomas and L\"{u}ttgen, Gerald}, 136 title = {DSI: an evidence-based approach to identify dynamic data structures in C programs}, 137 year = {2016}, 138 isbn = {9781450343909}, 139 publisher = {Association for Computing Machinery}, 140 address = {New York, NY, USA}, 141 url = {https://doi.org/10.1145/2931037.2931071}, 142 doi = {10.1145/2931037.2931071}, 143 booktitle = {Proceedings of the 25th International Symposium on Software Testing and Analysis}, 144 pages = {259–269}, 145 numpages = {11}, 146 keywords = {Data structure identification, dynamic data structures, pointer programs, program comprehension}, 147 location = {Saarbr\"{u}cken, Germany}, 148 series = {ISSTA 2016} 149 } 150 151 % ----------------------------------------------- 152 % Misc 153 124 154 @misc{RVO20, 125 155 contributer = {pabuhr@plg}, … … 149 179 howpublished= {\url{https://c-faq.com/decl/spiral.anderson.html}}, 150 180 } 181
Note:
See TracChangeset
for help on using the changeset viewer.