Changeset fe6047c for doc


Ignore:
Timestamp:
Dec 12, 2025, 12:18:17 PM (4 weeks ago)
Author:
Michael Brooks <mlbrooks@…>
Branches:
master
Children:
67748f9
Parents:
f2b74e3
Message:

Edits in early sections.

Includes explicating the connection between unsafety stats and the three collection types.

Location:
doc/theses/mike_brooks_MMath
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mike_brooks_MMath/background.tex

    rf2b74e3 rfe6047c  
    620620In fact, the outermost type constructor (syntactically first dimension) is really the one that determines the parameter flavour.
    621621
    622 \PAB{TODO: add examples of mycode/arrr/bugs/c-dependent/x.cfa:v5102,5103,
    623 which are shocking how much C ignores.}
     622C ignores length information given in parameter declarations entirely, when determining a function's type.
     623For example, it accepts this pair of declarations
     624\begin{cfa}
     625    void f( int, float[][42] );  $\C{// array of len-42 arrays}$
     626    void f( int n, float[][n]  );   $\C{// array of VLAs}$
     627\end{cfa}
     628as a repeat, without an error about conflicting types for @f@.
     629Yet, entirely different stride calculations would occur in a function body whose parameters were declared in each of the two styles.
    624630
    625631\begin{figure}
     
    668674An array parameter declaration can specify the outermost dimension with a dimension value, @[10]@ (which is ignored), an empty dimension list, @[ ]@, or a pointer, @*@, as seen in \VRef[Figure]{f:ArParmEquivDecl}.
    669675The rationale for rejecting the first invalid row follows shortly, while the second invalid row is nonsense, included to complete the pattern; its syntax hints at what the final row actually achieves.
    670 Note, in the leftmost style, the typechecker ignores the actual value even in a dynamic expression.
     676Note, in the leftmost style, the typechecker ignores the actual value, even for a dynamic expression.
    671677\begin{cfa}
    672678int N;
  • doc/theses/mike_brooks_MMath/intro.tex

    rf2b74e3 rfe6047c  
    11\chapter{Introduction}
    22
    3 All modern programming languages provide three high-level containers (collections): array, linked-list, and string.
    4 Often array is part of the programming language, while linked-lists are built from (recursive) pointer types, and strings from a combination of array and linked-list.
    5 For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component level, such as copying, slicing, extracting, and iterating among elements.
     3All modern programming languages provide at least these three high-level containers (collections): array, linked-list, and string.
     4Often, the array is part of the programming language, while linked lists are built from (recursive) pointer types, and strings from arrays and/or linked lists.
     5For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component levels, such as copying, slicing, extracting, and iterating among elements.
    66
    7 Unfortunately, these three aspects of C cause a significant number of memory errors~\cite{Oorschot23}.
     7Unfortunately, memory misuse under C-idiomatic programming causes a significant number of errors~\cite{Oorschot23}.
    88Estimates suggest 50\%~\cite{Mendio24} of total reported open-source vulnerabilities occurring in C result from errors using these facilities (memory errors).
    99For operating system and browser vendors, who heavily use systems languages, 60\%--70\% of reported software vulnerabilities involved memory errors~\cite{Kehrer23}.
    1010For Microsoft, 70\% of vulnerabilities addressed via security updates between 2006--2018 are memory safety issues~\cite[slide 10]{Miller19}.
    11 In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is (memory) buffer errors, among 19 vulnerability categories~\cite{Cifuentes19}.
    12 Therefore, hardening these three C types goes a long way to make the majority of C programs safer and eliminating major hacker attack-vectors.
     11In a study of software exploits in the U.S. National Vulnerability Database over 2013--2017, the top reported vulnerability is buffer errors (in the sense of misused array bounds), among 19 vulnerability categories~\cite{Cifuentes19}.
    1312
    14 This work looks at extending these three foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
     13While these figures address memory errors, defined without limit to the three primary collections, these authors' explanations of memory errors, among other C analysis efforts, suggest improving on C's language/library options for working with these three collections would be widely beneficial.
     14Every one of the safety-statistic sources above includes array bounds in its first characterization of unsafe programming-language features.
     15Of the four studies that deal exclusively with memory safety (excluding~\cite{Cifuentes19}), all put @malloc@--@free@ upfront as well.
     16The one that specifically surveys C features, \cite{Oorschot23}, also has similarly prominent blame on both of: the null-terminated text idiom, with its support/mandate in the C standard library; the unprotected nature of C pointers.
     17An effort at translating C to safe Rust~\cite{Emre2022} has pointers as the stand-out difficult aspect to analyze.\footnote{
     18        \cite{Emre2022} demonstrates the difficulty of analyzing C pointers for safety (while tacking some of this difficult problem).
     19        It first puts a corpus of C programs through a naive Rust translation, which ``succeeds'' by leaving many uses of Rust's \lstinline{unsafe} marker.
     20        Then, it analyzes these uses and presents a technique for reducing those uses that hinge exclusively on referenced object lifetime.
     21        Pointer dereference accounts for 80\% of the cases that the naive translation did not make safe (Table 3.2, p33).
     22        Of these, 10\% are eligible for the work's own safety improvement technique, while this filtering leaves 75\% with no single cause of unsafety determined (Table 3.5, p39).
     23        The presented technique succeeds at making 75\% of the eligible initially unsafe dereferences safe, by inferring the necessary lifetime annotations (Table 4.2, p83).
     24        This result leaves, per 1000 LOC, 3.3 unremovable unsafe dereferences that are understood to evade lifetime analysis, among 220 gross unsafe dereferences (Tables 3.5, 4.2 and 3.1, p27).
     25        % 3.3 = 1398/413,428 * 1000
     26        % 220 ~= (99,101 [@tab3.5] - 9631 [@tab4.2] ) / 413,428 [@tab 3.1] * 1000
     27}
     28A tool for analyzing C code's use of linked data structures (including, \eg skip list and binary tree) found variations of the linked-list shape to be present in most of the programs in its corpus~\cite{White2016}.
     29So, C's array unsafety is infamous, its string pattern necessitates explicit storage management (also infamous), and user-written linked lists are an attrictively recognizable player in the arena of (infamously) unsafe raw pointer uses.
     30Therefore, hardening the three primary collections goes a long way to making the majority of C programs safer, and eliminating major hacker attack vectors.
     31
     32This work looks at extending these foundational container types in the programming language \CFA, which is a new dialect of the C programming language.
    1533A primary goal of \CFA~\cite{Cforall} is 99\% backward compatibility with C, while maintaining a look and feel that matches with C programmer experience and intuition.
    1634This goal requires ``thinking inside the box'' to engineer new features that ``work and play'' with C and its massive legacy code-base.
     
    3048
    3149C provides a simple array type as a language feature.
    32 However, it adopts the controversial position of treating pointer and array as duals, leading to multiple problems.
     50However, it adopts the controversial position of treating pointer and array as twins, leading to multiple problems.
     51The way that arrays are typicaly passed around a program removes the information necessary to do bound checks.
    3352
    3453
     
    4160otherwise, elements are heap allocated with internal or external link fields.
    4261
    43 C provides a few simple, polymorphic, library data-structures (@glibc@):
     62C provides a few simple, library data-structures (@glibc@):
    4463\begin{itemize}
    4564\item
     
    4867hash search table consisting of a key (string) with associated data (@<search.h>@)
    4968\end{itemize}
    50 Because these container libraries can be restrictive, awkward to use, and unsafe, C programmers often build bespoke linked data-structures, which further increases the potential for memory errors.
     69Because these container libraries can be restrictive or awkward to use, C programmers often build bespoke linked data-structures, which further increases the potential for memory errors.
    5170
    5271
  • doc/theses/mike_brooks_MMath/uw-ethesis-frontpgs.tex

    rf2b74e3 rfe6047c  
    131131\CFA strives to fix mistakes in C, chief among them, safety.
    132132This thesis presents a significant step forward in \CFA's goal to remove unsafe pointer operations.
    133 The thesis presents improvements to the \CFA language design, both syntax and semantics, to support advanced container features.
    134 These features are implemented across the \CFA compiler, libraries, and runtime system.
    135 The results maintain another \CFA goal of remaining 99\% backwards compatible with C.
    136 This thesis leverages preexisting work within the compiler's type and runtime systems generated by prior students working on the \CFA project.
    137 
    138 All modern programming languages provide three high-level containers (collections): array, linked-list, and string.
    139 Often array is part of the programming language, while linked-lists are built from (recursive) pointer types, and strings from a combination of array and linked-list.
    140 For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component level, such as copying, slicing, extracting, and iterating among elements.
    141 Unfortunately, these three aspects of C cause 60\%--70\% of the reported software vulnerabilities involving memory errors, and 70\%--80\% of hacker attack-vectors target these types.
     133It describes improvements to the \CFA language design to support advanced container features.
     134These features are implemented across the \CFA compiler and runtime libraries.
     135The results maintain another \CFA goal of offering strong backwards compatibility with C.
     136This work leverages preexisting \CFA contributiongs of prior students working on the \CFA project, particularly through novel applications of the compiler's type system.
     137
     138All modern programming languages provide at least these three high-level containers (collections): array, linked-list, and string.
     139Often, the array is part of the programming language, while linked lists are built from (recursive) pointer types, and strings from arrays and/or linked lists.
     140For all three types, languages and/or their libraries supply varying degrees of high-level mechanisms for manipulating these objects at the bulk and component levels, such as copying, slicing, extracting, and iterating among elements.
     141Unfortunately, typical solutions for the the key types in C cause 60\%--70\% of the reported software vulnerabilities involving memory errors; 70\%--80\% of hacker attack vectors target these types.
    142142Therefore, hardening these three C types goes a long way to make the majority of C programs safer.
    143143
    144144Specifically, an array utility is provided that tracks length internally, relieving the user of managing explicit length parameters and stopping buffer-overrun errors.
    145145This feature requires augmenting the \CFA type system, making array length available at compile and runtime.
    146 A linked-list utility is provided, which obviates many explicit recursive pointers by catering directly to system-programming uses (intrusive lists) for which a library solution is often dismissed.
     146A linked-list utility is provided, which obviates many user-managed recursive pointers by catering directly to system-programming uses (intrusive linking, ad-hoc listing) for which a library solution is often dismissed.
    147147Finally, a string utility is provided with implicit memory management of text in a specialized heap, relieving error-prone buffer management, including overrun, and providing a copy-on-write speed boost.
    148148For all three utilities, performance is argued to be on-par with, and occasionally surpassing relevant comparators.
  • doc/theses/mike_brooks_MMath/uw-ethesis.bib

    rf2b74e3 rfe6047c  
    9494}
    9595
     96% -------------------------------------------------
     97% Safety
     98
    9699@article{Blache19,
    97100    author      = {Gunter Blache},
     
    122125}
    123126
     127@phdthesis{Emre2022,
     128    author  = "Mehmet Emre",
     129    title   = "Translating C to Safe Rust: Reasoning about Pointer Types and Lifetimes",
     130    school  = "UC Santa Barbara",
     131    year    = "2022"
     132}
     133
     134@inproceedings{White2016,
     135    author = {White, David H. and Rupprecht, Thomas and L\"{u}ttgen, Gerald},
     136    title = {DSI: an evidence-based approach to identify dynamic data structures in C programs},
     137    year = {2016},
     138    isbn = {9781450343909},
     139    publisher = {Association for Computing Machinery},
     140    address = {New York, NY, USA},
     141    url = {https://doi.org/10.1145/2931037.2931071},
     142    doi = {10.1145/2931037.2931071},
     143    booktitle = {Proceedings of the 25th International Symposium on Software Testing and Analysis},
     144    pages = {259–269},
     145    numpages = {11},
     146    keywords = {Data structure identification, dynamic data structures, pointer programs, program comprehension},
     147    location = {Saarbr\"{u}cken, Germany},
     148    series = {ISSTA 2016}
     149}
     150
     151% -----------------------------------------------
     152% Misc
     153
    124154@misc{RVO20,
    125155    contributer = {pabuhr@plg},
     
    149179    howpublished= {\url{https://c-faq.com/decl/spiral.anderson.html}},
    150180}
     181
Note: See TracChangeset for help on using the changeset viewer.