Ignore:
Timestamp:
Aug 31, 2022, 3:02:53 PM (2 years ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
7e5da64
Parents:
38e4c5c (diff), 67a1c67 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

Location:
doc/theses/thierry_delisle_PhD/thesis
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/local.bib

    r38e4c5c r9d67a6d  
    555555  title = {Mach Scheduling and Thread Interfaces - Kernel Programming Guide},
    556556  organization = {Apple Inc.},
    557   note = {\href{https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}{https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}}
     557  note = {\href{https://developer.apple.com/library/archive/documentation/Darwin/Conceptual/KernelProgramming/scheduler/scheduler.html}{https://\-developer.apple.com/\-library/archive/\-documentation/\-Darwin/\-Conceptual/\-KernelProgramming/\-scheduler/\-scheduler.html}}
    558558}
    559559
  • doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex

    r38e4c5c r9d67a6d  
    11\chapter{Conclusion}\label{conclusion}
    22
    3 \Gls{uthrding} is popular.
    4 It makes sense for \CFA to use it.
     3Building the \CFA runtime has been an extremely challenging project.
     4The work was divided between high-level concurrency design and a user-level threading runtime (Masters thesis), and low-level support of the user-level runtime using OS kernel-threading and its (multiple) I/O subsystems (Ph.D. thesis).
     5Because I am the main developer for both components of this project, there is strong continuity across the design and implementation.
     6This continuity provides a consistent approach to advanced control-flow and concurrency, with easier development, management and maintenance of the runtime in the future.
    57
    6 \todo{Obivously fix the above}
     8I believed my Masters work would provide the background to make the Ph.D work reasonably straightforward.
     9What I discovered is that interacting with kernel locking, threading, and I/O in the UNIX (Linux) operating system is extremely difficult.
     10There are multiple concurrency aspects in UNIX that are poorly designed, not only for user-level threading but also for kernel-level threading.
     11Basically, UNIX-based OSs are not very concurrency friendly.
     12To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multi-processor computers and concurrency knowledge was just developing.
     13It is unfortunately so little has changed in the intervening years.
    714
    8 An important aspect of this approach to threading is how threads are scheduled.
    9 As \CFA aims to increase productivity and safety of C while maintaining its performance, so to should the threading runtime achieve these goals.
    10 For scheduling, productivity and safety manifest in removing pitfalls in the efficient usage of the threading runtime.
    11 This thesis contributes to this goal by presenting a low-latency scheduler that offers improved starvation prevention compared to other state-of-the-art schedulers.
    12 It presents a core algorithm (Chapter~\ref{core}) that provides increased fairness through helping (Section~\ref{heling}) as well as optimizations which virtually remove the cost of this fairness (Section~\ref{relaxedtimes}).
    13 Building upon the fundamental scheduling algorithm, an implementation of user-level \io blocking is presented (Chapter~\ref{io}) which achieves the same performance and fairness balance as the scheduler itself.
    14 From these core algorithms, and a low-latency idle-sleep mechanism is presented (Chapter~\ref{practice}) which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system.
     15Also, my decision to use @io_uring@ was both a positive and negative.
     16The positive is that @io_uring@ supports the panoply of I/O mechanisms in UNIX;
     17hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and some unknown to handle disk I/O.
     18It is unclear to me how I would have merged all these different I/O mechanisms into a coherent scheduling implementation.
     19The negative is that @io_uring@ is new and developing.
     20As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds.
     21Given what I now know about @io_uring@, I would say it is insufficiently coupled with the Linux kernel to properly handle non-blocking I/O.
     22Specifically, spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel.
     23Nonblocking I/O should not be handled in this way.
     24
     25% While \gls{uthrding} is an old idea, going back to the first multi-processor computers, it was largely set aside in the 1990s because of the complexity in making it work well between applications and the operating system.
     26% Unfortunately,, a large amount of that complexity still exists, making \gls{uthrding} a difficult task for library and programming-language implementers.
     27% For example, the introduction of thread-local storage and its usage in many C libraries causes the serially reusable problem~\cite{SeriallyReusable} for all \gls{uthrding} implementers.
     28% Specifically, if a \gls{kthrd} is preempted, it always restarts with the same thread-local storage;
     29% when a user thread is preempted, it can be restarted on another \gls{kthrd}, accessing the new thread-local storage, or worse, the previous thread-local storage.
     30% The latter case causes failures when an operation using the thread-local storage is assumed to be atomic at the kernel-threading level.
     31% If library implementers always used the pthreads interface to access threads, locks, and thread-local storage, language runtimes can interpose a \gls{uthrding} version of pthreads, switching the kind of threads, locks, and storage, and hence, make it safe.
     32% However, C libraries are currently filled with direct declarations of thread-local storage and low-level atomic instructions.
     33% In essence, every library developer is inventing their own threading mechanisms to solve their unique problem, independent from any standardize approaches.
     34% This state of affairs explains why concurrency is such a mess.
     35%
     36% To address the concurrency mess, new programming languages integrate threading into the language rather than using the operating-system supplied interface.
     37% The reason is that many sequential code-optimizations invalid correctly written concurrent programs.
     38% While providing safe concurrent-code generation is essential, the underlying language runtime is still free to implement threading using kernel or user/kernel threading, \eg Java versus Go.
     39% In either case, the language/runtime manages all the details to simplify concurrency and increase safety.
     40
     41\section{Goals}
     42
     43The underlying goal of this thesis is scheduling the complex hardware components that make up a computer to provide good utilization and fairness.
     44However, direct hardware scheduling is only possible in the OS.
     45Instead, this thesis is performing arms-length application scheduling of the hardware components through a complex set of OS interfaces that indirectly manipulate the hardware components.
     46Couple that with the OS running multiple applications with its own goals for scheduling among them.
     47Hence, my goal became the battle of two schedulers.
     48
     49This work focusses on efficient and fair scheduling of the multiple CPUs, which are ubiquitous on all modern computers.
     50The levels of indirection to the CPUs are:
     51\begin{itemize}
     52\item
     53The \CFA presentation of concurrency through multiple high-level language constructs.
     54\item
     55The OS presentation of concurrency through multiple kernel threads within an application.
     56\item
     57The OS and library presentation of disk and network I/O, and many secondary library routines that directly and indirectly use these mechanisms.
     58\end{itemize}
     59The key aspect of all of these mechanisms is that control flow can block, which is the enemy of the scheduler.
     60Fundamentally, scheduling needs to understand all the mechanisms used by threads that affect their state changes.
     61
     62Interestingly, there is another major hardware component that affects threading: memory.
     63How memory is organized, and when it is acquired and released, has a significant affect on thread scheduling.
     64To this end, I worked closely with another graduate student, Mubeen Zulfiqar, in the development of a new memory allocator for \CFA.
     65(See Mubeen's thesis~\cite{Zulfiqar22} for a discussion of how threading is managed in the \CFA memory-allocator.)
     66
     67% An important aspect of this approach to threading is how threads are scheduled.
     68As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals.
     69Productivity and safety manifest in removing scheduling pitfalls in the efficient usage of the threading runtime.
     70Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs.
     71
     72This thesis achieves its stated contributes by presenting:
     73\begin{enumerate}[leftmargin=*]
     74\item
     75A scalable low-latency scheduler that offers improved starvation prevention (progress guarantee) compared to other state-of-the-art schedulers, including NUMA awareness.
     76\item
     77The scheduler demonstrates a core algorithm that provides increased fairness through helping, as well as optimizations which virtually remove the cost of this fairness.
     78\item
     79An implementation of user-level \io blocking is incorporated into the scheduler, which achieves the same performance and fairness balance as the scheduler itself.
     80\item
     81These core algorithms are further extended with a low-latency idle-sleep mechanism, which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system.
     82\end{enumerate}
     83Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state changes is low.
    1584
    1685\section{Future Work}
    17 While the \CFA runtime achieves a better compromise in term of performance and fairness than other schedulers, I do believe that further improvements could be made to reduce even further the number of cases where performance deteriorates.
    18 Furthermore, I believe that achieve performance and starvation freedom simultaneously is generally a challenge even outside of scheduling algorithms.
     86While the \CFA runtime achieves a better compromise, in term of performance and fairness, than other schedulers, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate.
     87Fundamentally, achieving performance and starvation freedom will always be opposing goals even outside of scheduling algorithms.
    1988
    2089\subsection{Idle Sleep}
    21 A difficult challenge that was not fully address in this thesis is idle-sleep.
    22 While a correct and somewhat low-cost idle-sleep mechanism was presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system.
     90A difficult challenge, not fully address in this thesis, is idle-sleep.
     91While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system.
    2392The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping.
    2493Furthermore, this thesis did not present any heuristic for when \procs should be put to sleep and when \procs should be woken up.
    25 It is especially worth noting that relaxed timestamps and topology aware helping lead to notable improvements in performance.
    26 Neither of these techniques were used for the idle sleep mechanism.
     94While relaxed timestamps and topology awareness made a notable improvements in performance, neither of these techniques are used for the idle-sleep mechanism.
    2795
    28 There are opportunities where these techniques could be use:
     96Here are opportunities where these techniques could be use:
     97\begin{itemize}
     98\item
    2999The mechanism uses a hand-shake between notification and sleep to ensure that no \at is missed.
    30 The correctness of that hand-shake is cirtical when the last \proc goes to sleep but could be relaxed when several \procs are awake.
     100\item
     101The correctness of that hand-shake is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake.
     102\item
    31103Furthermore, organizing the sleeping \procs as a LIDO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead.
    32 
    33 However, using these techniques could require significant investigation.
    34 For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwith.
    35 The balance between these is not necessarily obvious.
     104\end{itemize}
     105However, using these techniques would require significant investigation.
     106For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwidth.
     107The balance between these approaches is not obvious.
    36108
    37109\subsection{Hardware}
    38 One challenge that needed to be overcome for this thesis was that the modern x86-64 has very few tools to implement fairness.
    39 \Glspl{proc} attempting to help eachother inherently cause cache-coherence traffic.
     110One challenge that needed to be overcome for this thesis is that the modern x86-64 processors has very few tools to implement fairness.
     111\Glspl{proc} attempting to help each other inherently cause cache-coherence traffic.
    40112However, as mentioned in Section~\ref{helping}, relaxed requirements mean this traffic is not necessarily productive.
    41113In cases like this one, there is an opportunity to improve performance by extending the hardware.
    42114
    43 Many different extensions would be suitable here.
    44 For example, when attempting to read remote timestamps when deciding to whether or not to help, it could be useful to allow cancelling the remote read if it will lead to significant latency.
    45 If the latency is due to a recent cache invalidation, it is unlikely that the timestamp is old and that helping will be needed.
     115Many different extensions are suitable here.
     116For example, when attempting to read remote timestamps for helping, it would be useful to allow cancelling the remote read if it leads to significant latency.
     117If the latency is due to a recent cache invalidation, it is unlikely the timestamp is old and that helping is needed.
    46118As such, simply moving on without the result is likely to be acceptable.
    47 Another option would be to attempt to read multiple memory addresses and only wait for \emph{one of} these reads to retire.
    48 This would have a similar effect, where cache-lines with more traffic would be waited on less often.
    49 In both of these examples, some care would probably be needed to make sure that the reads to an address \emph{sometimes} retire.
     119Another option would be to read multiple memory addresses and only wait for \emph{one of} these reads to retire.
     120This approach has a similar effect, where cache-lines with more traffic would be waited on less often.
     121In both of these examples, some care is needed to ensure that reads to an address \emph{sometime} retire.
    50122
    51 Note that this is similar to the feature \newterm{Hardware Transactional Memory}~\cite{HTM}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired.
     123Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{HTM}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired.
    52124However, I believe this feature is generally aimed at large groups of instructions.
    53 A more fine-grained approach may be more amenable to carefully picking which aspects of an algorithm require exact correctness and which do not.
     125A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not.
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    r38e4c5c r9d67a6d  
    373373
    374374Supporting this case in the webserver would require creating more \procs or creating a dedicated thread-pool.
    375 However, I felt this kind of modification moves to far away from my goal of evaluating the \CFA runtime, \ie it begins writing another runtime system;
     375However, I felt this kind of modification moves too far away from my goal of evaluating the \CFA runtime, \ie it begins writing another runtime system;
    376376hence, I decided to forgo experiments on low-memory performance.
Note: See TracChangeset for help on using the changeset viewer.