Changeset 0fec6c1


Ignore:
Timestamp:
Sep 5, 2022, 9:41:11 AM (2 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
1fcbce7, 83cb754
Parents:
4dba1da
Message:

proofread conclusion chapter

Location:
doc
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • doc/bibliography/pl.bib

    r4dba1da r0fec6c1  
    37573757    series      = {Innovative Technology},
    37583758    year        = 1991,
     3759}
     3760
     3761@mastersthesis{Zulfiqar22,
     3762    keywords    = {Cforall, memory allocation, threading},
     3763    contributer = {pabuhr@plg},
     3764    author      = {Mubeen Zulfiqar},
     3765    title       = {High-Performance Concurrent Memory Allocation},
     3766    school      = {School of Computer Science, University of Waterloo},
     3767    year        = 2022,
     3768    address     = {Waterloo, Ontario, Canada, N2L 3G1},
     3769    note        = {\href{https://uwspace.uwaterloo.ca/handle/10012/18329}{https://\-uwspace.uwaterloo.ca/\-handle/\-10012/18329}},
    37593770}
    37603771
  • doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex

    r4dba1da r0fec6c1  
    55Because I am the main developer for both components of this project, there is strong continuity across the design and implementation.
    66This continuity provides a consistent approach to advanced control-flow and concurrency, with easier development, management and maintenance of the runtime in the future.
     7I believed my Masters work would provide the background to make the Ph.D work reasonably straightforward.
     8However, I discovered two significant challenges.
    79
    8 I believed my Masters work would provide the background to make the Ph.D work reasonably straightforward.
    9 However, in doing so I discovered two expected challenges.
    10 First, while modern symmetric multiprocessing CPU have significant performance penalties for communicating across cores.
    11 This makes implementing fair schedulers notably more difficult, since fairness generally requires \procs to be aware of each other's progress.
    12 This challenge is made even harder when comparing against MQMS schedulers (see Section\ref{sched}) which have very little inter-\proc communication.
    13 This is particularly true of state-of-the-art work-stealing schedulers, which can have virtually no inter-\proc communication in some common workloads.
    14 This means that when adding fairness to work-stealing schedulers, extreme care must be taken to hide the communication costs so performance does not suffer.
    15 Second, the kernel locking, threading, and I/O in the Linux operating system offers very little flexibility of use.
     10First, modern symmetric multiprocessing CPU have significant performance penalties for communication (often cache related).
     11A SQMS scheduler (see Section~\ref{sched}), with its \proc-shared ready-queue, has perfect load-balancing but poor affinity resulting in high communication across \procs.
     12A MQMS scheduler, with its \proc-specific ready-queues, has poor load-balancing but perfect affinity often resulting in significantly reduced communication.
     13However, implementing fairness for an MQMS scheduler is difficult, since fairness requires \procs to be aware of each other's ready-queue progress, \ie communicated knowledge.
     14% This challenge is made harder when comparing against MQMS schedulers (see Section\ref{sched}) which have very little inter-\proc communication.
     15For balanced workloads with little or no data sharing (embarrassingly parallel), an MQMS scheduler is near optimal, \eg state-of-the-art work-stealing schedulers.
     16For these kinds of fair workloads, adding fairness must be low-cost to hide the communication costs needed for global ready-queue progress or performance suffers.
     17
     18Second, the kernel locking, threading, and I/O in the Linux operating system offers very little flexibility, and are not designed to facilitate user-level threading.
    1619There are multiple concurrency aspects in Linux that require carefully following a strict procedure in order to achieve acceptable performance.
    1720To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multi-processor computers and concurrency knowledge was just developing.
     
    2124The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux;
    2225hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O.
    23 Merging all these different I/O mechanisms into a coherent scheduling implementation would require a much more work than what is present in this thesis, as well as detailed knowledge of the I/O mechanisms in Linux.
     26Merging all these different I/O mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as a detailed knowledge of multiple I/O mechanisms.
    2427The negative is that @io_uring@ is new and developing.
    2528As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds.
     29
    2630Given what I now know about @io_uring@, I would say it is insufficiently coupled with the Linux kernel to properly handle non-blocking I/O.
    27 It does not seem to reach deep into the Kernel's handling of \io, and as such it must contend with the same realities that users of epoll must contend with.
     31It does not seem to reach deep into the kernel's handling of \io, and as such it must contend with the same realities that users of @epoll@ must contend with.
    2832Specifically, in cases where @O_NONBLOCK@ behaves as desired, operations must still be retried.
    29 To preserve the illusion of asynchronicity, this requires delegating operations to kernel threads.
    30 This is also true of cases where @O_NONBLOCK@ does not prevent blocking.
     33To preserve the illusion of asynchronicity requires delegating these operations to kernel threads.
     34This requirement is also true of cases where @O_NONBLOCK@ does not prevent blocking.
    3135Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds significant burden to the system.
    3236Nonblocking I/O should not be handled in this way.
     
    4347The OS and library presentation of disk and network I/O, and many secondary library routines that directly and indirectly use these mechanisms.
    4448\end{itemize}
    45 The key aspect of all of these mechanisms is that control flow can block, which immidiately hinders any level above from making scheduling decision as a result.
     49The key aspect of all of these mechanisms is that control flow can block, which immediately hinders any level above from making scheduling decision as a result.
    4650Fundamentally, scheduling needs to understand all the mechanisms used by threads that affect their state changes.
    4751
     
    4953However, direct hardware scheduling is only possible in the OS.
    5054Instead, this thesis is performing arms-length application scheduling of the hardware components through a set of OS interfaces that indirectly manipulate the hardware components.
    51 This can quickly lead to tensions if the OS interface was built with different use cases in mind.
     55This can quickly lead to tensions when the OS interface has different use cases in mind.
    5256
    5357As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals.
     
    6670These core algorithms are further extended with a low-latency idle-sleep mechanism, which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system.
    6771\end{enumerate}
    68 Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state changes is low.
     72Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state-changes is low.
    6973
    7074\section{Future Work}
     
    8488The mechanism uses a hand-shake between notification and sleep to ensure that no \at is missed.
    8589\item
    86 The correctness of that hand-shake is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake.
     90The hand-shake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake.
    8791\item
    8892Furthermore, organizing the sleeping \procs as a LIFO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead.
     
    9195For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwidth.
    9296The balance between these approaches is not obvious.
     97I am aware there is a host of low-power research that could be tapped here.
    9398
    9499\subsection{Hardware}
     
    102107If the latency is due to a recent cache invalidation, it is unlikely the timestamp is old and that helping is needed.
    103108As such, simply moving on without the result is likely to be acceptable.
    104 Another option would be to read multiple memory addresses and only wait for \emph{one of} these reads to retire.
    105 This approach has a similar effect, where cache-lines with more traffic would be waited on less often.
     109Another option is to read multiple memory addresses and only wait for \emph{one of} these reads to retire.
     110This approach has a similar effect, where cache lines with more traffic are on less often.
    106111In both of these examples, some care is needed to ensure that reads to an address \emph{sometime} retire.
    107112
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    r4dba1da r0fec6c1  
    182182NGINX is an high-performance, \emph{full-service}, event-driven webserver.
    183183It can handle both static and dynamic web content, as well as serve as a reverse proxy and a load balancer~\cite{reese2008nginx}.
    184 This wealth of features comes with a variety of potential configuration, dictating both available features and performance.
     184This wealth of capabilities comes with a variety of potential configurations, dictating available features and performance.
    185185The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes.
    186 When running as a static webserver, uses an event driven architecture to server incoming requests.
    187 Incoming connections are assigned a \emph{statckless} HTTP state-machine and worker processes can potentially handle thousands of these state machines.
    188 For the following experiment, NGINX was configured to use epoll to listen for events on these state machines and have each worker process independently accept new connections.
    189 Because of the realities of Linux, see Subsection~\ref{ononblock}, NGINX also maintains a pool of auxilary threads to handle block \io.
    190 The configuration can be used to set the number of worker processes desired, as well as the size of the auxilary pool.
    191 However, for the following experiments NGINX was configured to let the master process decided the appropriate number of threads.
     186When running as a static webserver, it uses an event-driven architecture to service incoming requests.
     187Incoming connections are assigned a \emph{stackless} HTTP state-machine and worker processes can handle thousands of these state machines.
     188For the following experiment, NGINX is configured to use @epoll@ to listen for events on these state machines and have each worker process independently accept new connections.
     189Because of the realities of Linux, see Subsection~\ref{ononblock}, NGINX also maintains a pool of auxiliary threads to handle blocking \io.
     190The configuration can set the number of worker processes desired, as well as the size of the auxiliary pool.
     191However, for the following experiments, NGINX is configured to let the master process decided the appropriate number of threads.
    192192
    193193
Note: See TracChangeset for help on using the changeset viewer.