Context Navigation

← Previous Change
Next Change →

Changeset 8040286 for doc/theses/thierry_delisle_PhD/thesis/text

Timestamp:

Aug 5, 2022, 4:18:02 PM (4 years ago)

Author:

Thierry Delisle <tdelisle@…>

Branches:

ADT, ast-experimental, master, pthread-emulation, stuck-waitfor-destruct

Children:

Parents:

Message:

Filled in several citations and did some of the todos

Location:

doc/theses/thierry_delisle_PhD/thesis/text

Files:

: 6 edited

core.tex (modified) (4 diffs)
eval_macro.tex (modified) (3 diffs)
existing.tex (modified) (6 diffs)
io.tex (modified) (2 diffs)
practice.tex (modified) (3 diffs)
runtime.tex (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

doc/theses/thierry_delisle_PhD/thesis/text/core.tex

-              r511a9368
+              r8040286
 For threading, a simple and common execution mental-model is the ``Ideal multi-tasking CPU'' :
 \begin{displayquote}[Linux CFS\cit{https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt}]
+\begin{displayquote}[Linux CFS\cite{MAN:linux/cfs}]
         {[The]} ``Ideal multi-tasking CPU'' is a (non-existent  :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed.  For example: if there are 2 tasks running, then it runs each at 50\% physical power --- i.e., actually in parallel.
         \label{q:LinuxCFS}
 …
 This suggests to the following approach:
 \subsection{Dynamic Entropy}\cit{https://xkcd.com/2318/}
+\subsection{Dynamic Entropy}\cite{xkcd:dynamicentropy}
 The Relaxed-FIFO approach can be made to handle the case of mostly empty subqueues by tweaking the \glsxtrlong{prng}.
 The \glsxtrshort{prng} state can be seen as containing a list of all the future subqueues that will be accessed.
 While this concept is not particularly useful on its own, the consequence is that if the \glsxtrshort{prng} algorithm can be run \emph{backwards}, then the state also contains a list of all the subqueues that were accessed.
 Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators\cit{https://en.wikipedia.org/wiki/Linear\_congruential\_generator} support running the algorithm backwards while offering good quality and performance.
+Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators\cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance.
 This particular \glsxtrshort{prng} can be used as follows:
 \begin{itemize}
 …
         \input{base.pstex_t}
         \caption[Base \CFA design]{Base \CFA design \smallskip\newline A pool of subqueues offers the sharding, two per \glspl{proc}.
         Each \gls{proc} can access all of the subqueues.
+        Each \gls{proc} can access all of the subqueues.
         Each \at is timestamped when enqueued.}
         \label{fig:base}
 …
 \end{figure}
 A simple solution to this problem is to use an exponential moving average\cit{https://en.wikipedia.org/wiki/Moving\_average\#Exponential\_moving\_average} (MA) instead of a raw timestamps, shown in Figure~\ref{fig:base-ma}.
+A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamps, shown in Figure~\ref{fig:base-ma}.
 Note, this is more complex because the \at at the head of a subqueue is still waiting, so its wait time has not ended.
 Therefore, the exponential moving average is actually an exponential moving average of how long each dequeued \at has waited.

doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

-              r511a9368
+              r8040286
 \section{Memcached}
+Memcached~\cit{memcached} is an in memory key-value store that is used in many production environments, \eg \cit{Berk Atikoglu et al., Workload Analysis of a Large-Scale Key-Value Store,
+SIGMETRICS 2012}.
+This also server also has the notable added benefit that there exists a full-featured front-end for performance testing called @mutilate@~\cit{mutilate}.
+Memcached~\cite{memcached} is an in memory key-value store that is used in many production environments, \eg \cite{atikoglu2012workload}.
+This also server also has the notable added benefit that there exists a full-featured front-end for performance testing called @mutilate@~\cite{GITHUB:mutilate}.
 Experimenting on memcached allows for a simple test of the \CFA runtime as a whole, it will exercise the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.
 This experiment does not exercise the \io subsytem with regards to disk operations.
 …
 Most of the implementation is fairly straight forward however the inclusion of file \io introduces a new challenge that had to be hacked around.
 Normally, webservers use @sendfile@\cit{sendfile} to send files over the socket.
 @io_uring@ does not support @sendfile@, it supports @splice@\cit{splice} instead, which is strictly more powerful.
+Normally, webservers use @sendfile@\cite{MAN:sendfile} to send files over the socket.
+@io_uring@ does not support @sendfile@, it supports @splice@\cite{splice} instead, which is strictly more powerful.
 However, because of how linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@'s implementation must delegate calls to splice to worker threads inside the kernel.
 As of Linux 5.13, @io_uring@ caps the numer of these worker threads to @RLIMIT_NPROC@ and therefore, when tens of thousands of splice requests are made, it can create tens of thousands of \glspl{kthrd}.
 …
 When the saturation point of the server is attained, latency will increase and inevitably some client connections will timeout.
 As these clients close there connections, the server must close these sockets without delay so the OS can reclaim the resources used by these connections.
 Indeed, until they are closed on the server end, the connection will linger in the CLOSE-WAIT tcp state~\cit{RFC793} and the tcp buffers will be preserved.
+Indeed, until they are closed on the server end, the connection will linger in the CLOSE-WAIT tcp state~\cite{rfc:tcp} and the tcp buffers will be preserved.
 However, this poses a problem using blocking @sendfile@ calls.
 The calls can block if they do not have suffcient memory, which can be caused by having too many connections in the CLOSE-WAIT state.

doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

-              r511a9368
+              r8040286
 \section{Naming Convention}
 Scheduling has been studied by various communities concentrating on different incarnation of the same problems.
 As a result, there are no standard naming conventions for scheduling that is respected across these communities.
+Scheduling has been studied by various communities concentrating on different incarnation of the same problems.
+As a result, there are no standard naming conventions for scheduling that is respected across these communities.
 This document uses the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the concrete objects executing these \ats.
 …
 \section{Dynamic Scheduling}
 \newterm{Dynamic schedulers} determine \ats dependencies and costs during scheduling, if at all.
 Hence, unlike static scheduling, \ats dependencies are conditional and detected at runtime.
+Hence, unlike static scheduling, \ats dependencies are conditional and detected at runtime.
 This detection takes the form of observing new \ats(s) in the system and determining dependencies from their behaviour, including suspending or halting a \ats that dynamically detects unfulfilled dependencies.
 Furthermore, each \ats has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled.
 …
 Most common operating systems use some variant on priorities with overlaps and dynamic priority adjustments.
 For example, Microsoft Windows uses a pair of priorities
 \cit{https://docs.microsoft.com/en-us/windows/win32/procthread/scheduling-priorities,https://docs.microsoft.com/en-us/windows/win32/taskschd/taskschedulerschema-priority-settingstype-element}, one specified by users out of ten possible options and one adjusted by the system.
+\cite{win:priority}, one specified by users out of ten possible options and one adjusted by the system.
 \subsection{Uninformed and Self-Informed Dynamic Schedulers}
 …
 The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
+\todo{load balancing}
+In~\cite{russinovich2009windows}, Chapter 1 section ``Processes, Threads, and Jobs'' discusses the scheduling policy more in depth.
+Multicore scheduling is based on a combination of priorities, preferred \proc.
+Each \at is assigned an \newterm{ideal} \proc using a round-robin policy.
+\Ats are distributed among the \procs according to their priority, preferring to match \ats to their ideal \proc and then to the last \proc they ran on.
+This is similar to a variation of work stealing, where the stealing \proc restore the \at to its original \proc after running it, but with priorities added onto the mix.
 \paragraph{Apple OS X}
 …
 \paragraph{Go}\label{GoSafePoint}
 Go's scheduler uses a randomized work-stealing algorithm that has a global run-queue (\emph{GRQ}) and each processor (\emph{P}) has both a fixed-size run-queue (\emph{LRQ}) and a high-priority next ``chair'' holding a single element~\cite{GITHUB:go,YTUBE:go}.
 Preemption is present, but only at safe-points,~\cit{https://go.dev/src/runtime/preempt.go} which are inserted detection code at various frequent access boundaries.
+Preemption is present, but only at safe-points,~\cite{go:safepoints} which are inserted detection code at various frequent access boundaries.
 The algorithm is as follows :
 …
 \paragraph{Grand Central Dispatch}
 An Apple\cit{Official GCD source} API that offers task parallelism~\cite{wiki:taskparallel}.
+An Apple\cite{apple:gcd} API that offers task parallelism~\cite{wiki:taskparallel}.
 Its distinctive aspect is multiple ``Dispatch Queues'', some of which are created by programmers.
 Each queue has its own local ordering guarantees, \eg \ats on queue $A$ are executed in \emph{FIFO} order.
+\todo{load balancing and scheduling}
+% http://web.archive.org/web/20090920043909/http://images.apple.com/macosx/technology/docs/GrandCentral_TB_brief_20090903.pdf
+In terms of semantics, the Dispatch Queues seem to be very similar to Intel\textregistered ~TBB @execute()@ and predecessor semantics.
+While the documentation only gives limited insight into the scheduling and load balancing approach, \cite{apple:gcd2} suggests an approach fairly classic;
+Where each \proc has a queue of \newterm{blocks} to run, effectively \ats, and they drain their respective queues in \glsxtrshort{fifo}.
+They seem to add the concept of dependent queues with clear ordering, where a executing a block ends-up scheduling more blocks.
+In terms of semantics, these Dispatch Queues seem to be very similar to Intel\textregistered ~TBB @execute()@ and predecessor semantics.
 \paragraph{LibFibre}

doc/theses/thierry_delisle_PhD/thesis/text/io.tex

-              r511a9368
+              r8040286
 In the worst case, where all \glspl{thrd} are consistently blocking on \io, it devolves into 1-to-1 threading.
 However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \glspl{thrd} are ready to run.
 This approach is used by languages like Go\cit{Go}, frameworks like libuv\cit{libuv}, and web servers like Apache~\cite{apache} and Nginx~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
+This approach is used by languages like Go\cite{GITHUB:go}, frameworks like libuv\cite{libuv}, and web servers like Apache~\cite{apache} and Nginx~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
 This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
 As opposed to C, which has a very limited standard api for \io, \eg, the C standard library has no networking.
 …
 These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete.
 All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically.
 For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cit{https://docs.microsoft.com/en-us/windows/win32/fileio/synchronous-and-asynchronous-i-o} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
+For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
 For this project, I selected @io_uring@, in large parts because of its generality.

doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

-              r511a9368
+              r8040286
 To achieve this goal requires each reader to have its own memory to mark as locked and unlocked.
 The read acquire possibly waits for a writer to finish the critical section and then acquires a reader's local spinlock.
 The write acquire acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks.
+The write acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks.
 Acquiring all the local read locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer.
 Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock.
 The lock in nonblocking, so both readers and writers spin while the lock is held.
+\todo{finish explanation}
+This very wide sharding strategy means that readers have very good locality, since they only ever need to access two memory location.
 \begin{figure}
 …
 \subsection{Event FDs}
 Another interesting approach is to use an event file descriptor\cit{eventfd}.
+Another interesting approach is to use an event file descriptor\cite{eventfd}.
 This Linux feature is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore.
 Indeed, all reads and writes must use a word-sized values, \ie 64 or 32 bits.
 …
 \end{figure}
 The next optimization is to avoid the latency of the event @fd@, which can be done by adding what is effectively a binary benaphore\cit{benaphore} in front of the event @fd@.
+The next optimization is to avoid the latency of the event @fd@, which can be done by adding what is effectively a binary benaphore\cite{schillings1996engineering} in front of the event @fd@.
 The benaphore over the event @fd@ logically provides a three state flag to avoid unnecessary system calls, where the states are expressed explicit in Figure~\ref{fig:idle:state}.
 A \proc begins its idle sleep by adding itself to the idle list before searching for an \at.

doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex

r511a9368	r8040286
62	62	Only UNIX @man@ pages identify whether or not a library function is thread safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.
63	63
64		Languages like Go and Java, which have strict interoperability with C\cit~~{JNI, GoLang with C~~}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
	64	Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
65	65
66	66	As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions. Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together. This fine-grained interoperability between C and \CFA has two consequences:

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: