# source:doc/proposals/concurrency/text/internals.tex@b3ffb61

aaron-thesisarm-ehcleanup-dtorsdeferred_resndemanglerjacob/cs343-translationjenkins-sandboxnew-astnew-ast-unique-exprnew-envno_listpersistent-indexerresolv-newwith_gc
Last change on this file since b3ffb61 was b3ffb61, checked in by Thierry Delisle <tdelisle@…>, 4 years ago

• Property mode set to 100644
File size: 23.7 KB
Line
1
2\chapter{Behind the scene}
3There are several challenges specific to \CFA when implementing concurrency. These challenges are a direct result of \gls{bulk-acq} and loose object-definitions. These two constraints are the root cause of most design decisions in the implementation. Furthermore, to avoid contention from dynamically allocating memory in a concurrent environment, the internal-scheduling design is (almost) entirely free of mallocs. This is to avoid the chicken and egg problem \cite{Chicken} of having a memory allocator that relies on the threading system and a threading system that relies on the runtime. This extra goal, means that memory management is a constant concern in the design of the system.
4
5The main memory concern for concurrency is queues. All blocking operations are made by parking threads onto queues. The queue design needs to be intrusive\cite{IntrusiveData} to avoid the need for memory allocation, which entails that all the nodes need specific fields to keep track of all needed information. Since many concurrency operations can use an unbound amount of memory (depending on \gls{bulk-acq}), statically defining information in the intrusive fields of threads is insufficient. The only variable sized container that does not require memory allocation is the callstack, which is heavily used in the implementation of internal scheduling. Particularly variable length arrays, which are used extensively.
6
7Since stack allocation is based around scope, the first step of the implementation is to identify the scopes that are available to store the information, and which of these can have a variable length. The threads and the condition both allow a fixed amount of memory to be stored, while mutex-routines and the actual blocking call allow for an unbound amount (though the later is preferable in terms of performance).
8
9Note that since the major contributions of this thesis are extending monitor semantics to \gls{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristiques of \CFA are considered as solved problems and therefore not discussed further.
10
11% ======================================================================
12% ======================================================================
13\section{Mutex routines}
14% ======================================================================
15% ======================================================================
16
17The first step towards the monitor implementation is simple mutex-routines using monitors. In the single monitor case, this is done using the entry/exit procedure highlighted in listing \ref{lst:entry1}. This entry/exit procedure does not actually have to be extended to support multiple monitors, indeed it is sufficient to enter/leave monitors one-by-one as long as the order is correct to prevent deadlocks\cite{Havender68}. In \CFA, ordering of monitor relies on memory ordering, this is sufficient because all objects are guaranteed to have distinct non-overlaping memory layouts and mutual-exclusion for a monitor is only defined for its lifetime, meaning that destroying a monitor while it is acquired is undefined behavior. When a mutex call is made, the concerned monitors are agregated into a variable-length pointer array and sorted based on pointer values. This array presists for the entire duration of the mutual-exclusion and its ordering reused extensively.
18\begin{figure}
19\begin{multicols}{2}
20Entry
21\begin{pseudo}
22if monitor is free
23        enter
25        continue
26else
27        block
28increment recursions
29\end{pseudo}
30\columnbreak
31Exit
32\begin{pseudo}
33decrement recursion
34if recursion == 0
35        if entry queue not empty
37\end{pseudo}
38\end{multicols}
39\caption{Initial entry and exit routine for monitors}
40\label{lst:entry1}
41\end{figure}
42
43\subsection{ Details: Interaction with polymorphism}
44Depending on the choice of semantics for when monitor locks are acquired, interaction between monitors and \CFA's concept of polymorphism can be more complex to support. However, it is shown that entry-point locking solves most of the issues.
45
46First of all, interaction between \code{otype} polymorphism and monitors is impossible since monitors do not support copying. Therefore, the main question is how to support \code{dtype} polymorphism. It is important to present the difference between the two acquiring options : callsite and entry-point locking, i.e. acquiring the monitors before making a mutex routine call or as the first operation of the mutex routine-call. For example:
47\begin{figure}[H]
48\begin{center}
49\begin{tabular}{|c|c|c|}
50Mutex & \gls{callsite-locking} & \gls{entry-point-locking} \\
51call & pseudo-code & pseudo-code \\
52\hline
53\begin{cfacode}[tabsize=3]
54void foo(monitor& mutex a){
55
56        //Do Work
57        //...
58
59}
60
61void main() {
62        monitor a;
63
64        foo(a);
65
66}
67\end{cfacode} & \begin{pseudo}[tabsize=3]
68foo(& a) {
69
70        //Do Work
71        //...
72
73}
74
75main() {
76        monitor a;
77        acquire(a);
78        foo(a);
79        release(a);
80}
81\end{pseudo} & \begin{pseudo}[tabsize=3]
82foo(& a) {
83        acquire(a);
84        //Do Work
85        //...
86        release(a);
87}
88
89main() {
90        monitor a;
91
92        foo(a);
93
94}
95\end{pseudo}
96\end{tabular}
97\end{center}
98\caption{Call-site vs entry-point locking for mutex calls}
99\label{fig:locking-site}
100\end{figure}
101
102Note the \code{mutex} keyword relies on the type system, which means that in cases where a generic monitor routine is desired, writing the mutex routine is possible with the proper trait, for example:
103\begin{cfacode}
104//Incorrect: T may not be monitor
105forall(dtype T)
106void foo(T * mutex t);
107
108//Correct: this function only works on monitors (any monitor)
109forall(dtype T | is_monitor(T))
110void bar(T * mutex t));
111\end{cfacode}
112
113Both entry-point and callsite locking are feasible implementations. The current \CFA implementations uses entry-point locking because it requires less work when using \gls{raii}, effectively transferring the burden of implementation to object construction/destruction. The same could be said of callsite locking, the difference being that the later does not necessarily have an existing scope that matches exactly the scope of the mutual exclusion, i.e.: the function body. Furthermore, entry-point locking requires less code generation since any useful routine is called at least as often as it is define, there can be only one entry-point but many callsites.
114
115% ======================================================================
116% ======================================================================
118% ======================================================================
119% ======================================================================
120
121Figure \ref{fig:system1} shows a high-level picture if the \CFA runtime system in regards to concurrency. Each component of the picture is explained in details in the fllowing sections.
122
123\begin{figure}
124\begin{center}
125{\resizebox{\textwidth}{!}{\input{system.pstex_t}}}
126\end{center}
127\caption{Overview of the entire system}
128\label{fig:system1}
129\end{figure}
130
131\subsection{Context Switching}
132As mentionned in section \ref{coroutine}, coroutines are a stepping stone for implementing threading. This is because they share the same mechanism for context-switching between different stacks. To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific function call. This assumption means that the context-switch only has to copy the callee-saved registers onto the stack and then switch the stack registers with the ones of the target coroutine/thread. Note that the instruction pointer can be left untouched since the context-switch is always inside the same function. Threads however do not context-switch between each other directly. They context-switch to the scheduler. This method is called a 2-step context-switch and has the advantage of having a clear distinction between user code and the kernel where scheduling and other system operation happen. Obiously, this has the cost of doubling the context-switch cost because threads must context-switch to an intermediate stack. However, the performance of the 2-step context-switch is still superior to a \code{pthread_yield}(see section \ref{results}). additionally, for users in need for optimal performance, it is important to note that having a 2-step context-switch as the default does not prevent \CFA from offering a 1-step context-switch to use manually (or as part of monitors). This option is not currently present in \CFA but the changes required to add it are strictly additive.
133
134\subsection{Processors}
135Parallelism in \CFA is built around using processors to specify how much parallelism is desired. \CFA processors are object wrappers around kernel threads, specifically pthreads in the current implementation of \CFA. Indeed, any parallelism must go through operating-system librairies. However, \glspl{uthread} are still the main source of concurrency, processors are simply the underlying source of parallelism. Indeed, processor \glspl{kthread} simply fetch a \glspl{uthread} from the scheduler and run, they are effectively executers for user-threads. The main benefit of this approach is that it offers a well defined boundary between kernel code and user code, for example, kernel thread quiescing, scheduling and interrupt handling. Processors internally use coroutines to take advantage of the existing context-switching semantics.
136
137\subsection{Stack management}
138One of the challenges of this system is to reduce the footprint as much as possible. Specifically, all pthreads created also have a stack created with them, which should be used as much as possible. Normally, coroutines also create there own stack to run on, however, in the case of the coroutines used for processors, these coroutines run directly on the kernel thread stack, effectively stealing the processor stack. The exception to this rule is the Main Processor, i.e. the initial kernel thread that is given to any program. In order to respect user expectations, the stack of the initial kernel thread, the main stack of the program, is used by the main user thread rather than the main processor.
139
140\subsection{Preemption} \label{preemption}
141Finally, an important aspect for any complete threading system is preemption. As mentionned in chapter \ref{basics}, preemption introduces an extra degree of uncertainty, which enables users to have multiple threads interleave transparently, rather than having to cooperate among threads for proper scheduling and CPU distribution. Indeed, preemption is desireable because it adds a degree of isolation among threads. In a fully cooperative system, any thread that runs into a long loop can starve other threads, while in a preemptive system starvation can still occur but it does not rely on every thread having to yield or block on a regular basis, which reduces significantly a programmer burden. Obviously, preemption is not optimal for every workload, however any preemptive system can become a cooperative system by making the time-slices extremely large. Which is why \CFA uses a preemptive threading system.
142
143Preemption in \CFA is based on kernel timers, which are used to run a discrete-event simulation. Every processor keeps track of the current time and registers an expiration time with the preemption system. When the preemption system receives a change in preemption, it sorts these expiration times in a list and sets a kernel timer for the closest one, effectively stepping between preemption events on each signals sent by the timer. These timers use the linux signal {\tt SIGALRM}, which is delivered to the process rather than the kernel-thread. This results in an implementation problem,because when delivering signals to a process, the kernel documentation states that the signal can be delivered to any kernel thread for which the signal is not blocked i.e. :
144\begin{quote}
145A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of the threads has the signal unblocked, then the kernel chooses an arbitrary thread to which to deliver the signal.
146SIGNAL(7) - Linux Programmer's Manual
147\end{quote}
148For the sake of simplicity and in order to prevent the case of having two threads receiving alarms simultaneously, \CFA programs block the {\tt SIGALRM} signal on every thread except one. Now because of how involontary context-switches are handled, the kernel thread handling {\tt SIGALRM} cannot also be a processor thread.
149
150Involuntary context-switching is done by sending signal {\tt SIGUSER1} to the corresponding processor and having the thread yield from inside the signal handler. Effectively context-switching away from the signal-handler back to the kernel and the signal-handler frame is eventually unwound when the thread is scheduled again. This approach means that a signal-handler can start on one kernel thread and terminate on a second kernel thread (but the same user thread). It is important to note that signal-handlers save and restore signal masks because user-thread migration can cause signal mask to migrate from one kernel thread to another. This behaviour is only a problem if all kernel threads among which a user thread can migrate differ in terms of signal masks\footnote{Sadly, official POSIX documentation is silent on what distiguishes async-signal-safe'' functions from other functions}. However, since the kernel thread hanlding preemption requires a different signal mask, executing user threads on the kernel alarm thread can cause deadlocks. For this reason, the alarm thread is on a tight loop around a system call to \code{sigwaitinfo}, requiring very little CPU time for preemption. One final detail about the alarm thread is how to wake it when additional communication is required (e.g., on thread termination). This unblocking is also done using {\tt SIGALRM}, but sent throught the \code{pthread_sigqueue}. Indeed, \code{sigwait} can differentiate signals sent from \code{pthread_sigqueue} from signals sent from alarms or the kernel.
151
152\subsection{Scheduler}
153Finally, an aspect that was not mentionned yet is the scheduling algorithm. Currently, the \CFA scheduler uses a single ready queue for all processors, which is the simplest approach to scheduling. Further discussion on scheduling is present in section \label{futur:sched}.
154
155% ======================================================================
156% ======================================================================
157\section{Internal scheduling} \label{impl:intsched}
158% ======================================================================
159% ======================================================================
160The following figure is the traditional illustration of a monitor (repeated from page~\pageref{fig:monitor} for convenience) :
161
162\begin{figure}[H]
163\begin{center}
164{\resizebox{0.4\textwidth}{!}{\input{monitor}}}
165\end{center}
167\label{fig:monitor}
168\end{figure}
169
170This picture has several components, the two most important being the entry-queue and the AS-stack. The entry-queue is an (almost) FIFO list where threads waiting to enter are parked, while the acceptor-signalor (AS) stack is a FILO list used for threads that have been signalled or otherwise marked as running next.
171
172For \CFA, this picture does not have support for blocking multiple monitors on a single condition. To support \gls{bulk-acq} two changes to this picture are required. First, it is non longer helpful to attach the condition to a single monitor. Secondly, the thread waiting on the conditions has to be seperated multiple monitors, which yields :
173
174\begin{figure}[H]
175\begin{center}
176{\resizebox{0.8\textwidth}{!}{\input{int_monitor}}}
177\end{center}
178\caption{Illustration of \CFA monitor}
179\label{fig:monitor_cfa}
180\end{figure}
181
182This picture and the proper entry and leave algorithms is the fundamental implementation of internal scheduling (see listing \ref{lst:entry2}). Note that when threads are moved from the condition to the AS-stack, it splits the thread into to pieces. The thread is woken up when all the pieces have moved from the AS-stacks to the active thread seat. In this picture, the threads are split into halves but this is only because there are two monitors in this picture. For a specific signaling operation every monitor needs a piece of thread on its AS-stack.
183
184\begin{figure}[b]
185\begin{multicols}{2}
186Entry
187\begin{pseudo}
188if monitor is free
189        enter
191        continue
192else
193        block
194increment recursion
195
196\end{pseudo}
197\columnbreak
198Exit
199\begin{pseudo}
200decrement recursion
201if recursion == 0
202        if signal_stack not empty
206
207        if entry queue not empty
209\end{pseudo}
210\end{multicols}
211\caption{Entry and exit routine for monitors with internal scheduling}
212\label{lst:entry2}
213\end{figure}
214
215Some important things to notice about the exit routine. The solution discussed in \ref{intsched} can be seen in the exit routine of listing \ref{lst:entry2}. Basically, the solution boils down to having a seperate data structure for the condition queue and the AS-stack, and unconditionally transferring ownership of the monitors but only unblocking the thread when the last monitor has transferred ownership. This solution is deadlock safe as well as preventing any potential barging. The data structure used for the AS-stack are reused extensively for external scheduling, but in the case of internal scheduling, the data is allocated using variable-length arrays on the callstack of the \code{wait} and \code{signal_block} routines.
216
217\begin{figure}[H]
218\begin{center}
219{\resizebox{0.8\textwidth}{!}{\input{monitor_structs.pstex_t}}}
220\end{center}
221\caption{Data structures involved in internal/external scheduling}
222\label{fig:structs}
223\end{figure}
224
225Figure \ref{fig:structs} shows a high level representation of these data-structures. The main idea behind them is that, while figure \ref{fig:monitor_cfa} is a nice illustration in theory, in practice breaking a threads into multiple pieces to put unto intrusive stacks does not make sense. The \code{condition node} is the data structure that is queued into a condition variable and, when signaled, the condition queue is popped and each \code{condition criterion} are moved to the AS-stack. Once all the criterion have be popped from their respective AS-stacks, the thread is woken-up, which is what is shown in listing \ref{lst:entry2}.
226
227% ======================================================================
228% ======================================================================
229\section{External scheduling}
230% ======================================================================
231% ======================================================================
232Similarly to internal scheduling, external scheduling for multiple monitors relies on the idea that waiting-thread queues are no longer specific to a single monitor, as mentionned in section \ref{extsched}. For internal scheduling, these queues are part of condition variables which are still unique for a given scheduling operation (e.g., no single statment uses multiple conditions). However, in the case of external scheduling, there is no equivalent object which is associated with \code{waitfor} statements. This absence means the queues holding the waiting threads must be stored inside at least one of the monitors that is acquired. The monitors being the only objects that have sufficient lifetime and are available on both sides of the \code{waitfor} statment. This requires an algorithm to choose which monitor holds the relevant queue. It is also important that said algorithm be independent of the order in which users list parameters. The proposed algorithm is to fall back on monitor lock ordering and specify that the monitor that is acquired first is the one with the relevant wainting queue. This assumes that the lock acquiring order is static for the lifetime of all concerned objects but that is a reasonable constraint.
233
234This algorithm choice has two consequences :
235\begin{itemize}
236        \item The queue of the highest priority monitor is no longer a true FIFO queue because threads can be moved to the front of the queue. These queues need to contain a set of monitors for each of the waiting threads. Therefore, another thread whose set contains the same highest priority monitor but different lower priority monitors may arrive first but enter the critical section after a thread with the correct pairing.
237        \item The queue of the lowest priority monitor is both required and potentially unused. Indeed, since it is not known at compile time which monitor will be the lowest priority monitor, every monitor needs to have the correct queues even though it is possible that some queues will go unused for the entire duration of the program, for example if a monitor is only used in a specific pair.
238\end{itemize}
239
240Therefore, the following modifications need to be made to support external scheduling :
241\begin{itemize}
242        \item The threads waiting on the entry-queue need to keep track of which routine is trying to enter, and using which set of monitors. The \code{mutex} routine already has all the required information on its stack so the thread only needs to keep a pointer to that information.
243        \item The monitors need to keep a mask of acceptable routines. This mask contains for each acceptable routine, a routine pointer and an array of monitors to go with it. It also needs storage to keep track of which routine was accepted. Since this information is not specific to any monitor, the monitors actually contain a pointer to an integer on the stack of the waiting thread. Note that the complete mask can be pushed to any owned monitors, regardless of \code{when} statements, the \code{waitfor} statement is used in a context where the thread already has full ownership of (at least) every concerned monitor and therefore monitors will refuse all calls no matter what.
244        \item The entry/exit routine need to be updated as shown in listing \ref{lst:entry3}.
245\end{itemize}
246
247\subsection{External scheduling - destructors}
248Finally, to support the ordering inversion of destructors, the code generation needs to be modified to use a special entry routine. This routine is needed because of the storage requirements of the call order inversion. Indeed, when waiting for the destructors, storage is need for the waiting context and the lifetime of said storage needs to outlive the waiting operation it is needed for. For regular \code{waitfor} statements, the callstack of the routine itself matches this requirement but it is no longer the case when waiting for the destructor since it is pushed on to the AS-stack for later. The waitfor semantics can then be adjusted correspondingly, as seen in listing \ref{lst:entry-dtor}
249
250\begin{figure}
251\begin{multicols}{2}
252Entry
253\begin{pseudo}
254if monitor is free
255        enter
257        continue
259        push criterions to AS-stack
260        continue
261else
262        block
263increment recursion
264\end{pseudo}
265\columnbreak
266Exit
267\begin{pseudo}
268decrement recursion
269if recursion == 0
270        if signal_stack not empty
274                endif
275        endif
276
277        if entry queue not empty
279        endif
280\end{pseudo}
281\end{multicols}
282\caption{Entry and exit routine for monitors with internal scheduling and external scheduling}
283\label{lst:entry3}
284\end{figure}
285
286\begin{figure}
287\begin{multicols}{2}
288Destructor Entry
289\begin{pseudo}
290if monitor is free
291        enter
293        increment recursion
294        return
295create wait context
298        push self to AS-stack
299        baton pass
300else
301        wait
302increment recursion
303\end{pseudo}
304\columnbreak
305Waitfor
306\begin{pseudo}
308        if found destructor
309                push destructor to AS-stack
310                unlock all monitors
311        else
312                push self to AS-stack
313                baton pass
314        endif
315        return
316endif
317if non-blocking
318        Unlock all monitors
319        Return
320endif
321
322push self to AS-stack