Context Navigation

source: doc/theses/thierry_delisle_MMath/text/internals.tex @ 4b75ae9

Last change on this file since 4b75ae9 was 67982887, checked in by Peter A. Buhr <pabuhr@…>, 6 years ago
specialize thesis directory-names
Property mode set to `100644`
File size: 24.5 KB

Rev	Line
[dcfc4b35]	1
[6090518]	2	\chapter{Behind the Scenes}
	3	There are several challenges specific to \CFA when implementing concurrency. These challenges are a direct result of \gls{bulk-acq} and loose object definitions. These two constraints are the root cause of most design decisions in the implementation. Furthermore, to avoid contention from dynamically allocating memory in a concurrent environment, the internal-scheduling design is (almost) entirely free of mallocs. This approach avoids the chicken and egg problem~\cite{Chicken} of having a memory allocator that relies on the threading system and a threading system that relies on the runtime. This extra goal means that memory management is a constant concern in the design of the system.
[3364962]	4
[5c4f2c2]	5	The main memory concern for concurrency is queues. All blocking operations are made by parking threads onto queues and all queues are designed with intrusive nodes, where each node has pre-allocated link fields for chaining, to avoid the need for memory allocation. Since several concurrency operations can use an unbound amount of memory (depending on \gls{bulk-acq}), statically defining information in the intrusive fields of threads is insufficient.The only way to use a variable amount of memory without requiring memory allocation is to pre-allocate large buffers of memory eagerly and store the information in these buffers. Conveniently, the call stack fits that description and is easy to use, which is why it is used heavily in the implementation of internal scheduling, particularly variable-length arrays. Since stack allocation is based on scopes, the first step of the implementation is to identify the scopes that are available to store the information, and which of these can have a variable-length array. The threads and the condition both have a fixed amount of memory, while \code{mutex} routines and blocking calls allow for an unbound amount, within the stack size.
[64b272a]	6
[cf966b5]	7	Note that since the major contributions of this thesis are extending monitor semantics to \gls{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed.
[3364962]	8
	9	% ======================================================================
	10	% ======================================================================
[6090518]	11	\section{Mutex Routines}
[3364962]	12	% ======================================================================
	13	% ======================================================================
	14
[cae28da]	15	The first step towards the monitor implementation is simple \code{mutex} routines. In the single monitor case, mutual-exclusion is done using the entry/exit procedure in listing \ref{lst:entry1}. The entry/exit procedures do not have to be extended to support multiple monitors. Indeed it is sufficient to enter/leave monitors one-by-one as long as the order is correct to prevent deadlock~\cite{Havender68}. In \CFA, ordering of monitor acquisition relies on memory ordering. This approach is sufficient because all objects are guaranteed to have distinct non-overlapping memory layouts and mutual-exclusion for a monitor is only defined for its lifetime, meaning that destroying a monitor while it is acquired is undefined behaviour. When a mutex call is made, the concerned monitors are aggregated into a variable-length pointer array and sorted based on pointer values. This array persists for the entire duration of the mutual-exclusion and its ordering reused extensively.
[3364962]	16	\begin{figure}
[64b272a]	17	\begin{multicols}{2}
	18	Entry
	19	\begin{pseudo}
	20	if monitor is free
	21	enter
	22	elif already own the monitor
	23	continue
	24	else
	25	block
	26	increment recursions
	27	\end{pseudo}
	28	\columnbreak
	29	Exit
	30	\begin{pseudo}
	31	decrement recursion
	32	if recursion == 0
	33	if entry queue not empty
	34	wake-up thread
	35	\end{pseudo}
	36	\end{multicols}
[cf966b5]	37	\begin{pseudo}[caption={Initial entry and exit routine for monitors},label={lst:entry1}]
	38	\end{pseudo}
[64b272a]	39	\end{figure}
	40
[cae28da]	41	\subsection{Details: Interaction with polymorphism}
[64b272a]	42	Depending on the choice of semantics for when monitor locks are acquired, interaction between monitors and \CFA's concept of polymorphism can be more complex to support. However, it is shown that entry-point locking solves most of the issues.
	43
[cae28da]	44	First of all, interaction between \code{otype} polymorphism (see Section~\ref{s:ParametricPolymorphism}) and monitors is impossible since monitors do not support copying. Therefore, the main question is how to support \code{dtype} polymorphism. It is important to present the difference between the two acquiring options: \glspl{callsite-locking} and entry-point locking, i.e., acquiring the monitors before making a mutex routine-call or as the first operation of the mutex routine-call. For example:
[cf966b5]	45	\begin{table}[H]
[3364962]	46	\begin{center}
	47	\begin{tabular}{\|c\|c\|c\|}
	48	Mutex & \gls{callsite-locking} & \gls{entry-point-locking} \\
	49	call & pseudo-code & pseudo-code \\
	50	\hline
	51	\begin{cfacode}[tabsize=3]
	52	void foo(monitor& mutex a){
	53
	54	//Do Work
	55	//...
	56
	57	}
	58
	59	void main() {
	60	monitor a;
	61
	62	foo(a);
	63
	64	}
	65	\end{cfacode} & \begin{pseudo}[tabsize=3]
	66	foo(& a) {
	67
	68	//Do Work
	69	//...
	70
	71	}
	72
	73	main() {
	74	monitor a;
	75	acquire(a);
	76	foo(a);
	77	release(a);
	78	}
	79	\end{pseudo} & \begin{pseudo}[tabsize=3]
	80	foo(& a) {
	81	acquire(a);
	82	//Do Work
	83	//...
	84	release(a);
	85	}
	86
	87	main() {
	88	monitor a;
	89
	90	foo(a);
	91
	92	}
	93	\end{pseudo}
	94	\end{tabular}
	95	\end{center}
[20ffcf3]	96	\caption{Call-site vs entry-point locking for mutex calls}
[cf966b5]	97	\label{tbl:locking-site}
	98	\end{table}
[3364962]	99
[cf966b5]	100	Note the \code{mutex} keyword relies on the type system, which means that in cases where a generic monitor-routine is desired, writing the mutex routine is possible with the proper trait, e.g.:
[3364962]	101	\begin{cfacode}
[20ffcf3]	102	//Incorrect: T may not be monitor
[3364962]	103	forall(dtype T)
	104	void foo(T * mutex t);
	105
	106	//Correct: this function only works on monitors (any monitor)
	107	forall(dtype T \| is_monitor(T))
	108	void bar(T * mutex t));
	109	\end{cfacode}
	110
[5c4f2c2]	111	Both entry point and \gls{callsite-locking} are feasible implementations. The current \CFA implementation uses entry-point locking because it requires less work when using \gls{raii}, effectively transferring the burden of implementation to object construction/destruction. It is harder to use \gls{raii} for call-site locking, as it does not necessarily have an existing scope that matches exactly the scope of the mutual exclusion, i.e., the function body. For example, the monitor call can appear in the middle of an expression. Furthermore, entry-point locking requires less code generation since any useful routine is called multiple times but there is only one entry point for many call sites.
[3364962]	112
	113	% ======================================================================
	114	% ======================================================================
[64b272a]	115	\section{Threading} \label{impl:thread}
[3364962]	116	% ======================================================================
	117	% ======================================================================
	118
[6090518]	119	Figure \ref{fig:system1} shows a high-level picture if the \CFA runtime system in regards to concurrency. Each component of the picture is explained in detail in the flowing sections.
[64b272a]	120
	121	\begin{figure}
	122	\begin{center}
	123	{\resizebox{\textwidth}{!}{\input{system.pstex_t}}}
	124	\end{center}
	125	\caption{Overview of the entire system}
	126	\label{fig:system1}
	127	\end{figure}
	128
	129	\subsection{Processors}
[cae28da]	130	Parallelism in \CFA is built around using processors to specify how much parallelism is desired. \CFA processors are object wrappers around kernel threads, specifically \texttt{pthread}s in the current implementation of \CFA. Indeed, any parallelism must go through operating-system libraries. However, \glspl{uthread} are still the main source of concurrency, processors are simply the underlying source of parallelism. Indeed, processor \glspl{kthread} simply fetch a \gls{uthread} from the scheduler and run it; they are effectively executers for user-threads. The main benefit of this approach is that it offers a well-defined boundary between kernel code and user code, for example, kernel thread quiescing, scheduling and interrupt handling. Processors internally use coroutines to take advantage of the existing context-switching semantics.
[64b272a]	131
[6090518]	132	\subsection{Stack Management}
[cae28da]	133	One of the challenges of this system is to reduce the footprint as much as possible. Specifically, all \texttt{pthread}s created also have a stack created with them, which should be used as much as possible. Normally, coroutines also create their own stack to run on, however, in the case of the coroutines used for processors, these coroutines run directly on the \gls{kthread} stack, effectively stealing the processor stack. The exception to this rule is the Main Processor, i.e., the initial \gls{kthread} that is given to any program. In order to respect C user expectations, the stack of the initial kernel thread, the main stack of the program, is used by the main user thread rather than the main processor, which can grow very large.
	134
	135	\subsection{Context Switching}
	136	As mentioned in section \ref{coroutine}, coroutines are a stepping stone for implementing threading, because they share the same mechanism for context-switching between different stacks. To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific function call. This assumption means that the context-switch only has to copy the callee-saved registers onto the stack and then switch the stack registers with the ones of the target coroutine/thread. Note that the instruction pointer can be left untouched since the context-switch is always inside the same function. Threads, however, do not context-switch between each other directly. They context-switch to the scheduler. This method is called a 2-step context-switch and has the advantage of having a clear distinction between user code and the kernel where scheduling and other system operations happen. Obviously, this doubles the context-switch cost because threads must context-switch to an intermediate stack. The alternative 1-step context-switch uses the stack of the ``from'' thread to schedule and then context-switches directly to the ``to'' thread. However, the performance of the 2-step context-switch is still superior to a \code{pthread_yield} (see section \ref{results}). Additionally, for users in need for optimal performance, it is important to note that having a 2-step context-switch as the default does not prevent \CFA from offering a 1-step context-switch (akin to the Microsoft \code{SwitchToFiber}~\cite{switchToWindows} routine). This option is not currently present in \CFA, but the changes required to add it are strictly additive.
[64b272a]	137
[20ffcf3]	138	\subsection{Preemption} \label{preemption}
[6090518]	139	Finally, an important aspect for any complete threading system is preemption. As mentioned in chapter \ref{basics}, preemption introduces an extra degree of uncertainty, which enables users to have multiple threads interleave transparently, rather than having to cooperate among threads for proper scheduling and CPU distribution. Indeed, preemption is desirable because it adds a degree of isolation among threads. In a fully cooperative system, any thread that runs a long loop can starve other threads, while in a preemptive system, starvation can still occur but it does not rely on every thread having to yield or block on a regular basis, which reduces significantly a programmer burden. Obviously, preemption is not optimal for every workload. However any preemptive system can become a cooperative system by making the time slices extremely large. Therefore, \CFA uses a preemptive threading system.
[3364962]	140
[5c4f2c2]	141	Preemption in \CFA\footnote{Note that the implementation of preemption is strongly tied with the underlying threading system. For this reason, only the Linux implementation is cover, \CFA does not run on Windows at the time of writting} is based on kernel timers, which are used to run a discrete-event simulation. Every processor keeps track of the current time and registers an expiration time with the preemption system. When the preemption system receives a change in preemption, it inserts the time in a sorted order and sets a kernel timer for the closest one, effectively stepping through preemption events on each signal sent by the timer. These timers use the Linux signal {\tt SIGALRM}, which is delivered to the process rather than the kernel-thread. This results in an implementation problem, because when delivering signals to a process, the kernel can deliver the signal to any kernel thread for which the signal is not blocked, i.e.:
[64b272a]	142	\begin{quote}
	143	A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of the threads has the signal unblocked, then the kernel chooses an arbitrary thread to which to deliver the signal.
	144	SIGNAL(7) - Linux Programmer's Manual
	145	\end{quote}
[cae28da]	146	For the sake of simplicity, and in order to prevent the case of having two threads receiving alarms simultaneously, \CFA programs block the {\tt SIGALRM} signal on every kernel thread except one.
[3364962]	147
[cae28da]	148	Now because of how involuntary context-switches are handled, the kernel thread handling {\tt SIGALRM} cannot also be a processor thread. Hence, involuntary context-switching is done by sending signal {\tt SIGUSR1} to the corresponding proces\-sor and having the thread yield from inside the signal handler. This approach effectively context-switches away from the signal handler back to the kernel and the signal handler frame is eventually unwound when the thread is scheduled again. As a result, a signal handler can start on one kernel thread and terminate on a second kernel thread (but the same user thread). It is important to note that signal handlers save and restore signal masks because user-thread migration can cause a signal mask to migrate from one kernel thread to another. This behaviour is only a problem if all kernel threads, among which a user thread can migrate, differ in terms of signal masks\footnote{Sadly, official POSIX documentation is silent on what distinguishes ``async-signal-safe'' functions from other functions.}. However, since the kernel thread handling preemption requires a different signal mask, executing user threads on the kernel-alarm thread can cause deadlocks. For this reason, the alarm thread is in a tight loop around a system call to \code{sigwaitinfo}, requiring very little CPU time for preemption. One final detail about the alarm thread is how to wake it when additional communication is required (e.g., on thread termination). This unblocking is also done using {\tt SIGALRM}, but sent through the \code{pthread_sigqueue}. Indeed, \code{sigwait} can differentiate signals sent from \code{pthread_sigqueue} from signals sent from alarms or the kernel.
[64b272a]	149
[20ffcf3]	150	\subsection{Scheduler}
[cf966b5]	151	Finally, an aspect that was not mentioned yet is the scheduling algorithm. Currently, the \CFA scheduler uses a single ready queue for all processors, which is the simplest approach to scheduling. Further discussion on scheduling is present in section \ref{futur:sched}.
[64b272a]	152
	153	% ======================================================================
	154	% ======================================================================
[6090518]	155	\section{Internal Scheduling} \label{impl:intsched}
[64b272a]	156	% ======================================================================
	157	% ======================================================================
[cae28da]	158	The following figure is the traditional illustration of a monitor (repeated from page~\pageref{fig:ClassicalMonitor} for convenience):
[3364962]	159
[20ffcf3]	160	\begin{figure}[H]
[3364962]	161	\begin{center}
	162	{\resizebox{0.4\textwidth}{!}{\input{monitor}}}
	163	\end{center}
[20ffcf3]	164	\caption{Traditional illustration of a monitor}
	165	\end{figure}
	166
[6090518]	167	This picture has several components, the two most important being the entry queue and the AS-stack. The entry queue is an (almost) FIFO list where threads waiting to enter are parked, while the acceptor/signaller (AS) stack is a FILO list used for threads that have been signalled or otherwise marked as running next.
[3364962]	168
[cf966b5]	169	For \CFA, this picture does not have support for blocking multiple monitors on a single condition. To support \gls{bulk-acq} two changes to this picture are required. First, it is no longer helpful to attach the condition to \emph{a single} monitor. Secondly, the thread waiting on the condition has to be separated across multiple monitors, seen in figure \ref{fig:monitor_cfa}.
[3364962]	170
[20ffcf3]	171	\begin{figure}[H]
[3364962]	172	\begin{center}
	173	{\resizebox{0.8\textwidth}{!}{\input{int_monitor}}}
	174	\end{center}
[6090518]	175	\caption{Illustration of \CFA Monitor}
[20ffcf3]	176	\label{fig:monitor_cfa}
	177	\end{figure}
[3364962]	178
[5c4f2c2]	179	This picture and the proper entry and leave algorithms (see listing \ref{lst:entry2}) is the fundamental implementation of internal scheduling. Note that when a thread is moved from the condition to the AS-stack, it is conceptually split into N pieces, where N is the number of monitors specified in the parameter list. The thread is woken up when all the pieces have popped from the AS-stacks and made active. In this picture, the threads are split into halves but this is only because there are two monitors. For a specific signalling operation every monitor needs a piece of thread on its AS-stack.
[3364962]	180
[64b272a]	181	\begin{figure}[b]
[3364962]	182	\begin{multicols}{2}
	183	Entry
[64b272a]	184	\begin{pseudo}
[3364962]	185	if monitor is free
	186	enter
[64b272a]	187	elif already own the monitor
[3364962]	188	continue
	189	else
	190	block
	191	increment recursion
	192
	193	\end{pseudo}
	194	\columnbreak
	195	Exit
[64b272a]	196	\begin{pseudo}
[3364962]	197	decrement recursion
	198	if recursion == 0
	199	if signal_stack not empty
	200	set_owner to thread
	201	if all monitors ready
	202	wake-up thread
	203
	204	if entry queue not empty
	205	wake-up thread
	206	\end{pseudo}
	207	\end{multicols}
[cf966b5]	208	\begin{pseudo}[caption={Entry and exit routine for monitors with internal scheduling},label={lst:entry2}]
	209	\end{pseudo}
[64b272a]	210	\end{figure}
	211
[5c4f2c2]	212	The solution discussed in \ref{intsched} can be seen in the exit routine of listing \ref{lst:entry2}. Basically, the solution boils down to having a separate data structure for the condition queue and the AS-stack, and unconditionally transferring ownership of the monitors but only unblocking the thread when the last monitor has transferred ownership. This solution is deadlock safe as well as preventing any potential barging. The data structures used for the AS-stack are reused extensively for external scheduling, but in the case of internal scheduling, the data is allocated using variable-length arrays on the call stack of the \code{wait} and \code{signal_block} routines.
[20ffcf3]	213
	214	\begin{figure}[H]
	215	\begin{center}
	216	{\resizebox{0.8\textwidth}{!}{\input{monitor_structs.pstex_t}}}
	217	\end{center}
	218	\caption{Data structures involved in internal/external scheduling}
	219	\label{fig:structs}
	220	\end{figure}
[3364962]	221
[cae28da]	222	Figure \ref{fig:structs} shows a high-level representation of these data structures. The main idea behind them is that, a thread cannot contain an arbitrary number of intrusive ``next'' pointers for linking onto monitors. The \code{condition node} is the data structure that is queued onto a condition variable and, when signalled, the condition queue is popped and each \code{condition criterion} is moved to the AS-stack. Once all the criteria have been popped from their respective AS-stacks, the thread is woken up, which is what is shown in listing \ref{lst:entry2}.
[3364962]	223
	224	% ======================================================================
	225	% ======================================================================
[6090518]	226	\section{External Scheduling}
[3364962]	227	% ======================================================================
	228	% ======================================================================
[5c4f2c2]	229	Similarly to internal scheduling, external scheduling for multiple monitors relies on the idea that waiting-thread queues are no longer specific to a single monitor, as mentioned in section \ref{extsched}. For internal scheduling, these queues are part of condition variables, which are still unique for a given scheduling operation (i.e., no signal statement uses multiple conditions). However, in the case of external scheduling, there is no equivalent object which is associated with \code{waitfor} statements. This absence means the queues holding the waiting threads must be stored inside at least one of the monitors that is acquired. These monitors being the only objects that have sufficient lifetime and are available on both sides of the \code{waitfor} statement. This requires an algorithm to choose which monitor holds the relevant queue. It is also important that said algorithm be independent of the order in which users list parameters. The proposed algorithm is to fall back on monitor lock ordering (sorting by address) and specify that the monitor that is acquired first is the one with the relevant waiting queue. This assumes that the lock acquiring order is static for the lifetime of all concerned objects but that is a reasonable constraint.
[64b272a]	230
[cae28da]	231	This algorithm choice has two consequences:
[20ffcf3]	232	\begin{itemize}
[cf966b5]	233	\item The queue of the monitor with the lowest address is no longer a true FIFO queue because threads can be moved to the front of the queue. These queues need to contain a set of monitors for each of the waiting threads. Therefore, another thread whose set contains the same lowest address monitor but different lower priority monitors may arrive first but enter the critical section after a thread with the correct pairing.
[6090518]	234	\item The queue of the lowest priority monitor is both required and potentially unused. Indeed, since it is not known at compile time which monitor is the monitor which has the lowest address, every monitor needs to have the correct queues even though it is possible that some queues go unused for the entire duration of the program, for example if a monitor is only used in a specific pair.
[20ffcf3]	235	\end{itemize}
[cae28da]	236	Therefore, the following modifications need to be made to support external scheduling:
[64b272a]	237	\begin{itemize}
[5c4f2c2]	238	\item The threads waiting on the entry queue need to keep track of which routine they are trying to enter, and using which set of monitors. The \code{mutex} routine already has all the required information on its stack, so the thread only needs to keep a pointer to that information.
[cf966b5]	239	\item The monitors need to keep a mask of acceptable routines. This mask contains for each acceptable routine, a routine pointer and an array of monitors to go with it. It also needs storage to keep track of which routine was accepted. Since this information is not specific to any monitor, the monitors actually contain a pointer to an integer on the stack of the waiting thread. Note that if a thread has acquired two monitors but executes a \code{waitfor} with only one monitor as a parameter, setting the mask of acceptable routines to both monitors will not cause any problems since the extra monitor will not change ownership regardless. This becomes relevant when \code{when} clauses affect the number of monitors passed to a \code{waitfor} statement.
[5c4f2c2]	240	\item The entry/exit routines need to be updated as shown in listing \ref{lst:entry3}.
[64b272a]	241	\end{itemize}
	242
[6090518]	243	\subsection{External Scheduling - Destructors}
[5c4f2c2]	244	Finally, to support the ordering inversion of destructors, the code generation needs to be modified to use a special entry routine. This routine is needed because of the storage requirements of the call order inversion. Indeed, when waiting for the destructors, storage is needed for the waiting context and the lifetime of said storage needs to outlive the waiting operation it is needed for. For regular \code{waitfor} statements, the call stack of the routine itself matches this requirement but it is no longer the case when waiting for the destructor since it is pushed on to the AS-stack for later. The \code{waitfor} semantics can then be adjusted correspondingly, as seen in listing \ref{lst:entry-dtor}
[64b272a]	245
	246	\begin{figure}
	247	\begin{multicols}{2}
	248	Entry
	249	\begin{pseudo}
	250	if monitor is free
	251	enter
	252	elif already own the monitor
	253	continue
	254	elif matches waitfor mask
[6090518]	255	push criteria to AS-stack
[64b272a]	256	continue
	257	else
	258	block
	259	increment recursion
	260	\end{pseudo}
	261	\columnbreak
	262	Exit
	263	\begin{pseudo}
	264	decrement recursion
	265	if recursion == 0
	266	if signal_stack not empty
	267	set_owner to thread
	268	if all monitors ready
	269	wake-up thread
[20ffcf3]	270	endif
	271	endif
[3364962]	272
[64b272a]	273	if entry queue not empty
	274	wake-up thread
[20ffcf3]	275	endif
[64b272a]	276	\end{pseudo}
	277	\end{multicols}
[cf966b5]	278	\begin{pseudo}[caption={Entry and exit routine for monitors with internal scheduling and external scheduling},label={lst:entry3}]
	279	\end{pseudo}
[64b272a]	280	\end{figure}
[3364962]	281
[64b272a]	282	\begin{figure}
	283	\begin{multicols}{2}
	284	Destructor Entry
	285	\begin{pseudo}
	286	if monitor is free
	287	enter
	288	elif already own the monitor
	289	increment recursion
	290	return
	291	create wait context
	292	if matches waitfor mask
	293	reset mask
	294	push self to AS-stack
	295	baton pass
	296	else
	297	wait
	298	increment recursion
	299	\end{pseudo}
	300	\columnbreak
	301	Waitfor
	302	\begin{pseudo}
	303	if matching thread is already there
	304	if found destructor
	305	push destructor to AS-stack
	306	unlock all monitors
	307	else
	308	push self to AS-stack
	309	baton pass
[20ffcf3]	310	endif
[64b272a]	311	return
[20ffcf3]	312	endif
[64b272a]	313	if non-blocking
	314	Unlock all monitors
	315	Return
[20ffcf3]	316	endif
[64b272a]	317
	318	push self to AS-stack
	319	set waitfor mask
	320	block
	321	return
	322	\end{pseudo}
	323	\end{multicols}
[cf966b5]	324	\begin{pseudo}[caption={Pseudo code for the \code{waitfor} routine and the \code{mutex} entry routine for destructors},label={lst:entry-dtor}]
	325	\end{pseudo}
[64b272a]	326	\end{figure}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: