Ignore:
Timestamp:
Nov 13, 2017, 10:45:32 AM (7 years ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, pthread-emulation, qualifiedEnum, resolv-new, with_gc
Children:
b3ffb61
Parents:
6d2386e
Message:

Commit after new draft

Location:
doc/proposals/concurrency/text
Files:
6 edited

Legend:

Unmodified
Added
Removed
  • doc/proposals/concurrency/text/concurrency.tex

    r6d2386e r20ffcf3  
    771771For the first two conditions, it is easy to implement a check that can evaluate the condition in a few instruction. However, a fast check for \pscode{monitor accepts me} is much harder to implement depending on the constraints put on the monitors. Indeed, monitors are often expressed as an entry queue and some acceptor queue as in the following figure:
    772772
     773\begin{figure}[H]
    773774\begin{center}
    774775{\resizebox{0.4\textwidth}{!}{\input{monitor}}}
    775776\end{center}
     777\label{fig:monitor}
     778\end{figure}
    776779
    777780There are other alternatives to these pictures, but in the case of this picture, implementing a fast accept check is relatively easy. Restricted to a fixed number of mutex members, N, the accept check reduces to updating a bitmask when the acceptor queue changes, a check that executes in a single instruction even with a fairly large number (e.g., 128) of mutex members. This technique cannot be used in \CFA because it relies on the fact that the monitor type enumerates (declares) all the acceptable routines. For OO languages this does not compromise much since monitors already have an exhaustive list of member routines. However, for \CFA this is not the case; routines can be added to a type anywhere after its declaration. It is important to note that the bitmask approach does not actually require an exhaustive list of routines, but it requires a dense unique ordering of routines with an upper-bound and that ordering must be consistent across translation units.
  • doc/proposals/concurrency/text/future.tex

    r6d2386e r20ffcf3  
    66
    77\section{Flexible Scheduling} \label{futur:sched}
    8 
     8An important part of concurrency is scheduling. Different scheduling algorithm can affact peformance (both in terms of average and variation). However, no single scheduler is optimal for all workloads and therefore there is value in being able to change the scheduler for given programs. One solution is to offer various tweaking options to users, allowing the scheduler to be adjusted the to requirements of the workload. However, in order to be truly flexible, it would be interesting to allow users to add arbitrary data and arbirary scheduling algorithms to the scheduler. For example, a web server could attach Type-of-Service information to threads and have a ``ToS aware'' scheduling algorithm tailored to this specific web server. This path of flexible schedulers will be explored for \CFA.
    99
    1010\section{Non-Blocking IO} \label{futur:nbio}
     
    1212However, many modern workloads are not bound on computation but on IO operations, an common case being webservers and XaaS (anything as a service). These type of workloads often require significant engineering around amortising costs of blocking IO operations. While improving throughtput of these operations is outside what \CFA can do as a language, it can help users to make better use of the CPU time otherwise spent waiting on IO operations. The current trend is to use asynchronous programming using tools like callbacks and/or futurs and promises\cit. However, while these are valid solutions, they lead to code that is harder to read and maintain because it is much less linear
    1313
    14 
    15 
    1614\section{Other concurrency tools} \label{futur:tools}
    17 
     15While monitors offer a flexible and powerful concurent core for \CFA, other concurrency tools are also necessary for a complete multi-paradigm concurrency package. Example of such tools can include simple locks and condition variables, futures and promises, and executors. These additional features are useful when monitors offer a level of abstraction which is indaquate for certain tasks.
    1816
    1917\section{Implicit threading} \label{futur:implcit}
     
    103101\end{figure}
    104102
    105 Implicit parallelism is a general solution and therefore is
    106 
    107 \section{Multiple Paradigms} \label{futur:paradigms}
     103Implicit parallelism is a general solution and therefore has its limitations. However, it is a quick and simple approach to parallelism which may very well be sufficient for smaller applications and reduces the amount of boiler-plate that is needed to start benefiting from parallelism in modern CPUs.
    108104
    109105
    110 \section{Transactions} \label{futur:transaction}
    111 Concurrency and parallelism is still a very active field that strongly benefits from hardware advances. As such certain features that aren't necessarily mature enough in their current state could become relevant in the lifetime of \CFA.
  • doc/proposals/concurrency/text/internals.tex

    r6d2386e r20ffcf3  
    11
    22\chapter{Behind the scene}
    3 There are several challenges specific to \CFA when implementing concurrency. These challenges are direct results of \gls{bulk-acq} and loose object definitions. These two constraints are to root cause of most design decisions in the implementation. Furthermore, to avoid the head-aches of dynamically allocating memory in a concurrent environment, the internal-scheduling design is (almost) entirely free of mallocs and other dynamic memory allocation scheme. This is to avoid the chicken and egg problem \cite{Chicken} of having a memory allocator that relies on the threading system and a threading system that relies on the runtime. This extra goal, means that memory management is a constant concern in the design of the system.
    4 
    5 The main memory concern for concurrency is queues. All blocking operations are made by parking threads onto queues. These queues need to be intrinsic\cit to avoid the need memory allocation. This entails that all the fields needed to keep track of all needed information. Since many conconcurrency operations can use an unbound amount of memory (depending on \gls{bulk-acq}) statically defining information in the intrusive fields of threads is insufficient. The only variable sized container that does not require memory allocation is the callstack, which is heavily used in the implementation of internal scheduling. Particularly the GCC extension variable length arrays which is used extensively.
     3There are several challenges specific to \CFA when implementing concurrency. These challenges are a direct result of \gls{bulk-acq} and loose object-definitions. These two constraints are the root cause of most design decisions in the implementation. Furthermore, to avoid contention from dynamically allocating memory in a concurrent environment, the internal-scheduling design is (almost) entirely free of mallocs. This is to avoid the chicken and egg problem \cite{Chicken} of having a memory allocator that relies on the threading system and a threading system that relies on the runtime. This extra goal, means that memory management is a constant concern in the design of the system.
     4
     5The main memory concern for concurrency is queues. All blocking operations are made by parking threads onto queues. The queue design needs to be intrinsic\cit to avoid the need for memory allocation, which entails that all the nodes need specific fields to keep track of all needed information. Since many concurrency operations can use an unbound amount of memory (depending on \gls{bulk-acq}), statically defining information in the intrusive fields of threads is insufficient. The only variable sized container that does not require memory allocation is the callstack, which is heavily used in the implementation of internal scheduling. Particularly variable length arrays, which are used extensively.
    66
    77Since stack allocation is based around scope, the first step of the implementation is to identify the scopes that are available to store the information, and which of these can have a variable length. The threads and the condition both allow a fixed amount of memory to be stored, while mutex-routines and the actual blocking call allow for an unbound amount (though the later is preferable in terms of performance).
    88
    9 Note that since the major contributions of this thesis are extending monitor semantics to \gls{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristiques of \CFA are consired as problems which have already been solved and therefore will not be discussed further.
     9Note that since the major contributions of this thesis are extending monitor semantics to \gls{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristiques of \CFA are considered as solved problems and therefore not discussed further.
    1010
    1111% ======================================================================
     
    1515% ======================================================================
    1616
    17 The first step towards the monitor implementation is simple mutex-routines using monitors. In the single monitor case, this is done using the entry/exit procedure highlighted in listing \ref{lst:entry1}. This entry/exit procedure doesn't actually have to be extended to support multiple monitors, indeed it is sufficient to enter/leave monitors one-by-one as long as the order is correct to prevent deadlocks\cit. In \CFA, ordering of monitor relies on memory ordering, this is sufficient because all objects are guaranteed to have distinct non-overlaping memory layouts and mutual-exclusion for a monitor is only defined for its lifetime, meaning that destroying a monitor while it is acquired is undefined behavior. When a mutex call is made, the concerned monitors are agregated into an variable-length pointer array and sorted based on pointer values. This array is concerved during the entire duration of the mutual-exclusion and it's ordering reused extensively.
     17The first step towards the monitor implementation is simple mutex-routines using monitors. In the single monitor case, this is done using the entry/exit procedure highlighted in listing \ref{lst:entry1}. This entry/exit procedure does not actually have to be extended to support multiple monitors, indeed it is sufficient to enter/leave monitors one-by-one as long as the order is correct to prevent deadlocks\cit. In \CFA, ordering of monitor relies on memory ordering, this is sufficient because all objects are guaranteed to have distinct non-overlaping memory layouts and mutual-exclusion for a monitor is only defined for its lifetime, meaning that destroying a monitor while it is acquired is undefined behavior. When a mutex call is made, the concerned monitors are agregated into a variable-length pointer array and sorted based on pointer values. This array presists for the entire duration of the mutual-exclusion and its ordering reused extensively.
    1818\begin{figure}
    1919\begin{multicols}{2}
     
    9696\end{tabular}
    9797\end{center}
    98 \caption{Callsite vs entry-point locking for mutex calls}
     98\caption{Call-site vs entry-point locking for mutex calls}
    9999\label{fig:locking-site}
    100100\end{figure}
    101101
    102 Note the \code{mutex} keyword relies on the type system, which means that in cases where a generic monitor routine is actually desired, writing a mutex routine is possible with the proper trait, for example:
     102Note the \code{mutex} keyword relies on the type system, which means that in cases where a generic monitor routine is desired, writing the mutex routine is possible with the proper trait, for example:
    103103\begin{cfacode}
    104 //Incorrect: T is not a monitor
     104//Incorrect: T may not be monitor
    105105forall(dtype T)
    106106void foo(T * mutex t);
     
    111111\end{cfacode}
    112112
    113 Both entry-point and callsite locking are valid implementations. The current \CFA implementations uses entry-point locking because it seems to require less work if done using \gls{raii}, effectively transferring the burden of implementation to object construction/destruction. The same could be said of callsite locking, the difference being that the later does not necessarily have an existing scope that matches exactly the scope of the mutual exclusion, i.e.: the function body.
     113Both entry-point and callsite locking are feasible implementations. The current \CFA implementations uses entry-point locking because it requires less work when using \gls{raii}, effectively transferring the burden of implementation to object construction/destruction. The same could be said of callsite locking, the difference being that the later does not necessarily have an existing scope that matches exactly the scope of the mutual exclusion, i.e.: the function body. Furthermore, entry-point locking requires less code generation since any useful routine is called at least as often as it is define, there can be only one entry-point but many callsites.
    114114
    115115% ======================================================================
     
    119119% ======================================================================
    120120
    121 Figure \ref{fig:system1} shows a high-level picture if the \CFA runtime system in regards to concurrency.
     121Figure \ref{fig:system1} shows a high-level picture if the \CFA runtime system in regards to concurrency. Each component of the picture is explained in details in the fllowing sections.
    122122
    123123\begin{figure}
     
    130130
    131131\subsection{Context Switching}
    132 As mentionned in section \ref{coroutine}, coroutines are a stepping stone for implementing threading. This is because they share the same mechanism for context-switching between different stacks. To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific function call. This assumptions means that the basic recipe for context-switch is only to copy all callee-saved registers unto the stack and then switch the stack registers with the ones of the target coroutine/thread. Note that instruction pointer can be left untouched since the context-switch always inside the same function. In the case of coroutines, that is the entire story. Threads however do not simply context-switch between each other directly. The context-switch to processors which is where the scheduling happens. This method is called a 2-step context-switch and has the advantage of having a clear distinction between user code and the "kernel" where scheduling and other system operation happen. Obiously, this has the cost of doubling the context-switch cost from because threads must context-switch to an intermediate stack. However, the performance of the 2-step context-switch is still superior to a \code{pthread_yield}(see section \ref{results}). additionally, for users in need for optimal performance, it is important to note that having a 2-step context-switch as the default does not prevent \CFA from offering a 1-step context-switch to use manually (or as part of monitors). This option is not currently present in \CFA but the changes required to add it are strictly additive.
     132As mentionned in section \ref{coroutine}, coroutines are a stepping stone for implementing threading. This is because they share the same mechanism for context-switching between different stacks. To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific function call. This assumption means that the context-switch only has to copy the callee-saved registers onto the stack and then switch the stack registers with the ones of the target coroutine/thread. Note that the instruction pointer can be left untouched since the context-switch is always inside the same function. Threads however do not context-switch between each other directly. They context-switch to the scheduler. This method is called a 2-step context-switch and has the advantage of having a clear distinction between user code and the kernel where scheduling and other system operation happen. Obiously, this has the cost of doubling the context-switch cost because threads must context-switch to an intermediate stack. However, the performance of the 2-step context-switch is still superior to a \code{pthread_yield}(see section \ref{results}). additionally, for users in need for optimal performance, it is important to note that having a 2-step context-switch as the default does not prevent \CFA from offering a 1-step context-switch to use manually (or as part of monitors). This option is not currently present in \CFA but the changes required to add it are strictly additive.
    133133
    134134\subsection{Processors}
    135 Parallelism in \CFA are built around using processors to specify how much parallelism is desired. \CFA processors are object wrappers around kernel threads, specifically pthreads in the current implementation of \CFA. Indeed, any parallelism must go through operatiing system librairies. However, \gls{cfathread} are still the main source of concurrency, processors are simply the underlying source of parallelism. Indeed, processor kernel threads simply fetch a user-level thread from the scheduler and run, they are effectively executers for user-threads. The main benefit of this approach is that it offers a well defined boundary between kernel code and user-code, for example kernel thread quiescing, scheduling and interrupt handling. Processors internally use coroutines to take advantage of the existing context-switching semantics.
     135Parallelism in \CFA is built around using processors to specify how much parallelism is desired. \CFA processors are object wrappers around kernel threads, specifically pthreads in the current implementation of \CFA. Indeed, any parallelism must go through operating-system librairies. However, \glspl{uthread} are still the main source of concurrency, processors are simply the underlying source of parallelism. Indeed, processor \glspl{kthread} simply fetch a \glspl{uthread} from the scheduler and run, they are effectively executers for user-threads. The main benefit of this approach is that it offers a well defined boundary between kernel code and user code, for example, kernel thread quiescing, scheduling and interrupt handling. Processors internally use coroutines to take advantage of the existing context-switching semantics.
    136136
    137137\subsection{Stack management}
    138138One of the challenges of this system is to reduce the footprint as much as possible. Specifically, all pthreads created also have a stack created with them, which should be used as much as possible. Normally, coroutines also create there own stack to run on, however, in the case of the coroutines used for processors, these coroutines run directly on the kernel thread stack, effectively stealing the processor stack. The exception to this rule is the Main Processor, i.e. the initial kernel thread that is given to any program. In order to respect user expectations, the stack of the initial kernel thread, the main stack of the program, is used by the main user thread rather than the main processor.
    139139
    140 \subsection{Preemption}
    141 Finally, an important aspect for any complete threading system is preemption. As mentionned in chapter \ref{basics}, preemption introduces an extra degree of unceretainty, which enables users to have multiple threads interleave transparrently between eachother, rather than having to cooperate between thread for proper scheduling and CPU distribution. Indeed, preemption is desireable because it adds a degree of isolation between tasks. In a fully cooperative system, any thread that runs into a long loop can starve other threads, while in a preemptive system starvation can still occur but it does not rely on every thread having to yield or block on a regular basis, which reduces significantly programmer burden. Obviously, preemption is not optimal for every workload, however any preemptive system can become a cooperative system by making the time-slices extremely large. Which is why \CFA uses a preemptive threading system.
    142 
    143 Preemption in \CFA is based on kernel timers which are used to run a discreet event simulation. Every processor keeps track of the current time and registers an expiration time with the preemption system. When the preemption system receives a change in preemption it sorts these expiration times in a list and sets a kernel timer for the closest one, effectiveling stepping between preemption events on each signals sent by the timer. These timers use the linux signal {\tt SIGALRM}, which is delivered to the process. This is important because when delivering signals to a process, the kernel documentation states that the signal can be delivered to any kernel thread for which the signal isn't block i.e. :
     140\subsection{Preemption} \label{preemption}
     141Finally, an important aspect for any complete threading system is preemption. As mentionned in chapter \ref{basics}, preemption introduces an extra degree of uncertainty, which enables users to have multiple threads interleave transparently, rather than having to cooperate among threads for proper scheduling and CPU distribution. Indeed, preemption is desireable because it adds a degree of isolation among threads. In a fully cooperative system, any thread that runs into a long loop can starve other threads, while in a preemptive system starvation can still occur but it does not rely on every thread having to yield or block on a regular basis, which reduces significantly a programmer burden. Obviously, preemption is not optimal for every workload, however any preemptive system can become a cooperative system by making the time-slices extremely large. Which is why \CFA uses a preemptive threading system.
     142
     143Preemption in \CFA is based on kernel timers, which are used to run a discrete-event simulation. Every processor keeps track of the current time and registers an expiration time with the preemption system. When the preemption system receives a change in preemption, it sorts these expiration times in a list and sets a kernel timer for the closest one, effectively stepping between preemption events on each signals sent by the timer. These timers use the linux signal {\tt SIGALRM}, which is delivered to the process rather than the kernel-thread. This results in an implementation problem,because when delivering signals to a process, the kernel documentation states that the signal can be delivered to any kernel thread for which the signal is not blocked i.e. :
    144144\begin{quote}
    145145A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of the threads has the signal unblocked, then the kernel chooses an arbitrary thread to which to deliver the signal.
     
    148148For the sake of simplicity and in order to prevent the case of having two threads receiving alarms simultaneously, \CFA programs block the {\tt SIGALRM} signal on every thread except one. Now because of how involontary context-switches are handled, the kernel thread handling {\tt SIGALRM} cannot also be a processor thread.
    149149
    150 Involontary context-switching is done by sending {\tt SIGUSER1} to the corresponding processor and having the thread yield from inside the signal handler. Effectively context-switch away from the signal-handler back to the kernel and the signal-handler frame will be unwound when the thread is scheduled again. This means that a signal-handler can start on one kernel thread and terminate on a second kernel thread (but the same user thread). It is important to note that signal-handlers save and restore signal masks because user-thread migration can cause signal mask to migrate from one kernel thread to another. This is only a problem if all kernel threads among which a user thread can migrate differ in terms of signal masks. However, since the kernel thread hanlding preemption requires a different signal mask, executing user threads on the kernel alarm thread can cause deadlocks. For this reason, the alarm thread is on a tight loop around a system call to \code{sigwait} or more specifically \code{sigwaitinfo}, requiring very little CPU time for preemption. One final detail about the alarm thread is how to wake it when additional communication is required (e.g. on thread termination). This is also done using {\tt SIGALRM}, but sent throught the \code{pthread_sigqueue}. Indeed, \code{sigwait} can differentiate signals sent from \code{pthread_sigqueue} from signals sent from alarms or the kernel.
    151 
    152 \subsection{Scheduler} \footnote{ I'm not sure what to write here, is this section even needed. }
    153 Finally, an aspect that was not mentionned yet is the scheduling algorithm. Currently, the \CFA scheduler uses a single ready queue for all processors. Will this is not the highest performance algorithm, it has the significant advantage of being robust to heterogenous workloads. This is a very simple scheduling approach but is sufficient to for the context of this thesis.
    154 
    155 What to do here?
    156 
    157 However, when
    158 As will be mentionned \ref{futur:sched} it needs to be updated when clusters will be
    159 
    160 clusters
    161 
    162 
    163 
    164 Among the most pressing updates to the \CFA
    165 uses single queue
    166 in future should move to multiple queues with workstealing
    167 general purpouse means robust > fast
    168 worksharing can higher standard deviation in performance
    169 
     150Involuntary context-switching is done by sending signal {\tt SIGUSER1} to the corresponding processor and having the thread yield from inside the signal handler. Effectively context-switching away from the signal-handler back to the kernel and the signal-handler frame is eventually unwound when the thread is scheduled again. This approach means that a signal-handler can start on one kernel thread and terminate on a second kernel thread (but the same user thread). It is important to note that signal-handlers save and restore signal masks because user-thread migration can cause signal mask to migrate from one kernel thread to another. This behaviour is only a problem if all kernel threads among which a user thread can migrate differ in terms of signal masks\footnote{Sadly, official POSIX documentation is silent on what distiguishes ``async-signal-safe'' functions from other functions}. However, since the kernel thread hanlding preemption requires a different signal mask, executing user threads on the kernel alarm thread can cause deadlocks. For this reason, the alarm thread is on a tight loop around a system call to \code{sigwaitinfo}, requiring very little CPU time for preemption. One final detail about the alarm thread is how to wake it when additional communication is required (e.g., on thread termination). This unblocking is also done using {\tt SIGALRM}, but sent throught the \code{pthread_sigqueue}. Indeed, \code{sigwait} can differentiate signals sent from \code{pthread_sigqueue} from signals sent from alarms or the kernel.
     151
     152\subsection{Scheduler}
     153Finally, an aspect that was not mentionned yet is the scheduling algorithm. Currently, the \CFA scheduler uses a single ready queue for all processors, which is the simplest approach to scheduling. Further discussion on scheduling is present in section \label{futur:sched}.
    170154
    171155% ======================================================================
     
    174158% ======================================================================
    175159% ======================================================================
    176 To ease the understanding of monitors, like many other concepts, they are generelly represented graphically. While non-scheduled monitors are simple enough for a graphical representation to be useful, internal scheduling is complex enough to justify a visual representation. The following figure is the traditionnal illustration of a monitor :
    177 
     160The following figure is the traditional illustration of a monitor (repeated from page~\pageref{fig:monitor} for convenience) :
     161
     162\begin{figure}[H]
    178163\begin{center}
    179164{\resizebox{0.4\textwidth}{!}{\input{monitor}}}
    180165\end{center}
    181 
    182 This picture has several components, the two most important being the entry-queue and the AS-stack. The entry-queue is a (almost) FIFO list where threads waiting to enter are parked, while the AS-stack is a FILO list used for threads that have been signaled or otherwise marked as running next. For \CFA, the previous picture does not have support for blocking multiple monitors on a single condition. To support \gls{bulk-acq} two changes to this picture are required. First, it doesn't make sense to tie the condition to a single monitor since blocking two monitors as one would require arbitrarily picking a monitor to hold the condition. Secondly, the object waiting on the conditions and AS-stack cannot simply contain the waiting thread since a single thread can potentially wait on multiple monitors. As mentionned in section \ref{intsched}, the handling in multiple monitors is done by partially passing, which entails that each concerned monitor needs to have a node object. However, for waiting on the condition, since all threads need to wait together, a single object needs to be queued in the condition. Moving out the condition and updating the node types yields :
    183 
     166\caption{Traditional illustration of a monitor}
     167\label{fig:monitor}
     168\end{figure}
     169
     170This picture has several components, the two most important being the entry-queue and the AS-stack. The entry-queue is an (almost) FIFO list where threads waiting to enter are parked, while the acceptor-signalor (AS) stack is a FILO list used for threads that have been signalled or otherwise marked as running next.
     171
     172For \CFA, this picture does not have support for blocking multiple monitors on a single condition. To support \gls{bulk-acq} two changes to this picture are required. First, it is non longer helpful to attach the condition to a single monitor. Secondly, the thread waiting on the conditions has to be seperated multiple monitors, which yields :
     173
     174\begin{figure}[H]
    184175\begin{center}
    185176{\resizebox{0.8\textwidth}{!}{\input{int_monitor}}}
    186177\end{center}
    187 
    188 This picture and the proper entry and leave algorithms is the fundamental implementation of internal scheduling (see listing \ref{lst:entry2}).
     178\caption{Illustration of \CFA monitor}
     179\label{fig:monitor_cfa}
     180\end{figure}
     181
     182This picture and the proper entry and leave algorithms is the fundamental implementation of internal scheduling (see listing \ref{lst:entry2}). Note that when threads are moved from the condition to the AS-stack, it splits the thread into to pieces. The thread is woken up when all the pieces have moved from the AS-stacks to the active thread seat. In this picture, the threads are split into halves but this is only because there are two monitors in this picture. For a specific signaling operation every monitor needs a piece of thread on its AS-stack.
    189183
    190184\begin{figure}[b]
     
    219213\end{figure}
    220214
    221 Some important things to notice about the exit routine. The solution discussed in \ref{intsched} can be seen in the exit routine of listing \ref{lst:entry2}. Basically, the solution boils down to having a seperate data structure for the condition queue and the AS-stack, and unconditionally transferring ownership of the monitors but only unblocking the thread when the last monitor has transferred ownership. This solution is deadlock safe as well as preventing any potential barging.
    222 
    223 The data structure used for the AS-stack are reused extensively for external scheduling, but in the case of internal scheduling, the data is allocated using variable-length arrays on the callstack of the \code{wait} and \code{signal_block} routines.
     215Some important things to notice about the exit routine. The solution discussed in \ref{intsched} can be seen in the exit routine of listing \ref{lst:entry2}. Basically, the solution boils down to having a seperate data structure for the condition queue and the AS-stack, and unconditionally transferring ownership of the monitors but only unblocking the thread when the last monitor has transferred ownership. This solution is deadlock safe as well as preventing any potential barging. The data structure used for the AS-stack are reused extensively for external scheduling, but in the case of internal scheduling, the data is allocated using variable-length arrays on the callstack of the \code{wait} and \code{signal_block} routines.
     216
     217\begin{figure}[H]
     218\begin{center}
     219{\resizebox{0.8\textwidth}{!}{\input{monitor_structs.pstex_t}}}
     220\end{center}
     221\caption{Data structures involved in internal/external scheduling}
     222\label{fig:structs}
     223\end{figure}
     224
     225Figure \ref{fig:structs} shows a high level representation of these data-structures. The main idea behind them is that, while figure \ref{fig:monitor_cfa} is a nice illustration in theory, in practice breaking a threads into multiple pieces to put unto intrusive stacks does not make sense. The \code{condition node} is the data structure that is queued into a condition variable and, when signaled, the condition queue is popped and each \code{condition criterion} are moved to the AS-stack. Once all the criterion have be popped from their respective AS-stacks, the thread is woken-up, which is what is shown in listing \ref{lst:entry2}.
    224226
    225227% ======================================================================
     
    228230% ======================================================================
    229231% ======================================================================
    230 Similarly to internal scheduling, external scheduling for multiple monitors relies on the idea that entry-queues are no longer specific to a single monitor, as mentionned in section \ref{extsched}. This means that some kind of entry-queues must be used that is aware of both monitors and which holds threads that are currently waiting to enter the critical section. This challenge is solved for internal scheduling by having the entry-queues in conditions no longer be tied to a monitor, effectively allowing conditions to be moved outside of monitors. However, in the case of external scheduling, acceptable routines must be aware of the entry queues, which means they must be stored inside at least one of the monitors that will be acquired. This in turn adds the requirement that a systematic algorithm of disambiguating which monitor holds the relevant queue regardless of user ordering. The proposed algorithm is to fall back on monitor lock ordering and specify that the monitor that is acquired first is the one with the relevant entry queue. This assumes that the lock acquiring order is static for the lifetime of all concerned objects but that is a reasonable constraint.
    231 
    232 This algorithm choice has two consequences, the entry queue of the highest priority monitor is no longer a true FIFO queue and the queue of the lowest priority monitor is both required and probably unused. The queue can no longer be a FIFO queue because instead of simply containing the waiting threads in order of arrival, they also contain a set of monitors. Therefore, another thread whos set contains the same highest priority monitor but different lower priority monitors may arrive first but enter the critical section after a thread with the correct pairing. Secondly, since it is not known at compile time which monitor will be the lowest priority monitor, every monitor needs to have the correct queues even though it is probable that some queues will go unused for the entire duration of the program, for example if a monitor is only used in a pair.
     232Similarly to internal scheduling, external scheduling for multiple monitors relies on the idea that waiting-thread queues are no longer specific to a single monitor, as mentionned in section \ref{extsched}. For internal scheduling, these queues are part of condition variables which are still unique for a given scheduling operation (e.g., no single statment uses multiple conditions). However, in the case of external scheduling, there is no equivalent object which is associated with \code{waitfor} statements. This absence means the queues holding the waiting threads must be stored inside at least one of the monitors that is acquired. The monitors being the only objects that have sufficient lifetime and are available on both sides of the \code{waitfor} statment. This requires an algorithm to choose which monitor holds the relevant queue. It is also important that said algorithm be independent of the order in which users list parameters. The proposed algorithm is to fall back on monitor lock ordering and specify that the monitor that is acquired first is the one with the relevant wainting queue. This assumes that the lock acquiring order is static for the lifetime of all concerned objects but that is a reasonable constraint.
     233
     234This algorithm choice has two consequences :
     235\begin{itemize}
     236        \item The queue of the highest priority monitor is no longer a true FIFO queue because threads can be moved to the front of the queue. These queues need to contain a set of monitors for each of the waiting threads. Therefore, another thread whose set contains the same highest priority monitor but different lower priority monitors may arrive first but enter the critical section after a thread with the correct pairing.
     237        \item The queue of the lowest priority monitor is both required and potentially unused. Indeed, since it is not known at compile time which monitor will be the lowest priority monitor, every monitor needs to have the correct queues even though it is possible that some queues will go unused for the entire duration of the program, for example if a monitor is only used in a specific pair.
     238\end{itemize}
    233239
    234240Therefore, the following modifications need to be made to support external scheduling :
    235241\begin{itemize}
    236         \item The threads waiting on the entry-queue need to keep track of which routine is trying to enter, and using which set of monitors. The \code{mutex} routine already has all the required information on it's stack so the thread only needs to keep a pointer to that information.
     242        \item The threads waiting on the entry-queue need to keep track of which routine is trying to enter, and using which set of monitors. The \code{mutex} routine already has all the required information on its stack so the thread only needs to keep a pointer to that information.
    237243        \item The monitors need to keep a mask of acceptable routines. This mask contains for each acceptable routine, a routine pointer and an array of monitors to go with it. It also needs storage to keep track of which routine was accepted. Since this information is not specific to any monitor, the monitors actually contain a pointer to an integer on the stack of the waiting thread. Note that the complete mask can be pushed to any owned monitors, regardless of \code{when} statements, the \code{waitfor} statement is used in a context where the thread already has full ownership of (at least) every concerned monitor and therefore monitors will refuse all calls no matter what.
    238244        \item The entry/exit routine need to be updated as shown in listing \ref{lst:entry3}.
    239245\end{itemize}
    240246
     247\subsection{External scheduling - destructors}
    241248Finally, to support the ordering inversion of destructors, the code generation needs to be modified to use a special entry routine. This routine is needed because of the storage requirements of the call order inversion. Indeed, when waiting for the destructors, storage is need for the waiting context and the lifetime of said storage needs to outlive the waiting operation it is needed for. For regular \code{waitfor} statements, the callstack of the routine itself matches this requirement but it is no longer the case when waiting for the destructor since it is pushed on to the AS-stack for later. The waitfor semantics can then be adjusted correspondingly, as seen in listing \ref{lst:entry-dtor}
    242249
     
    250257        continue
    251258elif matches waitfor mask
    252         push waiter to AS-stack
     259        push criterions to AS-stack
    253260        continue
    254261else
     
    265272                if all monitors ready
    266273                        wake-up thread
     274                endif
     275        endif
    267276
    268277        if entry queue not empty
    269278                wake-up thread
     279        endif
    270280\end{pseudo}
    271281\end{multicols}
     
    295305Waitfor
    296306\begin{pseudo}
    297 lock all monitors
    298307if matching thread is already there
    299308        if found destructor
     
    303312                push self to AS-stack
    304313                baton pass
     314        endif
    305315        return
    306 
     316endif
    307317if non-blocking
    308318        Unlock all monitors
    309319        Return
     320endif
    310321
    311322push self to AS-stack
  • doc/proposals/concurrency/text/parallelism.tex

    r6d2386e r20ffcf3  
    1515Examples of languages that support \glspl{uthread} are Erlang~\cite{Erlang} and \uC~\cite{uC++book}.
    1616
    17 \subsection{Fibers : user-level threads without preemption}
     17\subsection{Fibers : user-level threads without preemption} \label{fibers}
    1818A popular varient of \glspl{uthread} is what is often refered to as \glspl{fiber}. However, \glspl{fiber} do not present meaningful semantical differences with \glspl{uthread}. The significant difference between \glspl{uthread} and \glspl{fiber} is the lack of \gls{preemption} in the later one. Advocates of \glspl{fiber} list their high performance and ease of implementation as majors strenghts of \glspl{fiber} but the performance difference between \glspl{uthread} and \glspl{fiber} is controversial, and the ease of implementation, while true, is a weak argument in the context of language design. Therefore this proposal largely ignores fibers.
    1919
  • doc/proposals/concurrency/text/results.tex

    r6d2386e r20ffcf3  
    11% ======================================================================
    22% ======================================================================
    3 \chapter{Performance results}
     3\chapter{Performance results} \label{results}
    44% ======================================================================
    55% ======================================================================
    6 
    76\section{Machine setup}
    8 
    9 \begin{figure}
     7Table \ref{tab:machine} shows the characteristiques of the machine used to run the benchmarks. All tests where made on this machine.
     8\begin{figure}[H]
    109\begin{center}
    1110\begin{tabular}{| l | r | l | r |}
     
    3736
    3837\section{Micro benchmarks}
     38All benchmarks are run using the same harness to produce the results, seen as the \code{BENCH()} macro in the following examples. This macro uses the following logic to benchmark the code :
     39\begin{pseudo}
     40#define BENCH(run, result)
     41        gettime();
     42        run;
     43        gettime();
     44        result = (after - before) / N;
     45\end{pseudo}
     46The method used to get time is \code{clock_gettime(CLOCK_THREAD_CPUTIME_ID);}. Each benchmark is using many interations of a simple call to measure the cost of the call. The specific number of interation dependes on the specific benchmark.
     47
     48\subsection{Context-switching}
     49The first interesting benchmark is to measure how long context-switches take. The simplest approach to do this is to yield on a thread, which executes a 2-step context switch. In order to make the comparison fair, coroutines also execute a 2-step context-switch, which is a resume/suspend cycle instead of a yield. Listing \ref{lst:ctx-switch} shows the code for coroutines and threads. All omitted tests are functionally identical to one of these tests. The results can be shown in table \ref{tab:ctx-switch}.
     50\begin{figure}
     51\begin{multicols}{2}
     52\CFA Coroutines
     53\begin{cfacode}
     54coroutine GreatSuspender {};
     55void main(GreatSuspender& this) {
     56        while(true) { suspend(); }
     57}
     58int main() {
     59        GreatSuspender s;
     60        resume(s);
     61        BENCH(
     62                for(size_t i=0; i<n; i++) {
     63                        resume(s);
     64                },
     65                result
     66        )
     67        printf("%llu\n", result);
     68}
     69\end{cfacode}
     70\columnbreak
     71\CFA Threads
     72\begin{cfacode}
     73
     74
     75
     76
     77int main() {
     78
     79
     80        BENCH(
     81                for(size_t i=0; i<n; i++) {
     82                        yield();
     83                },
     84                result
     85        )
     86        printf("%llu\n", result);
     87}
     88\end{cfacode}
     89\end{multicols}
     90\caption{\CFA benchmark code used to measure context-switches for coroutines and threads.}
     91\label{lst:ctx-switch}
     92\end{figure}
    3993
    4094\begin{figure}
     
    54108\caption{Context Switch comparaison. All numbers are in nanoseconds(\si{\nano\second})}
    55109\label{tab:ctx-switch}
     110\end{figure}
     111
     112\subsection{Mutual-exclusion}
     113The next interesting benchmark is to measure the overhead to enter/leave a critical-section. For monitors, the simplest appraoch is to measure how long it takes enter and leave a monitor routine. Listing \ref{lst:mutex} shows the code for \CFA. To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a pthread mutex lock are also mesured. The results can be shown in table \ref{tab:mutex}.
     114
     115\begin{figure}
     116\begin{cfacode}
     117monitor M {};
     118void __attribute__((noinline)) call( M & mutex m /*, m2, m3, m4*/ ) {}
     119
     120int main() {
     121        M m/*, m2, m3, m4*/;
     122        BENCH(
     123                for(size_t i=0; i<n; i++) {
     124                        call(m/*, m2, m3, m4*/);
     125                },
     126                result
     127        )
     128        printf("%llu\n", result);
     129}
     130\end{cfacode}
     131\caption{\CFA benchmark code used to measure mutex routines.}
     132\label{lst:mutex}
    56133\end{figure}
    57134
     
    75152\end{figure}
    76153
     154\subsection{Internal scheduling}
     155The Internal scheduling benchmark measures the cost of waiting on and signaling a condition variable. Listing \ref{lst:int-sched} shows the code for \CFA. The results can be shown in table \ref{tab:int-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests.
     156
     157\begin{figure}
     158\begin{cfacode}
     159volatile int go = 0;
     160condition c;
     161monitor M {};
     162M m1;
     163
     164void __attribute__((noinline)) do_call( M & mutex a1 ) { signal(c); }
     165
     166thread T {};
     167void ^?{}( T & mutex this ) {}
     168void main( T & this ) {
     169        while(go == 0) { yield(); }
     170        while(go == 1) { do_call(m1); }
     171}
     172int  __attribute__((noinline)) do_wait( M & mutex a1 ) {
     173        go = 1;
     174        BENCH(
     175                for(size_t i=0; i<n; i++) {
     176                        wait(c);
     177                },
     178                result
     179        )
     180        printf("%llu\n", result);
     181        go = 0;
     182        return 0;
     183}
     184int main() {
     185        T t;
     186        return do_wait(m1);
     187}
     188\end{cfacode}
     189\caption{Benchmark code for internal scheduling}
     190\label{lst:int-sched}
     191\end{figure}
     192
    77193\begin{figure}
    78194\begin{center}
     
    92208\end{figure}
    93209
     210\subsection{External scheduling}
     211The Internal scheduling benchmark measures the cost of the \code{waitfor} statement (\code{_Accept} in \uC). Listing \ref{lst:ext-sched} shows the code for \CFA. The results can be shown in table \ref{tab:ext-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests.
     212
     213\begin{figure}
     214\begin{cfacode}
     215volatile int go = 0;
     216monitor M {};
     217M m1;
     218thread T {};
     219
     220void __attribute__((noinline)) do_call( M & mutex a1 ) {}
     221
     222void ^?{}( T & mutex this ) {}
     223void main( T & this ) {
     224        while(go == 0) { yield(); }
     225        while(go == 1) { do_call(m1); }
     226}
     227int  __attribute__((noinline)) do_wait( M & mutex a1 ) {
     228        go = 1;
     229        BENCH(
     230                for(size_t i=0; i<n; i++) {
     231                        waitfor(call, a1);
     232                },
     233                result
     234        )
     235        printf("%llu\n", result);
     236        go = 0;
     237        return 0;
     238}
     239int main() {
     240        T t;
     241        return do_wait(m1);
     242}
     243\end{cfacode}
     244\caption{Benchmark code for external scheduling}
     245\label{lst:ext-sched}
     246\end{figure}
     247
    94248\begin{figure}
    95249\begin{center}
     
    109263\end{figure}
    110264
    111 \begin{figure}
    112 \begin{center}
    113 \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
    114 \cline{2-4}
    115 \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
    116 \hline
    117 Pthreads                & 26974.5       & 26977 & 124.12 \\
    118 \CFA Coroutines & 5             & 5             & 0      \\
    119 \CFA Threads    & 1122.5        & 1109.86       & 36.54  \\
    120 \uC Coroutines  & 106           & 107.04        & 1.61   \\
    121 \uC Threads             & 525.5 & 533.04        & 11.14  \\
     265\subsection{Object creation}
     266Finaly, the last benchmark measured is the cost of creation for concurrent objects. Listing \ref{lst:creation} shows the code for pthreads and \CFA threads. The results can be shown in table \ref{tab:creation}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests. The only note here is that the callstacks of \CFA coroutines are lazily created, therefore without priming the coroutine, the creation cost is very low.
     267
     268\begin{figure}
     269\begin{multicols}{2}
     270pthread
     271\begin{cfacode}
     272int main() {
     273        BENCH(
     274                for(size_t i=0; i<n; i++) {
     275                        pthread_t thread;
     276                        if(pthread_create(
     277                                &thread,
     278                                NULL,
     279                                foo,
     280                                NULL
     281                        ) < 0) {
     282                                perror( "failure" );
     283                                return 1;
     284                        }
     285
     286                        if(pthread_join(
     287                                thread,
     288                                NULL
     289                        ) < 0) {
     290                                perror( "failure" );
     291                                return 1;
     292                        }
     293                },
     294                result
     295        )
     296        printf("%llu\n", result);
     297}
     298\end{cfacode}
     299\columnbreak
     300\CFA Threads
     301\begin{cfacode}
     302int main() {
     303        BENCH(
     304                for(size_t i=0; i<n; i++) {
     305                        MyThread m;
     306                },
     307                result
     308        )
     309
     310        printf("%llu\n", result);
     311}
     312\end{cfacode}
     313\end{multicols}
     314\caption{Bechmark code for pthreads and \CFA to measure object creation}
     315\label{lst:creation}
     316\end{figure}
     317
     318\begin{figure}
     319\begin{center}
     320\begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
     321\cline{2-4}
     322\multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
     323\hline
     324Pthreads                        & 26974.5       & 26977 & 124.12 \\
     325\CFA Coroutines Lazy    & 5             & 5             & 0      \\
     326\CFA Coroutines Eager   & 335.0 & 357.67        & 34.2   \\
     327\CFA Threads            & 1122.5        & 1109.86       & 36.54  \\
     328\uC Coroutines          & 106           & 107.04        & 1.61   \\
     329\uC Threads                     & 525.5 & 533.04        & 11.14  \\
    122330\hline
    123331\end{tabular}
  • doc/proposals/concurrency/text/together.tex

    r6d2386e r20ffcf3  
    77
    88\section{Threads as monitors}
    9 As it was subtely alluded in section \ref{threads}, \code{threads} in \CFA are in fact monitors. This means that all the monitors features are available when using threads. For example, here is a very simple two thread pipeline that could be used for a simulator of a game engine :
     9As it was subtely alluded in section \ref{threads}, \code{threads} in \CFA are in fact monitors, which means that all monitor features are available when using threads. For example, here is a very simple two thread pipeline that could be used for a simulator of a game engine :
    1010\begin{cfacode}
    1111// Visualization declaration
     
    7272        }
    7373}
     74
     75// Call destructor for simulator once simulator finishes
     76// Call destructor for renderer to signify shutdown
    7477\end{cfacode}
    7578
    7679\section{Fibers \& Threads}
     80As mentionned in section \ref{preemption}, \CFA uses preemptive threads by default but can use fibers on demand. Currently, using fibers is done by adding the following line of code to the program~:
     81\begin{cfacode}
     82unsigned int default_preemption() {
     83        return 0;
     84}
     85\end{cfacode}
     86This function is called by the kernel to fetch the default preemption rate, where 0 signifies an infinite time-slice i.e. no preemption. However, once clusters are fully implemented, it will be possible to create fibers and uthreads in on the same system :
     87\begin{figure}
     88\begin{cfacode}
     89//Cluster forward declaration
     90struct cluster;
     91
     92//Processor forward declaration
     93struct processor;
     94
     95//Construct clusters with a preemption rate
     96void ?{}(cluster& this, unsigned int rate);
     97//Construct processor and add it to cluster
     98void ?{}(processor& this, cluster& cluster);
     99//Construct thread and schedule it on cluster
     100void ?{}(thread& this, cluster& cluster);
     101
     102//Declare two clusters
     103cluster thread_cluster = { 10`ms };                     //Preempt every 10 ms
     104cluster fibers_cluster = { 0 };                         //Never preempt
     105
     106//Construct 4 processors
     107processor processors[4] = {
     108        //2 for the thread cluster
     109        thread_cluster;
     110        thread_cluster;
     111        //2 for the fibers cluster
     112        fibers_cluster;
     113        fibers_cluster;
     114};
     115
     116//Declares thread
     117thread UThread {};
     118void ?{}(UThread& this) {
     119        //Construct underlying thread to automatically
     120        //be scheduled on the thread cluster
     121        (this){ thread_cluster }
     122}
     123
     124void main(UThread & this);
     125
     126//Declares fibers
     127thread Fiber {};
     128void ?{}(Fiber& this) {
     129        //Construct underlying thread to automatically
     130        //be scheduled on the fiber cluster
     131        (this.__thread){ fibers_cluster }
     132}
     133
     134void main(Fiber & this);
     135\end{cfacode}
     136\end{figure}
Note: See TracChangeset for help on using the changeset viewer.