\chapter{Behind the scene}


% ======================================================================
% ======================================================================
\section{Implementation Details: Interaction with polymorphism}
% ======================================================================
% ======================================================================
Depending on the choice of semantics for when monitor locks are acquired, interaction between monitors and \CFA's concept of polymorphism can be complex to support. However, it is shown that entry-point locking solves most of the issues.

First of all, interaction between \code{otype} polymorphism and monitors is impossible since monitors do not support copying. Therefore, the main question is how to support \code{dtype} polymorphism. Since a monitor's main purpose is to ensure mutual exclusion when accessing shared data, this implies that mutual exclusion is only required for routines that do in fact access shared data. However, since \code{dtype} polymorphism always handles incomplete types (by definition), no \code{dtype} polymorphic routine can access shared data since the data requires knowledge about the type. Therefore, the only concern when combining \code{dtype} polymorphism and monitors is to protect access to routines.

Before looking into complex control-flow, it is important to present the difference between the two acquiring options : callsite and entry-point locking, i.e. acquiring the monitors before making a mutex routine call or as the first operation of the mutex routine-call. For example:
\begin{figure}
\label{fig:locking-site}
\begin{center}
\setlength\tabcolsep{1.5pt}
\begin{tabular}{|c|c|c|}
Mutex & \gls{callsite-locking} & \gls{entry-point-locking} \\
call & pseudo-code & pseudo-code \\
\hline
\begin{cfacode}[tabsize=3]
void foo(monitor& mutex a){

	//Do Work
	//...

}

void main() {
	monitor a;

	foo(a);

}
\end{cfacode} & \begin{pseudo}[tabsize=3]
foo(& a) {

	//Do Work
	//...

}

main() {
	monitor a;
	acquire(a);
	foo(a);
	release(a);
}
\end{pseudo} & \begin{pseudo}[tabsize=3]
foo(& a) {
	acquire(a);
	//Do Work
	//...
	release(a);
}

main() {
	monitor a;

	foo(a);

}
\end{pseudo}
\end{tabular}
\end{center}
\caption{Callsite vs entry-point locking for mutex calls}
\end{figure}


Note the \code{mutex} keyword relies on the type system, which means that in cases where a generic monitor routine is actually desired, writing a mutex routine is possible with the proper trait, which is possible because monitors are designed in terms a trait. For example:
\begin{cfacode}
//Incorrect: T is not a monitor
forall(dtype T)
void foo(T * mutex t);

//Correct: this function only works on monitors (any monitor)
forall(dtype T | is_monitor(T))
void bar(T * mutex t));
\end{cfacode}


% ======================================================================
% ======================================================================
\section{Internal scheduling: Implementation} \label{inschedimpl}
% ======================================================================
% ======================================================================
There are several challenges specific to \CFA when implementing internal scheduling. These challenges are direct results of \gls{bulk-acq} and loose object definitions. These two constraints are to root cause of most design decisions in the implementation of internal scheduling. Furthermore, to avoid the head-aches of dynamically allocating memory in a concurrent environment, the internal-scheduling design is entirely free of mallocs and other dynamic memory allocation scheme. This is to avoid the chicken and egg problem \cite{Chicken} of having a memory allocator that relies on the threading system and a threading system that relies on the runtime. This extra goal, means that memory management is a constant concern in the design of the system.

The main memory concern for concurrency is queues. All blocking operations are made by parking threads onto queues. These queues need to be intrinsic\cit to avoid the need memory allocation. This entails that all the fields needed to keep track of all needed information. Since internal scheduling can use an unbound amount of memory (depending on \gls{bulk-acq}) statically defining information information in the intrusive fields of threads is insufficient. The only variable sized container that does not require memory allocation is the callstack, which is heavily used in the implementation of internal scheduling. Particularly the GCC extension variable length arrays which is used extensively.

Since stack allocation is based around scope, the first step of the implementation is to identify the scopes that are available to store the information, and which of these can have a variable length. In the case of external scheduling, the threads and the condition both allow a fixed amount of memory to be stored, while mutex-routines and the actual blocking call allow for an unbound amount (though adding too much to the mutex routine stack size can become expansive faster).

The following figure is the traditionnal illustration of a monitor :

\begin{center}
{\resizebox{0.4\textwidth}{!}{\input{monitor}}}
\end{center}

For \CFA, the previous picture does not have support for blocking multiple monitors on a single condition. To support \gls{bulk-acq} two changes to this picture are required. First, it doesn't make sense to tie the condition to a single monitor since blocking two monitors as one would require arbitrarily picking a monitor to hold the condition. Secondly, the object waiting on the conditions and AS-stack cannot simply contain the waiting thread since a single thread can potentially wait on multiple monitors. As mentionned in section \ref{inschedimpl}, the handling in multiple monitors is done by partially passing, which entails that each concerned monitor needs to have a node object. However, for waiting on the condition, since all threads need to wait together, a single object needs to be queued in the condition. Moving out the condition and updating the node types yields :

\begin{center}
{\resizebox{0.8\textwidth}{!}{\input{int_monitor}}}
\end{center}

\newpage

This picture and the proper entry and leave algorithms is the fundamental implementation of internal scheduling.

\begin{multicols}{2}
Entry
\begin{pseudo}[numbers=left]
if monitor is free
	enter
elif I already own the monitor
	continue
else
	block
increment recursion

\end{pseudo}
\columnbreak
Exit
\begin{pseudo}[numbers=left, firstnumber=8]
decrement recursion
if recursion == 0
	if signal_stack not empty
		set_owner to thread
		if all monitors ready
			wake-up thread

	if entry queue not empty
		wake-up thread
\end{pseudo}
\end{multicols}

Some important things to notice about the exit routine. The solution discussed in \ref{inschedimpl} can be seen on line 11 of the previous pseudo code. Basically, the solution boils down to having a seperate data structure for the condition queue and the AS-stack, and unconditionally transferring ownership of the monitors but only unblocking the thread when the last monitor has trasnferred ownership. This solution is safe as well as preventing any potential barging.

% ======================================================================
% ======================================================================
\section{Implementation Details: External scheduling queues}
% ======================================================================
% ======================================================================
To support multi-monitor external scheduling means that some kind of entry-queues must be used that is aware of both monitors. However, acceptable routines must be aware of the entry queues which means they must be stored inside at least one of the monitors that will be acquired. This in turn adds the requirement a systematic algorithm of disambiguating which queue is relavant regardless of user ordering. The proposed algorithm is to fall back on monitors lock ordering and specify that the monitor that is acquired first is the lock with the relevant entry queue. This assumes that the lock acquiring order is static for the lifetime of all concerned objects but that is a reasonable constraint. This algorithm choice has two consequences, the entry queue of the highest priority monitor is no longer a true FIFO queue and the queue of the lowest priority monitor is both required and probably unused. The queue can no longer be a FIFO queue because instead of simply containing the waiting threads in order arrival, they also contain the second mutex. Therefore, another thread with the same highest priority monitor but a different lowest priority monitor may arrive first but enter the critical section after a thread with the correct pairing. Secondly, since it may not be known at compile time which monitor will be the lowest priority monitor, every monitor needs to have the correct queues even though it is probable that half the multi-monitor queues will go unused for the entire duration of the program.


\section{Internals}
The complete mask can be pushed to any one, we are in a context where we already have full ownership of (at least) every concerned monitor and therefore monitors will refuse all calls no matter what.