\chapter{Scheduling in practice}\label{practice} The scheduling algorithm discribed in Chapter~\ref{core} addresses scheduling in a stable state. However, it does not address problems that occur when the system changes state. Indeed the \CFA runtime, supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically. This entails that the scheduling algorithm must support these transitions. More precise \CFA supports adding \procs using the RAII object @processor@. These objects can be created at any time and can be destroyed at any time. They are normally create as automatic stack variables, but this is not a requirement. The consequence is that the scheduler and \io subsystems must support \procs comming in and out of existence. \section{Manual Resizing} The consequence of dynamically changing the number of \procs is that all internal arrays that are sized based on the number of \procs neede to be \texttt{realloc}ed. This also means that any references into these arrays, pointers or indexes, may need to be fixed when shrinking\footnote{Indexes may still need fixing because there is no guarantee the \proc causing the shrink had the highest index. Therefore indexes need to be reassigned to preserve contiguous indexes.}. There are no performance requirements, within reason, for resizing since this is usually considered as part of setup and teardown. However, this operation has strict correctness requirements since shrinking and idle sleep can easily lead to deadlocks. It should also avoid as much as possible any effect on performance when the number of \procs remain constant. This later requirement prehibits simple solutions, like simply adding a global lock to these arrays. \subsection{Read-Copy-Update} One solution is to use the Read-Copy-Update\cite{wiki:rcu} pattern. In this pattern, resizing is done by creating a copy of the internal data strucures, updating the copy with the desired changes, and then attempt an Idiana Jones Switch to replace the original witht the copy. This approach potentially has the advantage that it may not need any synchronization to do the switch. The switch definitely implies a race where \procs could still use the previous, original, data structure after the copy was switched in. The important question then becomes whether or not this race can be recovered from. If the changes that arrived late can be transferred from the original to the copy then this solution works. For linked-lists, dequeing is somewhat of a problem. Dequeing from the original will not necessarily update the copy which could lead to multiple \procs dequeing the same \at. Fixing this requires making the array contain pointers to subqueues rather than the subqueues themselves. Another challenge is that the original must be kept until all \procs have witnessed the change. This is a straight forward memory reclamation challenge but it does mean that every operation will need \emph{some} form of synchronization. If each of these operation does need synchronization then it is possible a simpler solution achieves the same performance. Because in addition to the classic challenge of memory reclamation, transferring the original data to the copy before reclaiming it poses additional challenges. Especially merging subqueues while having a minimal impact on fairness and locality. \subsection{Read-Writer Lock} A simpler approach would be to use a \newterm{Readers-Writer Lock}\cite{wiki:rwlock} where the resizing requires acquiring the lock as a writer while simply enqueing/dequeing \ats requires acquiring the lock as a reader. Using a Readers-Writer lock solves the problem of dynamically resizing and leaves the challenge of finding or building a lock with sufficient good read-side performance. Since this is not a very complex challenge and an ad-hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken. To maximize reader scalability, the readers should not contend with eachother when attempting to acquire and release the critical sections. This effectively requires that each reader have its own piece of memory to mark as locked and unlocked. Reades then acquire the lock wait for writers to finish the critical section and then acquire their local spinlocks. Writers acquire the global lock, so writers have mutual exclusion among themselves, and then acquires each of the local reader locks. Acquiring all the local locks guarantees mutual exclusion between the readers and the writer, while the wait on the read side prevents readers from continously starving the writer. \todo{reference listings} \begin{lstlisting} void read_lock() { // Step 1 : make sure no writers in while write_lock { Pause(); } // May need fence here // Step 2 : acquire our local lock while atomic_xchg( tls.lock ) { Pause(); } } void read_unlock() { tls.lock = false; } \end{lstlisting} \begin{lstlisting} void write_lock() { // Step 1 : lock global lock while atomic_xchg( write_lock ) { Pause(); } // Step 2 : lock per-proc locks for t in all_tls { while atomic_xchg( t.lock ) { Pause(); } } } void write_unlock() { // Step 1 : release local locks for t in all_tls { t.lock = false; } // Step 2 : release global lock write_lock = false; } \end{lstlisting} \section{Idle-Sleep} \subsection{Tracking Sleepers} \subsection{Event FDs} \subsection{Epoll} \subsection{\texttt{io\_uring}} \subsection{Reducing Latency}