# Changeset 4a40f69 for doc/theses/thierry_delisle_PhD/comp_II/comp_II.tex

Ignore:
Timestamp:
Mar 5, 2020, 9:40:02 PM (3 years ago)
Branches:
arm-eh, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
Children:
42a99b0, 9c6f459
Parents:
bc9384ac
Message:

 rbc9384ac \documentclass[11pt,fullpage]{article} \documentclass[11pt]{article} \usepackage{fullpage} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage{glossaries} \usepackage{textcomp} \usepackage{geometry} \usepackage{float} %\usepackage[margin=1in]{geometry} %\usepackage{float} % cfa macros used in the document A simple ready-queue can be built from a FIFO queue, user-threads are pushed onto the queue when they are ready to run and processors (kernel-threads acting as virtual processors) pop the user-threads from the queue and execute them. Using Trevor's paper\cit as basis, it is simple to build a relaxed FIFO list that is fast and scalable for loaded or overloaded systems. The described queue uses an array of underlying strictly FIFO queues as shown in Figure~\ref{fig:base}\footnote{For this section, the number of underlying queues is assumed to be constant, Section~\ref{sec:resize} will discuss resizing the array.}. Pushing new data is done by selecting one of these underlying queues at random, recording a timestamp for the push and pushing to the selected queue. Popping is done by selecting two queues at random and popping from the queue for which the head has the oldest timestamp. A higher number of underlying queues leads to less contention on each queue and therefore better performance. In a loaded system, it is higly likely the queues are non-empty, i.e., several tasks are on each of the underlying queues. This means that selecting a queue at random to pop from is higly likely to yield a queue with available items. In Figure~\ref{fig:base}, ignoring the ellipsis, the chances of getting an empty queue is 2/7 per pick, meaning two randoms pick will yield an item approximately 9 times out of 10. \begin{figure}[H] \begin{center} {\resizebox{0.8\textwidth}{!}{\input{base}}} \begin{figure} \begin{center} %               {\resizebox{0.8\textwidth}{!}{\input{base}}} \input{base} \end{center} \caption{Relaxed FIFO list at the base of the scheduler: an array of strictly FIFO lists. } \end{figure} \begin{figure}[H] \begin{center} {\resizebox{0.8\textwidth}{!}{\input{empty}}} \begin{figure} \begin{center} %               {\resizebox{0.8\textwidth}{!}{\input{empty}}} \input{empty} \end{center} \caption{More empty'' state of the queue: the array contains many empty cells.} When the ready queue is "more empty", i.e., several of the inner queues are empty, selecting a random queue for popping is less likely to yield a valid selection and more attempts need to be made, resulting in a performance degradation. Figure~\ref{fig:empty} shows an example with fewer elements where the chances of getting an empty queue is 5/7 per pick, meaning two randoms pick will yield an item only half the time. Since the overarching queue is not empty, the pop operation \emph{must} find an element before returning and therefore must retry. Overall performance is therefore influenced by the contention on the underlying queues and pop performance is influenced by the items density. This leads to four performance cases, as depicted in Table~\ref{tab:perfcases}. \begin{table}[H] \begin{table} \begin{center} \begin{tabular}{|c|c|c|} A bitmask can be used to identify which inner queues are currently in use, as shown in Figure~\ref{fig:emptybit}. This means that processors can often find user-threads in constant time, regardless of how many underlying queues are empty. Furthermore, modern x86 CPUs have an extension (BMI2) which allow using the bitmask with very little overhead compared to a filled readyqueue, offerring decent performance even in cases with many empty inner queues. However, this technique has its limits, with a single word\footnote{Word refers here to however many bits can be written atomicly.} bitmask, the total number of underlying queues in the overarching queue is limited to the number of bits in the word. With a multi-word bitmask, this maximum limit can be increased arbitrarily, but it is not possible to check if the queue is empty by reading the bitmask atomicly. A dense bitmap, i.e., either a single word bitmask or a multi word bitmask where all words are densely packed, also causes additionnal problems in case~C (Table~\ref{tab:perfcases}), which the increased contention on the bitmask both causes new performance degradation and means the accuracy of the bitmask is less reliable due to more hardware threads potentially racing to read and/or update that information. \begin{figure}[H] \begin{figure} \begin{center} {\resizebox{0.8\textwidth}{!}{\input{emptybit}}} Another approach is to use a hiearchical data structure, for example Figure~\ref{fig:emptytree}. Creating a tree of nodes to reduce contention has been shown to work in similar cases\cit(SNZI: Scalable NonZero Indicators)\footnote{This particular paper seems to be patented in the US. How does that affect \CFA? Can I use it in my work?}. However, this approach may lead to poorer performance in case~B (Table~\ref{tab:perfcases}) due to the inherent pointer chasing cost and already low contention cost in that case. \begin{figure}[H] \begin{figure} \begin{center} {\resizebox{0.8\textwidth}{!}{\input{emptytree}}} The \CFA runtime system currently handles dynamically adding and removing processors from clusters at any time. Since this is part of the existing design, the proposed scheduler must also support this behaviour. However, dynamicly resizing the clusters is considered a rare event associated with setup, teardown and major configuration changes. This assumptions is made both in the design of the proposed scheduler as well as in the original design of the \CFA runtime system. As such, the proposed scheduler must honor the correctness of these behaviour but does not have any performance objectives with regards to resizing a cluster. How long adding or removing processors take and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long period of times. However, as mentionned in Section~\ref{sec:queue}, contention on the underlying queues can have a direct impact on performance, the number of underlying queues must therefore be adjusted as the number of processors grows or shrinks. Since the underlying queues are stored in a dense array, changing the number of queues requires resizing the array and therefore moving it. This can introduce memory reclamation problems if not done correctly. \begin{figure}[H] \begin{center} {\resizebox{0.8\textwidth}{!}{\input{resize}}} \begin{figure} \begin{center} %               {\resizebox{0.8\textwidth}{!}{\input{resize}}} \input{resize} \end{center} \caption{Copy of data structure shown in Figure~\ref{fig:base}. The cells of the array can be modified concurrently but resizing the array, which requires moving it, is not safe to do concurrently. This can also be true of the accompanying data structures used to find non-empty queues.}