[5378f33] | 1 | \chapter{Conclusion}\label{conclusion} |
---|
| 2 | |
---|
| 3 | \Gls{uthrding} is popular. |
---|
| 4 | It makes sense for \CFA to use it. |
---|
| 5 | |
---|
| 6 | \todo{Obivously fix the above} |
---|
| 7 | |
---|
| 8 | An important aspect of this approach to threading is how threads are scheduled. |
---|
| 9 | As \CFA aims to increase productivity and safety of C while maintaining its performance, so to should the threading runtime achieve these goals. |
---|
| 10 | For scheduling, productivity and safety manifest in removing pitfalls in the efficient usage of the threading runtime. |
---|
| 11 | This thesis contributes to this goal by presenting a low-latency scheduler that offers improved starvation prevention compared to other state-of-the-art schedulers. |
---|
| 12 | It presents a core algorithm (Chapter~\ref{core}) that provides increased fairness through helping (Section~\ref{heling}) as well as optimizations which virtually remove the cost of this fairness (Section~\ref{relaxedtimes}). |
---|
| 13 | Building upon the fundamental scheduling algorithm, an implementation of user-level \io blocking is presented (Chapter~\ref{io}) which achieves the same performance and fairness balance as the scheduler itself. |
---|
| 14 | From these core algorithms, and a low-latency idle-sleep mechanism is presented (Chapter~\ref{practice}) which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system. |
---|
| 15 | |
---|
| 16 | \section{Future Work} |
---|
| 17 | While the \CFA runtime achieves a better compromise in term of performance and fairness than other schedulers, I do believe that further improvements could be made to reduce even further the number of cases where performance deteriorates. |
---|
| 18 | Furthermore, I believe that achieve performance and starvation freedom simultaneously is generally a challenge even outside of scheduling algorithms. |
---|
| 19 | |
---|
| 20 | \subsection{Idle Sleep} |
---|
[d93ea1d] | 21 | A difficult challenge that was not fully address in this thesis is idle-sleep. |
---|
| 22 | While a correct and somewhat low-cost idle-sleep mechanism was presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system. |
---|
| 23 | The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping. |
---|
| 24 | Furthermore, this thesis did not present any heuristic for when \procs should be put to sleep and when \procs should be woken up. |
---|
| 25 | It is especially worth noting that relaxed timestamps and topology aware helping lead to notable improvements in performance. |
---|
| 26 | Neither of these techniques were used for the idle sleep mechanism. |
---|
[5378f33] | 27 | |
---|
[d93ea1d] | 28 | There are opportunities where these techniques could be use: |
---|
| 29 | The mechanism uses a hand-shake between notification and sleep to ensure that no \at is missed. |
---|
| 30 | The correctness of that hand-shake is cirtical when the last \proc goes to sleep but could be relaxed when several \procs are awake. |
---|
| 31 | Furthermore, organizing the sleeping \procs as a LIDO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead. |
---|
| 32 | |
---|
| 33 | However, using these techniques could require significant investigation. |
---|
| 34 | For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwith. |
---|
| 35 | The balance between these is not necessarily obvious. |
---|
| 36 | |
---|
| 37 | \subsection{Hardware} |
---|
| 38 | One challenge that needed to be overcome for this thesis was that the modern x86-64 has very few tools to implement fairness. |
---|
| 39 | \Glspl{proc} attempting to help eachother inherently cause cache-coherence traffic. |
---|
| 40 | However, as mentioned in Section~\ref{helping}, relaxed requirements mean this traffic is not necessarily productive. |
---|
| 41 | In cases like this one, there is an opportunity to improve performance by extending the hardware. |
---|
| 42 | |
---|
| 43 | Many different extensions would be suitable here. |
---|
| 44 | For example, when attempting to read remote timestamps when deciding to whether or not to help, it could be useful to allow cancelling the remote read if it will lead to significant latency. |
---|
| 45 | If the latency is due to a recent cache invalidation, it is unlikely that the timestamp is old and that helping will be needed. |
---|
| 46 | As such, simply moving on without the result is likely to be acceptable. |
---|
| 47 | Another option would be to attempt to read multiple memory addresses and only wait for \emph{one of} these reads to retire. |
---|
| 48 | This would have a similar effect, where cache-lines with more traffic would be waited on less often. |
---|
| 49 | In both of these examples, some care would probably be needed to make sure that the reads to an address \emph{sometimes} retire. |
---|
| 50 | |
---|
| 51 | Note that this is similar to the feature \newterm{Hardware Transactional Memory}~\cite{HTM}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired. |
---|
| 52 | However, I believe this feature is generally aimed at large groups of instructions. |
---|
| 53 | A more fine-grained approach may be more amenable to carefully picking which aspects of an algorithm require exact correctness and which do not. |
---|