% ====================================================================== % ====================================================================== \chapter{Performance Results} \label{results} % ====================================================================== % ====================================================================== \section{Machine Setup} Table \ref{tab:machine} shows the characteristics of the machine used to run the benchmarks. All tests were made on this machine. \begin{table}[H] \begin{center} \begin{tabular}{| l | r | l | r |} \hline Architecture & x86\_64 & NUMA node(s) & 8 \\ \hline CPU op-mode(s) & 32-bit, 64-bit & Model name & AMD Opteron\texttrademark Processor 6380 \\ \hline Byte Order & Little Endian & CPU Freq & 2.5\si{\giga\hertz} \\ \hline CPU(s) & 64 & L1d cache & \SI{16}{\kibi\byte} \\ \hline Thread(s) per core & 2 & L1i cache & \SI{64}{\kibi\byte} \\ \hline Core(s) per socket & 8 & L2 cache & \SI{2048}{\kibi\byte} \\ \hline Socket(s) & 4 & L3 cache & \SI{6144}{\kibi\byte} \\ \hline \hline Operating system & Ubuntu 16.04.3 LTS & Kernel & Linux 4.4-97-generic \\ \hline Compiler & GCC 6.3 & Translator & CFA 1 \\ \hline Java version & OpenJDK-9 & Go version & 1.9.2 \\ \hline \end{tabular} \end{center} \caption{Machine setup used for the tests} \label{tab:machine} \end{table} \section{Micro Benchmarks} All benchmarks are run using the same harness to produce the results, seen as the \code{BENCH()} macro in the following examples. This macro uses the following logic to benchmark the code: \begin{pseudo} #define BENCH(run, result) \ before = gettime(); \ run; \ after = gettime(); \ result = (after - before) / N; \end{pseudo} The method used to get time is \code{clock_gettime(CLOCK_THREAD_CPUTIME_ID);}. Each benchmark is using many iterations of a simple call to measure the cost of the call. The specific number of iterations depends on the specific benchmark. \subsection{Context-Switching} The first interesting benchmark is to measure how long context-switches take. The simplest approach to do this is to yield on a thread, which executes a 2-step context switch. Yielding causes the thread to context-switch to the scheduler and back, more precisely: from the \gls{uthread} to the \gls{kthread} then from the \gls{kthread} back to the same \gls{uthread} (or a different one in the general case). In order to make the comparison fair, coroutines also execute a 2-step context-switch by resuming another coroutine which does nothing but suspending in a tight loop, which is a resume/suspend cycle instead of a yield. Listing \ref{lst:ctx-switch} shows the code for coroutines and threads with the results in table \ref{tab:ctx-switch}. All omitted tests are functionally identical to one of these tests. The difference between coroutines and threads can be attributed to the cost of scheduling. \begin{figure} \begin{multicols}{2} \CFA Coroutines \begin{cfacode} coroutine GreatSuspender {}; void main(GreatSuspender& this) { while(true) { suspend(); } } int main() { GreatSuspender s; resume(s); BENCH( for(size_t i=0; i