% ====================================================================== % ====================================================================== \chapter{Performance results} \label{results} % ====================================================================== % ====================================================================== \section{Machine setup} Table \ref{tab:machine} shows the characteristics of the machine used to run the benchmarks. All tests where made on this machine. \begin{table}[H] \begin{center} \begin{tabular}{| l | r | l | r |} \hline Architecture & x86\_64 & NUMA node(s) & 8 \\ \hline CPU op-mode(s) & 32-bit, 64-bit & Model name & AMD Opteron\texttrademark Processor 6380 \\ \hline Byte Order & Little Endian & CPU Freq & 2.5\si{\giga\hertz} \\ \hline CPU(s) & 64 & L1d cache & \SI{16}{\kibi\byte} \\ \hline Thread(s) per core & 2 & L1i cache & \SI{64}{\kibi\byte} \\ \hline Core(s) per socket & 8 & L2 cache & \SI{2048}{\kibi\byte} \\ \hline Socket(s) & 4 & L3 cache & \SI{6144}{\kibi\byte} \\ \hline \hline Operating system & Ubuntu 16.04.3 LTS & Kernel & Linux 4.4.0-97-generic \\ \hline Compiler & GCC 6.3.0 & Translator & CFA 1.0.0 \\ \hline Java version & OpenJDK-9 & Go version & 1.9.2 \\ \hline \end{tabular} \end{center} \caption{Machine setup used for the tests} \label{tab:machine} \end{table} \section{Micro benchmarks} All benchmarks are run using the same harness to produce the results, seen as the \code{BENCH()} macro in the following examples. This macro uses the following logic to benchmark the code : \begin{pseudo} #define BENCH(run, result) before = gettime(); run; after = gettime(); result = (after - before) / N; \end{pseudo} The method used to get time is \code{clock_gettime(CLOCK_THREAD_CPUTIME_ID);}. Each benchmark is using many iterations of a simple call to measure the cost of the call. The specific number of iteration depends on the specific benchmark. \subsection{Context-switching} The first interesting benchmark is to measure how long context-switches take. The simplest approach to do this is to yield on a thread, which executes a 2-step context switch. In order to make the comparison fair, coroutines also execute a 2-step context-switch (\gls{uthread} to \gls{kthread} then \gls{kthread} to \gls{uthread}), which is a resume/suspend cycle instead of a yield. Listing \ref{lst:ctx-switch} shows the code for coroutines and threads whith the results in table \ref{tab:ctx-switch}. All omitted tests are functionally identical to one of these tests. \begin{figure} \begin{multicols}{2} \CFA Coroutines \begin{cfacode} coroutine GreatSuspender {}; void main(GreatSuspender& this) { while(true) { suspend(); } } int main() { GreatSuspender s; resume(s); BENCH( for(size_t i=0; i