| 1 | % ======================================================================
|
|---|
| 2 | % ======================================================================
|
|---|
| 3 | \chapter{Performance results} \label{results}
|
|---|
| 4 | % ======================================================================
|
|---|
| 5 | % ======================================================================
|
|---|
| 6 | \section{Machine setup}
|
|---|
| 7 | Table \ref{tab:machine} shows the characteristics of the machine used to run the benchmarks. All tests where made on this machine.
|
|---|
| 8 | \begin{figure}[H]
|
|---|
| 9 | \begin{center}
|
|---|
| 10 | \begin{tabular}{| l | r | l | r |}
|
|---|
| 11 | \hline
|
|---|
| 12 | Architecture & x86\_64 & NUMA node(s) & 8 \\
|
|---|
| 13 | \hline
|
|---|
| 14 | CPU op-mode(s) & 32-bit, 64-bit & Model name & AMD Opteron\texttrademark Processor 6380 \\
|
|---|
| 15 | \hline
|
|---|
| 16 | Byte Order & Little Endian & CPU Freq & 2.5\si{\giga\hertz} \\
|
|---|
| 17 | \hline
|
|---|
| 18 | CPU(s) & 64 & L1d cache & \SI{16}{\kibi\byte} \\
|
|---|
| 19 | \hline
|
|---|
| 20 | Thread(s) per core & 2 & L1i cache & \SI{64}{\kibi\byte} \\
|
|---|
| 21 | \hline
|
|---|
| 22 | Core(s) per socket & 8 & L2 cache & \SI{2048}{\kibi\byte} \\
|
|---|
| 23 | \hline
|
|---|
| 24 | Socket(s) & 4 & L3 cache & \SI{6144}{\kibi\byte} \\
|
|---|
| 25 | \hline
|
|---|
| 26 | \hline
|
|---|
| 27 | Operating system & Ubuntu 16.04.3 LTS & Kernel & Linux 4.4.0-97-generic \\
|
|---|
| 28 | \hline
|
|---|
| 29 | Compiler & GCC 6.3.0 & Translator & CFA 1.0.0 \\
|
|---|
| 30 | \hline
|
|---|
| 31 | \end{tabular}
|
|---|
| 32 | \end{center}
|
|---|
| 33 | \caption{Machine setup used for the tests}
|
|---|
| 34 | \label{tab:machine}
|
|---|
| 35 | \end{figure}
|
|---|
| 36 |
|
|---|
| 37 | \section{Micro benchmarks}
|
|---|
| 38 | All benchmarks are run using the same harness to produce the results, seen as the \code{BENCH()} macro in the following examples. This macro uses the following logic to benchmark the code :
|
|---|
| 39 | \begin{pseudo}
|
|---|
| 40 | #define BENCH(run, result)
|
|---|
| 41 | gettime();
|
|---|
| 42 | run;
|
|---|
| 43 | gettime();
|
|---|
| 44 | result = (after - before) / N;
|
|---|
| 45 | \end{pseudo}
|
|---|
| 46 | The method used to get time is \code{clock_gettime(CLOCK_THREAD_CPUTIME_ID);}. Each benchmark is using many iterations of a simple call to measure the cost of the call. The specific number of iteration depends on the specific benchmark.
|
|---|
| 47 |
|
|---|
| 48 | \subsection{Context-switching}
|
|---|
| 49 | The first interesting benchmark is to measure how long context-switches take. The simplest approach to do this is to yield on a thread, which executes a 2-step context switch. In order to make the comparison fair, coroutines also execute a 2-step context-switch, which is a resume/suspend cycle instead of a yield. Listing \ref{lst:ctx-switch} shows the code for coroutines and threads. All omitted tests are functionally identical to one of these tests. The results can be shown in table \ref{tab:ctx-switch}.
|
|---|
| 50 | \begin{figure}
|
|---|
| 51 | \begin{multicols}{2}
|
|---|
| 52 | \CFA Coroutines
|
|---|
| 53 | \begin{cfacode}
|
|---|
| 54 | coroutine GreatSuspender {};
|
|---|
| 55 | void main(GreatSuspender& this) {
|
|---|
| 56 | while(true) { suspend(); }
|
|---|
| 57 | }
|
|---|
| 58 | int main() {
|
|---|
| 59 | GreatSuspender s;
|
|---|
| 60 | resume(s);
|
|---|
| 61 | BENCH(
|
|---|
| 62 | for(size_t i=0; i<n; i++) {
|
|---|
| 63 | resume(s);
|
|---|
| 64 | },
|
|---|
| 65 | result
|
|---|
| 66 | )
|
|---|
| 67 | printf("%llu\n", result);
|
|---|
| 68 | }
|
|---|
| 69 | \end{cfacode}
|
|---|
| 70 | \columnbreak
|
|---|
| 71 | \CFA Threads
|
|---|
| 72 | \begin{cfacode}
|
|---|
| 73 |
|
|---|
| 74 |
|
|---|
| 75 |
|
|---|
| 76 |
|
|---|
| 77 | int main() {
|
|---|
| 78 |
|
|---|
| 79 |
|
|---|
| 80 | BENCH(
|
|---|
| 81 | for(size_t i=0; i<n; i++) {
|
|---|
| 82 | yield();
|
|---|
| 83 | },
|
|---|
| 84 | result
|
|---|
| 85 | )
|
|---|
| 86 | printf("%llu\n", result);
|
|---|
| 87 | }
|
|---|
| 88 | \end{cfacode}
|
|---|
| 89 | \end{multicols}
|
|---|
| 90 | \caption{\CFA benchmark code used to measure context-switches for coroutines and threads.}
|
|---|
| 91 | \label{lst:ctx-switch}
|
|---|
| 92 | \end{figure}
|
|---|
| 93 |
|
|---|
| 94 | \begin{figure}
|
|---|
| 95 | \begin{center}
|
|---|
| 96 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
|
|---|
| 97 | \cline{2-4}
|
|---|
| 98 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
|
|---|
| 99 | \hline
|
|---|
| 100 | Kernel Threads & 239 & 242.57 & 5.54 \\
|
|---|
| 101 | \CFA Coroutines & 38 & 38 & 0 \\
|
|---|
| 102 | \CFA Threads & 102 & 102.39 & 1.57 \\
|
|---|
| 103 | \uC Coroutines & 46 & 46.68 & 0.47 \\
|
|---|
| 104 | \uC Threads & 98 & 99.39 & 1.52 \\
|
|---|
| 105 | \hline
|
|---|
| 106 | \end{tabular}
|
|---|
| 107 | \end{center}
|
|---|
| 108 | \caption{Context Switch comparison. All numbers are in nanoseconds(\si{\nano\second})}
|
|---|
| 109 | \label{tab:ctx-switch}
|
|---|
| 110 | \end{figure}
|
|---|
| 111 |
|
|---|
| 112 | \subsection{Mutual-exclusion}
|
|---|
| 113 | The next interesting benchmark is to measure the overhead to enter/leave a critical-section. For monitors, the simplest approach is to measure how long it takes enter and leave a monitor routine. Listing \ref{lst:mutex} shows the code for \CFA. To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a pthread mutex lock are also measured. The results can be shown in table \ref{tab:mutex}.
|
|---|
| 114 |
|
|---|
| 115 | \begin{figure}
|
|---|
| 116 | \begin{cfacode}
|
|---|
| 117 | monitor M {};
|
|---|
| 118 | void __attribute__((noinline)) call( M & mutex m /*, m2, m3, m4*/ ) {}
|
|---|
| 119 |
|
|---|
| 120 | int main() {
|
|---|
| 121 | M m/*, m2, m3, m4*/;
|
|---|
| 122 | BENCH(
|
|---|
| 123 | for(size_t i=0; i<n; i++) {
|
|---|
| 124 | call(m/*, m2, m3, m4*/);
|
|---|
| 125 | },
|
|---|
| 126 | result
|
|---|
| 127 | )
|
|---|
| 128 | printf("%llu\n", result);
|
|---|
| 129 | }
|
|---|
| 130 | \end{cfacode}
|
|---|
| 131 | \caption{\CFA benchmark code used to measure mutex routines.}
|
|---|
| 132 | \label{lst:mutex}
|
|---|
| 133 | \end{figure}
|
|---|
| 134 |
|
|---|
| 135 | \begin{figure}
|
|---|
| 136 | \begin{center}
|
|---|
| 137 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
|
|---|
| 138 | \cline{2-4}
|
|---|
| 139 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
|
|---|
| 140 | \hline
|
|---|
| 141 | C routine & 2 & 2 & 0 \\
|
|---|
| 142 | Pthreads Mutex Lock & 31 & 31.86 & 0.99 \\
|
|---|
| 143 | \uC \code{monitor} member routine & 30 & 30 & 0 \\
|
|---|
| 144 | \CFA \code{mutex} routine, 1 argument & 46 & 46.14 & 0.74 \\
|
|---|
| 145 | \CFA \code{mutex} routine, 2 argument & 82 & 83 & 1.93 \\
|
|---|
| 146 | \CFA \code{mutex} routine, 4 argument & 165 & 161.15 & 54.04 \\
|
|---|
| 147 | \hline
|
|---|
| 148 | \end{tabular}
|
|---|
| 149 | \end{center}
|
|---|
| 150 | \caption{Mutex routine comparison. All numbers are in nanoseconds(\si{\nano\second})}
|
|---|
| 151 | \label{tab:mutex}
|
|---|
| 152 | \end{figure}
|
|---|
| 153 |
|
|---|
| 154 | \subsection{Internal scheduling}
|
|---|
| 155 | The Internal scheduling benchmark measures the cost of waiting on and signalling a condition variable. Listing \ref{lst:int-sched} shows the code for \CFA. The results can be shown in table \ref{tab:int-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests.
|
|---|
| 156 |
|
|---|
| 157 | \begin{figure}
|
|---|
| 158 | \begin{cfacode}
|
|---|
| 159 | volatile int go = 0;
|
|---|
| 160 | condition c;
|
|---|
| 161 | monitor M {};
|
|---|
| 162 | M m1;
|
|---|
| 163 |
|
|---|
| 164 | void __attribute__((noinline)) do_call( M & mutex a1 ) { signal(c); }
|
|---|
| 165 |
|
|---|
| 166 | thread T {};
|
|---|
| 167 | void ^?{}( T & mutex this ) {}
|
|---|
| 168 | void main( T & this ) {
|
|---|
| 169 | while(go == 0) { yield(); }
|
|---|
| 170 | while(go == 1) { do_call(m1); }
|
|---|
| 171 | }
|
|---|
| 172 | int __attribute__((noinline)) do_wait( M & mutex a1 ) {
|
|---|
| 173 | go = 1;
|
|---|
| 174 | BENCH(
|
|---|
| 175 | for(size_t i=0; i<n; i++) {
|
|---|
| 176 | wait(c);
|
|---|
| 177 | },
|
|---|
| 178 | result
|
|---|
| 179 | )
|
|---|
| 180 | printf("%llu\n", result);
|
|---|
| 181 | go = 0;
|
|---|
| 182 | return 0;
|
|---|
| 183 | }
|
|---|
| 184 | int main() {
|
|---|
| 185 | T t;
|
|---|
| 186 | return do_wait(m1);
|
|---|
| 187 | }
|
|---|
| 188 | \end{cfacode}
|
|---|
| 189 | \caption{Benchmark code for internal scheduling}
|
|---|
| 190 | \label{lst:int-sched}
|
|---|
| 191 | \end{figure}
|
|---|
| 192 |
|
|---|
| 193 | \begin{figure}
|
|---|
| 194 | \begin{center}
|
|---|
| 195 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
|
|---|
| 196 | \cline{2-4}
|
|---|
| 197 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
|
|---|
| 198 | \hline
|
|---|
| 199 | \uC \code{signal} & 322 & 322.57 & 2.77 \\
|
|---|
| 200 | \CFA \code{signal}, 1 \code{monitor} & 1145 & 1163.64 & 27.52 \\
|
|---|
| 201 | \CFA \code{signal}, 2 \code{monitor} & 1531 & 1550.75 & 32.77 \\
|
|---|
| 202 | \CFA \code{signal}, 4 \code{monitor} & 2288.5 & 2326.86 & 54.73 \\
|
|---|
| 203 | \hline
|
|---|
| 204 | \end{tabular}
|
|---|
| 205 | \end{center}
|
|---|
| 206 | \caption{Internal scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})}
|
|---|
| 207 | \label{tab:int-sched}
|
|---|
| 208 | \end{figure}
|
|---|
| 209 |
|
|---|
| 210 | \subsection{External scheduling}
|
|---|
| 211 | The Internal scheduling benchmark measures the cost of the \code{waitfor} statement (\code{_Accept} in \uC). Listing \ref{lst:ext-sched} shows the code for \CFA. The results can be shown in table \ref{tab:ext-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests.
|
|---|
| 212 |
|
|---|
| 213 | \begin{figure}
|
|---|
| 214 | \begin{cfacode}
|
|---|
| 215 | volatile int go = 0;
|
|---|
| 216 | monitor M {};
|
|---|
| 217 | M m1;
|
|---|
| 218 | thread T {};
|
|---|
| 219 |
|
|---|
| 220 | void __attribute__((noinline)) do_call( M & mutex a1 ) {}
|
|---|
| 221 |
|
|---|
| 222 | void ^?{}( T & mutex this ) {}
|
|---|
| 223 | void main( T & this ) {
|
|---|
| 224 | while(go == 0) { yield(); }
|
|---|
| 225 | while(go == 1) { do_call(m1); }
|
|---|
| 226 | }
|
|---|
| 227 | int __attribute__((noinline)) do_wait( M & mutex a1 ) {
|
|---|
| 228 | go = 1;
|
|---|
| 229 | BENCH(
|
|---|
| 230 | for(size_t i=0; i<n; i++) {
|
|---|
| 231 | waitfor(call, a1);
|
|---|
| 232 | },
|
|---|
| 233 | result
|
|---|
| 234 | )
|
|---|
| 235 | printf("%llu\n", result);
|
|---|
| 236 | go = 0;
|
|---|
| 237 | return 0;
|
|---|
| 238 | }
|
|---|
| 239 | int main() {
|
|---|
| 240 | T t;
|
|---|
| 241 | return do_wait(m1);
|
|---|
| 242 | }
|
|---|
| 243 | \end{cfacode}
|
|---|
| 244 | \caption{Benchmark code for external scheduling}
|
|---|
| 245 | \label{lst:ext-sched}
|
|---|
| 246 | \end{figure}
|
|---|
| 247 |
|
|---|
| 248 | \begin{figure}
|
|---|
| 249 | \begin{center}
|
|---|
| 250 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
|
|---|
| 251 | \cline{2-4}
|
|---|
| 252 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
|
|---|
| 253 | \hline
|
|---|
| 254 | \uC \code{Accept} & 349 & 339.32 & 3.14 \\
|
|---|
| 255 | \CFA \code{waitfor}, 1 \code{monitor} & 1155.5 & 1142.04 & 25.23 \\
|
|---|
| 256 | \CFA \code{waitfor}, 2 \code{monitor} & 1361 & 1376.75 & 28.81 \\
|
|---|
| 257 | \CFA \code{waitfor}, 4 \code{monitor} & 1941.5 & 1957.07 & 34.7 \\
|
|---|
| 258 | \hline
|
|---|
| 259 | \end{tabular}
|
|---|
| 260 | \end{center}
|
|---|
| 261 | \caption{External scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})}
|
|---|
| 262 | \label{tab:ext-sched}
|
|---|
| 263 | \end{figure}
|
|---|
| 264 |
|
|---|
| 265 | \subsection{Object creation}
|
|---|
| 266 | Finally, the last benchmark measured is the cost of creation for concurrent objects. Listing \ref{lst:creation} shows the code for pthreads and \CFA threads. The results can be shown in table \ref{tab:creation}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests. The only note here is that the call-stacks of \CFA coroutines are lazily created, therefore without priming the coroutine, the creation cost is very low.
|
|---|
| 267 |
|
|---|
| 268 | \begin{figure}
|
|---|
| 269 | \begin{multicols}{2}
|
|---|
| 270 | pthread
|
|---|
| 271 | \begin{cfacode}
|
|---|
| 272 | int main() {
|
|---|
| 273 | BENCH(
|
|---|
| 274 | for(size_t i=0; i<n; i++) {
|
|---|
| 275 | pthread_t thread;
|
|---|
| 276 | if(pthread_create(
|
|---|
| 277 | &thread,
|
|---|
| 278 | NULL,
|
|---|
| 279 | foo,
|
|---|
| 280 | NULL
|
|---|
| 281 | ) < 0) {
|
|---|
| 282 | perror( "failure" );
|
|---|
| 283 | return 1;
|
|---|
| 284 | }
|
|---|
| 285 |
|
|---|
| 286 | if(pthread_join(
|
|---|
| 287 | thread,
|
|---|
| 288 | NULL
|
|---|
| 289 | ) < 0) {
|
|---|
| 290 | perror( "failure" );
|
|---|
| 291 | return 1;
|
|---|
| 292 | }
|
|---|
| 293 | },
|
|---|
| 294 | result
|
|---|
| 295 | )
|
|---|
| 296 | printf("%llu\n", result);
|
|---|
| 297 | }
|
|---|
| 298 | \end{cfacode}
|
|---|
| 299 | \columnbreak
|
|---|
| 300 | \CFA Threads
|
|---|
| 301 | \begin{cfacode}
|
|---|
| 302 | int main() {
|
|---|
| 303 | BENCH(
|
|---|
| 304 | for(size_t i=0; i<n; i++) {
|
|---|
| 305 | MyThread m;
|
|---|
| 306 | },
|
|---|
| 307 | result
|
|---|
| 308 | )
|
|---|
| 309 |
|
|---|
| 310 | printf("%llu\n", result);
|
|---|
| 311 | }
|
|---|
| 312 | \end{cfacode}
|
|---|
| 313 | \end{multicols}
|
|---|
| 314 | \caption{Benchmark code for pthreads and \CFA to measure object creation}
|
|---|
| 315 | \label{lst:creation}
|
|---|
| 316 | \end{figure}
|
|---|
| 317 |
|
|---|
| 318 | \begin{figure}
|
|---|
| 319 | \begin{center}
|
|---|
| 320 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
|
|---|
| 321 | \cline{2-4}
|
|---|
| 322 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
|
|---|
| 323 | \hline
|
|---|
| 324 | Pthreads & 26974.5 & 26977 & 124.12 \\
|
|---|
| 325 | \CFA Coroutines Lazy & 5 & 5 & 0 \\
|
|---|
| 326 | \CFA Coroutines Eager & 335.0 & 357.67 & 34.2 \\
|
|---|
| 327 | \CFA Threads & 1122.5 & 1109.86 & 36.54 \\
|
|---|
| 328 | \uC Coroutines & 106 & 107.04 & 1.61 \\
|
|---|
| 329 | \uC Threads & 525.5 & 533.04 & 11.14 \\
|
|---|
| 330 | \hline
|
|---|
| 331 | \end{tabular}
|
|---|
| 332 | \end{center}
|
|---|
| 333 | \caption{Creation comparison. All numbers are in nanoseconds(\si{\nano\second})}
|
|---|
| 334 | \label{tab:creation}
|
|---|
| 335 | \end{figure}
|
|---|