[64b272a] | 1 | % ====================================================================== |
---|
| 2 | % ====================================================================== |
---|
[20ffcf3] | 3 | \chapter{Performance results} \label{results} |
---|
[64b272a] | 4 | % ====================================================================== |
---|
| 5 | % ====================================================================== |
---|
| 6 | \section{Machine setup} |
---|
[07c1e595] | 7 | Table \ref{tab:machine} shows the characteristics of the machine used to run the benchmarks. All tests where made on this machine. |
---|
[cf966b5] | 8 | \begin{table}[H] |
---|
[64b272a] | 9 | \begin{center} |
---|
| 10 | \begin{tabular}{| l | r | l | r |} |
---|
| 11 | \hline |
---|
| 12 | Architecture & x86\_64 & NUMA node(s) & 8 \\ |
---|
| 13 | \hline |
---|
| 14 | CPU op-mode(s) & 32-bit, 64-bit & Model name & AMD Opteron\texttrademark Processor 6380 \\ |
---|
| 15 | \hline |
---|
| 16 | Byte Order & Little Endian & CPU Freq & 2.5\si{\giga\hertz} \\ |
---|
| 17 | \hline |
---|
| 18 | CPU(s) & 64 & L1d cache & \SI{16}{\kibi\byte} \\ |
---|
| 19 | \hline |
---|
| 20 | Thread(s) per core & 2 & L1i cache & \SI{64}{\kibi\byte} \\ |
---|
| 21 | \hline |
---|
| 22 | Core(s) per socket & 8 & L2 cache & \SI{2048}{\kibi\byte} \\ |
---|
| 23 | \hline |
---|
| 24 | Socket(s) & 4 & L3 cache & \SI{6144}{\kibi\byte} \\ |
---|
| 25 | \hline |
---|
| 26 | \hline |
---|
[383e159] | 27 | Operating system & Ubuntu 16.04.3 LTS & Kernel & Linux 4.4-97-generic \\ |
---|
[64b272a] | 28 | \hline |
---|
[383e159] | 29 | Compiler & GCC 6.3 & Translator & CFA 1 \\ |
---|
[64b272a] | 30 | \hline |
---|
[cf966b5] | 31 | Java version & OpenJDK-9 & Go version & 1.9.2 \\ |
---|
| 32 | \hline |
---|
[64b272a] | 33 | \end{tabular} |
---|
| 34 | \end{center} |
---|
| 35 | \caption{Machine setup used for the tests} |
---|
| 36 | \label{tab:machine} |
---|
[cf966b5] | 37 | \end{table} |
---|
[64b272a] | 38 | |
---|
| 39 | \section{Micro benchmarks} |
---|
[20ffcf3] | 40 | All benchmarks are run using the same harness to produce the results, seen as the \code{BENCH()} macro in the following examples. This macro uses the following logic to benchmark the code : |
---|
| 41 | \begin{pseudo} |
---|
| 42 | #define BENCH(run, result) |
---|
[cf966b5] | 43 | before = gettime(); |
---|
[20ffcf3] | 44 | run; |
---|
[cf966b5] | 45 | after = gettime(); |
---|
[20ffcf3] | 46 | result = (after - before) / N; |
---|
| 47 | \end{pseudo} |
---|
[07c1e595] | 48 | The method used to get time is \code{clock_gettime(CLOCK_THREAD_CPUTIME_ID);}. Each benchmark is using many iterations of a simple call to measure the cost of the call. The specific number of iteration depends on the specific benchmark. |
---|
[20ffcf3] | 49 | |
---|
| 50 | \subsection{Context-switching} |
---|
[cf966b5] | 51 | The first interesting benchmark is to measure how long context-switches take. The simplest approach to do this is to yield on a thread, which executes a 2-step context switch. In order to make the comparison fair, coroutines also execute a 2-step context-switch (\gls{uthread} to \gls{kthread} then \gls{kthread} to \gls{uthread}), which is a resume/suspend cycle instead of a yield. Listing \ref{lst:ctx-switch} shows the code for coroutines and threads whith the results in table \ref{tab:ctx-switch}. All omitted tests are functionally identical to one of these tests. |
---|
[20ffcf3] | 52 | \begin{figure} |
---|
| 53 | \begin{multicols}{2} |
---|
| 54 | \CFA Coroutines |
---|
| 55 | \begin{cfacode} |
---|
| 56 | coroutine GreatSuspender {}; |
---|
| 57 | void main(GreatSuspender& this) { |
---|
| 58 | while(true) { suspend(); } |
---|
| 59 | } |
---|
| 60 | int main() { |
---|
| 61 | GreatSuspender s; |
---|
| 62 | resume(s); |
---|
| 63 | BENCH( |
---|
| 64 | for(size_t i=0; i<n; i++) { |
---|
| 65 | resume(s); |
---|
| 66 | }, |
---|
| 67 | result |
---|
| 68 | ) |
---|
| 69 | printf("%llu\n", result); |
---|
| 70 | } |
---|
| 71 | \end{cfacode} |
---|
| 72 | \columnbreak |
---|
| 73 | \CFA Threads |
---|
| 74 | \begin{cfacode} |
---|
| 75 | |
---|
| 76 | |
---|
| 77 | |
---|
| 78 | |
---|
| 79 | int main() { |
---|
| 80 | |
---|
| 81 | |
---|
| 82 | BENCH( |
---|
| 83 | for(size_t i=0; i<n; i++) { |
---|
| 84 | yield(); |
---|
| 85 | }, |
---|
| 86 | result |
---|
| 87 | ) |
---|
| 88 | printf("%llu\n", result); |
---|
| 89 | } |
---|
| 90 | \end{cfacode} |
---|
| 91 | \end{multicols} |
---|
[cf966b5] | 92 | \begin{cfacode}[caption={\CFA benchmark code used to measure context-switches for coroutines and threads.},label={lst:ctx-switch}] |
---|
| 93 | \end{cfacode} |
---|
[20ffcf3] | 94 | \end{figure} |
---|
[64b272a] | 95 | |
---|
[cf966b5] | 96 | \begin{table} |
---|
[64b272a] | 97 | \begin{center} |
---|
| 98 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |} |
---|
| 99 | \cline{2-4} |
---|
| 100 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\ |
---|
| 101 | \hline |
---|
[383e159] | 102 | Kernel Thread & 241.5 & 243.86 & 5.08 \\ |
---|
[cf966b5] | 103 | \CFA Coroutine & 38 & 38 & 0 \\ |
---|
[383e159] | 104 | \CFA Thread & 103 & 102.96 & 2.96 \\ |
---|
| 105 | \uC Coroutine & 46 & 45.86 & 0.35 \\ |
---|
| 106 | \uC Thread & 98 & 99.11 & 1.42 \\ |
---|
| 107 | Goroutine & 150 & 149.96 & 3.16 \\ |
---|
| 108 | Java Thread & 289 & 290.68 & 8.72 \\ |
---|
[64b272a] | 109 | \hline |
---|
| 110 | \end{tabular} |
---|
| 111 | \end{center} |
---|
[07c1e595] | 112 | \caption{Context Switch comparison. All numbers are in nanoseconds(\si{\nano\second})} |
---|
[64b272a] | 113 | \label{tab:ctx-switch} |
---|
[cf966b5] | 114 | \end{table} |
---|
[64b272a] | 115 | |
---|
[20ffcf3] | 116 | \subsection{Mutual-exclusion} |
---|
[cf966b5] | 117 | The next interesting benchmark is to measure the overhead to enter/leave a critical-section. For monitors, the simplest approach is to measure how long it takes to enter and leave a monitor routine. Listing \ref{lst:mutex} shows the code for \CFA. To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a pthread mutex lock are also measured. The results can be shown in table \ref{tab:mutex}. |
---|
[20ffcf3] | 118 | |
---|
| 119 | \begin{figure} |
---|
[cf966b5] | 120 | \begin{cfacode}[caption={\CFA benchmark code used to measure mutex routines.},label={lst:mutex}] |
---|
[20ffcf3] | 121 | monitor M {}; |
---|
| 122 | void __attribute__((noinline)) call( M & mutex m /*, m2, m3, m4*/ ) {} |
---|
| 123 | |
---|
| 124 | int main() { |
---|
| 125 | M m/*, m2, m3, m4*/; |
---|
| 126 | BENCH( |
---|
| 127 | for(size_t i=0; i<n; i++) { |
---|
| 128 | call(m/*, m2, m3, m4*/); |
---|
| 129 | }, |
---|
| 130 | result |
---|
| 131 | ) |
---|
| 132 | printf("%llu\n", result); |
---|
| 133 | } |
---|
| 134 | \end{cfacode} |
---|
| 135 | \end{figure} |
---|
| 136 | |
---|
[cf966b5] | 137 | \begin{table} |
---|
[64b272a] | 138 | \begin{center} |
---|
| 139 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |} |
---|
| 140 | \cline{2-4} |
---|
| 141 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\ |
---|
| 142 | \hline |
---|
[383e159] | 143 | C routine & 2 & 2 & 0 \\ |
---|
| 144 | FetchAdd + FetchSub & 26 & 26 & 0 \\ |
---|
| 145 | Pthreads Mutex Lock & 31 & 31.86 & 0.99 \\ |
---|
| 146 | \uC \code{monitor} member routine & 30 & 30 & 0 \\ |
---|
| 147 | \CFA \code{mutex} routine, 1 argument & 41 & 41.57 & 0.9 \\ |
---|
| 148 | \CFA \code{mutex} routine, 2 argument & 76 & 76.96 & 1.57 \\ |
---|
| 149 | \CFA \code{mutex} routine, 4 argument & 145 & 146.68 & 3.85 \\ |
---|
| 150 | Java synchronized routine & 27 & 28.57 & 2.6 \\ |
---|
[64b272a] | 151 | \hline |
---|
| 152 | \end{tabular} |
---|
| 153 | \end{center} |
---|
[07c1e595] | 154 | \caption{Mutex routine comparison. All numbers are in nanoseconds(\si{\nano\second})} |
---|
[64b272a] | 155 | \label{tab:mutex} |
---|
[cf966b5] | 156 | \end{table} |
---|
[64b272a] | 157 | |
---|
[20ffcf3] | 158 | \subsection{Internal scheduling} |
---|
[cf966b5] | 159 | The internal-scheduling benchmark measures the cost of waiting on and signalling a condition variable. Listing \ref{lst:int-sched} shows the code for \CFA, with results table \ref{tab:int-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests. |
---|
[20ffcf3] | 160 | |
---|
| 161 | \begin{figure} |
---|
[cf966b5] | 162 | \begin{cfacode}[caption={Benchmark code for internal scheduling},label={lst:int-sched}] |
---|
[20ffcf3] | 163 | volatile int go = 0; |
---|
| 164 | condition c; |
---|
| 165 | monitor M {}; |
---|
| 166 | M m1; |
---|
| 167 | |
---|
| 168 | void __attribute__((noinline)) do_call( M & mutex a1 ) { signal(c); } |
---|
| 169 | |
---|
| 170 | thread T {}; |
---|
| 171 | void ^?{}( T & mutex this ) {} |
---|
| 172 | void main( T & this ) { |
---|
| 173 | while(go == 0) { yield(); } |
---|
| 174 | while(go == 1) { do_call(m1); } |
---|
| 175 | } |
---|
| 176 | int __attribute__((noinline)) do_wait( M & mutex a1 ) { |
---|
| 177 | go = 1; |
---|
| 178 | BENCH( |
---|
| 179 | for(size_t i=0; i<n; i++) { |
---|
| 180 | wait(c); |
---|
| 181 | }, |
---|
| 182 | result |
---|
| 183 | ) |
---|
| 184 | printf("%llu\n", result); |
---|
| 185 | go = 0; |
---|
| 186 | return 0; |
---|
| 187 | } |
---|
| 188 | int main() { |
---|
| 189 | T t; |
---|
| 190 | return do_wait(m1); |
---|
| 191 | } |
---|
| 192 | \end{cfacode} |
---|
| 193 | \end{figure} |
---|
| 194 | |
---|
[cf966b5] | 195 | \begin{table} |
---|
[64b272a] | 196 | \begin{center} |
---|
| 197 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |} |
---|
| 198 | \cline{2-4} |
---|
| 199 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\ |
---|
| 200 | \hline |
---|
[383e159] | 201 | \uC \code{signal} & 322 & 323 & 3.36 \\ |
---|
| 202 | \CFA \code{signal}, 1 \code{monitor} & 352.5 & 353.11 & 3.66 \\ |
---|
| 203 | \CFA \code{signal}, 2 \code{monitor} & 430 & 430.29 & 8.97 \\ |
---|
| 204 | \CFA \code{signal}, 4 \code{monitor} & 594.5 & 606.57 & 18.33 \\ |
---|
| 205 | Java \code{notify} & 13831.5 & 15698.21 & 4782.3 \\ |
---|
[64b272a] | 206 | \hline |
---|
| 207 | \end{tabular} |
---|
| 208 | \end{center} |
---|
[07c1e595] | 209 | \caption{Internal scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})} |
---|
[64b272a] | 210 | \label{tab:int-sched} |
---|
[cf966b5] | 211 | \end{table} |
---|
[64b272a] | 212 | |
---|
[20ffcf3] | 213 | \subsection{External scheduling} |
---|
[cf966b5] | 214 | The Internal scheduling benchmark measures the cost of the \code{waitfor} statement (\code{_Accept} in \uC). Listing \ref{lst:ext-sched} shows the code for \CFA, with results in table \ref{tab:ext-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests. |
---|
[20ffcf3] | 215 | |
---|
| 216 | \begin{figure} |
---|
[cf966b5] | 217 | \begin{cfacode}[caption={Benchmark code for external scheduling},label={lst:ext-sched}] |
---|
[20ffcf3] | 218 | volatile int go = 0; |
---|
| 219 | monitor M {}; |
---|
| 220 | M m1; |
---|
| 221 | thread T {}; |
---|
| 222 | |
---|
| 223 | void __attribute__((noinline)) do_call( M & mutex a1 ) {} |
---|
| 224 | |
---|
| 225 | void ^?{}( T & mutex this ) {} |
---|
| 226 | void main( T & this ) { |
---|
| 227 | while(go == 0) { yield(); } |
---|
| 228 | while(go == 1) { do_call(m1); } |
---|
| 229 | } |
---|
| 230 | int __attribute__((noinline)) do_wait( M & mutex a1 ) { |
---|
| 231 | go = 1; |
---|
| 232 | BENCH( |
---|
| 233 | for(size_t i=0; i<n; i++) { |
---|
| 234 | waitfor(call, a1); |
---|
| 235 | }, |
---|
| 236 | result |
---|
| 237 | ) |
---|
| 238 | printf("%llu\n", result); |
---|
| 239 | go = 0; |
---|
| 240 | return 0; |
---|
| 241 | } |
---|
| 242 | int main() { |
---|
| 243 | T t; |
---|
| 244 | return do_wait(m1); |
---|
| 245 | } |
---|
| 246 | \end{cfacode} |
---|
| 247 | \end{figure} |
---|
| 248 | |
---|
[cf966b5] | 249 | \begin{table} |
---|
[64b272a] | 250 | \begin{center} |
---|
| 251 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |} |
---|
| 252 | \cline{2-4} |
---|
| 253 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\ |
---|
| 254 | \hline |
---|
[383e159] | 255 | \uC \code{Accept} & 350 & 350.61 & 3.11 \\ |
---|
| 256 | \CFA \code{waitfor}, 1 \code{monitor} & 358.5 & 358.36 & 3.82 \\ |
---|
| 257 | \CFA \code{waitfor}, 2 \code{monitor} & 422 & 426.79 & 7.95 \\ |
---|
| 258 | \CFA \code{waitfor}, 4 \code{monitor} & 579.5 & 585.46 & 11.25 \\ |
---|
[64b272a] | 259 | \hline |
---|
| 260 | \end{tabular} |
---|
| 261 | \end{center} |
---|
[07c1e595] | 262 | \caption{External scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})} |
---|
[64b272a] | 263 | \label{tab:ext-sched} |
---|
[cf966b5] | 264 | \end{table} |
---|
[64b272a] | 265 | |
---|
[20ffcf3] | 266 | \subsection{Object creation} |
---|
[cf966b5] | 267 | Finally, the last benchmark measurs the cost of creation for concurrent objects. Listing \ref{lst:creation} shows the code for pthreads and \CFA threads, with results shown in table \ref{tab:creation}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests. The only note here is that the call-stacks of \CFA coroutines are lazily created, therefore without priming the coroutine, the creation cost is very low. |
---|
[20ffcf3] | 268 | |
---|
| 269 | \begin{figure} |
---|
[cf966b5] | 270 | \begin{center} |
---|
[20ffcf3] | 271 | pthread |
---|
[cf966b5] | 272 | \begin{ccode} |
---|
[20ffcf3] | 273 | int main() { |
---|
| 274 | BENCH( |
---|
| 275 | for(size_t i=0; i<n; i++) { |
---|
| 276 | pthread_t thread; |
---|
[cf966b5] | 277 | if(pthread_create(&thread,NULL,foo,NULL)<0) { |
---|
[20ffcf3] | 278 | perror( "failure" ); |
---|
| 279 | return 1; |
---|
| 280 | } |
---|
| 281 | |
---|
[cf966b5] | 282 | if(pthread_join(thread, NULL)<0) { |
---|
[20ffcf3] | 283 | perror( "failure" ); |
---|
| 284 | return 1; |
---|
| 285 | } |
---|
| 286 | }, |
---|
| 287 | result |
---|
| 288 | ) |
---|
| 289 | printf("%llu\n", result); |
---|
| 290 | } |
---|
[cf966b5] | 291 | \end{ccode} |
---|
| 292 | |
---|
| 293 | |
---|
| 294 | |
---|
[20ffcf3] | 295 | \CFA Threads |
---|
| 296 | \begin{cfacode} |
---|
| 297 | int main() { |
---|
| 298 | BENCH( |
---|
| 299 | for(size_t i=0; i<n; i++) { |
---|
| 300 | MyThread m; |
---|
| 301 | }, |
---|
| 302 | result |
---|
| 303 | ) |
---|
| 304 | printf("%llu\n", result); |
---|
| 305 | } |
---|
| 306 | \end{cfacode} |
---|
[cf966b5] | 307 | \end{center} |
---|
| 308 | \begin{cfacode}[caption={Benchmark code for pthreads and \CFA to measure object creation},label={lst:creation}] |
---|
| 309 | \end{cfacode} |
---|
[20ffcf3] | 310 | \end{figure} |
---|
| 311 | |
---|
[cf966b5] | 312 | \begin{table} |
---|
[64b272a] | 313 | \begin{center} |
---|
| 314 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |} |
---|
| 315 | \cline{2-4} |
---|
| 316 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\ |
---|
| 317 | \hline |
---|
[383e159] | 318 | Pthreads & 26996 & 26984.71 & 156.6 \\ |
---|
| 319 | \CFA Coroutine Lazy & 6 & 5.71 & 0.45 \\ |
---|
| 320 | \CFA Coroutine Eager & 708 & 706.68 & 4.82 \\ |
---|
| 321 | \CFA Thread & 1173.5 & 1176.18 & 15.18 \\ |
---|
| 322 | \uC Coroutine & 109 & 107.46 & 1.74 \\ |
---|
| 323 | \uC Thread & 526 & 530.89 & 9.73 \\ |
---|
| 324 | Goroutine & 2520.5 & 2530.93 & 61,56 \\ |
---|
| 325 | Java Thread & 91114.5 & 92272.79 & 961.58 \\ |
---|
[64b272a] | 326 | \hline |
---|
| 327 | \end{tabular} |
---|
| 328 | \end{center} |
---|
[07c1e595] | 329 | \caption{Creation comparison. All numbers are in nanoseconds(\si{\nano\second})} |
---|
[64b272a] | 330 | \label{tab:creation} |
---|
[cf966b5] | 331 | \end{table} |
---|