1 | % ====================================================================== |
---|
2 | % ====================================================================== |
---|
3 | \chapter{Performance results} \label{results} |
---|
4 | % ====================================================================== |
---|
5 | % ====================================================================== |
---|
6 | \section{Machine setup} |
---|
7 | Table \ref{tab:machine} shows the characteristics of the machine used to run the benchmarks. All tests where made on this machine. |
---|
8 | \begin{figure}[H] |
---|
9 | \begin{center} |
---|
10 | \begin{tabular}{| l | r | l | r |} |
---|
11 | \hline |
---|
12 | Architecture & x86\_64 & NUMA node(s) & 8 \\ |
---|
13 | \hline |
---|
14 | CPU op-mode(s) & 32-bit, 64-bit & Model name & AMD Opteron\texttrademark Processor 6380 \\ |
---|
15 | \hline |
---|
16 | Byte Order & Little Endian & CPU Freq & 2.5\si{\giga\hertz} \\ |
---|
17 | \hline |
---|
18 | CPU(s) & 64 & L1d cache & \SI{16}{\kibi\byte} \\ |
---|
19 | \hline |
---|
20 | Thread(s) per core & 2 & L1i cache & \SI{64}{\kibi\byte} \\ |
---|
21 | \hline |
---|
22 | Core(s) per socket & 8 & L2 cache & \SI{2048}{\kibi\byte} \\ |
---|
23 | \hline |
---|
24 | Socket(s) & 4 & L3 cache & \SI{6144}{\kibi\byte} \\ |
---|
25 | \hline |
---|
26 | \hline |
---|
27 | Operating system & Ubuntu 16.04.3 LTS & Kernel & Linux 4.4.0-97-generic \\ |
---|
28 | \hline |
---|
29 | Compiler & GCC 6.3.0 & Translator & CFA 1.0.0 \\ |
---|
30 | \hline |
---|
31 | \end{tabular} |
---|
32 | \end{center} |
---|
33 | \caption{Machine setup used for the tests} |
---|
34 | \label{tab:machine} |
---|
35 | \end{figure} |
---|
36 | |
---|
37 | \section{Micro benchmarks} |
---|
38 | All benchmarks are run using the same harness to produce the results, seen as the \code{BENCH()} macro in the following examples. This macro uses the following logic to benchmark the code : |
---|
39 | \begin{pseudo} |
---|
40 | #define BENCH(run, result) |
---|
41 | gettime(); |
---|
42 | run; |
---|
43 | gettime(); |
---|
44 | result = (after - before) / N; |
---|
45 | \end{pseudo} |
---|
46 | The method used to get time is \code{clock_gettime(CLOCK_THREAD_CPUTIME_ID);}. Each benchmark is using many iterations of a simple call to measure the cost of the call. The specific number of iteration depends on the specific benchmark. |
---|
47 | |
---|
48 | \subsection{Context-switching} |
---|
49 | The first interesting benchmark is to measure how long context-switches take. The simplest approach to do this is to yield on a thread, which executes a 2-step context switch. In order to make the comparison fair, coroutines also execute a 2-step context-switch, which is a resume/suspend cycle instead of a yield. Listing \ref{lst:ctx-switch} shows the code for coroutines and threads. All omitted tests are functionally identical to one of these tests. The results can be shown in table \ref{tab:ctx-switch}. |
---|
50 | \begin{figure} |
---|
51 | \begin{multicols}{2} |
---|
52 | \CFA Coroutines |
---|
53 | \begin{cfacode} |
---|
54 | coroutine GreatSuspender {}; |
---|
55 | void main(GreatSuspender& this) { |
---|
56 | while(true) { suspend(); } |
---|
57 | } |
---|
58 | int main() { |
---|
59 | GreatSuspender s; |
---|
60 | resume(s); |
---|
61 | BENCH( |
---|
62 | for(size_t i=0; i<n; i++) { |
---|
63 | resume(s); |
---|
64 | }, |
---|
65 | result |
---|
66 | ) |
---|
67 | printf("%llu\n", result); |
---|
68 | } |
---|
69 | \end{cfacode} |
---|
70 | \columnbreak |
---|
71 | \CFA Threads |
---|
72 | \begin{cfacode} |
---|
73 | |
---|
74 | |
---|
75 | |
---|
76 | |
---|
77 | int main() { |
---|
78 | |
---|
79 | |
---|
80 | BENCH( |
---|
81 | for(size_t i=0; i<n; i++) { |
---|
82 | yield(); |
---|
83 | }, |
---|
84 | result |
---|
85 | ) |
---|
86 | printf("%llu\n", result); |
---|
87 | } |
---|
88 | \end{cfacode} |
---|
89 | \end{multicols} |
---|
90 | \caption{\CFA benchmark code used to measure context-switches for coroutines and threads.} |
---|
91 | \label{lst:ctx-switch} |
---|
92 | \end{figure} |
---|
93 | |
---|
94 | \begin{figure} |
---|
95 | \begin{center} |
---|
96 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |} |
---|
97 | \cline{2-4} |
---|
98 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\ |
---|
99 | \hline |
---|
100 | Kernel Threads & 239 & 242.57 & 5.54 \\ |
---|
101 | \CFA Coroutines & 38 & 38 & 0 \\ |
---|
102 | \CFA Threads & 102 & 102.39 & 1.57 \\ |
---|
103 | \uC Coroutines & 46 & 46.68 & 0.47 \\ |
---|
104 | \uC Threads & 98 & 99.39 & 1.52 \\ |
---|
105 | \hline |
---|
106 | \end{tabular} |
---|
107 | \end{center} |
---|
108 | \caption{Context Switch comparison. All numbers are in nanoseconds(\si{\nano\second})} |
---|
109 | \label{tab:ctx-switch} |
---|
110 | \end{figure} |
---|
111 | |
---|
112 | \subsection{Mutual-exclusion} |
---|
113 | The next interesting benchmark is to measure the overhead to enter/leave a critical-section. For monitors, the simplest approach is to measure how long it takes enter and leave a monitor routine. Listing \ref{lst:mutex} shows the code for \CFA. To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a pthread mutex lock are also measured. The results can be shown in table \ref{tab:mutex}. |
---|
114 | |
---|
115 | \begin{figure} |
---|
116 | \begin{cfacode} |
---|
117 | monitor M {}; |
---|
118 | void __attribute__((noinline)) call( M & mutex m /*, m2, m3, m4*/ ) {} |
---|
119 | |
---|
120 | int main() { |
---|
121 | M m/*, m2, m3, m4*/; |
---|
122 | BENCH( |
---|
123 | for(size_t i=0; i<n; i++) { |
---|
124 | call(m/*, m2, m3, m4*/); |
---|
125 | }, |
---|
126 | result |
---|
127 | ) |
---|
128 | printf("%llu\n", result); |
---|
129 | } |
---|
130 | \end{cfacode} |
---|
131 | \caption{\CFA benchmark code used to measure mutex routines.} |
---|
132 | \label{lst:mutex} |
---|
133 | \end{figure} |
---|
134 | |
---|
135 | \begin{figure} |
---|
136 | \begin{center} |
---|
137 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |} |
---|
138 | \cline{2-4} |
---|
139 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\ |
---|
140 | \hline |
---|
141 | C routine & 2 & 2 & 0 \\ |
---|
142 | Pthreads Mutex Lock & 31 & 31.86 & 0.99 \\ |
---|
143 | \uC \code{monitor} member routine & 30 & 30 & 0 \\ |
---|
144 | \CFA \code{mutex} routine, 1 argument & 46 & 46.14 & 0.74 \\ |
---|
145 | \CFA \code{mutex} routine, 2 argument & 82 & 83 & 1.93 \\ |
---|
146 | \CFA \code{mutex} routine, 4 argument & 165 & 161.15 & 54.04 \\ |
---|
147 | \hline |
---|
148 | \end{tabular} |
---|
149 | \end{center} |
---|
150 | \caption{Mutex routine comparison. All numbers are in nanoseconds(\si{\nano\second})} |
---|
151 | \label{tab:mutex} |
---|
152 | \end{figure} |
---|
153 | |
---|
154 | \subsection{Internal scheduling} |
---|
155 | The Internal scheduling benchmark measures the cost of waiting on and signalling a condition variable. Listing \ref{lst:int-sched} shows the code for \CFA. The results can be shown in table \ref{tab:int-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests. |
---|
156 | |
---|
157 | \begin{figure} |
---|
158 | \begin{cfacode} |
---|
159 | volatile int go = 0; |
---|
160 | condition c; |
---|
161 | monitor M {}; |
---|
162 | M m1; |
---|
163 | |
---|
164 | void __attribute__((noinline)) do_call( M & mutex a1 ) { signal(c); } |
---|
165 | |
---|
166 | thread T {}; |
---|
167 | void ^?{}( T & mutex this ) {} |
---|
168 | void main( T & this ) { |
---|
169 | while(go == 0) { yield(); } |
---|
170 | while(go == 1) { do_call(m1); } |
---|
171 | } |
---|
172 | int __attribute__((noinline)) do_wait( M & mutex a1 ) { |
---|
173 | go = 1; |
---|
174 | BENCH( |
---|
175 | for(size_t i=0; i<n; i++) { |
---|
176 | wait(c); |
---|
177 | }, |
---|
178 | result |
---|
179 | ) |
---|
180 | printf("%llu\n", result); |
---|
181 | go = 0; |
---|
182 | return 0; |
---|
183 | } |
---|
184 | int main() { |
---|
185 | T t; |
---|
186 | return do_wait(m1); |
---|
187 | } |
---|
188 | \end{cfacode} |
---|
189 | \caption{Benchmark code for internal scheduling} |
---|
190 | \label{lst:int-sched} |
---|
191 | \end{figure} |
---|
192 | |
---|
193 | \begin{figure} |
---|
194 | \begin{center} |
---|
195 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |} |
---|
196 | \cline{2-4} |
---|
197 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\ |
---|
198 | \hline |
---|
199 | \uC \code{signal} & 322 & 322.57 & 2.77 \\ |
---|
200 | \CFA \code{signal}, 1 \code{monitor} & 1145 & 1163.64 & 27.52 \\ |
---|
201 | \CFA \code{signal}, 2 \code{monitor} & 1531 & 1550.75 & 32.77 \\ |
---|
202 | \CFA \code{signal}, 4 \code{monitor} & 2288.5 & 2326.86 & 54.73 \\ |
---|
203 | \hline |
---|
204 | \end{tabular} |
---|
205 | \end{center} |
---|
206 | \caption{Internal scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})} |
---|
207 | \label{tab:int-sched} |
---|
208 | \end{figure} |
---|
209 | |
---|
210 | \subsection{External scheduling} |
---|
211 | The Internal scheduling benchmark measures the cost of the \code{waitfor} statement (\code{_Accept} in \uC). Listing \ref{lst:ext-sched} shows the code for \CFA. The results can be shown in table \ref{tab:ext-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests. |
---|
212 | |
---|
213 | \begin{figure} |
---|
214 | \begin{cfacode} |
---|
215 | volatile int go = 0; |
---|
216 | monitor M {}; |
---|
217 | M m1; |
---|
218 | thread T {}; |
---|
219 | |
---|
220 | void __attribute__((noinline)) do_call( M & mutex a1 ) {} |
---|
221 | |
---|
222 | void ^?{}( T & mutex this ) {} |
---|
223 | void main( T & this ) { |
---|
224 | while(go == 0) { yield(); } |
---|
225 | while(go == 1) { do_call(m1); } |
---|
226 | } |
---|
227 | int __attribute__((noinline)) do_wait( M & mutex a1 ) { |
---|
228 | go = 1; |
---|
229 | BENCH( |
---|
230 | for(size_t i=0; i<n; i++) { |
---|
231 | waitfor(call, a1); |
---|
232 | }, |
---|
233 | result |
---|
234 | ) |
---|
235 | printf("%llu\n", result); |
---|
236 | go = 0; |
---|
237 | return 0; |
---|
238 | } |
---|
239 | int main() { |
---|
240 | T t; |
---|
241 | return do_wait(m1); |
---|
242 | } |
---|
243 | \end{cfacode} |
---|
244 | \caption{Benchmark code for external scheduling} |
---|
245 | \label{lst:ext-sched} |
---|
246 | \end{figure} |
---|
247 | |
---|
248 | \begin{figure} |
---|
249 | \begin{center} |
---|
250 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |} |
---|
251 | \cline{2-4} |
---|
252 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\ |
---|
253 | \hline |
---|
254 | \uC \code{Accept} & 349 & 339.32 & 3.14 \\ |
---|
255 | \CFA \code{waitfor}, 1 \code{monitor} & 1155.5 & 1142.04 & 25.23 \\ |
---|
256 | \CFA \code{waitfor}, 2 \code{monitor} & 1361 & 1376.75 & 28.81 \\ |
---|
257 | \CFA \code{waitfor}, 4 \code{monitor} & 1941.5 & 1957.07 & 34.7 \\ |
---|
258 | \hline |
---|
259 | \end{tabular} |
---|
260 | \end{center} |
---|
261 | \caption{External scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})} |
---|
262 | \label{tab:ext-sched} |
---|
263 | \end{figure} |
---|
264 | |
---|
265 | \subsection{Object creation} |
---|
266 | Finally, the last benchmark measured is the cost of creation for concurrent objects. Listing \ref{lst:creation} shows the code for pthreads and \CFA threads. The results can be shown in table \ref{tab:creation}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests. The only note here is that the call-stacks of \CFA coroutines are lazily created, therefore without priming the coroutine, the creation cost is very low. |
---|
267 | |
---|
268 | \begin{figure} |
---|
269 | \begin{multicols}{2} |
---|
270 | pthread |
---|
271 | \begin{cfacode} |
---|
272 | int main() { |
---|
273 | BENCH( |
---|
274 | for(size_t i=0; i<n; i++) { |
---|
275 | pthread_t thread; |
---|
276 | if(pthread_create( |
---|
277 | &thread, |
---|
278 | NULL, |
---|
279 | foo, |
---|
280 | NULL |
---|
281 | ) < 0) { |
---|
282 | perror( "failure" ); |
---|
283 | return 1; |
---|
284 | } |
---|
285 | |
---|
286 | if(pthread_join( |
---|
287 | thread, |
---|
288 | NULL |
---|
289 | ) < 0) { |
---|
290 | perror( "failure" ); |
---|
291 | return 1; |
---|
292 | } |
---|
293 | }, |
---|
294 | result |
---|
295 | ) |
---|
296 | printf("%llu\n", result); |
---|
297 | } |
---|
298 | \end{cfacode} |
---|
299 | \columnbreak |
---|
300 | \CFA Threads |
---|
301 | \begin{cfacode} |
---|
302 | int main() { |
---|
303 | BENCH( |
---|
304 | for(size_t i=0; i<n; i++) { |
---|
305 | MyThread m; |
---|
306 | }, |
---|
307 | result |
---|
308 | ) |
---|
309 | |
---|
310 | printf("%llu\n", result); |
---|
311 | } |
---|
312 | \end{cfacode} |
---|
313 | \end{multicols} |
---|
314 | \caption{Benchmark code for pthreads and \CFA to measure object creation} |
---|
315 | \label{lst:creation} |
---|
316 | \end{figure} |
---|
317 | |
---|
318 | \begin{figure} |
---|
319 | \begin{center} |
---|
320 | \begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |} |
---|
321 | \cline{2-4} |
---|
322 | \multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\ |
---|
323 | \hline |
---|
324 | Pthreads & 26974.5 & 26977 & 124.12 \\ |
---|
325 | \CFA Coroutines Lazy & 5 & 5 & 0 \\ |
---|
326 | \CFA Coroutines Eager & 335.0 & 357.67 & 34.2 \\ |
---|
327 | \CFA Threads & 1122.5 & 1109.86 & 36.54 \\ |
---|
328 | \uC Coroutines & 106 & 107.04 & 1.61 \\ |
---|
329 | \uC Threads & 525.5 & 533.04 & 11.14 \\ |
---|
330 | \hline |
---|
331 | \end{tabular} |
---|
332 | \end{center} |
---|
333 | \caption{Creation comparison. All numbers are in nanoseconds(\si{\nano\second})} |
---|
334 | \label{tab:creation} |
---|
335 | \end{figure} |
---|