Context Navigation

source: doc/proposals/concurrency/text/results.tex @ cf966b5

ADTaaron-thesisarm-ehast-experimentalcleanup-dtorsdeferred_resndemanglerenumforall-pointer-decayjacob/cs343-translationjenkins-sandboxnew-astnew-ast-unique-exprnew-envno_listpersistent-indexerpthread-emulationqualifiedEnumresolv-newwith_gc

Last change on this file since cf966b5 was cf966b5, checked in by Thierry Delisle <tdelisle@…>, 6 years ago
Results need to be updated but otherwise, tentative final draft
Property mode set to `100644`
File size: 11.0 KB

Line
1	% ======================================================================
2	% ======================================================================
3	\chapter{Performance results} \label{results}
4	% ======================================================================
5	% ======================================================================
6	\section{Machine setup}
7	Table \ref{tab:machine} shows the characteristics of the machine used to run the benchmarks. All tests where made on this machine.
8	\begin{table}[H]
9	\begin{center}
10	\begin{tabular}{\| l \| r \| l \| r \|}
11	\hline
12	Architecture & x86\_64 & NUMA node(s) & 8 \\
13	\hline
14	CPU op-mode(s) & 32-bit, 64-bit & Model name & AMD Opteron\texttrademark Processor 6380 \\
15	\hline
16	Byte Order & Little Endian & CPU Freq & 2.5\si{\giga\hertz} \\
17	\hline
18	CPU(s) & 64 & L1d cache & \SI{16}{\kibi\byte} \\
19	\hline
20	Thread(s) per core & 2 & L1i cache & \SI{64}{\kibi\byte} \\
21	\hline
22	Core(s) per socket & 8 & L2 cache & \SI{2048}{\kibi\byte} \\
23	\hline
24	Socket(s) & 4 & L3 cache & \SI{6144}{\kibi\byte} \\
25	\hline
26	\hline
27	Operating system & Ubuntu 16.04.3 LTS & Kernel & Linux 4.4.0-97-generic \\
28	\hline
29	Compiler & GCC 6.3.0 & Translator & CFA 1.0.0 \\
30	\hline
31	Java version & OpenJDK-9 & Go version & 1.9.2 \\
32	\hline
33	\end{tabular}
34	\end{center}
35	\caption{Machine setup used for the tests}
36	\label{tab:machine}
37	\end{table}
38
39	\section{Micro benchmarks}
40	All benchmarks are run using the same harness to produce the results, seen as the \code{BENCH()} macro in the following examples. This macro uses the following logic to benchmark the code :
41	\begin{pseudo}
42	#define BENCH(run, result)
43	before = gettime();
44	run;
45	after = gettime();
46	result = (after - before) / N;
47	\end{pseudo}
48	The method used to get time is \code{clock_gettime(CLOCK_THREAD_CPUTIME_ID);}. Each benchmark is using many iterations of a simple call to measure the cost of the call. The specific number of iteration depends on the specific benchmark.
49
50	\subsection{Context-switching}
51	The first interesting benchmark is to measure how long context-switches take. The simplest approach to do this is to yield on a thread, which executes a 2-step context switch. In order to make the comparison fair, coroutines also execute a 2-step context-switch (\gls{uthread} to \gls{kthread} then \gls{kthread} to \gls{uthread}), which is a resume/suspend cycle instead of a yield. Listing \ref{lst:ctx-switch} shows the code for coroutines and threads whith the results in table \ref{tab:ctx-switch}. All omitted tests are functionally identical to one of these tests.
52	\begin{figure}
53	\begin{multicols}{2}
54	\CFA Coroutines
55	\begin{cfacode}
56	coroutine GreatSuspender {};
57	void main(GreatSuspender& this) {
58	while(true) { suspend(); }
59	}
60	int main() {
61	GreatSuspender s;
62	resume(s);
63	BENCH(
64	for(size_t i=0; i<n; i++) {
65	resume(s);
66	},
67	result
68	)
69	printf("%llu\n", result);
70	}
71	\end{cfacode}
72	\columnbreak
73	\CFA Threads
74	\begin{cfacode}
75
76
77
78
79	int main() {
80
81
82	BENCH(
83	for(size_t i=0; i<n; i++) {
84	yield();
85	},
86	result
87	)
88	printf("%llu\n", result);
89	}
90	\end{cfacode}
91	\end{multicols}
92	\begin{cfacode}[caption={\CFA benchmark code used to measure context-switches for coroutines and threads.},label={lst:ctx-switch}]
93	\end{cfacode}
94	\end{figure}
95
96	\begin{table}
97	\begin{center}
98	\begin{tabular}{\| l \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \|}
99	\cline{2-4}
100	\multicolumn{1}{c \|}{} & \multicolumn{1}{c \|}{ Median } &\multicolumn{1}{c \|}{ Average } & \multicolumn{1}{c \|}{ Standard Deviation} \\
101	\hline
102	Kernel Thread & 239 & 242.57 & 5.54 \\
103	\CFA Coroutine & 38 & 38 & 0 \\
104	\CFA Thread & 102 & 102.39 & 1.57 \\
105	\uC Coroutine & 46 & 46.68 & 0.47 \\
106	\uC Thread & 98 & 99.39 & 1.52 \\
107	Goroutine & 148 & 148.0 & 0 \\
108	Java Thread & 271 & 271.0 & 0 \\
109	\hline
110	\end{tabular}
111	\end{center}
112	\caption{Context Switch comparison. All numbers are in nanoseconds(\si{\nano\second})}
113	\label{tab:ctx-switch}
114	\end{table}
115
116	\subsection{Mutual-exclusion}
117	The next interesting benchmark is to measure the overhead to enter/leave a critical-section. For monitors, the simplest approach is to measure how long it takes to enter and leave a monitor routine. Listing \ref{lst:mutex} shows the code for \CFA. To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a pthread mutex lock are also measured. The results can be shown in table \ref{tab:mutex}.
118
119	\begin{figure}
120	\begin{cfacode}[caption={\CFA benchmark code used to measure mutex routines.},label={lst:mutex}]
121	monitor M {};
122	void __attribute__((noinline)) call( M & mutex m /, m2, m3, m4/ ) {}
123
124	int main() {
125	M m/, m2, m3, m4/;
126	BENCH(
127	for(size_t i=0; i<n; i++) {
128	call(m/, m2, m3, m4/);
129	},
130	result
131	)
132	printf("%llu\n", result);
133	}
134	\end{cfacode}
135	\end{figure}
136
137	\begin{table}
138	\begin{center}
139	\begin{tabular}{\| l \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \|}
140	\cline{2-4}
141	\multicolumn{1}{c \|}{} & \multicolumn{1}{c \|}{ Median } &\multicolumn{1}{c \|}{ Average } & \multicolumn{1}{c \|}{ Standard Deviation} \\
142	\hline
143	C routine & 2 & 2 & 0 \\
144	FetchAdd + FetchSub & 2 & 2 & 0 \\
145	Pthreads Mutex Lock & 31 & 31.86 & 0.99 \\
146	\uC \code{monitor} member routine & 30 & 30 & 0 \\
147	\CFA \code{mutex} routine, 1 argument & 46 & 46.14 & 0.74 \\
148	\CFA \code{mutex} routine, 2 argument & 82 & 83 & 1.93 \\
149	\CFA \code{mutex} routine, 4 argument & 165 & 161.15 & 54.04 \\
150	Java synchronized routine & 165 & 161.15 & 54.04 \\
151	\hline
152	\end{tabular}
153	\end{center}
154	\caption{Mutex routine comparison. All numbers are in nanoseconds(\si{\nano\second})}
155	\label{tab:mutex}
156	\end{table}
157
158	\subsection{Internal scheduling}
159	The internal-scheduling benchmark measures the cost of waiting on and signalling a condition variable. Listing \ref{lst:int-sched} shows the code for \CFA, with results table \ref{tab:int-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests.
160
161	\begin{figure}
162	\begin{cfacode}[caption={Benchmark code for internal scheduling},label={lst:int-sched}]
163	volatile int go = 0;
164	condition c;
165	monitor M {};
166	M m1;
167
168	void __attribute__((noinline)) do_call( M & mutex a1 ) { signal(c); }
169
170	thread T {};
171	void ^?{}( T & mutex this ) {}
172	void main( T & this ) {
173	while(go == 0) { yield(); }
174	while(go == 1) { do_call(m1); }
175	}
176	int __attribute__((noinline)) do_wait( M & mutex a1 ) {
177	go = 1;
178	BENCH(
179	for(size_t i=0; i<n; i++) {
180	wait(c);
181	},
182	result
183	)
184	printf("%llu\n", result);
185	go = 0;
186	return 0;
187	}
188	int main() {
189	T t;
190	return do_wait(m1);
191	}
192	\end{cfacode}
193	\end{figure}
194
195	\begin{table}
196	\begin{center}
197	\begin{tabular}{\| l \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \|}
198	\cline{2-4}
199	\multicolumn{1}{c \|}{} & \multicolumn{1}{c \|}{ Median } &\multicolumn{1}{c \|}{ Average } & \multicolumn{1}{c \|}{ Standard Deviation} \\
200	\hline
201	\uC \code{signal} & 322 & 322.57 & 2.77 \\
202	\CFA \code{signal}, 1 \code{monitor} & 1145 & 1163.64 & 27.52 \\
203	\CFA \code{signal}, 2 \code{monitor} & 1531 & 1550.75 & 32.77 \\
204	\CFA \code{signal}, 4 \code{monitor} & 2288.5 & 2326.86 & 54.73 \\
205	Java \code{notify} & 2288.5 & 2326.86 & 54.73 \\
206	\hline
207	\end{tabular}
208	\end{center}
209	\caption{Internal scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})}
210	\label{tab:int-sched}
211	\end{table}
212
213	\subsection{External scheduling}
214	The Internal scheduling benchmark measures the cost of the \code{waitfor} statement (\code{_Accept} in \uC). Listing \ref{lst:ext-sched} shows the code for \CFA, with results in table \ref{tab:ext-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests.
215
216	\begin{figure}
217	\begin{cfacode}[caption={Benchmark code for external scheduling},label={lst:ext-sched}]
218	volatile int go = 0;
219	monitor M {};
220	M m1;
221	thread T {};
222
223	void __attribute__((noinline)) do_call( M & mutex a1 ) {}
224
225	void ^?{}( T & mutex this ) {}
226	void main( T & this ) {
227	while(go == 0) { yield(); }
228	while(go == 1) { do_call(m1); }
229	}
230	int __attribute__((noinline)) do_wait( M & mutex a1 ) {
231	go = 1;
232	BENCH(
233	for(size_t i=0; i<n; i++) {
234	waitfor(call, a1);
235	},
236	result
237	)
238	printf("%llu\n", result);
239	go = 0;
240	return 0;
241	}
242	int main() {
243	T t;
244	return do_wait(m1);
245	}
246	\end{cfacode}
247	\end{figure}
248
249	\begin{table}
250	\begin{center}
251	\begin{tabular}{\| l \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \|}
252	\cline{2-4}
253	\multicolumn{1}{c \|}{} & \multicolumn{1}{c \|}{ Median } &\multicolumn{1}{c \|}{ Average } & \multicolumn{1}{c \|}{ Standard Deviation} \\
254	\hline
255	\uC \code{Accept} & 349 & 339.32 & 3.14 \\
256	\CFA \code{waitfor}, 1 \code{monitor} & 1155.5 & 1142.04 & 25.23 \\
257	\CFA \code{waitfor}, 2 \code{monitor} & 1361 & 1376.75 & 28.81 \\
258	\CFA \code{waitfor}, 4 \code{monitor} & 1941.5 & 1957.07 & 34.7 \\
259	\hline
260	\end{tabular}
261	\end{center}
262	\caption{External scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})}
263	\label{tab:ext-sched}
264	\end{table}
265
266	\subsection{Object creation}
267	Finally, the last benchmark measurs the cost of creation for concurrent objects. Listing \ref{lst:creation} shows the code for pthreads and \CFA threads, with results shown in table \ref{tab:creation}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests. The only note here is that the call-stacks of \CFA coroutines are lazily created, therefore without priming the coroutine, the creation cost is very low.
268
269	\begin{figure}
270	\begin{center}
271	pthread
272	\begin{ccode}
273	int main() {
274	BENCH(
275	for(size_t i=0; i<n; i++) {
276	pthread_t thread;
277	if(pthread_create(&thread,NULL,foo,NULL)<0) {
278	perror( "failure" );
279	return 1;
280	}
281
282	if(pthread_join(thread, NULL)<0) {
283	perror( "failure" );
284	return 1;
285	}
286	},
287	result
288	)
289	printf("%llu\n", result);
290	}
291	\end{ccode}
292
293
294
295	\CFA Threads
296	\begin{cfacode}
297	int main() {
298	BENCH(
299	for(size_t i=0; i<n; i++) {
300	MyThread m;
301	},
302	result
303	)
304	printf("%llu\n", result);
305	}
306	\end{cfacode}
307	\end{center}
308	\begin{cfacode}[caption={Benchmark code for pthreads and \CFA to measure object creation},label={lst:creation}]
309	\end{cfacode}
310	\end{figure}
311
312	\begin{table}
313	\begin{center}
314	\begin{tabular}{\| l \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \|}
315	\cline{2-4}
316	\multicolumn{1}{c \|}{} & \multicolumn{1}{c \|}{ Median } &\multicolumn{1}{c \|}{ Average } & \multicolumn{1}{c \|}{ Standard Deviation} \\
317	\hline
318	Pthreads & 26974.5 & 26977 & 124.12 \\
319	\CFA Coroutine Lazy & 5 & 5 & 0 \\
320	\CFA Coroutine Eager & 335.0 & 357.67 & 34.2 \\
321	\CFA Thread & 1122.5 & 1109.86 & 36.54 \\
322	\uC Coroutine & 106 & 107.04 & 1.61 \\
323	\uC Thread & 525.5 & 533.04 & 11.14 \\
324	Goroutine & 525.5 & 533.04 & 11.14 \\
325	Java Thread & 525.5 & 533.04 & 11.14 \\
326	\hline
327	\end{tabular}
328	\end{center}
329	\caption{Creation comparison. All numbers are in nanoseconds(\si{\nano\second})}
330	\label{tab:creation}
331	\end{table}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: