Context Navigation

source: doc/theses/thierry_delisle/text/results.tex @ 161cdf1

ADTaaron-thesisarm-ehast-experimentalcleanup-dtorsdeferred_resndemanglerenumforall-pointer-decayjacob/cs343-translationjenkins-sandboxnew-astnew-ast-unique-exprnew-envno_listpersistent-indexerpthread-emulationqualifiedEnumwith_gc

Last change on this file since 161cdf1 was 728df66, checked in by Peter A. Buhr <pabuhr@…>, 7 years ago
more documentation name changes
Property mode set to `100644`
File size: 11.5 KB

Line
1	% ======================================================================
2	% ======================================================================
3	\chapter{Performance Results} \label{results}
4	% ======================================================================
5	% ======================================================================
6	\section{Machine Setup}
7	Table \ref{tab:machine} shows the characteristics of the machine used to run the benchmarks. All tests were made on this machine.
8	\begin{table}[H]
9	\begin{center}
10	\begin{tabular}{\| l \| r \| l \| r \|}
11	\hline
12	Architecture & x86\_64 & NUMA node(s) & 8 \\
13	\hline
14	CPU op-mode(s) & 32-bit, 64-bit & Model name & AMD Opteron\texttrademark Processor 6380 \\
15	\hline
16	Byte Order & Little Endian & CPU Freq & 2.5\si{\giga\hertz} \\
17	\hline
18	CPU(s) & 64 & L1d cache & \SI{16}{\kibi\byte} \\
19	\hline
20	Thread(s) per core & 2 & L1i cache & \SI{64}{\kibi\byte} \\
21	\hline
22	Core(s) per socket & 8 & L2 cache & \SI{2048}{\kibi\byte} \\
23	\hline
24	Socket(s) & 4 & L3 cache & \SI{6144}{\kibi\byte} \\
25	\hline
26	\hline
27	Operating system & Ubuntu 16.04.3 LTS & Kernel & Linux 4.4-97-generic \\
28	\hline
29	Compiler & GCC 6.3 & Translator & CFA 1 \\
30	\hline
31	Java version & OpenJDK-9 & Go version & 1.9.2 \\
32	\hline
33	\end{tabular}
34	\end{center}
35	\caption{Machine setup used for the tests}
36	\label{tab:machine}
37	\end{table}
38
39	\section{Micro Benchmarks}
40	All benchmarks are run using the same harness to produce the results, seen as the \code{BENCH()} macro in the following examples. This macro uses the following logic to benchmark the code:
41	\begin{pseudo}
42	#define BENCH(run, result) \
43	before = gettime(); \
44	run; \
45	after = gettime(); \
46	result = (after - before) / N;
47	\end{pseudo}
48	The method used to get time is \code{clock_gettime(CLOCK_THREAD_CPUTIME_ID);}. Each benchmark is using many iterations of a simple call to measure the cost of the call. The specific number of iterations depends on the specific benchmark.
49
50	\subsection{Context-Switching}
51	The first interesting benchmark is to measure how long context-switches take. The simplest approach to do this is to yield on a thread, which executes a 2-step context switch. Yielding causes the thread to context-switch to the scheduler and back, more precisely: from the \gls{uthread} to the \gls{kthread} then from the \gls{kthread} back to the same \gls{uthread} (or a different one in the general case). In order to make the comparison fair, coroutines also execute a 2-step context-switch by resuming another coroutine which does nothing but suspending in a tight loop, which is a resume/suspend cycle instead of a yield. Listing \ref{lst:ctx-switch} shows the code for coroutines and threads with the results in table \ref{tab:ctx-switch}. All omitted tests are functionally identical to one of these tests. The difference between coroutines and threads can be attributed to the cost of scheduling.
52	\begin{figure}
53	\begin{multicols}{2}
54	\CFA Coroutines
55	\begin{cfacode}
56	coroutine GreatSuspender {};
57	void main(GreatSuspender& this) {
58	while(true) { suspend(); }
59	}
60	int main() {
61	GreatSuspender s;
62	resume(s);
63	BENCH(
64	for(size_t i=0; i<n; i++) {
65	resume(s);
66	},
67	result
68	)
69	printf("%llu\n", result);
70	}
71	\end{cfacode}
72	\columnbreak
73	\CFA Threads
74	\begin{cfacode}
75
76
77
78
79	int main() {
80
81
82	BENCH(
83	for(size_t i=0; i<n; i++) {
84	yield();
85	},
86	result
87	)
88	printf("%llu\n", result);
89	}
90	\end{cfacode}
91	\end{multicols}
92	\begin{cfacode}[caption={\CFA benchmark code used to measure context-switches for coroutines and threads.},label={lst:ctx-switch}]
93	\end{cfacode}
94	\end{figure}
95
96	\begin{table}
97	\begin{center}
98	\begin{tabular}{\| l \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \|}
99	\cline{2-4}
100	\multicolumn{1}{c \|}{} & \multicolumn{1}{c \|}{ Median } &\multicolumn{1}{c \|}{ Average } & \multicolumn{1}{c \|}{ Standard Deviation} \\
101	\hline
102	Kernel Thread & 241.5 & 243.86 & 5.08 \\
103	\CFA Coroutine & 38 & 38 & 0 \\
104	\CFA Thread & 103 & 102.96 & 2.96 \\
105	\uC Coroutine & 46 & 45.86 & 0.35 \\
106	\uC Thread & 98 & 99.11 & 1.42 \\
107	Goroutine & 150 & 149.96 & 3.16 \\
108	Java Thread & 289 & 290.68 & 8.72 \\
109	\hline
110	\end{tabular}
111	\end{center}
112	\caption{Context Switch comparison. All numbers are in nanoseconds(\si{\nano\second})}
113	\label{tab:ctx-switch}
114	\end{table}
115
116	\subsection{Mutual-Exclusion}
117	The next interesting benchmark is to measure the overhead to enter/leave a critical-section. For monitors, the simplest approach is to measure how long it takes to enter and leave a monitor routine. Listing \ref{lst:mutex} shows the code for \CFA. To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a \code{pthread_mutex} lock is also measured. The results can be shown in table \ref{tab:mutex}.
118
119	\begin{figure}
120	\begin{cfacode}[caption={\CFA benchmark code used to measure mutex routines.},label={lst:mutex}]
121	monitor M {};
122	void __attribute__((noinline)) call( M & mutex m /, m2, m3, m4/ ) {}
123
124	int main() {
125	M m/, m2, m3, m4/;
126	BENCH(
127	for(size_t i=0; i<n; i++) {
128	call(m/, m2, m3, m4/);
129	},
130	result
131	)
132	printf("%llu\n", result);
133	}
134	\end{cfacode}
135	\end{figure}
136
137	\begin{table}
138	\begin{center}
139	\begin{tabular}{\| l \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \|}
140	\cline{2-4}
141	\multicolumn{1}{c \|}{} & \multicolumn{1}{c \|}{ Median } &\multicolumn{1}{c \|}{ Average } & \multicolumn{1}{c \|}{ Standard Deviation} \\
142	\hline
143	C routine & 2 & 2 & 0 \\
144	FetchAdd + FetchSub & 26 & 26 & 0 \\
145	Pthreads Mutex Lock & 31 & 31.86 & 0.99 \\
146	\uC \code{monitor} member routine & 30 & 30 & 0 \\
147	\CFA \code{mutex} routine, 1 argument & 41 & 41.57 & 0.9 \\
148	\CFA \code{mutex} routine, 2 argument & 76 & 76.96 & 1.57 \\
149	\CFA \code{mutex} routine, 4 argument & 145 & 146.68 & 3.85 \\
150	Java synchronized routine & 27 & 28.57 & 2.6 \\
151	\hline
152	\end{tabular}
153	\end{center}
154	\caption{Mutex routine comparison. All numbers are in nanoseconds(\si{\nano\second})}
155	\label{tab:mutex}
156	\end{table}
157
158	\subsection{Internal Scheduling}
159	The internal-scheduling benchmark measures the cost of waiting on and signalling a condition variable. Listing \ref{lst:int-sched} shows the code for \CFA, with results table \ref{tab:int-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests.
160
161	\begin{figure}
162	\begin{cfacode}[caption={Benchmark code for internal scheduling},label={lst:int-sched}]
163	volatile int go = 0;
164	condition c;
165	monitor M {};
166	M m1;
167
168	void __attribute__((noinline)) do_call( M & mutex a1 ) { signal(c); }
169
170	thread T {};
171	void ^?{}( T & mutex this ) {}
172	void main( T & this ) {
173	while(go == 0) { yield(); }
174	while(go == 1) { do_call(m1); }
175	}
176	int __attribute__((noinline)) do_wait( M & mutex a1 ) {
177	go = 1;
178	BENCH(
179	for(size_t i=0; i<n; i++) {
180	wait(c);
181	},
182	result
183	)
184	printf("%llu\n", result);
185	go = 0;
186	return 0;
187	}
188	int main() {
189	T t;
190	return do_wait(m1);
191	}
192	\end{cfacode}
193	\end{figure}
194
195	\begin{table}
196	\begin{center}
197	\begin{tabular}{\| l \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \|}
198	\cline{2-4}
199	\multicolumn{1}{c \|}{} & \multicolumn{1}{c \|}{ Median } &\multicolumn{1}{c \|}{ Average } & \multicolumn{1}{c \|}{ Standard Deviation} \\
200	\hline
201	Pthreads Condition Variable & 5902.5 & 6093.29 & 714.78 \\
202	\uC \code{signal} & 322 & 323 & 3.36 \\
203	\CFA \code{signal}, 1 \code{monitor} & 352.5 & 353.11 & 3.66 \\
204	\CFA \code{signal}, 2 \code{monitor} & 430 & 430.29 & 8.97 \\
205	\CFA \code{signal}, 4 \code{monitor} & 594.5 & 606.57 & 18.33 \\
206	Java \code{notify} & 13831.5 & 15698.21 & 4782.3 \\
207	\hline
208	\end{tabular}
209	\end{center}
210	\caption{Internal scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})}
211	\label{tab:int-sched}
212	\end{table}
213
214	\subsection{External Scheduling}
215	The Internal scheduling benchmark measures the cost of the \code{waitfor} statement (\code{_Accept} in \uC). Listing \ref{lst:ext-sched} shows the code for \CFA, with results in table \ref{tab:ext-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests.
216
217	\begin{figure}
218	\begin{cfacode}[caption={Benchmark code for external scheduling},label={lst:ext-sched}]
219	volatile int go = 0;
220	monitor M {};
221	M m1;
222	thread T {};
223
224	void __attribute__((noinline)) do_call( M & mutex a1 ) {}
225
226	void ^?{}( T & mutex this ) {}
227	void main( T & this ) {
228	while(go == 0) { yield(); }
229	while(go == 1) { do_call(m1); }
230	}
231	int __attribute__((noinline)) do_wait( M & mutex a1 ) {
232	go = 1;
233	BENCH(
234	for(size_t i=0; i<n; i++) {
235	waitfor(call, a1);
236	},
237	result
238	)
239	printf("%llu\n", result);
240	go = 0;
241	return 0;
242	}
243	int main() {
244	T t;
245	return do_wait(m1);
246	}
247	\end{cfacode}
248	\end{figure}
249
250	\begin{table}
251	\begin{center}
252	\begin{tabular}{\| l \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \|}
253	\cline{2-4}
254	\multicolumn{1}{c \|}{} & \multicolumn{1}{c \|}{ Median } &\multicolumn{1}{c \|}{ Average } & \multicolumn{1}{c \|}{ Standard Deviation} \\
255	\hline
256	\uC \code{Accept} & 350 & 350.61 & 3.11 \\
257	\CFA \code{waitfor}, 1 \code{monitor} & 358.5 & 358.36 & 3.82 \\
258	\CFA \code{waitfor}, 2 \code{monitor} & 422 & 426.79 & 7.95 \\
259	\CFA \code{waitfor}, 4 \code{monitor} & 579.5 & 585.46 & 11.25 \\
260	\hline
261	\end{tabular}
262	\end{center}
263	\caption{External scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})}
264	\label{tab:ext-sched}
265	\end{table}
266
267	\subsection{Object Creation}
268	Finally, the last benchmark measures the cost of creation for concurrent objects. Listing \ref{lst:creation} shows the code for \texttt{pthread}s and \CFA threads, with results shown in table \ref{tab:creation}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests. The only note here is that the call stacks of \CFA coroutines are lazily created, therefore without priming the coroutine, the creation cost is very low.
269
270	\begin{figure}
271	\begin{center}
272	\texttt{pthread}
273	\begin{ccode}
274	int main() {
275	BENCH(
276	for(size_t i=0; i<n; i++) {
277	pthread_t thread;
278	if(pthread_create(&thread,NULL,foo,NULL)<0) {
279	perror( "failure" );
280	return 1;
281	}
282
283	if(pthread_join(thread, NULL)<0) {
284	perror( "failure" );
285	return 1;
286	}
287	},
288	result
289	)
290	printf("%llu\n", result);
291	}
292	\end{ccode}
293
294
295
296	\CFA Threads
297	\begin{cfacode}
298	int main() {
299	BENCH(
300	for(size_t i=0; i<n; i++) {
301	MyThread m;
302	},
303	result
304	)
305	printf("%llu\n", result);
306	}
307	\end{cfacode}
308	\end{center}
309	\begin{cfacode}[caption={Benchmark code for \texttt{pthread}s and \CFA to measure object creation},label={lst:creation}]
310	\end{cfacode}
311	\end{figure}
312
313	\begin{table}
314	\begin{center}
315	\begin{tabular}{\| l \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \| S[table-format=5.2,table-number-alignment=right] \|}
316	\cline{2-4}
317	\multicolumn{1}{c \|}{} & \multicolumn{1}{c \|}{ Median } &\multicolumn{1}{c \|}{ Average } & \multicolumn{1}{c \|}{ Standard Deviation} \\
318	\hline
319	Pthreads & 26996 & 26984.71 & 156.6 \\
320	\CFA Coroutine Lazy & 6 & 5.71 & 0.45 \\
321	\CFA Coroutine Eager & 708 & 706.68 & 4.82 \\
322	\CFA Thread & 1173.5 & 1176.18 & 15.18 \\
323	\uC Coroutine & 109 & 107.46 & 1.74 \\
324	\uC Thread & 526 & 530.89 & 9.73 \\
325	Goroutine & 2520.5 & 2530.93 & 61,56 \\
326	Java Thread & 91114.5 & 92272.79 & 961.58 \\
327	\hline
328	\end{tabular}
329	\end{center}
330	\caption{Creation comparison. All numbers are in nanoseconds(\si{\nano\second}).}
331	\label{tab:creation}
332	\end{table}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: