1 | \chapter{Performance} |
---|
2 | \label{c:Performance} |
---|
3 | |
---|
4 | This chapter uses the micro-benchmarks from \VRef[Chapter]{s:Benchmarks} to test a number of current memory allocators, including llheap. |
---|
5 | The goal is to see if llheap is competitive with the current best memory allocators. |
---|
6 | |
---|
7 | |
---|
8 | \section{Machine Specification} |
---|
9 | |
---|
10 | The performance experiments were run on two different multi-core architectures (x64 and ARM) to determine if there is consistency across platforms: |
---|
11 | \begin{itemize} |
---|
12 | \item |
---|
13 | \textbf{Nasus} AMD EPYC 7662, 64-core socket $\times$ 2, 2.0 GHz, GCC version 9.3.0 |
---|
14 | \item |
---|
15 | \textbf{Algol} Huawei ARM TaiShan 2280 V2 Kunpeng 920, 24-core socket $\times$ 4, 2.6 GHz, GCC version 9.4.0 |
---|
16 | \end{itemize} |
---|
17 | |
---|
18 | |
---|
19 | \section{Existing Memory Allocators} |
---|
20 | \label{sec:curAllocatorSec} |
---|
21 | |
---|
22 | With dynamic allocation being an important feature of C, there are many stand-alone memory allocators that have been designed for different purposes. |
---|
23 | For this thesis, 7 of the most popular and widely used memory allocators were selected for comparison, along with llheap. |
---|
24 | |
---|
25 | \paragraph{llheap (\textsf{llh})} |
---|
26 | is the thread-safe allocator from \VRef[Chapter]{c:Allocator} |
---|
27 | \\ |
---|
28 | \textbf{Version:} 1.0 |
---|
29 | \textbf{Configuration:} Compiled with dynamic linking, but without statistics or debugging.\\ |
---|
30 | \textbf{Compilation command:} @make@ |
---|
31 | |
---|
32 | \paragraph{glibc (\textsf{glc})} |
---|
33 | \cite{glibc} is the default gcc thread-safe allocator. |
---|
34 | \\ |
---|
35 | \textbf{Version:} Ubuntu GLIBC 2.31-0ubuntu9.7 2.31\\ |
---|
36 | \textbf{Configuration:} Compiled by Ubuntu 20.04.\\ |
---|
37 | \textbf{Compilation command:} N/A |
---|
38 | |
---|
39 | \paragraph{dlmalloc (\textsf{dl})} |
---|
40 | \cite{dlmalloc} is a thread-safe allocator that is single threaded and single heap. |
---|
41 | It maintains free-lists of different sizes to store freed dynamic memory. |
---|
42 | \\ |
---|
43 | \textbf{Version:} 2.8.6\\ |
---|
44 | \textbf{Configuration:} Compiled with preprocessor @USE_LOCKS@.\\ |
---|
45 | \textbf{Compilation command:} @gcc -g3 -O3 -Wall -Wextra -fno-builtin-malloc -fno-builtin-calloc@ @-fno-builtin-realloc -fno-builtin-free -fPIC -shared -DUSE_LOCKS -o libdlmalloc.so malloc-2.8.6.c@ |
---|
46 | |
---|
47 | \paragraph{hoard (\textsf{hrd})} |
---|
48 | \cite{hoard} is a thread-safe allocator that is multi-threaded and using a heap layer framework. It has per-thread heaps that have thread-local free-lists, and a global shared heap. |
---|
49 | \\ |
---|
50 | \textbf{Version:} 3.13\\ |
---|
51 | \textbf{Configuration:} Compiled with hoard's default configurations and @Makefile@.\\ |
---|
52 | \textbf{Compilation command:} @make all@ |
---|
53 | |
---|
54 | \paragraph{jemalloc (\textsf{je})} |
---|
55 | \cite{jemalloc} is a thread-safe allocator that uses multiple arenas. Each thread is assigned an arena. |
---|
56 | Each arena has chunks that contain contagious memory regions of same size. An arena has multiple chunks that contain regions of multiple sizes. |
---|
57 | \\ |
---|
58 | \textbf{Version:} 5.2.1\\ |
---|
59 | \textbf{Configuration:} Compiled with jemalloc's default configurations and @Makefile@.\\ |
---|
60 | \textbf{Compilation command:} @autogen.sh; configure; make; make install@ |
---|
61 | |
---|
62 | \paragraph{ptmalloc3 (\textsf{pt3})} |
---|
63 | \cite{ptmalloc3} is a modification of dlmalloc. |
---|
64 | It is a thread-safe multi-threaded memory allocator that uses multiple heaps. |
---|
65 | ptmalloc3 heap has similar design to dlmalloc's heap. |
---|
66 | \\ |
---|
67 | \textbf{Version:} 1.8\\ |
---|
68 | \textbf{Configuration:} Compiled with ptmalloc3's @Makefile@ using option ``linux-shared''.\\ |
---|
69 | \textbf{Compilation command:} @make linux-shared@ |
---|
70 | |
---|
71 | \paragraph{rpmalloc (\textsf{rp})} |
---|
72 | \cite{rpmalloc} is a thread-safe allocator that is multi-threaded and uses per-thread heap. |
---|
73 | Each heap has multiple size-classes and each size-class contains memory regions of the relevant size. |
---|
74 | \\ |
---|
75 | \textbf{Version:} 1.4.1\\ |
---|
76 | \textbf{Configuration:} Compiled with rpmalloc's default configurations and ninja build system.\\ |
---|
77 | \textbf{Compilation command:} @python3 configure.py; ninja@ |
---|
78 | |
---|
79 | \paragraph{tbb malloc (\textsf{tbb})} |
---|
80 | \cite{tbbmalloc} is a thread-safe allocator that is multi-threaded and uses private heap for each thread. |
---|
81 | Each private-heap has multiple bins of different sizes. Each bin contains free regions of the same size. |
---|
82 | \\ |
---|
83 | \textbf{Version:} intel tbb 2020 update 2, tbb\_interface\_version == 11102\\ |
---|
84 | \textbf{Configuration:} Compiled with tbbmalloc's default configurations and @Makefile@.\\ |
---|
85 | \textbf{Compilation command:} @make@ |
---|
86 | |
---|
87 | % \section{Experiment Environment} |
---|
88 | % We used our micro benchmark suite (FIX ME: cite mbench) to evaluate these memory allocators \ref{sec:curAllocatorSec} and our own memory allocator uHeap \ref{sec:allocatorSec}. |
---|
89 | |
---|
90 | \section{Experiments} |
---|
91 | |
---|
92 | The each micro-benchmark is configured and run with each of the allocators, |
---|
93 | The less time an allocator takes to complete a benchmark the better, so lower in the graphs is better. |
---|
94 | |
---|
95 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
96 | %% CHURN |
---|
97 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
98 | |
---|
99 | \subsection{Churn Micro-Benchmark} |
---|
100 | |
---|
101 | Churn tests allocators for speed under intensive dynamic memory usage (see \VRef{s:ChurnBenchmark}). |
---|
102 | This experiment was run with following configurations: |
---|
103 | \begin{description}[itemsep=0pt,parsep=0pt] |
---|
104 | \item[thread:] |
---|
105 | 1, 2, 4, 8, 16 |
---|
106 | \item[spots:] |
---|
107 | 16 |
---|
108 | \item[obj:] |
---|
109 | 100,000 |
---|
110 | \item[max:] |
---|
111 | 500 |
---|
112 | \item[min:] |
---|
113 | 50 |
---|
114 | \item[step:] |
---|
115 | 50 |
---|
116 | \item[distro:] |
---|
117 | fisher |
---|
118 | \end{description} |
---|
119 | |
---|
120 | % -maxS : 500 |
---|
121 | % -minS : 50 |
---|
122 | % -stepS : 50 |
---|
123 | % -distroS : fisher |
---|
124 | % -objN : 100000 |
---|
125 | % -cSpots : 16 |
---|
126 | % -threadN : 1, 2, 4, 8, 16 |
---|
127 | |
---|
128 | \VRef[Figure]{fig:churn} shows the results for algol and nasus. |
---|
129 | The X-axis shows the number of threads; |
---|
130 | the Y-axis shows the total experiment time. |
---|
131 | Each allocator's performance for each thread is shown in different colors. |
---|
132 | |
---|
133 | \begin{figure} |
---|
134 | \centering |
---|
135 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/churn} } |
---|
136 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/churn} } |
---|
137 | \caption{Churn} |
---|
138 | \label{fig:churn} |
---|
139 | \end{figure} |
---|
140 | |
---|
141 | All allocators did well in this micro-benchmark, except for \textsf{dl} on the ARM. |
---|
142 | llheap is slightly slower because it uses ownership, where many of the allocations have remote frees, which requires locking. |
---|
143 | When llheap is compiled without ownership, its performance is the same as the other allocators (not shown). |
---|
144 | |
---|
145 | |
---|
146 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
147 | %% THRASH |
---|
148 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
149 | |
---|
150 | \subsection{Cache Thrash} |
---|
151 | |
---|
152 | Thrash tests memory allocators for active false sharing (see \VRef{sec:benchThrashSec}). |
---|
153 | This experiment was run with following configurations: |
---|
154 | \begin{description}[itemsep=0pt,parsep=0pt] |
---|
155 | \item[threads:] |
---|
156 | 1, 2, 4, 8, 16 |
---|
157 | \item[iterations:] |
---|
158 | 1,000 |
---|
159 | \item[cacheRW:] |
---|
160 | 1,000,000 |
---|
161 | \item[size:] |
---|
162 | 1 |
---|
163 | \end{description} |
---|
164 | |
---|
165 | % * Each allocator was tested for its performance across different number of threads. |
---|
166 | % Experiment was repeated for each allocator for 1, 2, 4, 8, and 16 threads by setting the configuration -threadN. |
---|
167 | |
---|
168 | \VRef[Figure]{fig:cacheThrash} shows the results for algol and nasus. |
---|
169 | The X-axis shows the number of threads; |
---|
170 | the Y-axis shows the total experiment time. |
---|
171 | Each allocator's performance for each thread is shown in different colors. |
---|
172 | |
---|
173 | \begin{figure} |
---|
174 | \centering |
---|
175 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/cache-time-0-thrash} } |
---|
176 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/cache-time-0-thrash} } |
---|
177 | \caption{Cache Thrash} |
---|
178 | \label{fig:cacheThrash} |
---|
179 | \end{figure} |
---|
180 | |
---|
181 | All allocators did well in this micro-benchmark, except for \textsf{dl} and \textsf{pt3} on the x64. |
---|
182 | Either the memory allocators generate little active false-sharing or the micro-benchmark is not generating scenarios that cause active false-sharing. |
---|
183 | |
---|
184 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
185 | %% SCRATCH |
---|
186 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
187 | |
---|
188 | \subsection{Cache Scratch} |
---|
189 | |
---|
190 | Scratch tests memory allocators for program-induced allocator-preserved passive false-sharing (see \VRef{s:CacheScratch}). |
---|
191 | This experiment was run with following configurations: |
---|
192 | \begin{description}[itemsep=0pt,parsep=0pt] |
---|
193 | \item[threads:] |
---|
194 | 1, 2, 4, 8, 16 |
---|
195 | \item[iterations:] |
---|
196 | 1,000 |
---|
197 | \item[cacheRW:] |
---|
198 | 1,000,000 |
---|
199 | \item[size:] |
---|
200 | 1 |
---|
201 | \end{description} |
---|
202 | |
---|
203 | % * Each allocator was tested for its performance across different number of threads. |
---|
204 | % Experiment was repeated for each allocator for 1, 2, 4, 8, and 16 threads by setting the configuration -threadN. |
---|
205 | |
---|
206 | \VRef[Figure]{fig:cacheScratch} shows the results for algol and nasus. |
---|
207 | The X-axis shows the number of threads; |
---|
208 | the Y-axis shows the total experiment time. |
---|
209 | Each allocator's performance for each thread is shown in different colors. |
---|
210 | |
---|
211 | \begin{figure} |
---|
212 | \centering |
---|
213 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/cache-time-0-scratch} } |
---|
214 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/cache-time-0-scratch} } |
---|
215 | \caption{Cache Scratch} |
---|
216 | \label{fig:cacheScratch} |
---|
217 | \end{figure} |
---|
218 | |
---|
219 | All allocators did well in this micro-benchmark on the ARM. |
---|
220 | Allocators \textsf{llh}, \textsf{je}, and \textsf{rp} did well on the x64, while the remaining allocators experienced significant slowdowns from the false sharing. |
---|
221 | |
---|
222 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
223 | %% SPEED |
---|
224 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
225 | |
---|
226 | \subsection{Speed Micro-Benchmark} |
---|
227 | |
---|
228 | Speed tests memory allocators for runtime latency (see \VRef{s:SpeedMicroBenchmark}). |
---|
229 | This experiment was run with following configurations: |
---|
230 | \begin{description}[itemsep=0pt,parsep=0pt] |
---|
231 | \item[max:] |
---|
232 | 500 |
---|
233 | \item[min:] |
---|
234 | 50 |
---|
235 | \item[step:] |
---|
236 | 50 |
---|
237 | \item[distro:] |
---|
238 | fisher |
---|
239 | \item[objects:] |
---|
240 | 1,000,000 |
---|
241 | \item[workers:] |
---|
242 | 1, 2, 4, 8, 16 |
---|
243 | \end{description} |
---|
244 | |
---|
245 | % -maxS : 500 |
---|
246 | % -minS : 50 |
---|
247 | % -stepS : 50 |
---|
248 | % -distroS : fisher |
---|
249 | % -objN : 1000000 |
---|
250 | % -threadN : \{ 1, 2, 4, 8, 16 \} * |
---|
251 | |
---|
252 | %* Each allocator was tested for its performance across different number of threads. |
---|
253 | %Experiment was repeated for each allocator for 1, 2, 4, 8, and 16 threads by setting the configuration -threadN. |
---|
254 | |
---|
255 | \VRefrange[Figures]{fig:speed-3-malloc}{fig:speed-14-malloc-calloc-realloc-free} show 12 figures, one figure for each chain of the speed benchmark. |
---|
256 | The X-axis shows the number of threads; |
---|
257 | the Y-axis shows the total experiment time. |
---|
258 | Each allocator's performance for each thread is shown in different colors. |
---|
259 | |
---|
260 | \begin{itemize} |
---|
261 | \item \VRef[Figure]{fig:speed-3-malloc} shows results for chain: malloc |
---|
262 | \item \VRef[Figure]{fig:speed-4-realloc} shows results for chain: realloc |
---|
263 | \item \VRef[Figure]{fig:speed-5-free} shows results for chain: free |
---|
264 | \item \VRef[Figure]{fig:speed-6-calloc} shows results for chain: calloc |
---|
265 | \item \VRef[Figure]{fig:speed-7-malloc-free} shows results for chain: malloc-free |
---|
266 | \item \VRef[Figure]{fig:speed-8-realloc-free} shows results for chain: realloc-free |
---|
267 | \item \VRef[Figure]{fig:speed-9-calloc-free} shows results for chain: calloc-free |
---|
268 | \item \VRef[Figure]{fig:speed-10-malloc-realloc} shows results for chain: malloc-realloc |
---|
269 | \item \VRef[Figure]{fig:speed-11-calloc-realloc} shows results for chain: calloc-realloc |
---|
270 | \item \VRef[Figure]{fig:speed-12-malloc-realloc-free} shows results for chain: malloc-realloc-free |
---|
271 | \item \VRef[Figure]{fig:speed-13-calloc-realloc-free} shows results for chain: calloc-realloc-free |
---|
272 | \item \VRef[Figure]{fig:speed-14-malloc-calloc-realloc-free} shows results for chain: malloc-realloc-free-calloc |
---|
273 | \end{itemize} |
---|
274 | |
---|
275 | All allocators did well in this micro-benchmark across all allocation chains, except for \textsf{dl} and \textsf{pt3}. |
---|
276 | |
---|
277 | %speed-3-malloc.eps |
---|
278 | \begin{figure} |
---|
279 | \centering |
---|
280 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-3-malloc} } |
---|
281 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-3-malloc} } |
---|
282 | \caption{Speed benchmark chain: malloc} |
---|
283 | \label{fig:speed-3-malloc} |
---|
284 | \end{figure} |
---|
285 | |
---|
286 | %speed-4-realloc.eps |
---|
287 | \begin{figure} |
---|
288 | \centering |
---|
289 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-4-realloc} } |
---|
290 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-4-realloc} } |
---|
291 | \caption{Speed benchmark chain: realloc} |
---|
292 | \label{fig:speed-4-realloc} |
---|
293 | \end{figure} |
---|
294 | |
---|
295 | %speed-5-free.eps |
---|
296 | \begin{figure} |
---|
297 | \centering |
---|
298 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-5-free} } |
---|
299 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-5-free} } |
---|
300 | \caption{Speed benchmark chain: free} |
---|
301 | \label{fig:speed-5-free} |
---|
302 | \end{figure} |
---|
303 | |
---|
304 | %speed-6-calloc.eps |
---|
305 | \begin{figure} |
---|
306 | \centering |
---|
307 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-6-calloc} } |
---|
308 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-6-calloc} } |
---|
309 | \caption{Speed benchmark chain: calloc} |
---|
310 | \label{fig:speed-6-calloc} |
---|
311 | \end{figure} |
---|
312 | |
---|
313 | %speed-7-malloc-free.eps |
---|
314 | \begin{figure} |
---|
315 | \centering |
---|
316 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-7-malloc-free} } |
---|
317 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-7-malloc-free} } |
---|
318 | \caption{Speed benchmark chain: malloc-free} |
---|
319 | \label{fig:speed-7-malloc-free} |
---|
320 | \end{figure} |
---|
321 | |
---|
322 | %speed-8-realloc-free.eps |
---|
323 | \begin{figure} |
---|
324 | \centering |
---|
325 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-8-realloc-free} } |
---|
326 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-8-realloc-free} } |
---|
327 | \caption{Speed benchmark chain: realloc-free} |
---|
328 | \label{fig:speed-8-realloc-free} |
---|
329 | \end{figure} |
---|
330 | |
---|
331 | %speed-9-calloc-free.eps |
---|
332 | \begin{figure} |
---|
333 | \centering |
---|
334 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-9-calloc-free} } |
---|
335 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-9-calloc-free} } |
---|
336 | \caption{Speed benchmark chain: calloc-free} |
---|
337 | \label{fig:speed-9-calloc-free} |
---|
338 | \end{figure} |
---|
339 | |
---|
340 | %speed-10-malloc-realloc.eps |
---|
341 | \begin{figure} |
---|
342 | \centering |
---|
343 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-10-malloc-realloc} } |
---|
344 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-10-malloc-realloc} } |
---|
345 | \caption{Speed benchmark chain: malloc-realloc} |
---|
346 | \label{fig:speed-10-malloc-realloc} |
---|
347 | \end{figure} |
---|
348 | |
---|
349 | %speed-11-calloc-realloc.eps |
---|
350 | \begin{figure} |
---|
351 | \centering |
---|
352 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-11-calloc-realloc} } |
---|
353 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-11-calloc-realloc} } |
---|
354 | \caption{Speed benchmark chain: calloc-realloc} |
---|
355 | \label{fig:speed-11-calloc-realloc} |
---|
356 | \end{figure} |
---|
357 | |
---|
358 | %speed-12-malloc-realloc-free.eps |
---|
359 | \begin{figure} |
---|
360 | \centering |
---|
361 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-12-malloc-realloc-free} } |
---|
362 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-12-malloc-realloc-free} } |
---|
363 | \caption{Speed benchmark chain: malloc-realloc-free} |
---|
364 | \label{fig:speed-12-malloc-realloc-free} |
---|
365 | \end{figure} |
---|
366 | |
---|
367 | %speed-13-calloc-realloc-free.eps |
---|
368 | \begin{figure} |
---|
369 | \centering |
---|
370 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-13-calloc-realloc-free} } |
---|
371 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-13-calloc-realloc-free} } |
---|
372 | \caption{Speed benchmark chain: calloc-realloc-free} |
---|
373 | \label{fig:speed-13-calloc-realloc-free} |
---|
374 | \end{figure} |
---|
375 | |
---|
376 | %speed-14-{m,c,re}alloc-free.eps |
---|
377 | \begin{figure} |
---|
378 | \centering |
---|
379 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/speed-14-m-c-re-alloc-free} } |
---|
380 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/speed-14-m-c-re-alloc-free} } |
---|
381 | \caption{Speed benchmark chain: malloc-calloc-realloc-free} |
---|
382 | \label{fig:speed-14-malloc-calloc-realloc-free} |
---|
383 | \end{figure} |
---|
384 | |
---|
385 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
386 | %% MEMORY |
---|
387 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
388 | |
---|
389 | \newpage |
---|
390 | \subsection{Memory Micro-Benchmark} |
---|
391 | |
---|
392 | This experiment is run with the following two configurations for each allocator. |
---|
393 | The difference between the two configurations is the number of producers and consumers. |
---|
394 | Configuration 1 has one producer and one consumer, and configuration 2 has 4 producers, where each producer has 4 consumers. |
---|
395 | |
---|
396 | \noindent |
---|
397 | Configuration 1: |
---|
398 | \begin{description}[itemsep=0pt,parsep=0pt] |
---|
399 | \item[producer (K):] |
---|
400 | 1 |
---|
401 | \item[consumer (M):] |
---|
402 | 1 |
---|
403 | \item[round:] |
---|
404 | 100,000 |
---|
405 | \item[max:] |
---|
406 | 500 |
---|
407 | \item[min:] |
---|
408 | 50 |
---|
409 | \item[step:] |
---|
410 | 50 |
---|
411 | \item[distro:] |
---|
412 | fisher |
---|
413 | \item[objects (N):] |
---|
414 | 100,000 |
---|
415 | \end{description} |
---|
416 | |
---|
417 | % -threadA : 1 |
---|
418 | % -threadF : 1 |
---|
419 | % -maxS : 500 |
---|
420 | % -minS : 50 |
---|
421 | % -stepS : 50 |
---|
422 | % -distroS : fisher |
---|
423 | % -objN : 100000 |
---|
424 | % -consumeS: 100000 |
---|
425 | |
---|
426 | \noindent |
---|
427 | Configuration 2: |
---|
428 | \begin{description}[itemsep=0pt,parsep=0pt] |
---|
429 | \item[producer (K):] |
---|
430 | 4 |
---|
431 | \item[consumer (M):] |
---|
432 | 4 |
---|
433 | \item[round:] |
---|
434 | 100,000 |
---|
435 | \item[max:] |
---|
436 | 500 |
---|
437 | \item[min:] |
---|
438 | 50 |
---|
439 | \item[step:] |
---|
440 | 50 |
---|
441 | \item[distro:] |
---|
442 | fisher |
---|
443 | \item[objects (N):] |
---|
444 | 100,000 |
---|
445 | \end{description} |
---|
446 | |
---|
447 | % -threadA : 4 |
---|
448 | % -threadF : 4 |
---|
449 | % -maxS : 500 |
---|
450 | % -minS : 50 |
---|
451 | % -stepS : 50 |
---|
452 | % -distroS : fisher |
---|
453 | % -objN : 100000 |
---|
454 | % -consumeS: 100000 |
---|
455 | |
---|
456 | % \begin{table}[b] |
---|
457 | % \centering |
---|
458 | % \begin{tabular}{ |c|c|c| } |
---|
459 | % \hline |
---|
460 | % Memory Allocator & Configuration 1 Result & Configuration 2 Result\\ |
---|
461 | % \hline |
---|
462 | % llh & \VRef[Figure]{fig:mem-1-prod-1-cons-100-cfa} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-cfa}\\ |
---|
463 | % \hline |
---|
464 | % dl & \VRef[Figure]{fig:mem-1-prod-1-cons-100-dl} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-dl}\\ |
---|
465 | % \hline |
---|
466 | % glibc & \VRef[Figure]{fig:mem-1-prod-1-cons-100-glc} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-glc}\\ |
---|
467 | % \hline |
---|
468 | % hoard & \VRef[Figure]{fig:mem-1-prod-1-cons-100-hrd} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-hrd}\\ |
---|
469 | % \hline |
---|
470 | % je & \VRef[Figure]{fig:mem-1-prod-1-cons-100-je} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-je}\\ |
---|
471 | % \hline |
---|
472 | % pt3 & \VRef[Figure]{fig:mem-1-prod-1-cons-100-pt3} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-pt3}\\ |
---|
473 | % \hline |
---|
474 | % rp & \VRef[Figure]{fig:mem-1-prod-1-cons-100-rp} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-rp}\\ |
---|
475 | % \hline |
---|
476 | % tbb & \VRef[Figure]{fig:mem-1-prod-1-cons-100-tbb} & \VRef[Figure]{fig:mem-4-prod-4-cons-100-tbb}\\ |
---|
477 | % \hline |
---|
478 | % \end{tabular} |
---|
479 | % \caption{Memory benchmark results} |
---|
480 | % \label{table:mem-benchmark-figs} |
---|
481 | % \end{table} |
---|
482 | % Table \ref{table:mem-benchmark-figs} shows the list of figures that contain memory benchmark results. |
---|
483 | |
---|
484 | \VRefrange[Figures]{fig:mem-1-prod-1-cons-100-cfa}{fig:mem-4-prod-4-cons-100-tbb} show 16 figures, two figures for each of the 8 allocators, one for each configuration. |
---|
485 | Each figure has 2 graphs, one for each experiment environment. |
---|
486 | Each graph has following 5 subgraphs that show memory usage and statistics throughout the micro-benchmark's lifetime. |
---|
487 | \begin{itemize} |
---|
488 | \item \textit{\textbf{current\_req\_mem(B)}} shows the amount of dynamic memory requested and currently in-use of the benchmark. |
---|
489 | \item \textit{\textbf{heap}}* shows the memory requested by the program (allocator) from the system that lies in the heap (@sbrk@) area. |
---|
490 | \item \textit{\textbf{mmap\_so}}* shows the memory requested by the program (allocator) from the system that lies in the @mmap@ area. |
---|
491 | \item \textit{\textbf{mmap}}* shows the memory requested by the program (allocator or shared libraries) from the system that lies in the @mmap@ area. |
---|
492 | \item \textit{\textbf{total\_dynamic}} shows the total usage of dynamic memory by the benchmark program, which is a sum of \textit{heap}, \textit{mmap}, and \textit{mmap\_so}. |
---|
493 | \end{itemize} |
---|
494 | * These statistics are gathered by monitoring a process's @/proc/self/maps@ file. |
---|
495 | |
---|
496 | The X-axis shows the time when the memory information is polled. |
---|
497 | The Y-axis shows the memory usage in bytes. |
---|
498 | |
---|
499 | For the experiment, at a certain time in the program's life, the difference between the memory requested by the benchmark (\textit{current\_req\_mem(B)}) and the memory that the process has received from system (\textit{heap}, \textit{mmap}) should be minimum. |
---|
500 | This difference is the memory overhead caused by the allocator and shows the level of fragmentation in the allocator. |
---|
501 | |
---|
502 | First, the differences in the shape of the curves between architectures (top ARM, bottom x64) is small, where the differences are in the amount of memory used. |
---|
503 | Hence, it is possible to focus on either the top or bottom graph. |
---|
504 | The heap curve It is possible glib, hoard, jemalloc, ptmalloc3, rpmalloc do not use the sbrk area => only uses mmap. |
---|
505 | |
---|
506 | hoard, tbbmalloc uses more total memory |
---|
507 | |
---|
508 | ptmalloc3 gives memory back to operating system |
---|
509 | |
---|
510 | %mem-1-prod-1-cons-100-cfa.eps |
---|
511 | \begin{figure} |
---|
512 | \centering |
---|
513 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-cfa} } |
---|
514 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-cfa} } |
---|
515 | \caption{Memory benchmark results with 1 producer for llh memory allocator} |
---|
516 | \label{fig:mem-1-prod-1-cons-100-cfa} |
---|
517 | \end{figure} |
---|
518 | |
---|
519 | %mem-4-prod-4-cons-100-cfa.eps |
---|
520 | \begin{figure} |
---|
521 | \centering |
---|
522 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-cfa} } |
---|
523 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-cfa} } |
---|
524 | \caption{Memory benchmark results with 4 producers for llh memory allocator} |
---|
525 | \label{fig:mem-4-prod-4-cons-100-cfa} |
---|
526 | \end{figure} |
---|
527 | |
---|
528 | %mem-1-prod-1-cons-100-dl.eps |
---|
529 | \begin{figure} |
---|
530 | \centering |
---|
531 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-dl} } |
---|
532 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-dl} } |
---|
533 | \caption{Memory benchmark results with 1 producer for dl memory allocator} |
---|
534 | \label{fig:mem-1-prod-1-cons-100-dl} |
---|
535 | \end{figure} |
---|
536 | |
---|
537 | %mem-4-prod-4-cons-100-dl.eps |
---|
538 | \begin{figure} |
---|
539 | \centering |
---|
540 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-dl} } |
---|
541 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-dl} } |
---|
542 | \caption{Memory benchmark results with 4 producers for dl memory allocator} |
---|
543 | \label{fig:mem-4-prod-4-cons-100-dl} |
---|
544 | \end{figure} |
---|
545 | |
---|
546 | %mem-1-prod-1-cons-100-glc.eps |
---|
547 | \begin{figure} |
---|
548 | \centering |
---|
549 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-glc} } |
---|
550 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-glc} } |
---|
551 | \caption{Memory benchmark results with 1 producer for glibc memory allocator} |
---|
552 | \label{fig:mem-1-prod-1-cons-100-glc} |
---|
553 | \end{figure} |
---|
554 | |
---|
555 | %mem-4-prod-4-cons-100-glc.eps |
---|
556 | \begin{figure} |
---|
557 | \centering |
---|
558 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-glc} } |
---|
559 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-glc} } |
---|
560 | \caption{Memory benchmark results with 4 producers for glibc memory allocator} |
---|
561 | \label{fig:mem-4-prod-4-cons-100-glc} |
---|
562 | \end{figure} |
---|
563 | |
---|
564 | %mem-1-prod-1-cons-100-hrd.eps |
---|
565 | \begin{figure} |
---|
566 | \centering |
---|
567 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-hrd} } |
---|
568 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-hrd} } |
---|
569 | \caption{Memory benchmark results with 1 producer for hoard memory allocator} |
---|
570 | \label{fig:mem-1-prod-1-cons-100-hrd} |
---|
571 | \end{figure} |
---|
572 | |
---|
573 | %mem-4-prod-4-cons-100-hrd.eps |
---|
574 | \begin{figure} |
---|
575 | \centering |
---|
576 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-hrd} } |
---|
577 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-hrd} } |
---|
578 | \caption{Memory benchmark results with 4 producers for hoard memory allocator} |
---|
579 | \label{fig:mem-4-prod-4-cons-100-hrd} |
---|
580 | \end{figure} |
---|
581 | |
---|
582 | %mem-1-prod-1-cons-100-je.eps |
---|
583 | \begin{figure} |
---|
584 | \centering |
---|
585 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-je} } |
---|
586 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-je} } |
---|
587 | \caption{Memory benchmark results with 1 producer for je memory allocator} |
---|
588 | \label{fig:mem-1-prod-1-cons-100-je} |
---|
589 | \end{figure} |
---|
590 | |
---|
591 | %mem-4-prod-4-cons-100-je.eps |
---|
592 | \begin{figure} |
---|
593 | \centering |
---|
594 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-je} } |
---|
595 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-je} } |
---|
596 | \caption{Memory benchmark results with 4 producers for je memory allocator} |
---|
597 | \label{fig:mem-4-prod-4-cons-100-je} |
---|
598 | \end{figure} |
---|
599 | |
---|
600 | %mem-1-prod-1-cons-100-pt3.eps |
---|
601 | \begin{figure} |
---|
602 | \centering |
---|
603 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-pt3} } |
---|
604 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-pt3} } |
---|
605 | \caption{Memory benchmark results with 1 producer for pt3 memory allocator} |
---|
606 | \label{fig:mem-1-prod-1-cons-100-pt3} |
---|
607 | \end{figure} |
---|
608 | |
---|
609 | %mem-4-prod-4-cons-100-pt3.eps |
---|
610 | \begin{figure} |
---|
611 | \centering |
---|
612 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-pt3} } |
---|
613 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-pt3} } |
---|
614 | \caption{Memory benchmark results with 4 producers for pt3 memory allocator} |
---|
615 | \label{fig:mem-4-prod-4-cons-100-pt3} |
---|
616 | \end{figure} |
---|
617 | |
---|
618 | %mem-1-prod-1-cons-100-rp.eps |
---|
619 | \begin{figure} |
---|
620 | \centering |
---|
621 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-rp} } |
---|
622 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-rp} } |
---|
623 | \caption{Memory benchmark results with 1 producer for rp memory allocator} |
---|
624 | \label{fig:mem-1-prod-1-cons-100-rp} |
---|
625 | \end{figure} |
---|
626 | |
---|
627 | %mem-4-prod-4-cons-100-rp.eps |
---|
628 | \begin{figure} |
---|
629 | \centering |
---|
630 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-rp} } |
---|
631 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-rp} } |
---|
632 | \caption{Memory benchmark results with 4 producers for rp memory allocator} |
---|
633 | \label{fig:mem-4-prod-4-cons-100-rp} |
---|
634 | \end{figure} |
---|
635 | |
---|
636 | %mem-1-prod-1-cons-100-tbb.eps |
---|
637 | \begin{figure} |
---|
638 | \centering |
---|
639 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-1-prod-1-cons-100-tbb} } |
---|
640 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-1-prod-1-cons-100-tbb} } |
---|
641 | \caption{Memory benchmark results with 1 producer for tbb memory allocator} |
---|
642 | \label{fig:mem-1-prod-1-cons-100-tbb} |
---|
643 | \end{figure} |
---|
644 | |
---|
645 | %mem-4-prod-4-cons-100-tbb.eps |
---|
646 | \begin{figure} |
---|
647 | \centering |
---|
648 | \subfigure[Algol]{ \includegraphics[width=0.95\textwidth]{evaluations/algol-perf-eps/mem-4-prod-4-cons-100-tbb} } |
---|
649 | \subfigure[Nasus]{ \includegraphics[width=0.95\textwidth]{evaluations/nasus-perf-eps/mem-4-prod-4-cons-100-tbb} } |
---|
650 | \caption{Memory benchmark results with 4 producers for tbb memory allocator} |
---|
651 | \label{fig:mem-4-prod-4-cons-100-tbb} |
---|
652 | \end{figure} |
---|
653 | |
---|
654 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|
655 | %% ANALYSIS |
---|
656 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% |
---|