source: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex @ a6e8f64

ADTast-experimentalpthread-emulationqualifiedEnum
Last change on this file since a6e8f64 was a6e8f64, checked in by Peter A. Buhr <pabuhr@…>, 2 years ago

first proofread of chapter benchmarks

  • Property mode set to 100644
File size: 16.7 KB
Line 
1\chapter{Benchmarks}
2
3%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Micro Benchmark Suite
6%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
8
9There are two basic approaches for evaluating computer software: benchmarks and micro-benchmarks.
10\begin{description}
11\item[Benchmarks]
12are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc).
13The applications are suppose to represent common execution patterns that need to perform well with respect to an underlying software implementation.
14Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns.
15\item[Micro-Benchmarks]
16attempt to extract the common execution patterns associated with an application and run the pattern independently.
17This approach removes any masking from extraneous application code, allows execution pattern to be very precise, and provides an opportunity for the execution pattern to have multiple independent tuning adjustments (knobs).
18Micro-benchmarks are often criticized for inadequately representing real-world applications.
19\end{description}
20
21While some crucial software components have standard benchmarks, no standard benchmark exists for testing and comparing memory allocators.
22In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
23As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
24Many of these applications and micro-benchmark are old and may not reflect current application allocation patterns.
25
26This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters.
27The aim of the micro-benchmark suite is to create a set of programs that can evaluate a memory allocator based on the performance matrices described in (FIX ME: local cite).
28% These programs can be taken as a standard to benchmark an allocator's basic goals.
29These programs give details of an allocator's memory overhead and speed under certain allocation patterns.
30The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs.
31The micro-benchmark programs control knobs by command-line arguments.
32
33The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices.
34An allocator's speed is benchmarked in different ways, as are issues like false sharing.
35
36
37\section{Prior Multi-Threaded Micro-Benchmarks}
38
39Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT level.
40The following traditional multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00}.
41
42
43\subsection{threadtest}
44
45This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently.
46There is no interaction among threads, \ie no object sharing.
47Each thread repeatedly allocate 100,000 8-byte objects then deallocates them in the order they were allocated.
48Runtime of the benchmark evaluates its efficiency.
49
50
51\subsection{shbench}
52
53Each thread randomly allocates and frees a number of random-sized objects.
54It is a stress test that also uses runtime to determine efficiency of the allocator.
55
56
57\subsection{Larson}
58
59Larson simulates a server environment.
60Multiple threads are created where each thread allocates and frees a number of random-sized objects within a size range.
61Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process.
62The number of thread generations varies depending on the thread speed.
63It calculates memory operations per second as an indicator of memory allocator's performance.
64
65
66\section{New Multi-Threaded Micro-Benchmarks}
67
68The following new benchmarks were created to assess multi-threaded programs at the KT level.
69
70
71\subsection{Memory Micro-Benchmark}
72
73The memory micro-benchmark measures the memory overhead of an allocator.
74It allocates a number of dynamic objects.
75Then, by reading @/proc/self/proc/maps@, it gets the total memory requested by the allocator from the OS.
76It calculates the memory overhead by computing the difference between the memory the allocator has requested from the OS and the memory that the program has allocated.
77
78\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
79It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
80A producer has a separate buffer for each consumer and allocates N objects of random sizes following the given distribution for each consumer.
81A consumer frees these objects.
82After every memory operation, program memory usage is recorded throughout the runtime.
83This data then is used to visualize the memory usage and consumption for the program.
84
85\begin{figure}
86\centering
87\begin{lstlisting}
88Main Thread
89        print memory snapshot
90        create producer threads
91Producer Thread
92        set free start
93        create consumer threads
94        for ( N )
95                allocate memory
96                print memory snapshot
97Consumer Thread
98        wait while ( allocations < free start )
99        for ( N )
100                free memory
101                print memory snapshot
102\end{lstlisting}
103%\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
104\caption{Memory Overhead Benchmark (evaluate memory footprint)}
105\label{fig:MemoryBenchFig}
106\end{figure}
107
108The adjustment knobs for this micro-benchmark are:
109\begin{description}[itemsep=0pt,parsep=0pt]
110\item[producer:]
111sets the number of producer threads.
112\item[round:]
113sets production and consumption round size.
114\item[consumer:]
115sets number of consumers threads for each producer.
116\end{description}
117
118The adjustment knobs for the object allocation size are:
119\begin{description}[itemsep=0pt,parsep=0pt]
120\item[max:]
121sets max object size.
122\item[min:]
123sets min object size.
124\item[step:]
125sets object size increment.
126\item[distro:]
127sets object size distribution.
128\item[obj:]
129sets number of objects per thread.
130\end{description}
131
132
133\subsection{Speed Micro-Benchmark}
134
135The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines:
136\begin{enumerate}[itemsep=0pt,parsep=0pt]
137\item malloc
138\item realloc
139\item free
140\item calloc
141\item malloc-free
142\item realloc-free
143\item calloc-free
144\item malloc-realloc
145\item calloc-realloc
146\item malloc-realloc-free
147\item calloc-realloc-free
148\item malloc-realloc-free-calloc
149\end{enumerate}
150
151\VRef[Figure]{fig:SpeedBenchFig} shows the pseudo code for the speed micro-benchmark.
152Each routine in the chain is called for N objects and then those allocated objects are used when call the next routine in the allocation chain.
153This way we can measure the complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives us the whole picture of the major allocation routines when combined together in a chain.
154For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against each chain.
155
156\begin{figure}
157\centering
158\begin{lstlisting}[morekeywords={foreach}]
159Main Thread
160        create worker threads
161        foreach ( allocation chain )
162                note time T1
163                ...
164                note time T2
165                chain_speed = (T2 - T1) / number-of-worker-threads * N )
166Worker Thread
167        initialize variables
168        ...
169        foreach ( routine in allocation chain )
170                call routine N times
171\end{lstlisting}
172%\includegraphics[width=1\textwidth]{figures/bench-speed.eps}
173\caption{Speed Benchmark}
174\label{fig:SpeedBenchFig}
175\end{figure}
176
177The adjustment knobs for memory usage are:
178\begin{description}[itemsep=0pt,parsep=0pt]
179\item[max:]
180sets max object size.
181\item[min:]
182sets min object size.
183\item[step:]
184sets object size increment.
185\item[distro:]
186sets object size distribution.
187\item[obj:]
188sets number of objects per thread.
189\item[workers:]
190sets number of worker threads.
191\end{description}
192
193
194\subsection{Churn Benchmark}
195
196Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded scenerio where each thread extensively allocates and frees dynamic memory.
197
198\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
199This benchmark creates a buffer with M spots and starts K threads.
200Each thread randomly picks a spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot.
201Each thread repeats this cycle for N times.
202Main threads measures the total time taken for the whole benchmark and that time is used to evaluate memory allocator's performance.
203
204\begin{figure}
205\centering
206\begin{lstlisting}
207Main Thread
208        create worker threads
209        note time T1
210        ...
211        note time T2
212        churn_speed = (T2 - T1)
213Worker Thread
214        initialize variables
215        ...
216        for ( N )
217                R -> random spot in array
218                free R
219                allocate new object at R
220\end{lstlisting}
221%\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
222\caption{Churn Benchmark}
223\label{fig:ChurnBenchFig}
224\end{figure}
225
226Only malloc and free are used to allocate and free an object to eliminate any extra cost such as memcpy in realloc etc.
227Malloc/free allows us to measure latency of memory allocation only without paying any extra cost.
228Churn simulates a memory intensive program that can be tuned to create different scenarios.
229
230The adjustment knobs for churn usage are:
231\begin{description}[itemsep=0pt,parsep=0pt]
232\item[thread:]
233sets number of threads, K.
234\item[spots:]
235sets number of spots for churn, M.
236\item[obj:]
237sets number of objects per thread, N.
238\item[max:]
239sets max object size.
240\item[min:]
241sets min object size.
242\item[step:]
243sets object size increment.
244\item[distro:]
245sets object size distribution
246\end{description}
247
248
249\section{Cache Thrash}\label{sec:benchThrashSec}
250
251Cache Thrash benchmark measures allocator induced active false sharing in an allocator as illustrated in figure \VRef{f:AllocatorInducedActiveFalseSharing}.
252If memory allocator allocates memory for multiple threads on same cache line, this can slow down the program performance.
253If both threads, who share one cache line, frequently read/write to their object on the cache line concurrently then this will cause cache miss every time a thread accesses the object as the other thread might have written something at their memory location on the same cache line.
254
255Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory allocator is allocating dynamic memory to multiple threads on same cache lines.
256Ideally, a memory allocator should distance dynamic memory region of one thread from other threads'.
257Having multiple threads allocating small objects simultaneously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its
258not distancing the memory among different threads.
259
260\VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
261First, it creates K worker threads.
262Each worker thread allocates an object and intensively read/write it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line with it.
263Each thread repeats this for N times.
264Main thread measures the total time taken to for all worker threads to complete.
265Worker threads sharing cache lines with each other will take longer.
266
267\begin{figure}
268\centering
269\input{AllocInducedActiveFalseSharing}
270\medskip
271\begin{lstlisting}
272Main Thread
273        create worker threads
274        ...
275        signal workers to allocate
276        ...
277        signal workers to free
278        ...
279        print addresses from each $thread$
280Worker Thread$\(_1\)$
281        allocate, write, read, free
282        warmup memory in chunkc of 16 bytes
283        ...
284        malloc N objects
285        ...
286        free objects
287        return object address to Main Thread
288Worker Thread$\(_2\)$
289        // same as Worker Thread$\(_1\)$
290\end{lstlisting}
291%\input{MemoryOverhead}
292%\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
293\caption{Allocator-Induced Active False-Sharing Benchmark}
294\label{fig:benchThrashFig}
295\end{figure}
296
297The adjustment knobs for cache access scenarios are:
298\begin{description}[itemsep=0pt,parsep=0pt]
299\item[threadN:]
300sets number of threads, K.
301\item[cacheIt:]
302iterations for cache benchmark, N.
303\item[cacheRep:]
304repetitions for cache benchmark, M.
305\item[cacheObj:]
306object size for cache benchmark.
307\end{description}
308
309
310\subsection{Cache Scratch}
311
312The cache scratch benchmark measures allocator induced passive false sharing in an allocator.
313An allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in \VRef[Figure]{f:AllocatorInducedPassiveFalseSharing}.
314If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
315If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
316instead, the program induced this situation.
317Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
318Now this false sharing is being caused by the memory allocator although it was started by the program.
319
320The cache-scratch main-thread induces false sharing and creates a scenerio that should make the memory allocator preserve the program-induced false sharing if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
321An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
322If the object is returned to the thread who owns it or originally allocated it, then the thread Thread$_2$ gets a new object that is less likely to be on the same cache line as Thread$_1$.
323
324\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
325First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache-line.
326Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
327Each worker thread frees the object passed by the main thread.
328Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations.
329Each worker repeats this N times.
330
331\begin{figure}
332\centering
333\input{AllocInducedPassiveFalseSharing}
334\medskip
335\begin{lstlisting}
336Main Thread
337        malloc N objects for each worker thread
338        create worker threads and pass N objects to each worker
339        ...
340        signal workers to allocate
341        ...
342        signal workers to free
343        ...
344        print addresses from each $thread$
345Worker Thread$\(_1\)$
346        allocate, write, read, free
347        warmup memory in chunkc of 16 bytes
348        ...
349        for ( N )
350                free an object passed by Main Thread
351                malloc new object
352        ...
353        free objects
354        return new object addresses to Main Thread
355Worker Thread$\(_2\)$
356        // same as Worker Thread$\(_1\)$
357\end{lstlisting}
358%\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
359\caption{Program-Induced Passive False-Sharing Benchmark}
360\label{fig:benchScratchFig}
361\end{figure}
362
363Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread).
364Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses.
365Main thread measures the total time taken for all the workers to complete.
366
367Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.\\
368\begin{description}[itemsep=0pt,parsep=0pt]
369\item[threads:]
370number of threads, K.
371\item[iterations:]
372iterations for cache benchmark, N.
373\item[repetitions:]
374repetitions for cache benchmark, M.
375\item[objsize:]
376object size for cache benchmark.
377\end{description}
Note: See TracBrowser for help on using the repository browser.