source: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex @ e357efb

ADTast-experimentalpthread-emulationqualifiedEnum
Last change on this file since e357efb was e357efb, checked in by Peter A. Buhr <pabuhr@…>, 2 years ago

second proofread of chapter benchmarks

  • Property mode set to 100644
File size: 16.6 KB
Line 
1\chapter{Benchmarks}
2
3%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
5%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Micro Benchmark Suite
6%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
7%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
8
9There are two basic approaches for evaluating computer software: benchmarks and micro-benchmarks.
10\begin{description}
11\item[Benchmarks]
12are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc).
13The applications are suppose to represent common execution patterns that need to perform well with respect to an underlying software implementation.
14Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns.
15\item[Micro-Benchmarks]
16attempt to extract the common execution patterns associated with an application and run the pattern independently.
17This approach removes any masking from extraneous application code, allows execution pattern to be very precise, and provides an opportunity for the execution pattern to have multiple independent tuning adjustments (knobs).
18Micro-benchmarks are often criticized for inadequately representing real-world applications.
19\end{description}
20
21While some crucial software components have standard benchmarks, no standard benchmark exists for testing and comparing memory allocators.
22In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
23As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
24Many of these benchmark applications and micro-benchmarks are old and may not reflect current application allocation patterns.
25
26This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters.
27The aim of the micro-benchmark suite is to create a set of programs that can evaluate a memory allocator based on the performance matrices described in (FIX ME: local cite).
28% These programs can be taken as a standard to benchmark an allocator's basic goals.
29These programs give details of an allocator's memory overhead and speed under certain allocation patterns.
30The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs.
31Each micro-benchmark program has multiple control knobs specified by command-line arguments.
32
33The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices.
34An allocator's speed is benchmarked in different ways, as are issues like false sharing.
35
36
37\section{Prior Multi-Threaded Micro-Benchmarks}
38
39Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT and UT level.
40The following multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00} at the KT level.
41None of the prior work address multi-threading at the UT level.
42
43
44\subsection{threadtest}
45
46This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently.
47There is no interaction among threads, \ie no object sharing.
48Each thread repeatedly allocate 100,000 \emph{8-byte} objects then deallocates them in the order they were allocated.
49Runtime of the benchmark evaluates its efficiency.
50
51
52\subsection{shbench}
53
54This benchmark is similar to threadtest but each thread randomly allocate and free a number of \emph{random-sized} objects.
55It is a stress test that also uses runtime to determine efficiency of the allocator.
56
57
58\subsection{Larson}
59
60This benchmark simulates a server environment.
61Multiple threads are created where each thread allocates and frees a number of random-sized objects within a size range.
62Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process.
63The number of thread generations varies depending on the thread speed.
64It calculates memory operations per second as an indicator of memory allocator's performance.
65
66
67\section{New Multi-Threaded Micro-Benchmarks}
68
69The following new benchmarks were created to assess multi-threaded programs at the KT and UT level.
70
71
72\subsection{Memory Micro-Benchmark}
73
74The memory micro-benchmark measures the memory overhead of an allocator.
75It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS.
76It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates.
77This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing.
78
79\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
80It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
81A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer.
82A consumer frees these objects.
83After every memory operation, program memory usage is recorded throughout the runtime.
84This data is used to visualize the memory usage and consumption for the program.
85
86\begin{figure}
87\centering
88\begin{lstlisting}
89Main Thread
90        print memory snapshot
91        create producer threads
92Producer Thread (K)
93        set free start
94        create consumer threads
95        for ( N )
96                allocate memory
97                print memory snapshot
98Consumer Thread (M)
99        wait while ( allocations < free start )
100        for ( N )
101                free memory
102                print memory snapshot
103\end{lstlisting}
104%\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
105\caption{Memory Footprint Micro-Benchmark}
106\label{fig:MemoryBenchFig}
107\end{figure}
108
109The global adjustment knobs for this micro-benchmark are:
110\begin{description}[itemsep=0pt,parsep=0pt]
111\item[producer (K):]
112sets the number of producer threads.
113\item[consumer (M):]
114sets number of consumers threads for each producer.
115\item[round:]
116sets production and consumption round size.
117\end{description}
118
119The adjustment knobs for object allocation are:
120\begin{description}[itemsep=0pt,parsep=0pt]
121\item[max:]
122maximum object size.
123\item[min:]
124minimum object size.
125\item[step:]
126object size increment.
127\item[distro:]
128object size distribution.
129\item[objects (N):]
130number of objects per thread.
131\end{description}
132
133
134\subsection{Speed Micro-Benchmark}
135
136The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines:
137\begin{enumerate}[itemsep=0pt,parsep=0pt]
138\item malloc
139\item realloc
140\item free
141\item calloc
142\item malloc-free
143\item realloc-free
144\item calloc-free
145\item malloc-realloc
146\item calloc-realloc
147\item malloc-realloc-free
148\item calloc-realloc-free
149\item malloc-realloc-free-calloc
150\end{enumerate}
151
152\VRef[Figure]{fig:SpeedBenchFig} shows the pseudo code for the speed micro-benchmark.
153Each routine in the chain is called for N objects and then those allocated objects are used when calling the next routine in the allocation chain.
154This tests the latency of the memory allocator when multiple routines are chained together, \eg the call sequence malloc-realloc-free-calloc gives a complete picture of the major allocation routines when combined together.
155For each chain, the time is recorded to visualize performance of a memory allocator against each chain.
156
157\begin{figure}
158\centering
159\begin{lstlisting}[morekeywords={foreach}]
160Main Thread
161        create worker threads
162        foreach ( allocation chain )
163                note time T1
164                ...
165                note time T2
166                chain_speed = (T2 - T1) / number-of-worker-threads * N )
167Worker Thread
168        initialize variables
169        ...
170        foreach ( routine in allocation chain )
171                call routine N times
172\end{lstlisting}
173%\includegraphics[width=1\textwidth]{figures/bench-speed.eps}
174\caption{Speed Benchmark}
175\label{fig:SpeedBenchFig}
176\end{figure}
177
178The adjustment knobs for memory usage are:
179\begin{description}[itemsep=0pt,parsep=0pt]
180\item[max:]
181maximum object size.
182\item[min:]
183minimum object size.
184\item[step:]
185object size increment.
186\item[distro:]
187object size distribution.
188\item[objects:]
189number of objects per thread.
190\item[workers:]
191number of worker threads.
192\end{description}
193
194
195\subsection{Churn Benchmark}
196
197The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory.
198Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@.
199Churn simulates a memory intensive program that can be tuned to create different scenarios.
200
201\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
202This benchmark creates a buffer with M spots and starts K threads.
203Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot.
204Each thread repeats this cycle N times.
205The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance.
206
207\begin{figure}
208\centering
209\begin{lstlisting}
210Main Thread
211        create worker threads
212        note time T1
213        ...
214        note time T2
215        churn_speed = (T2 - T1)
216Worker Thread
217        initialize variables
218        ...
219        for ( N )
220                R = random spot in array
221                free R
222                allocate new object at R
223\end{lstlisting}
224%\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
225\caption{Churn Benchmark}
226\label{fig:ChurnBenchFig}
227\end{figure}
228
229The adjustment knobs for churn are:
230\begin{description}[itemsep=0pt,parsep=0pt]
231\item[thread:]
232number of threads (K).
233\item[spots:]
234number of spots for churn (M).
235\item[obj:]
236number of objects per thread (N).
237\item[max:]
238maximum object size.
239\item[min:]
240minimum object size.
241\item[step:]
242object size increment.
243\item[distro:]
244object size distribution
245\end{description}
246
247
248\subsection{Cache Thrash}\label{sec:benchThrashSec}
249
250The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}.
251If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
252When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases.
253
254Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.
255Ideally, a memory allocator should distance the dynamic memory region of one thread from another.
256Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads.
257
258\VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
259First, it creates K worker threads.
260Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line.
261Each thread repeats this for N times.
262The main thread measures the total time taken to for all worker threads to complete.
263Worker threads sharing cache lines with each other will take longer.
264
265\begin{figure}
266\centering
267\input{AllocInducedActiveFalseSharing}
268\medskip
269\begin{lstlisting}
270Main Thread
271        create worker threads
272        ...
273        signal workers to allocate
274        ...
275        signal workers to free
276        ...
277        print addresses from each $thread$
278Worker Thread$\(_1\)$
279        allocate, write, read, free
280        warmup memory in chunkc of 16 bytes
281        ...
282        malloc N objects
283        ...
284        free objects
285        return object address to Main Thread
286Worker Thread$\(_2\)$
287        // same as Worker Thread$\(_1\)$
288\end{lstlisting}
289%\input{MemoryOverhead}
290%\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
291\caption{Allocator-Induced Active False-Sharing Benchmark}
292\label{fig:benchThrashFig}
293\end{figure}
294
295The adjustment knobs for cache access scenarios are:
296\begin{description}[itemsep=0pt,parsep=0pt]
297\item[thread:]
298number of threads (K).
299\item[iterations:]
300iterations of cache benchmark (N).
301\item[cacheRW:]
302repetitions of reads/writes to object (M).
303\item[size:]
304object size.
305\end{description}
306
307
308\subsection{Cache Scratch}
309
310The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}.
311As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
312In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object.
313
314% An allocator can unintentionally induce false sharing depending upon its management of the freed objects.
315% If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
316% If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
317% instead, the program induced this situation.
318% Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
319
320Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
321An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
322If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line.
323
324\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
325First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line.
326Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
327Each worker thread frees the object passed by the main thread.
328Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations.
329Each worker repeats this N times.
330
331\begin{figure}
332\centering
333\input{AllocInducedPassiveFalseSharing}
334\medskip
335\begin{lstlisting}
336Main Thread
337        malloc N objects $for$ each worker $thread$
338        create worker threads and pass N objects to each worker
339        ...
340        signal workers to allocate
341        ...
342        signal workers to free
343        ...
344        print addresses from each $thread$
345Worker Thread$\(_1\)$
346        allocate, write, read, free
347        warmup memory in chunkc of 16 bytes
348        ...
349        for ( N )
350                free an object passed by Main Thread
351                malloc new object
352        ...
353        free objects
354        return new object addresses to Main Thread
355Worker Thread$\(_2\)$
356        // same as Worker Thread$\(_1\)$
357\end{lstlisting}
358%\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
359\caption{Program-Induced Passive False-Sharing Benchmark}
360\label{fig:benchScratchFig}
361\end{figure}
362
363Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread).
364Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses.
365Main thread measures the total time taken for all the workers to complete.
366
367Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.
368\begin{description}[itemsep=0pt,parsep=0pt]
369\item[threads:]
370number of threads (K).
371\item[iterations:]
372iterations of cache benchmark (N).
373\item[cacheRW:]
374repetitions of reads/writes to object (M).
375\item[size:]
376object size.
377\end{description}
Note: See TracBrowser for help on using the repository browser.