Changeset e357efb


Ignore:
Timestamp:
Apr 25, 2022, 10:55:56 AM (2 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
Children:
69ec0fb
Parents:
52a532ae
Message:

second proofread of chapter benchmarks

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex

    r52a532ae re357efb  
    2222In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
    2323As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
    24 Many of these applications and micro-benchmark are old and may not reflect current application allocation patterns.
     24Many of these benchmark applications and micro-benchmarks are old and may not reflect current application allocation patterns.
    2525
    2626This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters.
     
    2929These programs give details of an allocator's memory overhead and speed under certain allocation patterns.
    3030The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs.
    31 The micro-benchmark programs control knobs by command-line arguments.
     31Each micro-benchmark program has multiple control knobs specified by command-line arguments.
    3232
    3333The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices.
     
    3737\section{Prior Multi-Threaded Micro-Benchmarks}
    3838
    39 Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT level.
    40 The following traditional multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00}.
     39Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT and UT level.
     40The following multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00} at the KT level.
     41None of the prior work address multi-threading at the UT level.
    4142
    4243
     
    4546This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently.
    4647There is no interaction among threads, \ie no object sharing.
    47 Each thread repeatedly allocate 100,000 8-byte objects then deallocates them in the order they were allocated.
     48Each thread repeatedly allocate 100,000 \emph{8-byte} objects then deallocates them in the order they were allocated.
    4849Runtime of the benchmark evaluates its efficiency.
    4950
     
    5152\subsection{shbench}
    5253
    53 Each thread randomly allocates and frees a number of random-sized objects.
     54This benchmark is similar to threadtest but each thread randomly allocate and free a number of \emph{random-sized} objects.
    5455It is a stress test that also uses runtime to determine efficiency of the allocator.
    5556
     
    5758\subsection{Larson}
    5859
    59 Larson simulates a server environment.
     60This benchmark simulates a server environment.
    6061Multiple threads are created where each thread allocates and frees a number of random-sized objects within a size range.
    6162Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process.
     
    6667\section{New Multi-Threaded Micro-Benchmarks}
    6768
    68 The following new benchmarks were created to assess multi-threaded programs at the KT level.
     69The following new benchmarks were created to assess multi-threaded programs at the KT and UT level.
    6970
    7071
     
    7273
    7374The memory micro-benchmark measures the memory overhead of an allocator.
    74 It allocates a number of dynamic objects.
    75 Then, by reading @/proc/self/proc/maps@, it gets the total memory requested by the allocator from the OS.
    76 It calculates the memory overhead by computing the difference between the memory the allocator has requested from the OS and the memory that the program has allocated.
     75It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS.
     76It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates.
     77This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing.
    7778
    7879\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
    7980It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
    80 A producer has a separate buffer for each consumer and allocates N objects of random sizes following the given distribution for each consumer.
     81A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer.
    8182A consumer frees these objects.
    8283After every memory operation, program memory usage is recorded throughout the runtime.
    83 This data then is used to visualize the memory usage and consumption for the program.
     84This data is used to visualize the memory usage and consumption for the program.
    8485
    8586\begin{figure}
     
    8990        print memory snapshot
    9091        create producer threads
    91 Producer Thread
     92Producer Thread (K)
    9293        set free start
    9394        create consumer threads
     
    9596                allocate memory
    9697                print memory snapshot
    97 Consumer Thread
     98Consumer Thread (M)
    9899        wait while ( allocations < free start )
    99100        for ( N )
     
    102103\end{lstlisting}
    103104%\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
    104 \caption{Memory Overhead Benchmark (evaluate memory footprint)}
     105\caption{Memory Footprint Micro-Benchmark}
    105106\label{fig:MemoryBenchFig}
    106107\end{figure}
    107108
    108 The adjustment knobs for this micro-benchmark are:
    109 \begin{description}[itemsep=0pt,parsep=0pt]
    110 \item[producer:]
     109The global adjustment knobs for this micro-benchmark are:
     110\begin{description}[itemsep=0pt,parsep=0pt]
     111\item[producer (K):]
    111112sets the number of producer threads.
     113\item[consumer (M):]
     114sets number of consumers threads for each producer.
    112115\item[round:]
    113116sets production and consumption round size.
    114 \item[consumer:]
    115 sets number of consumers threads for each producer.
    116 \end{description}
    117 
    118 The adjustment knobs for the object allocation size are:
     117\end{description}
     118
     119The adjustment knobs for object allocation are:
    119120\begin{description}[itemsep=0pt,parsep=0pt]
    120121\item[max:]
    121 sets max object size.
     122maximum object size.
    122123\item[min:]
    123 sets min object size.
     124minimum object size.
    124125\item[step:]
    125 sets object size increment.
     126object size increment.
    126127\item[distro:]
    127 sets object size distribution.
    128 \item[obj:]
    129 sets number of objects per thread.
     128object size distribution.
     129\item[objects (N):]
     130number of objects per thread.
    130131\end{description}
    131132
     
    150151
    151152\VRef[Figure]{fig:SpeedBenchFig} shows the pseudo code for the speed micro-benchmark.
    152 Each routine in the chain is called for N objects and then those allocated objects are used when call the next routine in the allocation chain.
    153 This way we can measure the complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives us the whole picture of the major allocation routines when combined together in a chain.
    154 For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against each chain.
     153Each routine in the chain is called for N objects and then those allocated objects are used when calling the next routine in the allocation chain.
     154This tests the latency of the memory allocator when multiple routines are chained together, \eg the call sequence malloc-realloc-free-calloc gives a complete picture of the major allocation routines when combined together.
     155For each chain, the time is recorded to visualize performance of a memory allocator against each chain.
    155156
    156157\begin{figure}
     
    178179\begin{description}[itemsep=0pt,parsep=0pt]
    179180\item[max:]
    180 sets max object size.
     181maximum object size.
    181182\item[min:]
    182 sets min object size.
     183minimum object size.
    183184\item[step:]
    184 sets object size increment.
     185object size increment.
    185186\item[distro:]
    186 sets object size distribution.
    187 \item[obj:]
    188 sets number of objects per thread.
     187object size distribution.
     188\item[objects:]
     189number of objects per thread.
    189190\item[workers:]
    190 sets number of worker threads.
     191number of worker threads.
    191192\end{description}
    192193
     
    194195\subsection{Churn Benchmark}
    195196
    196 Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded scenerio where each thread extensively allocates and frees dynamic memory.
     197The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory.
     198Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@.
     199Churn simulates a memory intensive program that can be tuned to create different scenarios.
    197200
    198201\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
    199202This benchmark creates a buffer with M spots and starts K threads.
    200 Each thread randomly picks a spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot.
    201 Each thread repeats this cycle for N times.
    202 Main threads measures the total time taken for the whole benchmark and that time is used to evaluate memory allocator's performance.
     203Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot.
     204Each thread repeats this cycle N times.
     205The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance.
    203206
    204207\begin{figure}
     
    215218        ...
    216219        for ( N )
    217                 R -> random spot in array
     220                R = random spot in array
    218221                free R
    219222                allocate new object at R
     
    224227\end{figure}
    225228
    226 Only malloc and free are used to allocate and free an object to eliminate any extra cost such as memcpy in realloc etc.
    227 Malloc/free allows us to measure latency of memory allocation only without paying any extra cost.
    228 Churn simulates a memory intensive program that can be tuned to create different scenarios.
    229 
    230 The adjustment knobs for churn usage are:
     229The adjustment knobs for churn are:
    231230\begin{description}[itemsep=0pt,parsep=0pt]
    232231\item[thread:]
    233 sets number of threads, K.
     232number of threads (K).
    234233\item[spots:]
    235 sets number of spots for churn, M.
     234number of spots for churn (M).
    236235\item[obj:]
    237 sets number of objects per thread, N.
     236number of objects per thread (N).
    238237\item[max:]
    239 sets max object size.
     238maximum object size.
    240239\item[min:]
    241 sets min object size.
     240minimum object size.
    242241\item[step:]
    243 sets object size increment.
     242object size increment.
    244243\item[distro:]
    245 sets object size distribution
    246 \end{description}
    247 
    248 
    249 \section{Cache Thrash}\label{sec:benchThrashSec}
    250 
    251 Cache Thrash benchmark measures allocator induced active false sharing in an allocator as illustrated in figure \VRef{f:AllocatorInducedActiveFalseSharing}.
    252 If memory allocator allocates memory for multiple threads on same cache line, this can slow down the program performance.
    253 If both threads, who share one cache line, frequently read/write to their object on the cache line concurrently then this will cause cache miss every time a thread accesses the object as the other thread might have written something at their memory location on the same cache line.
    254 
    255 Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory allocator is allocating dynamic memory to multiple threads on same cache lines.
    256 Ideally, a memory allocator should distance dynamic memory region of one thread from other threads'.
    257 Having multiple threads allocating small objects simultaneously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its
    258 not distancing the memory among different threads.
     244object size distribution
     245\end{description}
     246
     247
     248\subsection{Cache Thrash}\label{sec:benchThrashSec}
     249
     250The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}.
     251If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
     252When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases.
     253
     254Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.
     255Ideally, a memory allocator should distance the dynamic memory region of one thread from another.
     256Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads.
    259257
    260258\VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
    261259First, it creates K worker threads.
    262 Each worker thread allocates an object and intensively read/write it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line with it.
     260Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line.
    263261Each thread repeats this for N times.
    264 Main thread measures the total time taken to for all worker threads to complete.
     262The main thread measures the total time taken to for all worker threads to complete.
    265263Worker threads sharing cache lines with each other will take longer.
    266264
     
    297295The adjustment knobs for cache access scenarios are:
    298296\begin{description}[itemsep=0pt,parsep=0pt]
    299 \item[threadN:]
    300 sets number of threads, K.
    301 \item[cacheIt:]
    302 iterations for cache benchmark, N.
    303 \item[cacheRep:]
    304 repetitions for cache benchmark, M.
    305 \item[cacheObj:]
    306 object size for cache benchmark.
     297\item[thread:]
     298number of threads (K).
     299\item[iterations:]
     300iterations of cache benchmark (N).
     301\item[cacheRW:]
     302repetitions of reads/writes to object (M).
     303\item[size:]
     304object size.
    307305\end{description}
    308306
     
    310308\subsection{Cache Scratch}
    311309
    312 The cache scratch benchmark measures allocator induced passive false sharing in an allocator.
    313 An allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in \VRef[Figure]{f:AllocatorInducedPassiveFalseSharing}.
    314 If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
    315 If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
    316 instead, the program induced this situation.
    317 Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
    318 Now this false sharing is being caused by the memory allocator although it was started by the program.
    319 
    320 The cache-scratch main-thread induces false sharing and creates a scenerio that should make the memory allocator preserve the program-induced false sharing if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
     310The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}.
     311As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
     312In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object.
     313
     314% An allocator can unintentionally induce false sharing depending upon its management of the freed objects.
     315% If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
     316% If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
     317% instead, the program induced this situation.
     318% Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
     319
     320Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
    321321An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
    322 If the object is returned to the thread who owns it or originally allocated it, then the thread Thread$_2$ gets a new object that is less likely to be on the same cache line as Thread$_1$.
     322If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line.
    323323
    324324\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
    325 First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache-line.
     325First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line.
    326326Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
    327327Each worker thread frees the object passed by the main thread.
     
    335335\begin{lstlisting}
    336336Main Thread
    337         malloc N objects for each worker thread
     337        malloc N objects $for$ each worker $thread$
    338338        create worker threads and pass N objects to each worker
    339339        ...
     
    365365Main thread measures the total time taken for all the workers to complete.
    366366
    367 Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.\\
     367Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.
    368368\begin{description}[itemsep=0pt,parsep=0pt]
    369369\item[threads:]
    370 number of threads, K.
     370number of threads (K).
    371371\item[iterations:]
    372 iterations for cache benchmark, N.
    373 \item[repetitions:]
    374 repetitions for cache benchmark, M.
    375 \item[objsize:]
    376 object size for cache benchmark.
    377 \end{description}
     372iterations of cache benchmark (N).
     373\item[cacheRW:]
     374repetitions of reads/writes to object (M).
     375\item[size:]
     376object size.
     377\end{description}
Note: See TracChangeset for help on using the changeset viewer.