Changeset cd1a5e8 for doc


Ignore:
Timestamp:
Apr 26, 2022, 8:31:46 AM (2 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
Children:
c9136d9
Parents:
c73213b
Message:

reorder micro-benchmarks to match performance chapter

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex

    rc73213b rcd1a5e8  
    11\chapter{Benchmarks}
     2\label{s:Benchmarks}
    23
    34%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     
    7071
    7172
    72 \subsection{Memory Micro-Benchmark}
    73 
    74 The memory micro-benchmark measures the memory overhead of an allocator.
    75 It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS.
    76 It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates.
    77 This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing.
    78 
    79 \VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
    80 It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
    81 A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer.
    82 A consumer frees these objects.
    83 After every memory operation, program memory usage is recorded throughout the runtime.
    84 This data is used to visualize the memory usage and consumption for the program.
     73\subsection{Churn Benchmark}
     74\label{s:ChurnBenchmark}
     75
     76The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory.
     77Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@.
     78Churn simulates a memory intensive program that can be tuned to create different scenarios.
     79
     80\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
     81This benchmark creates a buffer with M spots and starts K threads.
     82Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot.
     83Each thread repeats this cycle N times.
     84The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance.
    8585
    8686\begin{figure}
     
    8888\begin{lstlisting}
    8989Main Thread
    90         print memory snapshot
    91         create producer threads
    92 Producer Thread (K)
    93         set free start
    94         create consumer threads
     90        create worker threads
     91        note time T1
     92        ...
     93        note time T2
     94        churn_speed = (T2 - T1)
     95Worker Thread
     96        initialize variables
     97        ...
    9598        for ( N )
    96                 allocate memory
    97                 print memory snapshot
    98 Consumer Thread (M)
    99         wait while ( allocations < free start )
    100         for ( N )
    101                 free memory
    102                 print memory snapshot
    103 \end{lstlisting}
    104 %\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
    105 \caption{Memory Footprint Micro-Benchmark}
    106 \label{fig:MemoryBenchFig}
    107 \end{figure}
    108 
    109 The global adjustment knobs for this micro-benchmark are:
    110 \begin{description}[itemsep=0pt,parsep=0pt]
    111 \item[producer (K):]
    112 sets the number of producer threads.
    113 \item[consumer (M):]
    114 sets number of consumers threads for each producer.
    115 \item[round:]
    116 sets production and consumption round size.
    117 \end{description}
    118 
    119 The adjustment knobs for object allocation are:
    120 \begin{description}[itemsep=0pt,parsep=0pt]
     99                R = random spot in array
     100                free R
     101                allocate new object at R
     102\end{lstlisting}
     103%\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
     104\caption{Churn Benchmark}
     105\label{fig:ChurnBenchFig}
     106\end{figure}
     107
     108The adjustment knobs for churn are:
     109\begin{description}[itemsep=0pt,parsep=0pt]
     110\item[thread:]
     111number of threads (K).
     112\item[spots:]
     113number of spots for churn (M).
     114\item[obj:]
     115number of objects per thread (N).
    121116\item[max:]
    122117maximum object size.
     
    126121object size increment.
    127122\item[distro:]
    128 object size distribution.
    129 \item[objects (N):]
    130 number of objects per thread.
     123object size distribution
     124\end{description}
     125
     126
     127\subsection{Cache Thrash}
     128\label{sec:benchThrashSec}
     129
     130The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}.
     131If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
     132When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases.
     133
     134Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.
     135Ideally, a memory allocator should distance the dynamic memory region of one thread from another.
     136Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads.
     137
     138\VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
     139First, it creates K worker threads.
     140Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line.
     141Each thread repeats this for N times.
     142The main thread measures the total time taken to for all worker threads to complete.
     143Worker threads sharing cache lines with each other will take longer.
     144
     145\begin{figure}
     146\centering
     147\input{AllocInducedActiveFalseSharing}
     148\medskip
     149\begin{lstlisting}
     150Main Thread
     151        create worker threads
     152        ...
     153        signal workers to allocate
     154        ...
     155        signal workers to free
     156        ...
     157        print addresses from each $thread$
     158Worker Thread$\(_1\)$
     159        allocate, write, read, free
     160        warmup memory in chunkc of 16 bytes
     161        ...
     162        malloc N objects
     163        ...
     164        free objects
     165        return object address to Main Thread
     166Worker Thread$\(_2\)$
     167        // same as Worker Thread$\(_1\)$
     168\end{lstlisting}
     169%\input{MemoryOverhead}
     170%\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
     171\caption{Allocator-Induced Active False-Sharing Benchmark}
     172\label{fig:benchThrashFig}
     173\end{figure}
     174
     175The adjustment knobs for cache access scenarios are:
     176\begin{description}[itemsep=0pt,parsep=0pt]
     177\item[thread:]
     178number of threads (K).
     179\item[iterations:]
     180iterations of cache benchmark (N).
     181\item[cacheRW:]
     182repetitions of reads/writes to object (M).
     183\item[size:]
     184object size.
     185\end{description}
     186
     187
     188\subsection{Cache Scratch}
     189\label{s:CacheScratch}
     190
     191The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}.
     192As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
     193In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object.
     194
     195% An allocator can unintentionally induce false sharing depending upon its management of the freed objects.
     196% If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
     197% If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
     198% instead, the program induced this situation.
     199% Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
     200
     201Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
     202An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
     203If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line.
     204
     205\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
     206First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line.
     207Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
     208Each worker thread frees the object passed by the main thread.
     209Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations.
     210Each worker repeats this N times.
     211
     212\begin{figure}
     213\centering
     214\input{AllocInducedPassiveFalseSharing}
     215\medskip
     216\begin{lstlisting}
     217Main Thread
     218        malloc N objects $for$ each worker $thread$
     219        create worker threads and pass N objects to each worker
     220        ...
     221        signal workers to allocate
     222        ...
     223        signal workers to free
     224        ...
     225        print addresses from each $thread$
     226Worker Thread$\(_1\)$
     227        allocate, write, read, free
     228        warmup memory in chunkc of 16 bytes
     229        ...
     230        for ( N )
     231                free an object passed by Main Thread
     232                malloc new object
     233        ...
     234        free objects
     235        return new object addresses to Main Thread
     236Worker Thread$\(_2\)$
     237        // same as Worker Thread$\(_1\)$
     238\end{lstlisting}
     239%\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
     240\caption{Program-Induced Passive False-Sharing Benchmark}
     241\label{fig:benchScratchFig}
     242\end{figure}
     243
     244Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread).
     245Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses.
     246Main thread measures the total time taken for all the workers to complete.
     247
     248Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.
     249\begin{description}[itemsep=0pt,parsep=0pt]
     250\item[threads:]
     251number of threads (K).
     252\item[iterations:]
     253iterations of cache benchmark (N).
     254\item[cacheRW:]
     255repetitions of reads/writes to object (M).
     256\item[size:]
     257object size.
    131258\end{description}
    132259
    133260
    134261\subsection{Speed Micro-Benchmark}
     262\label{s:SpeedMicroBenchmark}
    135263
    136264The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines:
     
    193321
    194322
    195 \subsection{Churn Benchmark}
    196 
    197 The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory.
    198 Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@.
    199 Churn simulates a memory intensive program that can be tuned to create different scenarios.
    200 
    201 \VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
    202 This benchmark creates a buffer with M spots and starts K threads.
    203 Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot.
    204 Each thread repeats this cycle N times.
    205 The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance.
     323\subsection{Memory Micro-Benchmark}
     324\label{s:MemoryMicroBenchmark}
     325
     326The memory micro-benchmark measures the memory overhead of an allocator.
     327It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS.
     328It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates.
     329This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing.
     330
     331\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
     332It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
     333A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer.
     334A consumer frees these objects.
     335After every memory operation, program memory usage is recorded throughout the runtime.
     336This data is used to visualize the memory usage and consumption for the program.
    206337
    207338\begin{figure}
     
    209340\begin{lstlisting}
    210341Main Thread
    211         create worker threads
    212         note time T1
    213         ...
    214         note time T2
    215         churn_speed = (T2 - T1)
    216 Worker Thread
    217         initialize variables
    218         ...
     342        print memory snapshot
     343        create producer threads
     344Producer Thread (K)
     345        set free start
     346        create consumer threads
    219347        for ( N )
    220                 R = random spot in array
    221                 free R
    222                 allocate new object at R
    223 \end{lstlisting}
    224 %\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
    225 \caption{Churn Benchmark}
    226 \label{fig:ChurnBenchFig}
    227 \end{figure}
    228 
    229 The adjustment knobs for churn are:
    230 \begin{description}[itemsep=0pt,parsep=0pt]
    231 \item[thread:]
    232 number of threads (K).
    233 \item[spots:]
    234 number of spots for churn (M).
    235 \item[obj:]
    236 number of objects per thread (N).
     348                allocate memory
     349                print memory snapshot
     350Consumer Thread (M)
     351        wait while ( allocations < free start )
     352        for ( N )
     353                free memory
     354                print memory snapshot
     355\end{lstlisting}
     356%\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
     357\caption{Memory Footprint Micro-Benchmark}
     358\label{fig:MemoryBenchFig}
     359\end{figure}
     360
     361The global adjustment knobs for this micro-benchmark are:
     362\begin{description}[itemsep=0pt,parsep=0pt]
     363\item[producer (K):]
     364sets the number of producer threads.
     365\item[consumer (M):]
     366sets number of consumers threads for each producer.
     367\item[round:]
     368sets production and consumption round size.
     369\end{description}
     370
     371The adjustment knobs for object allocation are:
     372\begin{description}[itemsep=0pt,parsep=0pt]
    237373\item[max:]
    238374maximum object size.
     
    242378object size increment.
    243379\item[distro:]
    244 object size distribution
    245 \end{description}
    246 
    247 
    248 \subsection{Cache Thrash}\label{sec:benchThrashSec}
    249 
    250 The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}.
    251 If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
    252 When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases.
    253 
    254 Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.
    255 Ideally, a memory allocator should distance the dynamic memory region of one thread from another.
    256 Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads.
    257 
    258 \VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
    259 First, it creates K worker threads.
    260 Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line.
    261 Each thread repeats this for N times.
    262 The main thread measures the total time taken to for all worker threads to complete.
    263 Worker threads sharing cache lines with each other will take longer.
    264 
    265 \begin{figure}
    266 \centering
    267 \input{AllocInducedActiveFalseSharing}
    268 \medskip
    269 \begin{lstlisting}
    270 Main Thread
    271         create worker threads
    272         ...
    273         signal workers to allocate
    274         ...
    275         signal workers to free
    276         ...
    277         print addresses from each $thread$
    278 Worker Thread$\(_1\)$
    279         allocate, write, read, free
    280         warmup memory in chunkc of 16 bytes
    281         ...
    282         malloc N objects
    283         ...
    284         free objects
    285         return object address to Main Thread
    286 Worker Thread$\(_2\)$
    287         // same as Worker Thread$\(_1\)$
    288 \end{lstlisting}
    289 %\input{MemoryOverhead}
    290 %\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
    291 \caption{Allocator-Induced Active False-Sharing Benchmark}
    292 \label{fig:benchThrashFig}
    293 \end{figure}
    294 
    295 The adjustment knobs for cache access scenarios are:
    296 \begin{description}[itemsep=0pt,parsep=0pt]
    297 \item[thread:]
    298 number of threads (K).
    299 \item[iterations:]
    300 iterations of cache benchmark (N).
    301 \item[cacheRW:]
    302 repetitions of reads/writes to object (M).
    303 \item[size:]
    304 object size.
    305 \end{description}
    306 
    307 
    308 \subsection{Cache Scratch}
    309 
    310 The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}.
    311 As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
    312 In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object.
    313 
    314 % An allocator can unintentionally induce false sharing depending upon its management of the freed objects.
    315 % If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
    316 % If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
    317 % instead, the program induced this situation.
    318 % Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
    319 
    320 Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
    321 An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
    322 If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line.
    323 
    324 \VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
    325 First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line.
    326 Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
    327 Each worker thread frees the object passed by the main thread.
    328 Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations.
    329 Each worker repeats this N times.
    330 
    331 \begin{figure}
    332 \centering
    333 \input{AllocInducedPassiveFalseSharing}
    334 \medskip
    335 \begin{lstlisting}
    336 Main Thread
    337         malloc N objects $for$ each worker $thread$
    338         create worker threads and pass N objects to each worker
    339         ...
    340         signal workers to allocate
    341         ...
    342         signal workers to free
    343         ...
    344         print addresses from each $thread$
    345 Worker Thread$\(_1\)$
    346         allocate, write, read, free
    347         warmup memory in chunkc of 16 bytes
    348         ...
    349         for ( N )
    350                 free an object passed by Main Thread
    351                 malloc new object
    352         ...
    353         free objects
    354         return new object addresses to Main Thread
    355 Worker Thread$\(_2\)$
    356         // same as Worker Thread$\(_1\)$
    357 \end{lstlisting}
    358 %\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
    359 \caption{Program-Induced Passive False-Sharing Benchmark}
    360 \label{fig:benchScratchFig}
    361 \end{figure}
    362 
    363 Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread).
    364 Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses.
    365 Main thread measures the total time taken for all the workers to complete.
    366 
    367 Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.
    368 \begin{description}[itemsep=0pt,parsep=0pt]
    369 \item[threads:]
    370 number of threads (K).
    371 \item[iterations:]
    372 iterations of cache benchmark (N).
    373 \item[cacheRW:]
    374 repetitions of reads/writes to object (M).
    375 \item[size:]
    376 object size.
    377 \end{description}
     380object size distribution.
     381\item[objects (N):]
     382number of objects per thread.
     383\end{description}
Note: See TracChangeset for help on using the changeset viewer.