Changeset a6e8f64


Ignore:
Timestamp:
Apr 24, 2022, 10:54:45 PM (2 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
Children:
52a532ae
Parents:
16d397a
Message:

first proofread of chapter benchmarks

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex

    r16d397a ra6e8f64  
    77%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    88
    9 The aim of micro benchmark suite is to create a set of programs that can evaluate a memory allocator based on the
    10 performance matrices described in (FIX ME: local cite). These programs can be taken as a standard to benchmark an
    11 allocator's basic goals. These programs give details of an allocator's memory overhead and speed under a certain
    12 allocation pattern. The speed of the allocator is benchmarked in different ways. Similarly, false sharing happening in
    13 an allocator is also measured in multiple ways. These benchmarks evalute the allocator under a certain allocation
    14 pattern which is configurable and can be changed using a few knobs to benchmark observe an allocator's performance
    15 under a desired allocation pattern.
    16 
    17 Micro Benchmark Suite benchmarks an allocator's performance by allocating dynamic objects and, then, measuring specifc
    18 matrices. The benchmark suite evaluates an allocator with a certain allocation pattern. Bnechmarks have different knobs
    19 that can be used to change allocation pattern and evaluate an allocator under desired conditions. These can be set by
    20 giving commandline arguments to the benchmark on execution.
    21 
    22 \section{Current Benchmarks} There are multiple benchmarks that are built individually and evaluate different aspects of
    23  a memory allocator. But, there is not a set of benchamrks that can be used to evaluate multiple aspects of memory
    24  allocators.
    25 
    26 \subsection{threadtest}(FIX ME: cite benchmark and hoard) Each thread repeatedly allocates and then deallocates 100,000
    27  objects. Runtime of the benchmark evaluates its efficiency.
    28 
    29 \subsection{shbench}(FIX ME: cite benchmark and hoard) Each thread allocates and randomly frees a number of random-sized
    30  objects. It is a stress test that also uses runtime to determine efficiency of the allocator.
    31 
    32 \subsection{larson}(FIX ME: cite benchmark and hoard) Larson simulates a server environment. Multiple threads are
    33  created where each thread allocator and free a number of objects within a size range. Some objects are passed from
    34  threads to the child threads to free. It caluculates memory operations per second as an indicator of memory
    35  allocator's performance.
    36 
    37 \section{Memory Benchmark} Memory benchmark measures memory overhead of an allocator. It allocates a number of dynamic
    38  objects. Then, by reading /self/proc/maps, gets the total memory that the allocator has reuested from the OS. It
    39  calculates the memory head by taking the difference between the memory the allocator has requested from the OS and the
    40  memory that program has allocated.
    41 
    42 \begin{figure}
    43 \centering
    44 \includegraphics[width=1\textwidth]{figures/bench-memory.eps}
    45 \caption{Benchmark Memory Overhead}
    46 \label{fig:benchMemoryFig}
    47 \end{figure}
    48 
    49 Figure \ref{fig:benchMemoryFig} gives a flow of the memory benchmark. It creates a producer-consumer scenerio with K producers
    50  and each producer has M consumers. Producer has a separate buffer for each consumer. Producer allocates N objects of
    51  random sizes following the given distrubution for each consumer. Consumer frees those objects. After every memory
    52  operation, program memory usage is recorded throughout the runtime. This data then can be used to visualize the memory
    53  usage and consumption of the prigram.
    54 
    55 Different knobs can be adjusted to set certain thread model.\\
    56 -threadA :  sets number of alloc threads (producers) for mem benchmark\\
    57 -consumeS:  sets production and conumption round size\\
    58 -threadF :  sets number of free threads (consumers) for each producer for mem benchmark
    59 
    60 Object allocation size can be changed using the knobs:\\
    61 -maxS    :  sets max object size\\
    62 -minS    :  sets min object size\\
    63 -stepS   :  sets object size increment\\
    64 -distroS :  sets object size distribution\\
    65 -objN    :  sets number of objects per thread\\
    66 
    67 \section{Speed Benchmark} Speed benchmark measures the runtime speed of an allocator (FIX ME: cite allocator routines).
    68  Speed benchmark measures runtime speed of individual memory allocation routines. It also considers different
    69  allocation chains to measures the performance of the allocator by combining multiple allocation routines in a chain.
    70  It uses following chains and measures allocator runtime speed against them:
    71 \begin{itemize}
     9There are two basic approaches for evaluating computer software: benchmarks and micro-benchmarks.
     10\begin{description}
     11\item[Benchmarks]
     12are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc).
     13The applications are suppose to represent common execution patterns that need to perform well with respect to an underlying software implementation.
     14Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns.
     15\item[Micro-Benchmarks]
     16attempt to extract the common execution patterns associated with an application and run the pattern independently.
     17This approach removes any masking from extraneous application code, allows execution pattern to be very precise, and provides an opportunity for the execution pattern to have multiple independent tuning adjustments (knobs).
     18Micro-benchmarks are often criticized for inadequately representing real-world applications.
     19\end{description}
     20
     21While some crucial software components have standard benchmarks, no standard benchmark exists for testing and comparing memory allocators.
     22In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
     23As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
     24Many of these applications and micro-benchmark are old and may not reflect current application allocation patterns.
     25
     26This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters.
     27The aim of the micro-benchmark suite is to create a set of programs that can evaluate a memory allocator based on the performance matrices described in (FIX ME: local cite).
     28% These programs can be taken as a standard to benchmark an allocator's basic goals.
     29These programs give details of an allocator's memory overhead and speed under certain allocation patterns.
     30The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs.
     31The micro-benchmark programs control knobs by command-line arguments.
     32
     33The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices.
     34An allocator's speed is benchmarked in different ways, as are issues like false sharing.
     35
     36
     37\section{Prior Multi-Threaded Micro-Benchmarks}
     38
     39Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT level.
     40The following traditional multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00}.
     41
     42
     43\subsection{threadtest}
     44
     45This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently.
     46There is no interaction among threads, \ie no object sharing.
     47Each thread repeatedly allocate 100,000 8-byte objects then deallocates them in the order they were allocated.
     48Runtime of the benchmark evaluates its efficiency.
     49
     50
     51\subsection{shbench}
     52
     53Each thread randomly allocates and frees a number of random-sized objects.
     54It is a stress test that also uses runtime to determine efficiency of the allocator.
     55
     56
     57\subsection{Larson}
     58
     59Larson simulates a server environment.
     60Multiple threads are created where each thread allocates and frees a number of random-sized objects within a size range.
     61Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process.
     62The number of thread generations varies depending on the thread speed.
     63It calculates memory operations per second as an indicator of memory allocator's performance.
     64
     65
     66\section{New Multi-Threaded Micro-Benchmarks}
     67
     68The following new benchmarks were created to assess multi-threaded programs at the KT level.
     69
     70
     71\subsection{Memory Micro-Benchmark}
     72
     73The memory micro-benchmark measures the memory overhead of an allocator.
     74It allocates a number of dynamic objects.
     75Then, by reading @/proc/self/proc/maps@, it gets the total memory requested by the allocator from the OS.
     76It calculates the memory overhead by computing the difference between the memory the allocator has requested from the OS and the memory that the program has allocated.
     77
     78\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
     79It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
     80A producer has a separate buffer for each consumer and allocates N objects of random sizes following the given distribution for each consumer.
     81A consumer frees these objects.
     82After every memory operation, program memory usage is recorded throughout the runtime.
     83This data then is used to visualize the memory usage and consumption for the program.
     84
     85\begin{figure}
     86\centering
     87\begin{lstlisting}
     88Main Thread
     89        print memory snapshot
     90        create producer threads
     91Producer Thread
     92        set free start
     93        create consumer threads
     94        for ( N )
     95                allocate memory
     96                print memory snapshot
     97Consumer Thread
     98        wait while ( allocations < free start )
     99        for ( N )
     100                free memory
     101                print memory snapshot
     102\end{lstlisting}
     103%\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
     104\caption{Memory Overhead Benchmark (evaluate memory footprint)}
     105\label{fig:MemoryBenchFig}
     106\end{figure}
     107
     108The adjustment knobs for this micro-benchmark are:
     109\begin{description}[itemsep=0pt,parsep=0pt]
     110\item[producer:]
     111sets the number of producer threads.
     112\item[round:]
     113sets production and consumption round size.
     114\item[consumer:]
     115sets number of consumers threads for each producer.
     116\end{description}
     117
     118The adjustment knobs for the object allocation size are:
     119\begin{description}[itemsep=0pt,parsep=0pt]
     120\item[max:]
     121sets max object size.
     122\item[min:]
     123sets min object size.
     124\item[step:]
     125sets object size increment.
     126\item[distro:]
     127sets object size distribution.
     128\item[obj:]
     129sets number of objects per thread.
     130\end{description}
     131
     132
     133\subsection{Speed Micro-Benchmark}
     134
     135The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines:
     136\begin{enumerate}[itemsep=0pt,parsep=0pt]
    72137\item malloc
    73138\item realloc
     
    82147\item calloc-realloc-free
    83148\item malloc-realloc-free-calloc
    84 \end{itemize}
    85 
    86 \begin{figure}
    87 \centering
    88 \includegraphics[width=1\textwidth]{figures/bench-speed.eps}
    89 \caption{Benchmark Speed}
    90 \label{fig:benchSpeedFig}
    91 \end{figure}
    92 
    93 As laid out in figure \ref{fig:benchSpeedFig}, each chain is measured separately. Each routine in the chain is called for N objects and then
    94  those allocated objects are used when call the next routine in the allocation chain. This way we can measure the
    95  complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives
    96  us the whole picture of the major allocation routines when combined together in a chain.
    97 
    98 For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against
    99 each chain.
    100 
    101 Following knobs can be adjusted to tune memory usage.\\
    102 -maxS    :  sets max object size\\
    103 -minS    :  sets min object size\\
    104 -stepS   :  sets object size increment\\
    105 -distroS :  sets object size distribution\\
    106 -objN    :  sets number of objects per thread\\
    107 -threadN :  sets number of worker threads\\
    108 
    109 \section{Churn Benchmark} Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded
    110  scenerio where each thread extinsevly allocates and frees dynamic memory.
    111 
    112 \begin{figure}
    113 \centering
    114 \includegraphics[width=1\textwidth]{figures/bench-churn.eps}
    115 \caption{Benchmark Churn}
    116 \label{fig:benchChurnFig}
    117 \end{figure}
    118 
    119 Figure \ref{fig:benchChurnFig} illustrates churn benchmark.
    120  This benchmark creates a buffer with M spots and starts K threads. Each thread randomly picks a
    121  spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot. Each thread
    122  repeats this cycle for N times. Main threads measures the total time taken for the whole benchmark and that time is
    123  used to evaluate memory allocator's performance.
     149\end{enumerate}
     150
     151\VRef[Figure]{fig:SpeedBenchFig} shows the pseudo code for the speed micro-benchmark.
     152Each routine in the chain is called for N objects and then those allocated objects are used when call the next routine in the allocation chain.
     153This way we can measure the complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives us the whole picture of the major allocation routines when combined together in a chain.
     154For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against each chain.
     155
     156\begin{figure}
     157\centering
     158\begin{lstlisting}[morekeywords={foreach}]
     159Main Thread
     160        create worker threads
     161        foreach ( allocation chain )
     162                note time T1
     163                ...
     164                note time T2
     165                chain_speed = (T2 - T1) / number-of-worker-threads * N )
     166Worker Thread
     167        initialize variables
     168        ...
     169        foreach ( routine in allocation chain )
     170                call routine N times
     171\end{lstlisting}
     172%\includegraphics[width=1\textwidth]{figures/bench-speed.eps}
     173\caption{Speed Benchmark}
     174\label{fig:SpeedBenchFig}
     175\end{figure}
     176
     177The adjustment knobs for memory usage are:
     178\begin{description}[itemsep=0pt,parsep=0pt]
     179\item[max:]
     180sets max object size.
     181\item[min:]
     182sets min object size.
     183\item[step:]
     184sets object size increment.
     185\item[distro:]
     186sets object size distribution.
     187\item[obj:]
     188sets number of objects per thread.
     189\item[workers:]
     190sets number of worker threads.
     191\end{description}
     192
     193
     194\subsection{Churn Benchmark}
     195
     196Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded scenerio where each thread extensively allocates and frees dynamic memory.
     197
     198\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
     199This benchmark creates a buffer with M spots and starts K threads.
     200Each thread randomly picks a spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot.
     201Each thread repeats this cycle for N times.
     202Main threads measures the total time taken for the whole benchmark and that time is used to evaluate memory allocator's performance.
     203
     204\begin{figure}
     205\centering
     206\begin{lstlisting}
     207Main Thread
     208        create worker threads
     209        note time T1
     210        ...
     211        note time T2
     212        churn_speed = (T2 - T1)
     213Worker Thread
     214        initialize variables
     215        ...
     216        for ( N )
     217                R -> random spot in array
     218                free R
     219                allocate new object at R
     220\end{lstlisting}
     221%\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
     222\caption{Churn Benchmark}
     223\label{fig:ChurnBenchFig}
     224\end{figure}
    124225
    125226Only malloc and free are used to allocate and free an object to eliminate any extra cost such as memcpy in realloc etc.
    126 Malloc/free allows us to measure latency of memory allocation only without paying any extra cost. Churn simulates a
    127 memory intensive program that can be tuned to create different scenerios.
    128 
    129 Following commandline arguments can be used to tune the benchmark.\\
    130 -threadN :  sets number of threads, K\\
    131 -cSpots  :  sets number of spots for churn, M\\
    132 -objN    :  sets number of objects per thread, N\\
    133 -maxS    :  sets max object size\\
    134 -minS    :  sets min object size\\
    135 -stepS   :  sets object size increment\\
    136 -distroS :  sets object size distribution
    137 
    138 \section{Cache Thrash}\label{sec:benchThrashSec} Cache Thrash benchmark measures allocator induced active false sharing
    139  in an allocator as illustrated in figure \ref{f:AllocatorInducedActiveFalseSharing}.
    140  If memory allocator allocates memory for multiple threads on
    141  same cache line, this can slow down the program performance. If both threads, who share one cache line, frequently
    142  read/write to their object on the cache line concurrently then this will cause cache miss everytime a thread accesse
    143  the object as the other thread might have written something at their memory location on the same cache line.
    144 
    145 \begin{figure}
    146 \centering
    147 \includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
    148 \caption{Benchmark Allocator Induced Active False Sharing}
     227Malloc/free allows us to measure latency of memory allocation only without paying any extra cost.
     228Churn simulates a memory intensive program that can be tuned to create different scenarios.
     229
     230The adjustment knobs for churn usage are:
     231\begin{description}[itemsep=0pt,parsep=0pt]
     232\item[thread:]
     233sets number of threads, K.
     234\item[spots:]
     235sets number of spots for churn, M.
     236\item[obj:]
     237sets number of objects per thread, N.
     238\item[max:]
     239sets max object size.
     240\item[min:]
     241sets min object size.
     242\item[step:]
     243sets object size increment.
     244\item[distro:]
     245sets object size distribution
     246\end{description}
     247
     248
     249\section{Cache Thrash}\label{sec:benchThrashSec}
     250
     251Cache Thrash benchmark measures allocator induced active false sharing in an allocator as illustrated in figure \VRef{f:AllocatorInducedActiveFalseSharing}.
     252If memory allocator allocates memory for multiple threads on same cache line, this can slow down the program performance.
     253If both threads, who share one cache line, frequently read/write to their object on the cache line concurrently then this will cause cache miss every time a thread accesses the object as the other thread might have written something at their memory location on the same cache line.
     254
     255Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory allocator is allocating dynamic memory to multiple threads on same cache lines.
     256Ideally, a memory allocator should distance dynamic memory region of one thread from other threads'.
     257Having multiple threads allocating small objects simultaneously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its
     258not distancing the memory among different threads.
     259
     260\VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
     261First, it creates K worker threads.
     262Each worker thread allocates an object and intensively read/write it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line with it.
     263Each thread repeats this for N times.
     264Main thread measures the total time taken to for all worker threads to complete.
     265Worker threads sharing cache lines with each other will take longer.
     266
     267\begin{figure}
     268\centering
     269\input{AllocInducedActiveFalseSharing}
     270\medskip
     271\begin{lstlisting}
     272Main Thread
     273        create worker threads
     274        ...
     275        signal workers to allocate
     276        ...
     277        signal workers to free
     278        ...
     279        print addresses from each $thread$
     280Worker Thread$\(_1\)$
     281        allocate, write, read, free
     282        warmup memory in chunkc of 16 bytes
     283        ...
     284        malloc N objects
     285        ...
     286        free objects
     287        return object address to Main Thread
     288Worker Thread$\(_2\)$
     289        // same as Worker Thread$\(_1\)$
     290\end{lstlisting}
     291%\input{MemoryOverhead}
     292%\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
     293\caption{Allocator-Induced Active False-Sharing Benchmark}
    149294\label{fig:benchThrashFig}
    150295\end{figure}
    151296
    152 Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory
    153 allocator is allocating dynamic memory to multiple threads on same cache lines. Ideally, a memory allocator should
    154 distance dynamic memory region of one thread from other threads'. Having multiple threads allocating small objects
    155 simultanously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its
    156 not distancing the memory among different threads.
    157 
    158 Figure \ref{fig:benchThrashFig} lays out flow of the cache thrash benchmark.
    159  It creates K worker threads. Each worker thread allocates an object and intensively read/write
    160  it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line
    161  with it. Each thread repeats this for N times. Main thread measures the total time taken to for all worker threads to
    162  complete. Worker threads sharing cahche lines with each other will take longer.
    163 
    164 Different cache access scenerios can be created using the following commandline arguments.\\
    165 -threadN :  sets number of threads, K\\
    166 -cacheIt :  iterations for cache benchmark, N\\
    167 -cacheRep:  repetations for cache benchmark, M\\
    168 -cacheObj:  object size for cache benchmark
    169 
    170 \section{Cache Scratch} Cache Scratch benchmark measures allocator induced passive false sharing in an allocator. An
    171  allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in
    172  figure \ref{f:AllocatorInducedPassiveFalseSharing}. If a thread A allocates multiple objects together then they will be
    173   possibly allocated on the same cache line by the memory allocator. If the thread now passes this object to another
    174   thread B then the two of them will sharing the same cache line but this scenerio is not induced by the allocator.
    175   Instead, the program induced this situation. Now it might be possible that if thread B frees this object and then
    176   allocate an object of the same size then the allocator may return the same object which is on a cache line shared
    177   with thread A. Now this false sharing is being caused by the memory allocator although it was started by the
    178   program.
    179 
    180 \begin{figure}
    181 \centering
    182 \includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
    183 \caption{Benchmark Program Induced Passive False Sharing}
     297The adjustment knobs for cache access scenarios are:
     298\begin{description}[itemsep=0pt,parsep=0pt]
     299\item[threadN:]
     300sets number of threads, K.
     301\item[cacheIt:]
     302iterations for cache benchmark, N.
     303\item[cacheRep:]
     304repetitions for cache benchmark, M.
     305\item[cacheObj:]
     306object size for cache benchmark.
     307\end{description}
     308
     309
     310\subsection{Cache Scratch}
     311
     312The cache scratch benchmark measures allocator induced passive false sharing in an allocator.
     313An allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in \VRef[Figure]{f:AllocatorInducedPassiveFalseSharing}.
     314If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
     315If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
     316instead, the program induced this situation.
     317Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
     318Now this false sharing is being caused by the memory allocator although it was started by the program.
     319
     320The cache-scratch main-thread induces false sharing and creates a scenerio that should make the memory allocator preserve the program-induced false sharing if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
     321An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
     322If the object is returned to the thread who owns it or originally allocated it, then the thread Thread$_2$ gets a new object that is less likely to be on the same cache line as Thread$_1$.
     323
     324\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
     325First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache-line.
     326Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
     327Each worker thread frees the object passed by the main thread.
     328Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations.
     329Each worker repeats this N times.
     330
     331\begin{figure}
     332\centering
     333\input{AllocInducedPassiveFalseSharing}
     334\medskip
     335\begin{lstlisting}
     336Main Thread
     337        malloc N objects for each worker thread
     338        create worker threads and pass N objects to each worker
     339        ...
     340        signal workers to allocate
     341        ...
     342        signal workers to free
     343        ...
     344        print addresses from each $thread$
     345Worker Thread$\(_1\)$
     346        allocate, write, read, free
     347        warmup memory in chunkc of 16 bytes
     348        ...
     349        for ( N )
     350                free an object passed by Main Thread
     351                malloc new object
     352        ...
     353        free objects
     354        return new object addresses to Main Thread
     355Worker Thread$\(_2\)$
     356        // same as Worker Thread$\(_1\)$
     357\end{lstlisting}
     358%\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
     359\caption{Program-Induced Passive False-Sharing Benchmark}
    184360\label{fig:benchScratchFig}
    185361\end{figure}
    186362
    187 Cache scratch main thread induces false sharing and creates a scenerio that should make memory allocator preserve the
    188  program-induced false sharing if it does not retur a freed object to its owner thread and, instead, re-uses it
    189  instantly. An alloator using object ownership, as described in section \ref{s:Ownership}, would be less susceptible to allocator induced passive
    190  false sharing. If the object is returned to the thread who owns it or originally allocated it then the thread B will
    191  get a new object that will be less likely to be on the same cache line as thread A.
    192 
    193 As in figure \ref{fig:benchScratchFig}, cache Scratch allocates K dynamic objects together, one for each of the K worker threads,
    194  possibly causing memory allocator to allocate these objects on the same cache-line. Then it create K worker threads and passes
    195  an object from the K allocated objects to each of the K threads. Each worker thread frees the object passed by the main thread.
    196  Then, it allocates an object and reads/writes it repetitively for M times causing frequent cache invalidations. Each worker
    197  repeats this for N times.
    198 
    199 Each thread allocating an object after freeing the original object passed by the main thread should cause the memory
    200 allocator to return the same object that was initially allocated by the main thread if the allocator did not return the
    201 intial object bakc to its owner (main thread). Then, intensive read/write on the shared cache line by multiple threads
    202 should slow down worker threads due to to high cache invalidations and misses. Main thread measures the total time
    203 taken for all the workers to complete.
    204 
    205 Similar to bechmark cache thrash in section \ref{sec:benchThrashSec}, different cache access scenerios can be created using the following commandline arguments.\\
    206 -threadN :  sets number of threads, K\\
    207 -cacheIt :  iterations for cache benchmark, N\\
    208 -cacheRep:  repetations for cache benchmark, M\\
    209 -cacheObj:  object size for cache benchmark
     363Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread).
     364Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses.
     365Main thread measures the total time taken for all the workers to complete.
     366
     367Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.\\
     368\begin{description}[itemsep=0pt,parsep=0pt]
     369\item[threads:]
     370number of threads, K.
     371\item[iterations:]
     372iterations for cache benchmark, N.
     373\item[repetitions:]
     374repetitions for cache benchmark, M.
     375\item[objsize:]
     376object size for cache benchmark.
     377\end{description}
Note: See TracChangeset for help on using the changeset viewer.