Changeset a6e8f64
- Timestamp:
- Apr 24, 2022, 10:54:45 PM (19 months ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
- Children:
- 52a532ae
- Parents:
- 16d397a
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex
r16d397a ra6e8f64 7 7 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 8 8 9 The aim of micro benchmark suite is to create a set of programs that can evaluate a memory allocator based on the 10 performance matrices described in (FIX ME: local cite). These programs can be taken as a standard to benchmark an 11 allocator's basic goals. These programs give details of an allocator's memory overhead and speed under a certain 12 allocation pattern. The speed of the allocator is benchmarked in different ways. Similarly, false sharing happening in 13 an allocator is also measured in multiple ways. These benchmarks evalute the allocator under a certain allocation 14 pattern which is configurable and can be changed using a few knobs to benchmark observe an allocator's performance 15 under a desired allocation pattern. 16 17 Micro Benchmark Suite benchmarks an allocator's performance by allocating dynamic objects and, then, measuring specifc 18 matrices. The benchmark suite evaluates an allocator with a certain allocation pattern. Bnechmarks have different knobs 19 that can be used to change allocation pattern and evaluate an allocator under desired conditions. These can be set by 20 giving commandline arguments to the benchmark on execution. 21 22 \section{Current Benchmarks} There are multiple benchmarks that are built individually and evaluate different aspects of 23 a memory allocator. But, there is not a set of benchamrks that can be used to evaluate multiple aspects of memory 24 allocators. 25 26 \subsection{threadtest}(FIX ME: cite benchmark and hoard) Each thread repeatedly allocates and then deallocates 100,000 27 objects. Runtime of the benchmark evaluates its efficiency. 28 29 \subsection{shbench}(FIX ME: cite benchmark and hoard) Each thread allocates and randomly frees a number of random-sized 30 objects. It is a stress test that also uses runtime to determine efficiency of the allocator. 31 32 \subsection{larson}(FIX ME: cite benchmark and hoard) Larson simulates a server environment. Multiple threads are 33 created where each thread allocator and free a number of objects within a size range. Some objects are passed from 34 threads to the child threads to free. It caluculates memory operations per second as an indicator of memory 35 allocator's performance. 36 37 \section{Memory Benchmark} Memory benchmark measures memory overhead of an allocator. It allocates a number of dynamic 38 objects. Then, by reading /self/proc/maps, gets the total memory that the allocator has reuested from the OS. It 39 calculates the memory head by taking the difference between the memory the allocator has requested from the OS and the 40 memory that program has allocated. 41 42 \begin{figure} 43 \centering 44 \includegraphics[width=1\textwidth]{figures/bench-memory.eps} 45 \caption{Benchmark Memory Overhead} 46 \label{fig:benchMemoryFig} 47 \end{figure} 48 49 Figure \ref{fig:benchMemoryFig} gives a flow of the memory benchmark. It creates a producer-consumer scenerio with K producers 50 and each producer has M consumers. Producer has a separate buffer for each consumer. Producer allocates N objects of 51 random sizes following the given distrubution for each consumer. Consumer frees those objects. After every memory 52 operation, program memory usage is recorded throughout the runtime. This data then can be used to visualize the memory 53 usage and consumption of the prigram. 54 55 Different knobs can be adjusted to set certain thread model.\\ 56 -threadA : sets number of alloc threads (producers) for mem benchmark\\ 57 -consumeS: sets production and conumption round size\\ 58 -threadF : sets number of free threads (consumers) for each producer for mem benchmark 59 60 Object allocation size can be changed using the knobs:\\ 61 -maxS : sets max object size\\ 62 -minS : sets min object size\\ 63 -stepS : sets object size increment\\ 64 -distroS : sets object size distribution\\ 65 -objN : sets number of objects per thread\\ 66 67 \section{Speed Benchmark} Speed benchmark measures the runtime speed of an allocator (FIX ME: cite allocator routines). 68 Speed benchmark measures runtime speed of individual memory allocation routines. It also considers different 69 allocation chains to measures the performance of the allocator by combining multiple allocation routines in a chain. 70 It uses following chains and measures allocator runtime speed against them: 71 \begin{itemize} 9 There are two basic approaches for evaluating computer software: benchmarks and micro-benchmarks. 10 \begin{description} 11 \item[Benchmarks] 12 are a suite of application programs (SPEC CPU/WEB) that are exercised in a common way (inputs) to find differences among underlying software implementations associated with an application (compiler, memory allocator, web server, \etc). 13 The applications are suppose to represent common execution patterns that need to perform well with respect to an underlying software implementation. 14 Benchmarks are often criticized for having overlapping patterns, insufficient patterns, or extraneous code that masks patterns. 15 \item[Micro-Benchmarks] 16 attempt to extract the common execution patterns associated with an application and run the pattern independently. 17 This approach removes any masking from extraneous application code, allows execution pattern to be very precise, and provides an opportunity for the execution pattern to have multiple independent tuning adjustments (knobs). 18 Micro-benchmarks are often criticized for inadequately representing real-world applications. 19 \end{description} 20 21 While some crucial software components have standard benchmarks, no standard benchmark exists for testing and comparing memory allocators. 22 In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay. 23 As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing. 24 Many of these applications and micro-benchmark are old and may not reflect current application allocation patterns. 25 26 This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters. 27 The aim of the micro-benchmark suite is to create a set of programs that can evaluate a memory allocator based on the performance matrices described in (FIX ME: local cite). 28 % These programs can be taken as a standard to benchmark an allocator's basic goals. 29 These programs give details of an allocator's memory overhead and speed under certain allocation patterns. 30 The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs. 31 The micro-benchmark programs control knobs by command-line arguments. 32 33 The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices. 34 An allocator's speed is benchmarked in different ways, as are issues like false sharing. 35 36 37 \section{Prior Multi-Threaded Micro-Benchmarks} 38 39 Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT level. 40 The following traditional multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00}. 41 42 43 \subsection{threadtest} 44 45 This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently. 46 There is no interaction among threads, \ie no object sharing. 47 Each thread repeatedly allocate 100,000 8-byte objects then deallocates them in the order they were allocated. 48 Runtime of the benchmark evaluates its efficiency. 49 50 51 \subsection{shbench} 52 53 Each thread randomly allocates and frees a number of random-sized objects. 54 It is a stress test that also uses runtime to determine efficiency of the allocator. 55 56 57 \subsection{Larson} 58 59 Larson simulates a server environment. 60 Multiple threads are created where each thread allocates and frees a number of random-sized objects within a size range. 61 Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process. 62 The number of thread generations varies depending on the thread speed. 63 It calculates memory operations per second as an indicator of memory allocator's performance. 64 65 66 \section{New Multi-Threaded Micro-Benchmarks} 67 68 The following new benchmarks were created to assess multi-threaded programs at the KT level. 69 70 71 \subsection{Memory Micro-Benchmark} 72 73 The memory micro-benchmark measures the memory overhead of an allocator. 74 It allocates a number of dynamic objects. 75 Then, by reading @/proc/self/proc/maps@, it gets the total memory requested by the allocator from the OS. 76 It calculates the memory overhead by computing the difference between the memory the allocator has requested from the OS and the memory that the program has allocated. 77 78 \VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark. 79 It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads. 80 A producer has a separate buffer for each consumer and allocates N objects of random sizes following the given distribution for each consumer. 81 A consumer frees these objects. 82 After every memory operation, program memory usage is recorded throughout the runtime. 83 This data then is used to visualize the memory usage and consumption for the program. 84 85 \begin{figure} 86 \centering 87 \begin{lstlisting} 88 Main Thread 89 print memory snapshot 90 create producer threads 91 Producer Thread 92 set free start 93 create consumer threads 94 for ( N ) 95 allocate memory 96 print memory snapshot 97 Consumer Thread 98 wait while ( allocations < free start ) 99 for ( N ) 100 free memory 101 print memory snapshot 102 \end{lstlisting} 103 %\includegraphics[width=1\textwidth]{figures/bench-memory.eps} 104 \caption{Memory Overhead Benchmark (evaluate memory footprint)} 105 \label{fig:MemoryBenchFig} 106 \end{figure} 107 108 The adjustment knobs for this micro-benchmark are: 109 \begin{description}[itemsep=0pt,parsep=0pt] 110 \item[producer:] 111 sets the number of producer threads. 112 \item[round:] 113 sets production and consumption round size. 114 \item[consumer:] 115 sets number of consumers threads for each producer. 116 \end{description} 117 118 The adjustment knobs for the object allocation size are: 119 \begin{description}[itemsep=0pt,parsep=0pt] 120 \item[max:] 121 sets max object size. 122 \item[min:] 123 sets min object size. 124 \item[step:] 125 sets object size increment. 126 \item[distro:] 127 sets object size distribution. 128 \item[obj:] 129 sets number of objects per thread. 130 \end{description} 131 132 133 \subsection{Speed Micro-Benchmark} 134 135 The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines: 136 \begin{enumerate}[itemsep=0pt,parsep=0pt] 72 137 \item malloc 73 138 \item realloc … … 82 147 \item calloc-realloc-free 83 148 \item malloc-realloc-free-calloc 84 \end{itemize} 85 86 \begin{figure} 87 \centering 88 \includegraphics[width=1\textwidth]{figures/bench-speed.eps} 89 \caption{Benchmark Speed} 90 \label{fig:benchSpeedFig} 91 \end{figure} 92 93 As laid out in figure \ref{fig:benchSpeedFig}, each chain is measured separately. Each routine in the chain is called for N objects and then 94 those allocated objects are used when call the next routine in the allocation chain. This way we can measure the 95 complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives 96 us the whole picture of the major allocation routines when combined together in a chain. 97 98 For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against 99 each chain. 100 101 Following knobs can be adjusted to tune memory usage.\\ 102 -maxS : sets max object size\\ 103 -minS : sets min object size\\ 104 -stepS : sets object size increment\\ 105 -distroS : sets object size distribution\\ 106 -objN : sets number of objects per thread\\ 107 -threadN : sets number of worker threads\\ 108 109 \section{Churn Benchmark} Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded 110 scenerio where each thread extinsevly allocates and frees dynamic memory. 111 112 \begin{figure} 113 \centering 114 \includegraphics[width=1\textwidth]{figures/bench-churn.eps} 115 \caption{Benchmark Churn} 116 \label{fig:benchChurnFig} 117 \end{figure} 118 119 Figure \ref{fig:benchChurnFig} illustrates churn benchmark. 120 This benchmark creates a buffer with M spots and starts K threads. Each thread randomly picks a 121 spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot. Each thread 122 repeats this cycle for N times. Main threads measures the total time taken for the whole benchmark and that time is 123 used to evaluate memory allocator's performance. 149 \end{enumerate} 150 151 \VRef[Figure]{fig:SpeedBenchFig} shows the pseudo code for the speed micro-benchmark. 152 Each routine in the chain is called for N objects and then those allocated objects are used when call the next routine in the allocation chain. 153 This way we can measure the complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives us the whole picture of the major allocation routines when combined together in a chain. 154 For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against each chain. 155 156 \begin{figure} 157 \centering 158 \begin{lstlisting}[morekeywords={foreach}] 159 Main Thread 160 create worker threads 161 foreach ( allocation chain ) 162 note time T1 163 ... 164 note time T2 165 chain_speed = (T2 - T1) / number-of-worker-threads * N ) 166 Worker Thread 167 initialize variables 168 ... 169 foreach ( routine in allocation chain ) 170 call routine N times 171 \end{lstlisting} 172 %\includegraphics[width=1\textwidth]{figures/bench-speed.eps} 173 \caption{Speed Benchmark} 174 \label{fig:SpeedBenchFig} 175 \end{figure} 176 177 The adjustment knobs for memory usage are: 178 \begin{description}[itemsep=0pt,parsep=0pt] 179 \item[max:] 180 sets max object size. 181 \item[min:] 182 sets min object size. 183 \item[step:] 184 sets object size increment. 185 \item[distro:] 186 sets object size distribution. 187 \item[obj:] 188 sets number of objects per thread. 189 \item[workers:] 190 sets number of worker threads. 191 \end{description} 192 193 194 \subsection{Churn Benchmark} 195 196 Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded scenerio where each thread extensively allocates and frees dynamic memory. 197 198 \VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark. 199 This benchmark creates a buffer with M spots and starts K threads. 200 Each thread randomly picks a spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot. 201 Each thread repeats this cycle for N times. 202 Main threads measures the total time taken for the whole benchmark and that time is used to evaluate memory allocator's performance. 203 204 \begin{figure} 205 \centering 206 \begin{lstlisting} 207 Main Thread 208 create worker threads 209 note time T1 210 ... 211 note time T2 212 churn_speed = (T2 - T1) 213 Worker Thread 214 initialize variables 215 ... 216 for ( N ) 217 R -> random spot in array 218 free R 219 allocate new object at R 220 \end{lstlisting} 221 %\includegraphics[width=1\textwidth]{figures/bench-churn.eps} 222 \caption{Churn Benchmark} 223 \label{fig:ChurnBenchFig} 224 \end{figure} 124 225 125 226 Only malloc and free are used to allocate and free an object to eliminate any extra cost such as memcpy in realloc etc. 126 Malloc/free allows us to measure latency of memory allocation only without paying any extra cost. Churn simulates a 127 memory intensive program that can be tuned to create different scenerios. 128 129 Following commandline arguments can be used to tune the benchmark.\\ 130 -threadN : sets number of threads, K\\ 131 -cSpots : sets number of spots for churn, M\\ 132 -objN : sets number of objects per thread, N\\ 133 -maxS : sets max object size\\ 134 -minS : sets min object size\\ 135 -stepS : sets object size increment\\ 136 -distroS : sets object size distribution 137 138 \section{Cache Thrash}\label{sec:benchThrashSec} Cache Thrash benchmark measures allocator induced active false sharing 139 in an allocator as illustrated in figure \ref{f:AllocatorInducedActiveFalseSharing}. 140 If memory allocator allocates memory for multiple threads on 141 same cache line, this can slow down the program performance. If both threads, who share one cache line, frequently 142 read/write to their object on the cache line concurrently then this will cause cache miss everytime a thread accesse 143 the object as the other thread might have written something at their memory location on the same cache line. 144 145 \begin{figure} 146 \centering 147 \includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps} 148 \caption{Benchmark Allocator Induced Active False Sharing} 227 Malloc/free allows us to measure latency of memory allocation only without paying any extra cost. 228 Churn simulates a memory intensive program that can be tuned to create different scenarios. 229 230 The adjustment knobs for churn usage are: 231 \begin{description}[itemsep=0pt,parsep=0pt] 232 \item[thread:] 233 sets number of threads, K. 234 \item[spots:] 235 sets number of spots for churn, M. 236 \item[obj:] 237 sets number of objects per thread, N. 238 \item[max:] 239 sets max object size. 240 \item[min:] 241 sets min object size. 242 \item[step:] 243 sets object size increment. 244 \item[distro:] 245 sets object size distribution 246 \end{description} 247 248 249 \section{Cache Thrash}\label{sec:benchThrashSec} 250 251 Cache Thrash benchmark measures allocator induced active false sharing in an allocator as illustrated in figure \VRef{f:AllocatorInducedActiveFalseSharing}. 252 If memory allocator allocates memory for multiple threads on same cache line, this can slow down the program performance. 253 If both threads, who share one cache line, frequently read/write to their object on the cache line concurrently then this will cause cache miss every time a thread accesses the object as the other thread might have written something at their memory location on the same cache line. 254 255 Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory allocator is allocating dynamic memory to multiple threads on same cache lines. 256 Ideally, a memory allocator should distance dynamic memory region of one thread from other threads'. 257 Having multiple threads allocating small objects simultaneously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its 258 not distancing the memory among different threads. 259 260 \VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark. 261 First, it creates K worker threads. 262 Each worker thread allocates an object and intensively read/write it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line with it. 263 Each thread repeats this for N times. 264 Main thread measures the total time taken to for all worker threads to complete. 265 Worker threads sharing cache lines with each other will take longer. 266 267 \begin{figure} 268 \centering 269 \input{AllocInducedActiveFalseSharing} 270 \medskip 271 \begin{lstlisting} 272 Main Thread 273 create worker threads 274 ... 275 signal workers to allocate 276 ... 277 signal workers to free 278 ... 279 print addresses from each $thread$ 280 Worker Thread$\(_1\)$ 281 allocate, write, read, free 282 warmup memory in chunkc of 16 bytes 283 ... 284 malloc N objects 285 ... 286 free objects 287 return object address to Main Thread 288 Worker Thread$\(_2\)$ 289 // same as Worker Thread$\(_1\)$ 290 \end{lstlisting} 291 %\input{MemoryOverhead} 292 %\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps} 293 \caption{Allocator-Induced Active False-Sharing Benchmark} 149 294 \label{fig:benchThrashFig} 150 295 \end{figure} 151 296 152 Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory 153 allocator is allocating dynamic memory to multiple threads on same cache lines. Ideally, a memory allocator should 154 distance dynamic memory region of one thread from other threads'. Having multiple threads allocating small objects 155 simultanously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its 156 not distancing the memory among different threads. 157 158 Figure \ref{fig:benchThrashFig} lays out flow of the cache thrash benchmark. 159 It creates K worker threads. Each worker thread allocates an object and intensively read/write 160 it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line 161 with it. Each thread repeats this for N times. Main thread measures the total time taken to for all worker threads to 162 complete. Worker threads sharing cahche lines with each other will take longer. 163 164 Different cache access scenerios can be created using the following commandline arguments.\\ 165 -threadN : sets number of threads, K\\ 166 -cacheIt : iterations for cache benchmark, N\\ 167 -cacheRep: repetations for cache benchmark, M\\ 168 -cacheObj: object size for cache benchmark 169 170 \section{Cache Scratch} Cache Scratch benchmark measures allocator induced passive false sharing in an allocator. An 171 allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in 172 figure \ref{f:AllocatorInducedPassiveFalseSharing}. If a thread A allocates multiple objects together then they will be 173 possibly allocated on the same cache line by the memory allocator. If the thread now passes this object to another 174 thread B then the two of them will sharing the same cache line but this scenerio is not induced by the allocator. 175 Instead, the program induced this situation. Now it might be possible that if thread B frees this object and then 176 allocate an object of the same size then the allocator may return the same object which is on a cache line shared 177 with thread A. Now this false sharing is being caused by the memory allocator although it was started by the 178 program. 179 180 \begin{figure} 181 \centering 182 \includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps} 183 \caption{Benchmark Program Induced Passive False Sharing} 297 The adjustment knobs for cache access scenarios are: 298 \begin{description}[itemsep=0pt,parsep=0pt] 299 \item[threadN:] 300 sets number of threads, K. 301 \item[cacheIt:] 302 iterations for cache benchmark, N. 303 \item[cacheRep:] 304 repetitions for cache benchmark, M. 305 \item[cacheObj:] 306 object size for cache benchmark. 307 \end{description} 308 309 310 \subsection{Cache Scratch} 311 312 The cache scratch benchmark measures allocator induced passive false sharing in an allocator. 313 An allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in \VRef[Figure]{f:AllocatorInducedPassiveFalseSharing}. 314 If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator. 315 If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator; 316 instead, the program induced this situation. 317 Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$. 318 Now this false sharing is being caused by the memory allocator although it was started by the program. 319 320 The cache-scratch main-thread induces false sharing and creates a scenerio that should make the memory allocator preserve the program-induced false sharing if it does not return a freed object to its owner thread and, instead, re-uses it instantly. 321 An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing. 322 If the object is returned to the thread who owns it or originally allocated it, then the thread Thread$_2$ gets a new object that is less likely to be on the same cache line as Thread$_1$. 323 324 \VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark. 325 First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache-line. 326 Then it create K worker threads and passes an object from the K allocated objects to each of the K threads. 327 Each worker thread frees the object passed by the main thread. 328 Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations. 329 Each worker repeats this N times. 330 331 \begin{figure} 332 \centering 333 \input{AllocInducedPassiveFalseSharing} 334 \medskip 335 \begin{lstlisting} 336 Main Thread 337 malloc N objects for each worker thread 338 create worker threads and pass N objects to each worker 339 ... 340 signal workers to allocate 341 ... 342 signal workers to free 343 ... 344 print addresses from each $thread$ 345 Worker Thread$\(_1\)$ 346 allocate, write, read, free 347 warmup memory in chunkc of 16 bytes 348 ... 349 for ( N ) 350 free an object passed by Main Thread 351 malloc new object 352 ... 353 free objects 354 return new object addresses to Main Thread 355 Worker Thread$\(_2\)$ 356 // same as Worker Thread$\(_1\)$ 357 \end{lstlisting} 358 %\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps} 359 \caption{Program-Induced Passive False-Sharing Benchmark} 184 360 \label{fig:benchScratchFig} 185 361 \end{figure} 186 362 187 Cache scratch main thread induces false sharing and creates a scenerio that should make memory allocator preserve the 188 program-induced false sharing if it does not retur a freed object to its owner thread and, instead, re-uses it 189 instantly. An alloator using object ownership, as described in section \ref{s:Ownership}, would be less susceptible to allocator induced passive 190 false sharing. If the object is returned to the thread who owns it or originally allocated it then the thread B will 191 get a new object that will be less likely to be on the same cache line as thread A. 192 193 As in figure \ref{fig:benchScratchFig}, cache Scratch allocates K dynamic objects together, one for each of the K worker threads, 194 possibly causing memory allocator to allocate these objects on the same cache-line. Then it create K worker threads and passes 195 an object from the K allocated objects to each of the K threads. Each worker thread frees the object passed by the main thread. 196 Then, it allocates an object and reads/writes it repetitively for M times causing frequent cache invalidations. Each worker 197 repeats this for N times. 198 199 Each thread allocating an object after freeing the original object passed by the main thread should cause the memory 200 allocator to return the same object that was initially allocated by the main thread if the allocator did not return the 201 intial object bakc to its owner (main thread). Then, intensive read/write on the shared cache line by multiple threads 202 should slow down worker threads due to to high cache invalidations and misses. Main thread measures the total time 203 taken for all the workers to complete. 204 205 Similar to bechmark cache thrash in section \ref{sec:benchThrashSec}, different cache access scenerios can be created using the following commandline arguments.\\ 206 -threadN : sets number of threads, K\\ 207 -cacheIt : iterations for cache benchmark, N\\ 208 -cacheRep: repetations for cache benchmark, M\\ 209 -cacheObj: object size for cache benchmark 363 Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread). 364 Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses. 365 Main thread measures the total time taken for all the workers to complete. 366 367 Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.\\ 368 \begin{description}[itemsep=0pt,parsep=0pt] 369 \item[threads:] 370 number of threads, K. 371 \item[iterations:] 372 iterations for cache benchmark, N. 373 \item[repetitions:] 374 repetitions for cache benchmark, M. 375 \item[objsize:] 376 object size for cache benchmark. 377 \end{description}
Note: See TracChangeset
for help on using the changeset viewer.