Index: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision c73213b2e9319532ec484b9c3c4593fcba205c9e)
+++ doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision cd1a5e82757d4c0f985e674d59731246eb66df2a)
@@ -1,3 +1,4 @@
 \chapter{Benchmarks}
+\label{s:Benchmarks}
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -70,17 +71,16 @@
 
 
-\subsection{Memory Micro-Benchmark}
-
-The memory micro-benchmark measures the memory overhead of an allocator.
-It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS.
-It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates.
-This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing.
-
-\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
-It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
-A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer.
-A consumer frees these objects.
-After every memory operation, program memory usage is recorded throughout the runtime.
-This data is used to visualize the memory usage and consumption for the program.
+\subsection{Churn Benchmark}
+\label{s:ChurnBenchmark}
+
+The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory.
+Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@.
+Churn simulates a memory intensive program that can be tuned to create different scenarios.
+
+\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
+This benchmark creates a buffer with M spots and starts K threads.
+Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot.
+Each thread repeats this cycle N times.
+The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance.
 
 \begin{figure}
@@ -88,35 +88,30 @@
 \begin{lstlisting}
 Main Thread
-	print memory snapshot
-	create producer threads
-Producer Thread (K)
-	set free start
-	create consumer threads
+	create worker threads
+	note time T1
+	...
+	note time T2
+	churn_speed = (T2 - T1)
+Worker Thread
+	initialize variables
+	...
 	for ( N )
-		allocate memory
-		print memory snapshot
-Consumer Thread (M)
-	wait while ( allocations < free start )
-	for ( N )
-		free memory
-		print memory snapshot
-\end{lstlisting}
-%\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
-\caption{Memory Footprint Micro-Benchmark}
-\label{fig:MemoryBenchFig}
-\end{figure}
-
-The global adjustment knobs for this micro-benchmark are:
-\begin{description}[itemsep=0pt,parsep=0pt]
-\item[producer (K):]
-sets the number of producer threads.
-\item[consumer (M):]
-sets number of consumers threads for each producer.
-\item[round:]
-sets production and consumption round size.
-\end{description}
-
-The adjustment knobs for object allocation are:
-\begin{description}[itemsep=0pt,parsep=0pt]
+		R = random spot in array
+		free R
+		allocate new object at R
+\end{lstlisting}
+%\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
+\caption{Churn Benchmark}
+\label{fig:ChurnBenchFig}
+\end{figure}
+
+The adjustment knobs for churn are:
+\begin{description}[itemsep=0pt,parsep=0pt]
+\item[thread:]
+number of threads (K).
+\item[spots:]
+number of spots for churn (M).
+\item[obj:]
+number of objects per thread (N).
 \item[max:]
 maximum object size.
@@ -126,11 +121,144 @@
 object size increment.
 \item[distro:]
-object size distribution.
-\item[objects (N):]
-number of objects per thread.
+object size distribution
+\end{description}
+
+
+\subsection{Cache Thrash}
+\label{sec:benchThrashSec}
+
+The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}.
+If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
+When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases.
+
+Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.
+Ideally, a memory allocator should distance the dynamic memory region of one thread from another.
+Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads.
+
+\VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
+First, it creates K worker threads.
+Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line.
+Each thread repeats this for N times.
+The main thread measures the total time taken to for all worker threads to complete.
+Worker threads sharing cache lines with each other will take longer.
+
+\begin{figure}
+\centering
+\input{AllocInducedActiveFalseSharing}
+\medskip
+\begin{lstlisting}
+Main Thread
+	create worker threads
+	...
+	signal workers to allocate
+	...
+	signal workers to free
+	...
+	print addresses from each $thread$
+Worker Thread$\(_1\)$
+	allocate, write, read, free
+	warmup memory in chunkc of 16 bytes
+	...
+	malloc N objects
+	...
+	free objects
+	return object address to Main Thread
+Worker Thread$\(_2\)$
+	// same as Worker Thread$\(_1\)$
+\end{lstlisting}
+%\input{MemoryOverhead}
+%\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
+\caption{Allocator-Induced Active False-Sharing Benchmark}
+\label{fig:benchThrashFig}
+\end{figure}
+
+The adjustment knobs for cache access scenarios are:
+\begin{description}[itemsep=0pt,parsep=0pt]
+\item[thread:]
+number of threads (K).
+\item[iterations:]
+iterations of cache benchmark (N).
+\item[cacheRW:]
+repetitions of reads/writes to object (M).
+\item[size:]
+object size.
+\end{description}
+
+
+\subsection{Cache Scratch}
+\label{s:CacheScratch}
+
+The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}.
+As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
+In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object.
+
+% An allocator can unintentionally induce false sharing depending upon its management of the freed objects.
+% If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
+% If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
+% instead, the program induced this situation.
+% Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
+
+Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
+An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
+If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line.
+
+\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
+First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line.
+Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
+Each worker thread frees the object passed by the main thread.
+Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations.
+Each worker repeats this N times.
+
+\begin{figure}
+\centering
+\input{AllocInducedPassiveFalseSharing}
+\medskip
+\begin{lstlisting}
+Main Thread
+	malloc N objects $for$ each worker $thread$
+	create worker threads and pass N objects to each worker
+	...
+	signal workers to allocate
+	...
+	signal workers to free
+	...
+	print addresses from each $thread$
+Worker Thread$\(_1\)$
+	allocate, write, read, free
+	warmup memory in chunkc of 16 bytes
+	...
+	for ( N )
+		free an object passed by Main Thread
+		malloc new object
+	...
+	free objects
+	return new object addresses to Main Thread
+Worker Thread$\(_2\)$
+	// same as Worker Thread$\(_1\)$
+\end{lstlisting}
+%\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
+\caption{Program-Induced Passive False-Sharing Benchmark}
+\label{fig:benchScratchFig}
+\end{figure}
+
+Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread).
+Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses.
+Main thread measures the total time taken for all the workers to complete.
+
+Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.
+\begin{description}[itemsep=0pt,parsep=0pt]
+\item[threads:]
+number of threads (K).
+\item[iterations:]
+iterations of cache benchmark (N).
+\item[cacheRW:]
+repetitions of reads/writes to object (M).
+\item[size:]
+object size.
 \end{description}
 
 
 \subsection{Speed Micro-Benchmark}
+\label{s:SpeedMicroBenchmark}
 
 The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines:
@@ -193,15 +321,18 @@
 
 
-\subsection{Churn Benchmark}
-
-The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory.
-Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@.
-Churn simulates a memory intensive program that can be tuned to create different scenarios.
-
-\VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
-This benchmark creates a buffer with M spots and starts K threads.
-Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot.
-Each thread repeats this cycle N times.
-The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance.
+\subsection{Memory Micro-Benchmark}
+\label{s:MemoryMicroBenchmark}
+
+The memory micro-benchmark measures the memory overhead of an allocator.
+It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS.
+It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates.
+This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing.
+
+\VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
+It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
+A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer.
+A consumer frees these objects.
+After every memory operation, program memory usage is recorded throughout the runtime.
+This data is used to visualize the memory usage and consumption for the program.
 
 \begin{figure}
@@ -209,30 +340,35 @@
 \begin{lstlisting}
 Main Thread
-	create worker threads
-	note time T1
-	...
-	note time T2
-	churn_speed = (T2 - T1)
-Worker Thread
-	initialize variables
-	...
+	print memory snapshot
+	create producer threads
+Producer Thread (K)
+	set free start
+	create consumer threads
 	for ( N )
-		R = random spot in array
-		free R
-		allocate new object at R
-\end{lstlisting}
-%\includegraphics[width=1\textwidth]{figures/bench-churn.eps}
-\caption{Churn Benchmark}
-\label{fig:ChurnBenchFig}
-\end{figure}
-
-The adjustment knobs for churn are:
-\begin{description}[itemsep=0pt,parsep=0pt]
-\item[thread:]
-number of threads (K).
-\item[spots:]
-number of spots for churn (M).
-\item[obj:]
-number of objects per thread (N).
+		allocate memory
+		print memory snapshot
+Consumer Thread (M)
+	wait while ( allocations < free start )
+	for ( N )
+		free memory
+		print memory snapshot
+\end{lstlisting}
+%\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
+\caption{Memory Footprint Micro-Benchmark}
+\label{fig:MemoryBenchFig}
+\end{figure}
+
+The global adjustment knobs for this micro-benchmark are:
+\begin{description}[itemsep=0pt,parsep=0pt]
+\item[producer (K):]
+sets the number of producer threads.
+\item[consumer (M):]
+sets number of consumers threads for each producer.
+\item[round:]
+sets production and consumption round size.
+\end{description}
+
+The adjustment knobs for object allocation are:
+\begin{description}[itemsep=0pt,parsep=0pt]
 \item[max:]
 maximum object size.
@@ -242,136 +378,6 @@
 object size increment.
 \item[distro:]
-object size distribution
-\end{description}
-
-
-\subsection{Cache Thrash}\label{sec:benchThrashSec}
-
-The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}.
-If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
-When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases.
-
-Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.
-Ideally, a memory allocator should distance the dynamic memory region of one thread from another.
-Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads.
-
-\VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
-First, it creates K worker threads.
-Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line.
-Each thread repeats this for N times.
-The main thread measures the total time taken to for all worker threads to complete.
-Worker threads sharing cache lines with each other will take longer.
-
-\begin{figure}
-\centering
-\input{AllocInducedActiveFalseSharing}
-\medskip
-\begin{lstlisting}
-Main Thread
-	create worker threads
-	...
-	signal workers to allocate
-	...
-	signal workers to free
-	...
-	print addresses from each $thread$
-Worker Thread$\(_1\)$
-	allocate, write, read, free
-	warmup memory in chunkc of 16 bytes
-	...
-	malloc N objects
-	...
-	free objects
-	return object address to Main Thread
-Worker Thread$\(_2\)$
-	// same as Worker Thread$\(_1\)$
-\end{lstlisting}
-%\input{MemoryOverhead}
-%\includegraphics[width=1\textwidth]{figures/bench-cache-thrash.eps}
-\caption{Allocator-Induced Active False-Sharing Benchmark}
-\label{fig:benchThrashFig}
-\end{figure}
-
-The adjustment knobs for cache access scenarios are:
-\begin{description}[itemsep=0pt,parsep=0pt]
-\item[thread:]
-number of threads (K).
-\item[iterations:]
-iterations of cache benchmark (N).
-\item[cacheRW:]
-repetitions of reads/writes to object (M).
-\item[size:]
-object size.
-\end{description}
-
-
-\subsection{Cache Scratch}
-
-The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}.
-As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
-In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object.
-
-% An allocator can unintentionally induce false sharing depending upon its management of the freed objects.
-% If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
-% If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
-% instead, the program induced this situation.
-% Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
-
-Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
-An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
-If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line.
-
-\VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
-First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line.
-Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
-Each worker thread frees the object passed by the main thread.
-Then, it allocates an object and reads/writes it repetitively for M times possibly causing frequent cache invalidations.
-Each worker repeats this N times.
-
-\begin{figure}
-\centering
-\input{AllocInducedPassiveFalseSharing}
-\medskip
-\begin{lstlisting}
-Main Thread
-	malloc N objects $for$ each worker $thread$
-	create worker threads and pass N objects to each worker
-	...
-	signal workers to allocate
-	...
-	signal workers to free
-	...
-	print addresses from each $thread$
-Worker Thread$\(_1\)$
-	allocate, write, read, free
-	warmup memory in chunkc of 16 bytes
-	...
-	for ( N )
-		free an object passed by Main Thread
-		malloc new object
-	...
-	free objects
-	return new object addresses to Main Thread
-Worker Thread$\(_2\)$
-	// same as Worker Thread$\(_1\)$
-\end{lstlisting}
-%\includegraphics[width=1\textwidth]{figures/bench-cache-scratch.eps}
-\caption{Program-Induced Passive False-Sharing Benchmark}
-\label{fig:benchScratchFig}
-\end{figure}
-
-Each thread allocating an object after freeing the original object passed by the main thread should cause the memory allocator to return the same object that was initially allocated by the main thread if the allocator did not return the initial object back to its owner (main thread).
-Then, intensive read/write on the shared cache line by multiple threads should slow down worker threads due to to high cache invalidations and misses.
-Main thread measures the total time taken for all the workers to complete.
-
-Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.
-\begin{description}[itemsep=0pt,parsep=0pt]
-\item[threads:]
-number of threads (K).
-\item[iterations:]
-iterations of cache benchmark (N).
-\item[cacheRW:]
-repetitions of reads/writes to object (M).
-\item[size:]
-object size.
-\end{description}
+object size distribution.
+\item[objects (N):]
+number of objects per thread.
+\end{description}