Index: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision 52a532aeea4c5d07aed521ab8bb032f1a50074f7)
+++ doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision e357efb524377be08e829a02a3b82856a4aeb2f6)
@@ -22,5 +22,5 @@
 In the past, an assortment of applications have been used for benchmarking allocators~\cite{Detlefs93,Berger00,Berger01,berger02reconsidering}: P2C, GS, Espresso/Espresso-2, CFRAC/CFRAC-2, GMake, GCC, Perl/Perl-2, Gawk/Gawk-2, XPDF/XPDF-2, ROBOOP, Lindsay.
 As well, an assortment of micro-benchmark have been used for benchmarking allocators~\cite{larson99memory,Berger00,streamflow}: threadtest, shbench, Larson, consume, false sharing.
-Many of these applications and micro-benchmark are old and may not reflect current application allocation patterns.
+Many of these benchmark applications and micro-benchmarks are old and may not reflect current application allocation patterns.
 
 This thesis designs and examines a new set of micro-benchmarks for memory allocators that test a variety of allocation patterns, each with multiple tuning parameters.
@@ -29,5 +29,5 @@
 These programs give details of an allocator's memory overhead and speed under certain allocation patterns.
 The allocation patterns are configurable (adjustment knobs) to observe an allocator's performance across a spectrum of events for a desired allocation pattern, which is seldom possible with benchmark programs.
-The micro-benchmark programs control knobs by command-line arguments.
+Each micro-benchmark program has multiple control knobs specified by command-line arguments.
 
 The new micro-benchmark suite measures performance by allocating dynamic objects and measuring specific matrices.
@@ -37,6 +37,7 @@
 \section{Prior Multi-Threaded Micro-Benchmarks}
 
-Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT level.
-The following traditional multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00}.
+Modern memory allocators, such as llheap, must handle multi-threaded programs at the KT and UT level.
+The following multi-threaded micro-benchmarks are presented to give a sense of prior work~\cite{Berger00} at the KT level.
+None of the prior work address multi-threading at the UT level.
 
 
@@ -45,5 +46,5 @@
 This benchmark stresses the ability of the allocator to handle different threads allocating and deallocating independently.
 There is no interaction among threads, \ie no object sharing.
-Each thread repeatedly allocate 100,000 8-byte objects then deallocates them in the order they were allocated.
+Each thread repeatedly allocate 100,000 \emph{8-byte} objects then deallocates them in the order they were allocated.
 Runtime of the benchmark evaluates its efficiency.
 
@@ -51,5 +52,5 @@
 \subsection{shbench}
 
-Each thread randomly allocates and frees a number of random-sized objects.
+This benchmark is similar to threadtest but each thread randomly allocate and free a number of \emph{random-sized} objects.
 It is a stress test that also uses runtime to determine efficiency of the allocator.
 
@@ -57,5 +58,5 @@
 \subsection{Larson}
 
-Larson simulates a server environment.
+This benchmark simulates a server environment.
 Multiple threads are created where each thread allocates and frees a number of random-sized objects within a size range.
 Before the thread terminates, it passes its array of 10,000 objects to a new child thread to continue the process.
@@ -66,5 +67,5 @@
 \section{New Multi-Threaded Micro-Benchmarks}
 
-The following new benchmarks were created to assess multi-threaded programs at the KT level.
+The following new benchmarks were created to assess multi-threaded programs at the KT and UT level.
 
 
@@ -72,14 +73,14 @@
 
 The memory micro-benchmark measures the memory overhead of an allocator.
-It allocates a number of dynamic objects.
-Then, by reading @/proc/self/proc/maps@, it gets the total memory requested by the allocator from the OS.
-It calculates the memory overhead by computing the difference between the memory the allocator has requested from the OS and the memory that the program has allocated.
+It allocates a number of dynamic objects and reads @/proc/self/proc/maps@ to get the total memory requested by the allocator from the OS.
+It calculates the memory overhead by computing the difference between the memory the allocator requests from the OS and the memory that the program allocates.
+This micro-benchmark is like Larson and stresses the ability of an allocator to deal with object sharing.
 
 \VRef[Figure]{fig:MemoryBenchFig} shows the pseudo code for the memory micro-benchmark.
 It creates a producer-consumer scenario with K producer threads and each producer has M consumer threads.
-A producer has a separate buffer for each consumer and allocates N objects of random sizes following the given distribution for each consumer.
+A producer has a separate buffer for each consumer and allocates N objects of random sizes following a settable distribution for each consumer.
 A consumer frees these objects.
 After every memory operation, program memory usage is recorded throughout the runtime.
-This data then is used to visualize the memory usage and consumption for the program.
+This data is used to visualize the memory usage and consumption for the program.
 
 \begin{figure}
@@ -89,5 +90,5 @@
 	print memory snapshot
 	create producer threads
-Producer Thread
+Producer Thread (K)
 	set free start
 	create consumer threads
@@ -95,5 +96,5 @@
 		allocate memory
 		print memory snapshot
-Consumer Thread
+Consumer Thread (M)
 	wait while ( allocations < free start )
 	for ( N )
@@ -102,30 +103,30 @@
 \end{lstlisting}
 %\includegraphics[width=1\textwidth]{figures/bench-memory.eps}
-\caption{Memory Overhead Benchmark (evaluate memory footprint)}
+\caption{Memory Footprint Micro-Benchmark}
 \label{fig:MemoryBenchFig}
 \end{figure}
 
-The adjustment knobs for this micro-benchmark are:
-\begin{description}[itemsep=0pt,parsep=0pt]
-\item[producer:]
+The global adjustment knobs for this micro-benchmark are:
+\begin{description}[itemsep=0pt,parsep=0pt]
+\item[producer (K):]
 sets the number of producer threads.
+\item[consumer (M):]
+sets number of consumers threads for each producer.
 \item[round:]
 sets production and consumption round size.
-\item[consumer:]
-sets number of consumers threads for each producer.
-\end{description}
-
-The adjustment knobs for the object allocation size are:
+\end{description}
+
+The adjustment knobs for object allocation are:
 \begin{description}[itemsep=0pt,parsep=0pt]
 \item[max:]
-sets max object size.
+maximum object size.
 \item[min:]
-sets min object size.
+minimum object size.
 \item[step:]
-sets object size increment.
+object size increment.
 \item[distro:]
-sets object size distribution.
-\item[obj:]
-sets number of objects per thread.
+object size distribution.
+\item[objects (N):]
+number of objects per thread.
 \end{description}
 
@@ -150,7 +151,7 @@
 
 \VRef[Figure]{fig:SpeedBenchFig} shows the pseudo code for the speed micro-benchmark.
-Each routine in the chain is called for N objects and then those allocated objects are used when call the next routine in the allocation chain.
-This way we can measure the complete latency of memory allocator when multiple routines are chained together e.g. malloc-realloc-free-calloc gives us the whole picture of the major allocation routines when combined together in a chain.
-For each chain, time taken is recorded which then can be used to visualize performance of a memory allocator against each chain.
+Each routine in the chain is called for N objects and then those allocated objects are used when calling the next routine in the allocation chain.
+This tests the latency of the memory allocator when multiple routines are chained together, \eg the call sequence malloc-realloc-free-calloc gives a complete picture of the major allocation routines when combined together.
+For each chain, the time is recorded to visualize performance of a memory allocator against each chain.
 
 \begin{figure}
@@ -178,15 +179,15 @@
 \begin{description}[itemsep=0pt,parsep=0pt]
 \item[max:]
-sets max object size.
+maximum object size.
 \item[min:]
-sets min object size.
+minimum object size.
 \item[step:]
-sets object size increment.
+object size increment.
 \item[distro:]
-sets object size distribution.
-\item[obj:]
-sets number of objects per thread.
+object size distribution.
+\item[objects:]
+number of objects per thread.
 \item[workers:]
-sets number of worker threads.
+number of worker threads.
 \end{description}
 
@@ -194,11 +195,13 @@
 \subsection{Churn Benchmark}
 
-Churn benchmark measures the overall runtime speed of an allocator in a multi-threaded scenerio where each thread extensively allocates and frees dynamic memory.
+The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenerio, where each thread extensively allocates and frees dynamic memory.
+Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@.
+Churn simulates a memory intensive program that can be tuned to create different scenarios.
 
 \VRef[Figure]{fig:ChurnBenchFig} shows the pseudo code for the churn micro-benchmark.
 This benchmark creates a buffer with M spots and starts K threads.
-Each thread randomly picks a spot out of M spots, it frees the object currently at that spot and allocates a new object for that spot.
-Each thread repeats this cycle for N times.
-Main threads measures the total time taken for the whole benchmark and that time is used to evaluate memory allocator's performance.
+Each thread picks a random spot in M, frees the object currently at that spot, and allocates a new object for that spot.
+Each thread repeats this cycle N times.
+The main thread measures the total time taken for the whole benchmark and that time is used to evaluate the memory allocator's performance.
 
 \begin{figure}
@@ -215,5 +218,5 @@
 	...
 	for ( N )
-		R -> random spot in array
+		R = random spot in array
 		free R
 		allocate new object at R
@@ -224,43 +227,38 @@
 \end{figure}
 
-Only malloc and free are used to allocate and free an object to eliminate any extra cost such as memcpy in realloc etc.
-Malloc/free allows us to measure latency of memory allocation only without paying any extra cost.
-Churn simulates a memory intensive program that can be tuned to create different scenarios.
-
-The adjustment knobs for churn usage are:
+The adjustment knobs for churn are:
 \begin{description}[itemsep=0pt,parsep=0pt]
 \item[thread:]
-sets number of threads, K.
+number of threads (K).
 \item[spots:]
-sets number of spots for churn, M.
+number of spots for churn (M).
 \item[obj:]
-sets number of objects per thread, N.
+number of objects per thread (N).
 \item[max:]
-sets max object size.
+maximum object size.
 \item[min:]
-sets min object size.
+minimum object size.
 \item[step:]
-sets object size increment.
+object size increment.
 \item[distro:]
-sets object size distribution
-\end{description}
-
-
-\section{Cache Thrash}\label{sec:benchThrashSec}
-
-Cache Thrash benchmark measures allocator induced active false sharing in an allocator as illustrated in figure \VRef{f:AllocatorInducedActiveFalseSharing}.
-If memory allocator allocates memory for multiple threads on same cache line, this can slow down the program performance.
-If both threads, who share one cache line, frequently read/write to their object on the cache line concurrently then this will cause cache miss every time a thread accesses the object as the other thread might have written something at their memory location on the same cache line.
-
-Cache thrash tries to create a scenerio that should lead to allocator induced false sharing if the underlying memory allocator is allocating dynamic memory to multiple threads on same cache lines.
-Ideally, a memory allocator should distance dynamic memory region of one thread from other threads'.
-Having multiple threads allocating small objects simultaneously should cause the memory allocator to allocate objects for multiple objects on the same cache line if its
-not distancing the memory among different threads.
+object size distribution
+\end{description}
+
+
+\subsection{Cache Thrash}\label{sec:benchThrashSec}
+
+The cache-thrash micro-benchmark measures allocator-induced active false-sharing as illustrated in \VRef{s:AllocatorInducedActiveFalseSharing}.
+If memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
+When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases.
+
+Cache thrash tries to create a scenerio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.
+Ideally, a memory allocator should distance the dynamic memory region of one thread from another.
+Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads.
 
 \VRef[Figure]{fig:benchThrashFig} shows the pseudo code for the cache-thrash micro-benchmark.
 First, it creates K worker threads.
-Each worker thread allocates an object and intensively read/write it for M times to invalidate cache lines frequently to slow down other threads who might be sharing this cache line with it.
+Each worker thread allocates an object and intensively reads/writes it for M times to possible invalidate cache lines that may interfere with other threads sharing the same cache line.
 Each thread repeats this for N times.
-Main thread measures the total time taken to for all worker threads to complete.
+The main thread measures the total time taken to for all worker threads to complete.
 Worker threads sharing cache lines with each other will take longer.
 
@@ -297,12 +295,12 @@
 The adjustment knobs for cache access scenarios are:
 \begin{description}[itemsep=0pt,parsep=0pt]
-\item[threadN:]
-sets number of threads, K.
-\item[cacheIt:]
-iterations for cache benchmark, N.
-\item[cacheRep:]
-repetitions for cache benchmark, M.
-\item[cacheObj:]
-object size for cache benchmark.
+\item[thread:]
+number of threads (K).
+\item[iterations:]
+iterations of cache benchmark (N).
+\item[cacheRW:]
+repetitions of reads/writes to object (M).
+\item[size:]
+object size.
 \end{description}
 
@@ -310,18 +308,20 @@
 \subsection{Cache Scratch}
 
-The cache scratch benchmark measures allocator induced passive false sharing in an allocator.
-An allocator can unintentionally induce false sharing depending upon its management of the freed objects as described in \VRef[Figure]{f:AllocatorInducedPassiveFalseSharing}.
-If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
-If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
-instead, the program induced this situation.
-Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
-Now this false sharing is being caused by the memory allocator although it was started by the program.
-
-The cache-scratch main-thread induces false sharing and creates a scenerio that should make the memory allocator preserve the program-induced false sharing if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
+The cache-scratch micro-benchmark measures allocator-induced passive false-sharing as illustrated in \VRef{s:AllocatorInducedPassiveFalseSharing}.
+As for cache thrash, if memory is allocated for multiple threads on the same cache line, this can significantly slow down program performance.
+In this scenario, the false sharing is being caused by the memory allocator although it is started by the program sharing an object.
+
+% An allocator can unintentionally induce false sharing depending upon its management of the freed objects.
+% If thread Thread$_1$ allocates multiple objects together, they may be allocated on the same cache line by the memory allocator.
+% If Thread$_1$ passes these object to thread Thread$_2$, then both threads may share the same cache line but this scenerio is not induced by the allocator;
+% instead, the program induced this situation.
+% Now if Thread$_2$ frees this object and then allocate an object of the same size, the allocator may return the same object, which is on a cache line shared with thread Thread$_1$.
+
+Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly.
 An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing.
-If the object is returned to the thread who owns it or originally allocated it, then the thread Thread$_2$ gets a new object that is less likely to be on the same cache line as Thread$_1$.
+If the object is returned to the thread who owns it, then the thread that gets a new object is less likely to be on the same cache line.
 
 \VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark.
-First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache-line.
+First, it allocates K dynamic objects together, one for each of the K worker threads, possibly causing memory allocator to allocate these objects on the same cache line.
 Then it create K worker threads and passes an object from the K allocated objects to each of the K threads.
 Each worker thread frees the object passed by the main thread.
@@ -335,5 +335,5 @@
 \begin{lstlisting}
 Main Thread
-	malloc N objects for each worker thread
+	malloc N objects $for$ each worker $thread$
 	create worker threads and pass N objects to each worker
 	...
@@ -365,13 +365,13 @@
 Main thread measures the total time taken for all the workers to complete.
 
-Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.\\
+Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments.
 \begin{description}[itemsep=0pt,parsep=0pt]
 \item[threads:]
-number of threads, K.
+number of threads (K).
 \item[iterations:]
-iterations for cache benchmark, N.
-\item[repetitions:]
-repetitions for cache benchmark, M.
-\item[objsize:]
-object size for cache benchmark.
-\end{description}
+iterations of cache benchmark (N).
+\item[cacheRW:]
+repetitions of reads/writes to object (M).
+\item[size:]
+object size.
+\end{description}