Index: doc/theses/mubeen_zulfiqar_MMath/background.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/background.tex	(revision 41870a563239dced1a0ac5c7e6ad7bac319e3f7a)
+++ doc/theses/mubeen_zulfiqar_MMath/background.tex	(revision 05ffb7b232cd4ee5f90866b34af47c55ed4e4f88)
@@ -1,3 +1,22 @@
-\chapter{Background}
+\begin{comment}
+====================
+Writing Points:
+\begin{itemize}
+\item
+Classification of benchmarks.
+\item
+Literature review of current benchmarks.
+\item
+Features and limitations.
+\item
+Literature review of current memory allocators.
+\item
+Breakdown of memory allocation techniques.
+\item
+Features and limitations.
+\end{itemize}
+\end{comment}
+
+\chapter[Background]{Background\footnote{Part of this chapter draws from similar background work in~\cite{wasik.thesis} with many updates.}}
 
 
@@ -162,5 +181,5 @@
 
 A multi-threaded memory-allocator does not run any threads itself, but is used by a multi-threaded program.
-In addition to single-threaded design issues of locality and fragmentation, a multi-threaded allocator may be simultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as ymutual exclusion, false sharing, and additional forms of heap blowup.
+In addition to single-threaded design issues of locality and fragmentation, a multi-threaded allocator may be simultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as mutual exclusion, false sharing, and additional forms of heap blowup.
 
 
@@ -231,5 +250,5 @@
 \label{s:MultiThreadedMemoryAllocatorFeatures}
 
-By analyzing a suite of existing allocators (see \VRef{s:ExistingAllocators}), the following salient features were identified:
+The following features are used in the construction of multi-threaded memory-allocators:
 \begin{list}{\arabic{enumi}.}{\usecounter{enumi}\topsep=0.5ex\parsep=0pt\itemsep=0pt}
 \item multiple heaps
@@ -260,6 +279,7 @@
 The spectrum ranges from multiple threads using a single heap, denoted as T:1 (see \VRef[Figure]{f:SingleHeap}), to multiple threads sharing multiple heaps, denoted as T:H (see \VRef[Figure]{f:SharedHeaps}), to one thread per heap, denoted as 1:1 (see \VRef[Figure]{f:PerThreadHeap}), which is almost back to a single-threaded allocator.
 
-In the T:1 model, all threads allocate and deallocate objects from one heap.
-Memory is obtained from the freed objects or reserved memory in the heap, or from the operating system (OS);
+
+\paragraph{T:1 model} where all threads allocate and deallocate objects from one heap.
+Memory is obtained from the freed objects, or reserved memory in the heap, or from the operating system (OS);
 the heap may also return freed memory to the operating system.
 The arrows indicate the direction memory conceptually moves for each kind of operation: allocation moves memory along the path from the heap/operating-system to the user application, while deallocation moves memory along the path from the application back to the heap/operating-system.
@@ -289,12 +309,11 @@
 \end{figure}
 
-In the T:H model, each thread allocates storage from several heaps depending on certain criteria, with the goal of reducing contention by spreading allocations/deallocations across the heaps.
+
+\paragraph{T:H model} where each thread allocates storage from several heaps depending on certain criteria, with the goal of reducing contention by spreading allocations/deallocations across the heaps.
 The decision on when to create a new heap and which heap a thread allocates from depends on the allocator design.
 The performance goal is to reduce the ratio of heaps to threads.
 In general, locking is required, since more than one thread may concurrently access a heap during its lifetime, but contention is reduced because fewer threads access a specific heap.
-Two examples of this approach are:
-\begin{description}
-\item[heap pool:]
-Multiple heaps are managed in a pool, starting with a single or a fixed number of heaps that increase\-/decrease depending on contention\-/space issues.
+
+For example, multiple heaps are managed in a pool, starting with a single or a fixed number of heaps that increase\-/decrease depending on contention\-/space issues.
 At creation, a thread is associated with a heap from the pool.
 When the thread attempts an allocation and its associated heap is locked (contention), it scans for an unlocked heap in the pool.
@@ -304,23 +323,10 @@
 While the heap-pool approach often minimizes the number of extant heaps, the worse case can result in more heaps than threads;
 \eg if the number of threads is large at startup with many allocations creating a large number of heaps and then the number of threads reduces.
-\item[kernel threads:]
-Each kernel thread (CPU) executing an application has its own heap.
-A thread allocates/deallocates from/to the heap of the kernel thread on which it is executing.
-Special precautions must be taken to handle or prevent the case where a thread is preempted during allocation/deallocation and restarts execution on a different kernel thread~\cite{Dice02}.
-\end{description}
-
-In the 1:1 model (thread heaps), each thread has its own heap, which eliminates contention and locking because no thread accesses another thread's heap.
-An additional benefit of thread heaps is improved locality due to better memory layout.
-As each thread only allocates from its heap, all objects for a thread are more consolidated in the storage area for that heap, better utilizing each CPUs cache and accessing fewer pages.
-In contrast, the T:H model spreads each thread's objects over a larger area in different heaps.
-Thread heaps can also eliminate allocator-induced active false-sharing, if memory is acquired so it does not overlap at crucial boundaries with memory for another thread's heap.
-For example, assume page boundaries coincide with cache line boundaries, then if a thread heap always acquires pages of memory, no two threads share a page or cache line unless pointers are passed among them.
-Hence, allocator-induced active false-sharing in \VRef[Figure]{f:AllocatorInducedActiveFalseSharing} cannot occur because the memory for thread heaps never overlaps.
 
 Threads using multiple heaps need to determine the specific heap to access for an allocation/deallocation, \ie association of thread to heap.
 A number of techniques are used to establish this association.
 The simplest approach is for each thread to have a pointer to its associated heap (or to administrative information that points to the heap), and this pointer changes if the association changes.
-For threading systems with thread-local/specific storage, the heap pointer/data is created using this mechanism;
-otherwise, the heap routines must use approaches like hashing the thread's stack-pointer or thread-id to find its associated heap.
+For threading systems with thread-local storage, the heap pointer is created using this mechanism;
+otherwise, the heap routines must simulate thread-local storage using approaches like hashing the thread's stack-pointer or thread-id to find its associated heap.
 
 The storage management for multiple heaps is more complex than for a single heap (see \VRef[Figure]{f:AllocatorComponents}).
@@ -361,11 +367,61 @@
 In general, the cost is minimal since the majority of memory operations are completed without the use of the global heap.
 
-For thread heaps, when a kernel/user-thread terminates, there are two options for handling its heap.
+
+\paragraph{1:1 model (thread heaps)} where each thread has its own heap, which eliminates most contention and locking because threads seldom accesses another thread's heap (see ownership in \VRef{s:Ownership}).
+An additional benefit of thread heaps is improved locality due to better memory layout.
+As each thread only allocates from its heap, all objects for a thread are consolidated in the storage area for that heap, better utilizing each CPUs cache and accessing fewer pages.
+In contrast, the T:H model spreads each thread's objects over a larger area in different heaps.
+Thread heaps can also eliminate allocator-induced active false-sharing, if memory is acquired so it does not overlap at crucial boundaries with memory for another thread's heap.
+For example, assume page boundaries coincide with cache line boundaries, then if a thread heap always acquires pages of memory, no two threads share a page or cache line unless pointers are passed among them.
+Hence, allocator-induced active false-sharing in \VRef[Figure]{f:AllocatorInducedActiveFalseSharing} cannot occur because the memory for thread heaps never overlaps.
+
+When a thread terminates, there are two options for handling its heap.
 First is to free all objects in the heap to the global heap and destroy the thread heap.
-Second is to place the thread heap on a list of available heaps and reuse it for a new kernel/user thread in the future.
+Second is to place the thread heap on a list of available heaps and reuse it for a new thread in the future.
 Destroying the thread heap immediately may reduce external fragmentation sooner, since all free objects are freed to the global heap and may be reused by other threads.
 Alternatively, reusing thread heaps may improve performance if the inheriting thread makes similar allocation requests as the thread that previously held the thread heap.
 
-As multiple heaps are a key feature for a multi-threaded allocator, all further discussion assumes multiple heaps with a global heap to eliminate direct interaction with the operating system.
+
+\subsection{User-Level Threading}
+
+It is possible to use any of the heap models with user-level (M:N) threading.
+However, an important goal of user-level threading is for fast operations (creation/termination/context-switching) by not interacting with the operating system, which allows the ability to create large numbers of high-performance interacting threads ($>$ 10,000).
+It is difficult to retain this goal, if the user-threading model is directly involved with the heap model.
+\VRef[Figure]{f:UserLevelKernelHeaps} shows that virtually all user-level threading systems use whatever kernel-level heap-model provided by the language runtime.
+Hence, a user thread allocates/deallocates from/to the heap of the kernel thread on which it is currently executing.
+
+\begin{figure}
+\centering
+\input{UserKernelHeaps}
+\caption{User-Level Kernel Heaps}
+\label{f:UserLevelKernelHeaps}
+\end{figure}
+
+Adopting this model results in a subtle problem with shared heaps.
+With kernel threading, an operation that is started by a kernel thread is always completed by that thread.
+For example, if a kernel thread starts an allocation/deallocation on a shared heap, it always completes that operation with that heap even if preempted.
+Any correctness locking associated with the shared heap is preserved across preemption.
+
+However, this correctness property is not preserved for user-level threading.
+A user thread can start an allocation/deallocation on one kernel thread, be preempted (time slice), and continue running on a different kernel thread to complete the operation~\cite{Dice02}.
+When the user thread continues on the new kernel thread, it may have pointers into the previous kernel-thread's heap and hold locks associated with it.
+To get the same kernel-thread safety, time slicing must be disabled/\-enabled around these operations, so the user thread cannot jump to another kernel thread.
+However, eagerly disabling/enabling time-slicing on the allocation/deallocation fast path is expensive, because preemption is rare (10--100 milliseconds).
+Instead, techniques exist to lazily detect this case in the interrupt handler, abort the preemption, and return to the operation so it can complete atomically.
+Occasionally ignoring a preemption should be benign.
+
+
+\begin{figure}
+\centering
+\subfigure[Ownership]{
+	\input{MultipleHeapsOwnership}
+} % subfigure
+\hspace{0.25in}
+\subfigure[No Ownership]{
+	\input{MultipleHeapsNoOwnership}
+} % subfigure
+\caption{Heap Ownership}
+\label{f:HeapsOwnership}
+\end{figure}
 
 
@@ -380,17 +436,4 @@
 For the 1:1 thread:heap relationship, a thread only allocates from its own heap, and without ownership, a thread only frees objects to its own heap, which means the heap is private to its owner thread and does not require any locking, called a \newterm{private heap}.
 For the T:1/T:H models with or without ownership or the 1:1 model with ownership, a thread may free objects to different heaps, which makes each heap publicly accessible to all threads, called a \newterm{public heap}.
-
-\begin{figure}
-\centering
-\subfigure[Ownership]{
-	\input{MultipleHeapsOwnership}
-} % subfigure
-\hspace{0.25in}
-\subfigure[No Ownership]{
-	\input{MultipleHeapsNoOwnership}
-} % subfigure
-\caption{Heap Ownership}
-\label{f:HeapsOwnership}
-\end{figure}
 
 \VRef[Figure]{f:MultipleHeapStorageOwnership} shows the effect of ownership on storage layout.
@@ -417,8 +460,11 @@
 The disadvantage of ownership is deallocating to another task's heap so heaps are no longer private and require locks to provide safe concurrent access.
 
-Object ownership can be immediate or delayed, meaning objects may be returned to the owner heap immediately at deallocation or after some delay.
-A thread may delay the return by storing objects it does not own on a separate free list.
-Delaying can improve performance by batching objects for return to their owner heap and possibly reallocating these objects if storage runs out on the current heap.
-However, reallocation can result in passive false-sharing.
+Object ownership can be immediate or delayed, meaning free objects may be batched on a separate free list either by the returning or receiving thread.
+While the returning thread can batch objects, batching across multiple heaps is complex and there is no obvious time when to push back to the owner heap.
+It is better for returning threads to immediately return to the receiving thread's batch list as the receiving thread has better knowledge when to incorporate the batch list into its free pool.
+Batching leverages the fact that most allocation patterns use the contention-free fast-path so locking on the batch list is rare for both the returning and receiving threads.
+
+It is possible for heaps to steal objects rather than return them and reallocating these objects when storage runs out on a heap.
+However, stealing can result in passive false-sharing.
 For example, in \VRef[Figure]{f:AllocatorInducedPassiveFalseSharing}, Object$_2$ may be deallocated to Task$_2$'s heap initially.
 If Task$_2$ reallocates Object$_2$ before it is returned to its owner heap, then passive false-sharing may occur.
@@ -428,12 +474,10 @@
 \label{s:ObjectContainers}
 
-One approach for managing objects places headers/trailers around individual objects, meaning memory adjacent to the object is reserved for object-management information, as shown in \VRef[Figure]{f:ObjectHeaders}.
-However, this approach leads to poor cache usage, since only a portion of the cache line is holding useful information from the program's perspective.
-Spatial locality is also negatively affected.
-While the header and object are together in memory, they are generally not accessed together;
+Bracketing every allocation with headers/trailers can result in significant internal fragmentation, as shown in \VRef[Figure]{f:ObjectHeaders}.
+Especially if the headers contain redundant management information, \eg object size may be the same for many objects because programs only allocate a small set of object sizes.
+As well, it can result in poor cache usage, since only a portion of the cache line is holding useful information from the program's perspective.
+Spatial locality can also be negatively affected leading to poor cache locality~\cite{Feng05}:
+while the header and object are together in memory, they are generally not accessed together;
 \eg the object is accessed by the program when it is allocated, while the header is accessed by the allocator when the object is free.
-This difference in usage patterns can lead to poor cache locality~\cite{Feng05}.
-Additionally, placing headers on individual objects can lead to redundant management information.
-For example, if a header stores only the object size, then all objects with the same size have identical headers.
 
 \begin{figure}
@@ -443,5 +487,4 @@
 	\label{f:ObjectHeaders}
 } % subfigure
-\\
 \subfigure[Object Container]{
 	\input{Container}
@@ -452,5 +495,5 @@
 \end{figure}
 
-An alternative approach for managing objects factors common header/trailer information to a separate location in memory and organizes associated free storage into blocks called \newterm{object containers} (\newterm{superblocks} in~\cite{Berger00}), as in \VRef[Figure]{f:ObjectContainer}.
+An alternative approach factors common header/trailer information to a separate location in memory and organizes associated free storage into blocks called \newterm{object containers} (\newterm{superblocks} in~\cite{Berger00}), as in \VRef[Figure]{f:ObjectContainer}.
 The header for the container holds information necessary for all objects in the container;
 a trailer may also be used at the end of the container.
@@ -714,52 +757,2 @@
 Implementing lock-free operations for more complex data-structures (queue~\cite{Valois94}/deque~\cite{Sundell08}) is more complex.
 Michael~\cite{Michael04} and Gidenstam \etal \cite{Gidenstam05} have created lock-free variations of the Hoard allocator.
-
-
-\noindent
-====================
-
-Writing Points:
-\begin{itemize}
-\item
-Classification of benchmarks.
-\item
-Literature review of current benchmarks.
-\item
-Features and limitations.
-\item
-Literature review of current memory allocators.
-\item
-Breakdown of memory allocation techniques.
-\item
-Features and limitations.
-\end{itemize}
-
-\noindent
-====================
-
-% FIXME: cite wasik
-\cite{wasik.thesis}
-
-\section{Existing Memory Allocators}
-With dynamic allocation being an important feature of C, there are many stand-alone memory allocators that have been designed for different purposes. For this thesis, we chose 7 of the most popular and widely used memory allocators.
-
-\paragraph{dlmalloc}
-dlmalloc (FIX ME: cite allocator) is a thread-safe allocator that is single threaded and single heap. dlmalloc maintains free-lists of different sizes to store freed dynamic memory. (FIX ME: cite wasik)
-
-\paragraph{hoard}
-Hoard (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and using a heap layer framework. It has per-thread heaps that have thread-local free-lists, and a global shared heap. (FIX ME: cite wasik)
-
-\paragraph{jemalloc}
-jemalloc (FIX ME: cite allocator) is a thread-safe allocator that uses multiple arenas. Each thread is assigned an arena. Each arena has chunks that contain contagious memory regions of same size. An arena has multiple chunks that contain regions of multiple sizes.
-
-\paragraph{ptmalloc}
-ptmalloc (FIX ME: cite allocator) is a modification of dlmalloc. It is a thread-safe multi-threaded memory allocator that uses multiple heaps. ptmalloc heap has similar design to dlmalloc's heap.
-
-\paragraph{rpmalloc}
-rpmalloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses per-thread heap. Each heap has multiple size-classes and each size-class contains memory regions of the relevant size.
-
-\paragraph{tbb malloc}
-tbb malloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses private heap for each thread. Each private-heap has multiple bins of different sizes. Each bin contains free regions of the same size.
-
-\paragraph{tc malloc}
-tcmalloc (FIX ME: cite allocator) is a thread-safe allocator. It uses per-thread cache to store free objects that prevents contention on shared resources in multi-threaded application. A central free-list is used to refill per-thread cache when it gets empty.
Index: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision 41870a563239dced1a0ac5c7e6ad7bac319e3f7a)
+++ doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision 05ffb7b232cd4ee5f90866b34af47c55ed4e4f88)
@@ -216,2 +216,28 @@
 \paragraph{Relevant Knobs}
 *** FIX ME: Insert Relevant Knobs
+
+
+
+\section{Existing Memory Allocators}
+With dynamic allocation being an important feature of C, there are many stand-alone memory allocators that have been designed for different purposes. For this thesis, we chose 7 of the most popular and widely used memory allocators.
+
+\paragraph{dlmalloc}
+dlmalloc (FIX ME: cite allocator) is a thread-safe allocator that is single threaded and single heap. dlmalloc maintains free-lists of different sizes to store freed dynamic memory. (FIX ME: cite wasik)
+
+\paragraph{hoard}
+Hoard (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and using a heap layer framework. It has per-thread heaps that have thread-local free-lists, and a global shared heap. (FIX ME: cite wasik)
+
+\paragraph{jemalloc}
+jemalloc (FIX ME: cite allocator) is a thread-safe allocator that uses multiple arenas. Each thread is assigned an arena. Each arena has chunks that contain contagious memory regions of same size. An arena has multiple chunks that contain regions of multiple sizes.
+
+\paragraph{ptmalloc}
+ptmalloc (FIX ME: cite allocator) is a modification of dlmalloc. It is a thread-safe multi-threaded memory allocator that uses multiple heaps. ptmalloc heap has similar design to dlmalloc's heap.
+
+\paragraph{rpmalloc}
+rpmalloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses per-thread heap. Each heap has multiple size-classes and each size-class contains memory regions of the relevant size.
+
+\paragraph{tbb malloc}
+tbb malloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses private heap for each thread. Each private-heap has multiple bins of different sizes. Each bin contains free regions of the same size.
+
+\paragraph{tc malloc}
+tcmalloc (FIX ME: cite allocator) is a thread-safe allocator. It uses per-thread cache to store free objects that prevents contention on shared resources in multi-threaded application. A central free-list is used to refill per-thread cache when it gets empty.
Index: doc/theses/mubeen_zulfiqar_MMath/intro.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/intro.tex	(revision 41870a563239dced1a0ac5c7e6ad7bac319e3f7a)
+++ doc/theses/mubeen_zulfiqar_MMath/intro.tex	(revision 05ffb7b232cd4ee5f90866b34af47c55ed4e4f88)
@@ -96,4 +96,7 @@
 
 \item
+Provide additional heap wrapper functions in \CFA to provide a complete orthogonal set of allocation operations and properties.
+
+\item
 Provide additional query operations to access information about an allocation:
 \begin{itemize}