Ignore:
Timestamp:
Feb 26, 2022, 11:14:03 AM (3 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, ast-experimental, enum, master, pthread-emulation, qualifiedEnum
Children:
028404f, 9c5aef9, eb3bc52
Parents:
41870a5
Message:

more editing of Mubeen's thesis

Location:
doc/theses/mubeen_zulfiqar_MMath
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mubeen_zulfiqar_MMath/background.tex

    r41870a5 r05ffb7b  
    1 \chapter{Background}
     1\begin{comment}
     2====================
     3Writing Points:
     4\begin{itemize}
     5\item
     6Classification of benchmarks.
     7\item
     8Literature review of current benchmarks.
     9\item
     10Features and limitations.
     11\item
     12Literature review of current memory allocators.
     13\item
     14Breakdown of memory allocation techniques.
     15\item
     16Features and limitations.
     17\end{itemize}
     18\end{comment}
     19
     20\chapter[Background]{Background\footnote{Part of this chapter draws from similar background work in~\cite{wasik.thesis} with many updates.}}
    221
    322
     
    162181
    163182A multi-threaded memory-allocator does not run any threads itself, but is used by a multi-threaded program.
    164 In addition to single-threaded design issues of locality and fragmentation, a multi-threaded allocator may be simultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as ymutual exclusion, false sharing, and additional forms of heap blowup.
     183In addition to single-threaded design issues of locality and fragmentation, a multi-threaded allocator may be simultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as mutual exclusion, false sharing, and additional forms of heap blowup.
    165184
    166185
     
    231250\label{s:MultiThreadedMemoryAllocatorFeatures}
    232251
    233 By analyzing a suite of existing allocators (see \VRef{s:ExistingAllocators}), the following salient features were identified:
     252The following features are used in the construction of multi-threaded memory-allocators:
    234253\begin{list}{\arabic{enumi}.}{\usecounter{enumi}\topsep=0.5ex\parsep=0pt\itemsep=0pt}
    235254\item multiple heaps
     
    260279The spectrum ranges from multiple threads using a single heap, denoted as T:1 (see \VRef[Figure]{f:SingleHeap}), to multiple threads sharing multiple heaps, denoted as T:H (see \VRef[Figure]{f:SharedHeaps}), to one thread per heap, denoted as 1:1 (see \VRef[Figure]{f:PerThreadHeap}), which is almost back to a single-threaded allocator.
    261280
    262 In the T:1 model, all threads allocate and deallocate objects from one heap.
    263 Memory is obtained from the freed objects or reserved memory in the heap, or from the operating system (OS);
     281
     282\paragraph{T:1 model} where all threads allocate and deallocate objects from one heap.
     283Memory is obtained from the freed objects, or reserved memory in the heap, or from the operating system (OS);
    264284the heap may also return freed memory to the operating system.
    265285The arrows indicate the direction memory conceptually moves for each kind of operation: allocation moves memory along the path from the heap/operating-system to the user application, while deallocation moves memory along the path from the application back to the heap/operating-system.
     
    289309\end{figure}
    290310
    291 In the T:H model, each thread allocates storage from several heaps depending on certain criteria, with the goal of reducing contention by spreading allocations/deallocations across the heaps.
     311
     312\paragraph{T:H model} where each thread allocates storage from several heaps depending on certain criteria, with the goal of reducing contention by spreading allocations/deallocations across the heaps.
    292313The decision on when to create a new heap and which heap a thread allocates from depends on the allocator design.
    293314The performance goal is to reduce the ratio of heaps to threads.
    294315In general, locking is required, since more than one thread may concurrently access a heap during its lifetime, but contention is reduced because fewer threads access a specific heap.
    295 Two examples of this approach are:
    296 \begin{description}
    297 \item[heap pool:]
    298 Multiple heaps are managed in a pool, starting with a single or a fixed number of heaps that increase\-/decrease depending on contention\-/space issues.
     316
     317For example, multiple heaps are managed in a pool, starting with a single or a fixed number of heaps that increase\-/decrease depending on contention\-/space issues.
    299318At creation, a thread is associated with a heap from the pool.
    300319When the thread attempts an allocation and its associated heap is locked (contention), it scans for an unlocked heap in the pool.
     
    304323While the heap-pool approach often minimizes the number of extant heaps, the worse case can result in more heaps than threads;
    305324\eg if the number of threads is large at startup with many allocations creating a large number of heaps and then the number of threads reduces.
    306 \item[kernel threads:]
    307 Each kernel thread (CPU) executing an application has its own heap.
    308 A thread allocates/deallocates from/to the heap of the kernel thread on which it is executing.
    309 Special precautions must be taken to handle or prevent the case where a thread is preempted during allocation/deallocation and restarts execution on a different kernel thread~\cite{Dice02}.
    310 \end{description}
    311 
    312 In the 1:1 model (thread heaps), each thread has its own heap, which eliminates contention and locking because no thread accesses another thread's heap.
    313 An additional benefit of thread heaps is improved locality due to better memory layout.
    314 As each thread only allocates from its heap, all objects for a thread are more consolidated in the storage area for that heap, better utilizing each CPUs cache and accessing fewer pages.
    315 In contrast, the T:H model spreads each thread's objects over a larger area in different heaps.
    316 Thread heaps can also eliminate allocator-induced active false-sharing, if memory is acquired so it does not overlap at crucial boundaries with memory for another thread's heap.
    317 For example, assume page boundaries coincide with cache line boundaries, then if a thread heap always acquires pages of memory, no two threads share a page or cache line unless pointers are passed among them.
    318 Hence, allocator-induced active false-sharing in \VRef[Figure]{f:AllocatorInducedActiveFalseSharing} cannot occur because the memory for thread heaps never overlaps.
    319325
    320326Threads using multiple heaps need to determine the specific heap to access for an allocation/deallocation, \ie association of thread to heap.
    321327A number of techniques are used to establish this association.
    322328The simplest approach is for each thread to have a pointer to its associated heap (or to administrative information that points to the heap), and this pointer changes if the association changes.
    323 For threading systems with thread-local/specific storage, the heap pointer/data is created using this mechanism;
    324 otherwise, the heap routines must use approaches like hashing the thread's stack-pointer or thread-id to find its associated heap.
     329For threading systems with thread-local storage, the heap pointer is created using this mechanism;
     330otherwise, the heap routines must simulate thread-local storage using approaches like hashing the thread's stack-pointer or thread-id to find its associated heap.
    325331
    326332The storage management for multiple heaps is more complex than for a single heap (see \VRef[Figure]{f:AllocatorComponents}).
     
    361367In general, the cost is minimal since the majority of memory operations are completed without the use of the global heap.
    362368
    363 For thread heaps, when a kernel/user-thread terminates, there are two options for handling its heap.
     369
     370\paragraph{1:1 model (thread heaps)} where each thread has its own heap, which eliminates most contention and locking because threads seldom accesses another thread's heap (see ownership in \VRef{s:Ownership}).
     371An additional benefit of thread heaps is improved locality due to better memory layout.
     372As each thread only allocates from its heap, all objects for a thread are consolidated in the storage area for that heap, better utilizing each CPUs cache and accessing fewer pages.
     373In contrast, the T:H model spreads each thread's objects over a larger area in different heaps.
     374Thread heaps can also eliminate allocator-induced active false-sharing, if memory is acquired so it does not overlap at crucial boundaries with memory for another thread's heap.
     375For example, assume page boundaries coincide with cache line boundaries, then if a thread heap always acquires pages of memory, no two threads share a page or cache line unless pointers are passed among them.
     376Hence, allocator-induced active false-sharing in \VRef[Figure]{f:AllocatorInducedActiveFalseSharing} cannot occur because the memory for thread heaps never overlaps.
     377
     378When a thread terminates, there are two options for handling its heap.
    364379First is to free all objects in the heap to the global heap and destroy the thread heap.
    365 Second is to place the thread heap on a list of available heaps and reuse it for a new kernel/user thread in the future.
     380Second is to place the thread heap on a list of available heaps and reuse it for a new thread in the future.
    366381Destroying the thread heap immediately may reduce external fragmentation sooner, since all free objects are freed to the global heap and may be reused by other threads.
    367382Alternatively, reusing thread heaps may improve performance if the inheriting thread makes similar allocation requests as the thread that previously held the thread heap.
    368383
    369 As multiple heaps are a key feature for a multi-threaded allocator, all further discussion assumes multiple heaps with a global heap to eliminate direct interaction with the operating system.
     384
     385\subsection{User-Level Threading}
     386
     387It is possible to use any of the heap models with user-level (M:N) threading.
     388However, an important goal of user-level threading is for fast operations (creation/termination/context-switching) by not interacting with the operating system, which allows the ability to create large numbers of high-performance interacting threads ($>$ 10,000).
     389It is difficult to retain this goal, if the user-threading model is directly involved with the heap model.
     390\VRef[Figure]{f:UserLevelKernelHeaps} shows that virtually all user-level threading systems use whatever kernel-level heap-model provided by the language runtime.
     391Hence, a user thread allocates/deallocates from/to the heap of the kernel thread on which it is currently executing.
     392
     393\begin{figure}
     394\centering
     395\input{UserKernelHeaps}
     396\caption{User-Level Kernel Heaps}
     397\label{f:UserLevelKernelHeaps}
     398\end{figure}
     399
     400Adopting this model results in a subtle problem with shared heaps.
     401With kernel threading, an operation that is started by a kernel thread is always completed by that thread.
     402For example, if a kernel thread starts an allocation/deallocation on a shared heap, it always completes that operation with that heap even if preempted.
     403Any correctness locking associated with the shared heap is preserved across preemption.
     404
     405However, this correctness property is not preserved for user-level threading.
     406A user thread can start an allocation/deallocation on one kernel thread, be preempted (time slice), and continue running on a different kernel thread to complete the operation~\cite{Dice02}.
     407When the user thread continues on the new kernel thread, it may have pointers into the previous kernel-thread's heap and hold locks associated with it.
     408To get the same kernel-thread safety, time slicing must be disabled/\-enabled around these operations, so the user thread cannot jump to another kernel thread.
     409However, eagerly disabling/enabling time-slicing on the allocation/deallocation fast path is expensive, because preemption is rare (10--100 milliseconds).
     410Instead, techniques exist to lazily detect this case in the interrupt handler, abort the preemption, and return to the operation so it can complete atomically.
     411Occasionally ignoring a preemption should be benign.
     412
     413
     414\begin{figure}
     415\centering
     416\subfigure[Ownership]{
     417        \input{MultipleHeapsOwnership}
     418} % subfigure
     419\hspace{0.25in}
     420\subfigure[No Ownership]{
     421        \input{MultipleHeapsNoOwnership}
     422} % subfigure
     423\caption{Heap Ownership}
     424\label{f:HeapsOwnership}
     425\end{figure}
    370426
    371427
     
    380436For the 1:1 thread:heap relationship, a thread only allocates from its own heap, and without ownership, a thread only frees objects to its own heap, which means the heap is private to its owner thread and does not require any locking, called a \newterm{private heap}.
    381437For the T:1/T:H models with or without ownership or the 1:1 model with ownership, a thread may free objects to different heaps, which makes each heap publicly accessible to all threads, called a \newterm{public heap}.
    382 
    383 \begin{figure}
    384 \centering
    385 \subfigure[Ownership]{
    386         \input{MultipleHeapsOwnership}
    387 } % subfigure
    388 \hspace{0.25in}
    389 \subfigure[No Ownership]{
    390         \input{MultipleHeapsNoOwnership}
    391 } % subfigure
    392 \caption{Heap Ownership}
    393 \label{f:HeapsOwnership}
    394 \end{figure}
    395438
    396439\VRef[Figure]{f:MultipleHeapStorageOwnership} shows the effect of ownership on storage layout.
     
    417460The disadvantage of ownership is deallocating to another task's heap so heaps are no longer private and require locks to provide safe concurrent access.
    418461
    419 Object ownership can be immediate or delayed, meaning objects may be returned to the owner heap immediately at deallocation or after some delay.
    420 A thread may delay the return by storing objects it does not own on a separate free list.
    421 Delaying can improve performance by batching objects for return to their owner heap and possibly reallocating these objects if storage runs out on the current heap.
    422 However, reallocation can result in passive false-sharing.
     462Object ownership can be immediate or delayed, meaning free objects may be batched on a separate free list either by the returning or receiving thread.
     463While the returning thread can batch objects, batching across multiple heaps is complex and there is no obvious time when to push back to the owner heap.
     464It is better for returning threads to immediately return to the receiving thread's batch list as the receiving thread has better knowledge when to incorporate the batch list into its free pool.
     465Batching leverages the fact that most allocation patterns use the contention-free fast-path so locking on the batch list is rare for both the returning and receiving threads.
     466
     467It is possible for heaps to steal objects rather than return them and reallocating these objects when storage runs out on a heap.
     468However, stealing can result in passive false-sharing.
    423469For example, in \VRef[Figure]{f:AllocatorInducedPassiveFalseSharing}, Object$_2$ may be deallocated to Task$_2$'s heap initially.
    424470If Task$_2$ reallocates Object$_2$ before it is returned to its owner heap, then passive false-sharing may occur.
     
    428474\label{s:ObjectContainers}
    429475
    430 One approach for managing objects places headers/trailers around individual objects, meaning memory adjacent to the object is reserved for object-management information, as shown in \VRef[Figure]{f:ObjectHeaders}.
    431 However, this approach leads to poor cache usage, since only a portion of the cache line is holding useful information from the program's perspective.
    432 Spatial locality is also negatively affected.
    433 While the header and object are together in memory, they are generally not accessed together;
     476Bracketing every allocation with headers/trailers can result in significant internal fragmentation, as shown in \VRef[Figure]{f:ObjectHeaders}.
     477Especially if the headers contain redundant management information, \eg object size may be the same for many objects because programs only allocate a small set of object sizes.
     478As well, it can result in poor cache usage, since only a portion of the cache line is holding useful information from the program's perspective.
     479Spatial locality can also be negatively affected leading to poor cache locality~\cite{Feng05}:
     480while the header and object are together in memory, they are generally not accessed together;
    434481\eg the object is accessed by the program when it is allocated, while the header is accessed by the allocator when the object is free.
    435 This difference in usage patterns can lead to poor cache locality~\cite{Feng05}.
    436 Additionally, placing headers on individual objects can lead to redundant management information.
    437 For example, if a header stores only the object size, then all objects with the same size have identical headers.
    438482
    439483\begin{figure}
     
    443487        \label{f:ObjectHeaders}
    444488} % subfigure
    445 \\
    446489\subfigure[Object Container]{
    447490        \input{Container}
     
    452495\end{figure}
    453496
    454 An alternative approach for managing objects factors common header/trailer information to a separate location in memory and organizes associated free storage into blocks called \newterm{object containers} (\newterm{superblocks} in~\cite{Berger00}), as in \VRef[Figure]{f:ObjectContainer}.
     497An alternative approach factors common header/trailer information to a separate location in memory and organizes associated free storage into blocks called \newterm{object containers} (\newterm{superblocks} in~\cite{Berger00}), as in \VRef[Figure]{f:ObjectContainer}.
    455498The header for the container holds information necessary for all objects in the container;
    456499a trailer may also be used at the end of the container.
     
    714757Implementing lock-free operations for more complex data-structures (queue~\cite{Valois94}/deque~\cite{Sundell08}) is more complex.
    715758Michael~\cite{Michael04} and Gidenstam \etal \cite{Gidenstam05} have created lock-free variations of the Hoard allocator.
    716 
    717 
    718 \noindent
    719 ====================
    720 
    721 Writing Points:
    722 \begin{itemize}
    723 \item
    724 Classification of benchmarks.
    725 \item
    726 Literature review of current benchmarks.
    727 \item
    728 Features and limitations.
    729 \item
    730 Literature review of current memory allocators.
    731 \item
    732 Breakdown of memory allocation techniques.
    733 \item
    734 Features and limitations.
    735 \end{itemize}
    736 
    737 \noindent
    738 ====================
    739 
    740 % FIXME: cite wasik
    741 \cite{wasik.thesis}
    742 
    743 \section{Existing Memory Allocators}
    744 With dynamic allocation being an important feature of C, there are many stand-alone memory allocators that have been designed for different purposes. For this thesis, we chose 7 of the most popular and widely used memory allocators.
    745 
    746 \paragraph{dlmalloc}
    747 dlmalloc (FIX ME: cite allocator) is a thread-safe allocator that is single threaded and single heap. dlmalloc maintains free-lists of different sizes to store freed dynamic memory. (FIX ME: cite wasik)
    748 
    749 \paragraph{hoard}
    750 Hoard (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and using a heap layer framework. It has per-thread heaps that have thread-local free-lists, and a global shared heap. (FIX ME: cite wasik)
    751 
    752 \paragraph{jemalloc}
    753 jemalloc (FIX ME: cite allocator) is a thread-safe allocator that uses multiple arenas. Each thread is assigned an arena. Each arena has chunks that contain contagious memory regions of same size. An arena has multiple chunks that contain regions of multiple sizes.
    754 
    755 \paragraph{ptmalloc}
    756 ptmalloc (FIX ME: cite allocator) is a modification of dlmalloc. It is a thread-safe multi-threaded memory allocator that uses multiple heaps. ptmalloc heap has similar design to dlmalloc's heap.
    757 
    758 \paragraph{rpmalloc}
    759 rpmalloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses per-thread heap. Each heap has multiple size-classes and each size-class contains memory regions of the relevant size.
    760 
    761 \paragraph{tbb malloc}
    762 tbb malloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses private heap for each thread. Each private-heap has multiple bins of different sizes. Each bin contains free regions of the same size.
    763 
    764 \paragraph{tc malloc}
    765 tcmalloc (FIX ME: cite allocator) is a thread-safe allocator. It uses per-thread cache to store free objects that prevents contention on shared resources in multi-threaded application. A central free-list is used to refill per-thread cache when it gets empty.
  • doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex

    r41870a5 r05ffb7b  
    216216\paragraph{Relevant Knobs}
    217217*** FIX ME: Insert Relevant Knobs
     218
     219
     220
     221\section{Existing Memory Allocators}
     222With dynamic allocation being an important feature of C, there are many stand-alone memory allocators that have been designed for different purposes. For this thesis, we chose 7 of the most popular and widely used memory allocators.
     223
     224\paragraph{dlmalloc}
     225dlmalloc (FIX ME: cite allocator) is a thread-safe allocator that is single threaded and single heap. dlmalloc maintains free-lists of different sizes to store freed dynamic memory. (FIX ME: cite wasik)
     226
     227\paragraph{hoard}
     228Hoard (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and using a heap layer framework. It has per-thread heaps that have thread-local free-lists, and a global shared heap. (FIX ME: cite wasik)
     229
     230\paragraph{jemalloc}
     231jemalloc (FIX ME: cite allocator) is a thread-safe allocator that uses multiple arenas. Each thread is assigned an arena. Each arena has chunks that contain contagious memory regions of same size. An arena has multiple chunks that contain regions of multiple sizes.
     232
     233\paragraph{ptmalloc}
     234ptmalloc (FIX ME: cite allocator) is a modification of dlmalloc. It is a thread-safe multi-threaded memory allocator that uses multiple heaps. ptmalloc heap has similar design to dlmalloc's heap.
     235
     236\paragraph{rpmalloc}
     237rpmalloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses per-thread heap. Each heap has multiple size-classes and each size-class contains memory regions of the relevant size.
     238
     239\paragraph{tbb malloc}
     240tbb malloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses private heap for each thread. Each private-heap has multiple bins of different sizes. Each bin contains free regions of the same size.
     241
     242\paragraph{tc malloc}
     243tcmalloc (FIX ME: cite allocator) is a thread-safe allocator. It uses per-thread cache to store free objects that prevents contention on shared resources in multi-threaded application. A central free-list is used to refill per-thread cache when it gets empty.
  • doc/theses/mubeen_zulfiqar_MMath/intro.tex

    r41870a5 r05ffb7b  
    9696
    9797\item
     98Provide additional heap wrapper functions in \CFA to provide a complete orthogonal set of allocation operations and properties.
     99
     100\item
    98101Provide additional query operations to access information about an allocation:
    99102\begin{itemize}
Note: See TracChangeset for help on using the changeset viewer.