Changeset 05ffb7b
- Timestamp:
- Feb 26, 2022, 11:14:03 AM (3 years ago)
- Branches:
- ADT, ast-experimental, enum, master, pthread-emulation, qualifiedEnum
- Children:
- 028404f, 9c5aef9, eb3bc52
- Parents:
- 41870a5
- Location:
- doc/theses/mubeen_zulfiqar_MMath
- Files:
-
- 3 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mubeen_zulfiqar_MMath/background.tex
r41870a5 r05ffb7b 1 \chapter{Background} 1 \begin{comment} 2 ==================== 3 Writing Points: 4 \begin{itemize} 5 \item 6 Classification of benchmarks. 7 \item 8 Literature review of current benchmarks. 9 \item 10 Features and limitations. 11 \item 12 Literature review of current memory allocators. 13 \item 14 Breakdown of memory allocation techniques. 15 \item 16 Features and limitations. 17 \end{itemize} 18 \end{comment} 19 20 \chapter[Background]{Background\footnote{Part of this chapter draws from similar background work in~\cite{wasik.thesis} with many updates.}} 2 21 3 22 … … 162 181 163 182 A multi-threaded memory-allocator does not run any threads itself, but is used by a multi-threaded program. 164 In addition to single-threaded design issues of locality and fragmentation, a multi-threaded allocator may be simultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as ymutual exclusion, false sharing, and additional forms of heap blowup.183 In addition to single-threaded design issues of locality and fragmentation, a multi-threaded allocator may be simultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as mutual exclusion, false sharing, and additional forms of heap blowup. 165 184 166 185 … … 231 250 \label{s:MultiThreadedMemoryAllocatorFeatures} 232 251 233 By analyzing a suite of existing allocators (see \VRef{s:ExistingAllocators}), the following salient features were identified:252 The following features are used in the construction of multi-threaded memory-allocators: 234 253 \begin{list}{\arabic{enumi}.}{\usecounter{enumi}\topsep=0.5ex\parsep=0pt\itemsep=0pt} 235 254 \item multiple heaps … … 260 279 The spectrum ranges from multiple threads using a single heap, denoted as T:1 (see \VRef[Figure]{f:SingleHeap}), to multiple threads sharing multiple heaps, denoted as T:H (see \VRef[Figure]{f:SharedHeaps}), to one thread per heap, denoted as 1:1 (see \VRef[Figure]{f:PerThreadHeap}), which is almost back to a single-threaded allocator. 261 280 262 In the T:1 model, all threads allocate and deallocate objects from one heap. 263 Memory is obtained from the freed objects or reserved memory in the heap, or from the operating system (OS); 281 282 \paragraph{T:1 model} where all threads allocate and deallocate objects from one heap. 283 Memory is obtained from the freed objects, or reserved memory in the heap, or from the operating system (OS); 264 284 the heap may also return freed memory to the operating system. 265 285 The arrows indicate the direction memory conceptually moves for each kind of operation: allocation moves memory along the path from the heap/operating-system to the user application, while deallocation moves memory along the path from the application back to the heap/operating-system. … … 289 309 \end{figure} 290 310 291 In the T:H model, each thread allocates storage from several heaps depending on certain criteria, with the goal of reducing contention by spreading allocations/deallocations across the heaps. 311 312 \paragraph{T:H model} where each thread allocates storage from several heaps depending on certain criteria, with the goal of reducing contention by spreading allocations/deallocations across the heaps. 292 313 The decision on when to create a new heap and which heap a thread allocates from depends on the allocator design. 293 314 The performance goal is to reduce the ratio of heaps to threads. 294 315 In general, locking is required, since more than one thread may concurrently access a heap during its lifetime, but contention is reduced because fewer threads access a specific heap. 295 Two examples of this approach are: 296 \begin{description} 297 \item[heap pool:] 298 Multiple heaps are managed in a pool, starting with a single or a fixed number of heaps that increase\-/decrease depending on contention\-/space issues. 316 317 For example, multiple heaps are managed in a pool, starting with a single or a fixed number of heaps that increase\-/decrease depending on contention\-/space issues. 299 318 At creation, a thread is associated with a heap from the pool. 300 319 When the thread attempts an allocation and its associated heap is locked (contention), it scans for an unlocked heap in the pool. … … 304 323 While the heap-pool approach often minimizes the number of extant heaps, the worse case can result in more heaps than threads; 305 324 \eg if the number of threads is large at startup with many allocations creating a large number of heaps and then the number of threads reduces. 306 \item[kernel threads:]307 Each kernel thread (CPU) executing an application has its own heap.308 A thread allocates/deallocates from/to the heap of the kernel thread on which it is executing.309 Special precautions must be taken to handle or prevent the case where a thread is preempted during allocation/deallocation and restarts execution on a different kernel thread~\cite{Dice02}.310 \end{description}311 312 In the 1:1 model (thread heaps), each thread has its own heap, which eliminates contention and locking because no thread accesses another thread's heap.313 An additional benefit of thread heaps is improved locality due to better memory layout.314 As each thread only allocates from its heap, all objects for a thread are more consolidated in the storage area for that heap, better utilizing each CPUs cache and accessing fewer pages.315 In contrast, the T:H model spreads each thread's objects over a larger area in different heaps.316 Thread heaps can also eliminate allocator-induced active false-sharing, if memory is acquired so it does not overlap at crucial boundaries with memory for another thread's heap.317 For example, assume page boundaries coincide with cache line boundaries, then if a thread heap always acquires pages of memory, no two threads share a page or cache line unless pointers are passed among them.318 Hence, allocator-induced active false-sharing in \VRef[Figure]{f:AllocatorInducedActiveFalseSharing} cannot occur because the memory for thread heaps never overlaps.319 325 320 326 Threads using multiple heaps need to determine the specific heap to access for an allocation/deallocation, \ie association of thread to heap. 321 327 A number of techniques are used to establish this association. 322 328 The simplest approach is for each thread to have a pointer to its associated heap (or to administrative information that points to the heap), and this pointer changes if the association changes. 323 For threading systems with thread-local /specific storage, the heap pointer/datais created using this mechanism;324 otherwise, the heap routines must useapproaches like hashing the thread's stack-pointer or thread-id to find its associated heap.329 For threading systems with thread-local storage, the heap pointer is created using this mechanism; 330 otherwise, the heap routines must simulate thread-local storage using approaches like hashing the thread's stack-pointer or thread-id to find its associated heap. 325 331 326 332 The storage management for multiple heaps is more complex than for a single heap (see \VRef[Figure]{f:AllocatorComponents}). … … 361 367 In general, the cost is minimal since the majority of memory operations are completed without the use of the global heap. 362 368 363 For thread heaps, when a kernel/user-thread terminates, there are two options for handling its heap. 369 370 \paragraph{1:1 model (thread heaps)} where each thread has its own heap, which eliminates most contention and locking because threads seldom accesses another thread's heap (see ownership in \VRef{s:Ownership}). 371 An additional benefit of thread heaps is improved locality due to better memory layout. 372 As each thread only allocates from its heap, all objects for a thread are consolidated in the storage area for that heap, better utilizing each CPUs cache and accessing fewer pages. 373 In contrast, the T:H model spreads each thread's objects over a larger area in different heaps. 374 Thread heaps can also eliminate allocator-induced active false-sharing, if memory is acquired so it does not overlap at crucial boundaries with memory for another thread's heap. 375 For example, assume page boundaries coincide with cache line boundaries, then if a thread heap always acquires pages of memory, no two threads share a page or cache line unless pointers are passed among them. 376 Hence, allocator-induced active false-sharing in \VRef[Figure]{f:AllocatorInducedActiveFalseSharing} cannot occur because the memory for thread heaps never overlaps. 377 378 When a thread terminates, there are two options for handling its heap. 364 379 First is to free all objects in the heap to the global heap and destroy the thread heap. 365 Second is to place the thread heap on a list of available heaps and reuse it for a new kernel/userthread in the future.380 Second is to place the thread heap on a list of available heaps and reuse it for a new thread in the future. 366 381 Destroying the thread heap immediately may reduce external fragmentation sooner, since all free objects are freed to the global heap and may be reused by other threads. 367 382 Alternatively, reusing thread heaps may improve performance if the inheriting thread makes similar allocation requests as the thread that previously held the thread heap. 368 383 369 As multiple heaps are a key feature for a multi-threaded allocator, all further discussion assumes multiple heaps with a global heap to eliminate direct interaction with the operating system. 384 385 \subsection{User-Level Threading} 386 387 It is possible to use any of the heap models with user-level (M:N) threading. 388 However, an important goal of user-level threading is for fast operations (creation/termination/context-switching) by not interacting with the operating system, which allows the ability to create large numbers of high-performance interacting threads ($>$ 10,000). 389 It is difficult to retain this goal, if the user-threading model is directly involved with the heap model. 390 \VRef[Figure]{f:UserLevelKernelHeaps} shows that virtually all user-level threading systems use whatever kernel-level heap-model provided by the language runtime. 391 Hence, a user thread allocates/deallocates from/to the heap of the kernel thread on which it is currently executing. 392 393 \begin{figure} 394 \centering 395 \input{UserKernelHeaps} 396 \caption{User-Level Kernel Heaps} 397 \label{f:UserLevelKernelHeaps} 398 \end{figure} 399 400 Adopting this model results in a subtle problem with shared heaps. 401 With kernel threading, an operation that is started by a kernel thread is always completed by that thread. 402 For example, if a kernel thread starts an allocation/deallocation on a shared heap, it always completes that operation with that heap even if preempted. 403 Any correctness locking associated with the shared heap is preserved across preemption. 404 405 However, this correctness property is not preserved for user-level threading. 406 A user thread can start an allocation/deallocation on one kernel thread, be preempted (time slice), and continue running on a different kernel thread to complete the operation~\cite{Dice02}. 407 When the user thread continues on the new kernel thread, it may have pointers into the previous kernel-thread's heap and hold locks associated with it. 408 To get the same kernel-thread safety, time slicing must be disabled/\-enabled around these operations, so the user thread cannot jump to another kernel thread. 409 However, eagerly disabling/enabling time-slicing on the allocation/deallocation fast path is expensive, because preemption is rare (10--100 milliseconds). 410 Instead, techniques exist to lazily detect this case in the interrupt handler, abort the preemption, and return to the operation so it can complete atomically. 411 Occasionally ignoring a preemption should be benign. 412 413 414 \begin{figure} 415 \centering 416 \subfigure[Ownership]{ 417 \input{MultipleHeapsOwnership} 418 } % subfigure 419 \hspace{0.25in} 420 \subfigure[No Ownership]{ 421 \input{MultipleHeapsNoOwnership} 422 } % subfigure 423 \caption{Heap Ownership} 424 \label{f:HeapsOwnership} 425 \end{figure} 370 426 371 427 … … 380 436 For the 1:1 thread:heap relationship, a thread only allocates from its own heap, and without ownership, a thread only frees objects to its own heap, which means the heap is private to its owner thread and does not require any locking, called a \newterm{private heap}. 381 437 For the T:1/T:H models with or without ownership or the 1:1 model with ownership, a thread may free objects to different heaps, which makes each heap publicly accessible to all threads, called a \newterm{public heap}. 382 383 \begin{figure}384 \centering385 \subfigure[Ownership]{386 \input{MultipleHeapsOwnership}387 } % subfigure388 \hspace{0.25in}389 \subfigure[No Ownership]{390 \input{MultipleHeapsNoOwnership}391 } % subfigure392 \caption{Heap Ownership}393 \label{f:HeapsOwnership}394 \end{figure}395 438 396 439 \VRef[Figure]{f:MultipleHeapStorageOwnership} shows the effect of ownership on storage layout. … … 417 460 The disadvantage of ownership is deallocating to another task's heap so heaps are no longer private and require locks to provide safe concurrent access. 418 461 419 Object ownership can be immediate or delayed, meaning objects may be returned to the owner heap immediately at deallocation or after some delay. 420 A thread may delay the return by storing objects it does not own on a separate free list. 421 Delaying can improve performance by batching objects for return to their owner heap and possibly reallocating these objects if storage runs out on the current heap. 422 However, reallocation can result in passive false-sharing. 462 Object ownership can be immediate or delayed, meaning free objects may be batched on a separate free list either by the returning or receiving thread. 463 While the returning thread can batch objects, batching across multiple heaps is complex and there is no obvious time when to push back to the owner heap. 464 It is better for returning threads to immediately return to the receiving thread's batch list as the receiving thread has better knowledge when to incorporate the batch list into its free pool. 465 Batching leverages the fact that most allocation patterns use the contention-free fast-path so locking on the batch list is rare for both the returning and receiving threads. 466 467 It is possible for heaps to steal objects rather than return them and reallocating these objects when storage runs out on a heap. 468 However, stealing can result in passive false-sharing. 423 469 For example, in \VRef[Figure]{f:AllocatorInducedPassiveFalseSharing}, Object$_2$ may be deallocated to Task$_2$'s heap initially. 424 470 If Task$_2$ reallocates Object$_2$ before it is returned to its owner heap, then passive false-sharing may occur. … … 428 474 \label{s:ObjectContainers} 429 475 430 One approach for managing objects places headers/trailers around individual objects, meaning memory adjacent to the object is reserved for object-management information, as shown in \VRef[Figure]{f:ObjectHeaders}. 431 However, this approach leads to poor cache usage, since only a portion of the cache line is holding useful information from the program's perspective. 432 Spatial locality is also negatively affected. 433 While the header and object are together in memory, they are generally not accessed together; 476 Bracketing every allocation with headers/trailers can result in significant internal fragmentation, as shown in \VRef[Figure]{f:ObjectHeaders}. 477 Especially if the headers contain redundant management information, \eg object size may be the same for many objects because programs only allocate a small set of object sizes. 478 As well, it can result in poor cache usage, since only a portion of the cache line is holding useful information from the program's perspective. 479 Spatial locality can also be negatively affected leading to poor cache locality~\cite{Feng05}: 480 while the header and object are together in memory, they are generally not accessed together; 434 481 \eg the object is accessed by the program when it is allocated, while the header is accessed by the allocator when the object is free. 435 This difference in usage patterns can lead to poor cache locality~\cite{Feng05}.436 Additionally, placing headers on individual objects can lead to redundant management information.437 For example, if a header stores only the object size, then all objects with the same size have identical headers.438 482 439 483 \begin{figure} … … 443 487 \label{f:ObjectHeaders} 444 488 } % subfigure 445 \\446 489 \subfigure[Object Container]{ 447 490 \input{Container} … … 452 495 \end{figure} 453 496 454 An alternative approach f or managing objects factors common header/trailer information to a separate location in memory and organizes associated free storage into blocks called \newterm{object containers} (\newterm{superblocks} in~\cite{Berger00}), as in \VRef[Figure]{f:ObjectContainer}.497 An alternative approach factors common header/trailer information to a separate location in memory and organizes associated free storage into blocks called \newterm{object containers} (\newterm{superblocks} in~\cite{Berger00}), as in \VRef[Figure]{f:ObjectContainer}. 455 498 The header for the container holds information necessary for all objects in the container; 456 499 a trailer may also be used at the end of the container. … … 714 757 Implementing lock-free operations for more complex data-structures (queue~\cite{Valois94}/deque~\cite{Sundell08}) is more complex. 715 758 Michael~\cite{Michael04} and Gidenstam \etal \cite{Gidenstam05} have created lock-free variations of the Hoard allocator. 716 717 718 \noindent719 ====================720 721 Writing Points:722 \begin{itemize}723 \item724 Classification of benchmarks.725 \item726 Literature review of current benchmarks.727 \item728 Features and limitations.729 \item730 Literature review of current memory allocators.731 \item732 Breakdown of memory allocation techniques.733 \item734 Features and limitations.735 \end{itemize}736 737 \noindent738 ====================739 740 % FIXME: cite wasik741 \cite{wasik.thesis}742 743 \section{Existing Memory Allocators}744 With dynamic allocation being an important feature of C, there are many stand-alone memory allocators that have been designed for different purposes. For this thesis, we chose 7 of the most popular and widely used memory allocators.745 746 \paragraph{dlmalloc}747 dlmalloc (FIX ME: cite allocator) is a thread-safe allocator that is single threaded and single heap. dlmalloc maintains free-lists of different sizes to store freed dynamic memory. (FIX ME: cite wasik)748 749 \paragraph{hoard}750 Hoard (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and using a heap layer framework. It has per-thread heaps that have thread-local free-lists, and a global shared heap. (FIX ME: cite wasik)751 752 \paragraph{jemalloc}753 jemalloc (FIX ME: cite allocator) is a thread-safe allocator that uses multiple arenas. Each thread is assigned an arena. Each arena has chunks that contain contagious memory regions of same size. An arena has multiple chunks that contain regions of multiple sizes.754 755 \paragraph{ptmalloc}756 ptmalloc (FIX ME: cite allocator) is a modification of dlmalloc. It is a thread-safe multi-threaded memory allocator that uses multiple heaps. ptmalloc heap has similar design to dlmalloc's heap.757 758 \paragraph{rpmalloc}759 rpmalloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses per-thread heap. Each heap has multiple size-classes and each size-class contains memory regions of the relevant size.760 761 \paragraph{tbb malloc}762 tbb malloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses private heap for each thread. Each private-heap has multiple bins of different sizes. Each bin contains free regions of the same size.763 764 \paragraph{tc malloc}765 tcmalloc (FIX ME: cite allocator) is a thread-safe allocator. It uses per-thread cache to store free objects that prevents contention on shared resources in multi-threaded application. A central free-list is used to refill per-thread cache when it gets empty. -
doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex
r41870a5 r05ffb7b 216 216 \paragraph{Relevant Knobs} 217 217 *** FIX ME: Insert Relevant Knobs 218 219 220 221 \section{Existing Memory Allocators} 222 With dynamic allocation being an important feature of C, there are many stand-alone memory allocators that have been designed for different purposes. For this thesis, we chose 7 of the most popular and widely used memory allocators. 223 224 \paragraph{dlmalloc} 225 dlmalloc (FIX ME: cite allocator) is a thread-safe allocator that is single threaded and single heap. dlmalloc maintains free-lists of different sizes to store freed dynamic memory. (FIX ME: cite wasik) 226 227 \paragraph{hoard} 228 Hoard (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and using a heap layer framework. It has per-thread heaps that have thread-local free-lists, and a global shared heap. (FIX ME: cite wasik) 229 230 \paragraph{jemalloc} 231 jemalloc (FIX ME: cite allocator) is a thread-safe allocator that uses multiple arenas. Each thread is assigned an arena. Each arena has chunks that contain contagious memory regions of same size. An arena has multiple chunks that contain regions of multiple sizes. 232 233 \paragraph{ptmalloc} 234 ptmalloc (FIX ME: cite allocator) is a modification of dlmalloc. It is a thread-safe multi-threaded memory allocator that uses multiple heaps. ptmalloc heap has similar design to dlmalloc's heap. 235 236 \paragraph{rpmalloc} 237 rpmalloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses per-thread heap. Each heap has multiple size-classes and each size-class contains memory regions of the relevant size. 238 239 \paragraph{tbb malloc} 240 tbb malloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses private heap for each thread. Each private-heap has multiple bins of different sizes. Each bin contains free regions of the same size. 241 242 \paragraph{tc malloc} 243 tcmalloc (FIX ME: cite allocator) is a thread-safe allocator. It uses per-thread cache to store free objects that prevents contention on shared resources in multi-threaded application. A central free-list is used to refill per-thread cache when it gets empty. -
doc/theses/mubeen_zulfiqar_MMath/intro.tex
r41870a5 r05ffb7b 96 96 97 97 \item 98 Provide additional heap wrapper functions in \CFA to provide a complete orthogonal set of allocation operations and properties. 99 100 \item 98 101 Provide additional query operations to access information about an allocation: 99 102 \begin{itemize}
Note: See TracChangeset
for help on using the changeset viewer.