Changeset a114743
- Timestamp:
- Mar 30, 2022, 10:11:02 PM (3 years ago)
- Branches:
- ADT, ast-experimental, enum, master, pthread-emulation, qualifiedEnum
- Children:
- ee3da78
- Parents:
- 65781a8
- Location:
- doc/theses/mubeen_zulfiqar_MMath
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mubeen_zulfiqar_MMath/background.tex
r65781a8 ra114743 34 34 \VRef[Figure]{f:AllocatorComponents} shows the two important data components for a memory allocator, management and storage, collectively called the \newterm{heap}. 35 35 The \newterm{management data} is a data structure located at a known memory address and contains all information necessary to manage the storage data. 36 The management data starts with fixed-sized information in the static-data memory that flows intothe dynamic-allocation memory.36 The management data starts with fixed-sized information in the static-data memory that references components in the dynamic-allocation memory. 37 37 The \newterm{storage data} is composed of allocated and freed objects, and \newterm{reserved memory}. 38 38 Allocated objects (white) are variable sized, and allocated and maintained by the program; … … 44 44 \label{f:AllocatorComponents} 45 45 \end{figure} 46 Freed objects (light grey) are memory deallocated by the program, which are linked into one or more lists facilitating easy location fornew allocations.46 Freed objects (light grey) represent memory deallocated by the program, which are linked into one or more lists facilitating easy location of new allocations. 47 47 Often the free list is chained internally so it does not consume additional storage, \ie the link fields are placed at known locations in the unused memory blocks. 48 48 Reserved memory (dark grey) is one or more blocks of memory obtained from the operating system but not yet allocated to the program; … … 81 81 Fragmentation is memory requested from the operating system but not used by the program; 82 82 hence, allocated objects are not fragmentation. 83 \VRef[Figure]{f:InternalExternalFragmentation} )shows fragmentation is divided into two forms: internal or external.83 \VRef[Figure]{f:InternalExternalFragmentation} shows fragmentation is divided into two forms: internal or external. 84 84 85 85 \begin{figure} … … 96 96 An allocator should strive to keep internal management information to a minimum. 97 97 98 \newterm{External fragmentation} is all memory space reserved from the operating system but not allocated to the program~\cite{Wilson95,Lim98,Siebert00}, which includes freed objects, all external management data, and reserved memory.98 \newterm{External fragmentation} is all memory space reserved from the operating system but not allocated to the program~\cite{Wilson95,Lim98,Siebert00}, which includes all external management data, freed objects, and reserved memory. 99 99 This memory is problematic in two ways: heap blowup and highly fragmented memory. 100 100 \newterm{Heap blowup} occurs when memory freed by the program is not reused for future allocations leading to potentially unbounded external fragmentation growth~\cite{Berger00}. … … 125 125 \end{figure} 126 126 127 For a single-threaded memory allocator, three basic approaches for controlling fragmentation have beenidentified~\cite{Johnstone99}.127 For a single-threaded memory allocator, three basic approaches for controlling fragmentation are identified~\cite{Johnstone99}. 128 128 The first approach is a \newterm{sequential-fit algorithm} with one list of free objects that is searched for a block large enough to fit a requested object size. 129 129 Different search policies determine the free object selected, \eg the first free object large enough or closest to the requested size. … … 132 132 133 133 The second approach is a \newterm{segregated} or \newterm{binning algorithm} with a set of lists for different sized freed objects. 134 When an object is allocated, the requested size is rounded up to the nearest bin-size, possibly withspacing after the object.134 When an object is allocated, the requested size is rounded up to the nearest bin-size, often leading to spacing after the object. 135 135 A binning algorithm is fast at finding free memory of the appropriate size and allocating it, since the first free object on the free list is used. 136 136 The fewer bin-sizes, the fewer lists need to be searched and maintained; … … 158 158 Temporal locality commonly occurs during an iterative computation with a fix set of disjoint variables, while spatial locality commonly occurs when traversing an array. 159 159 160 Hardware takes advantage of temporal and spatial locality through multiple levels of caching (\ie memory hierarchy).160 Hardware takes advantage of temporal and spatial locality through multiple levels of caching, \ie memory hierarchy. 161 161 When an object is accessed, the memory physically located around the object is also cached with the expectation that the current and nearby objects will be referenced within a short period of time. 162 162 For example, entire cache lines are transferred between memory and cache and entire virtual-memory pages are transferred between disk and memory. … … 171 171 172 172 There are a number of ways a memory allocator can degrade locality by increasing the working set. 173 For example, a memory allocator may access multiple free objects before finding one to satisfy an allocation request (\eg sequential-fit algorithm).173 For example, a memory allocator may access multiple free objects before finding one to satisfy an allocation request, \eg sequential-fit algorithm. 174 174 If there are a (large) number of objects accessed in very different areas of memory, the allocator may perturb the program's memory hierarchy causing multiple cache or page misses~\cite{Grunwald93}. 175 175 Another way locality can be degraded is by spatially separating related data. … … 181 181 182 182 A multi-threaded memory-allocator does not run any threads itself, but is used by a multi-threaded program. 183 In addition to single-threaded design issues of locality and fragmentation, a multi-threaded allocator may besimultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as mutual exclusion, false sharing, and additional forms of heap blowup.183 In addition to single-threaded design issues of fragmentation and locality, a multi-threaded allocator is simultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as mutual exclusion, false sharing, and additional forms of heap blowup. 184 184 185 185 … … 192 192 Second is when multiple threads contend for a shared resource simultaneously, and hence, some threads must wait until the resource is released. 193 193 Contention can be reduced in a number of ways: 194 \begin{itemize}[itemsep=0pt] 195 \item 194 196 using multiple fine-grained locks versus a single lock, spreading the contention across a number of locks; 197 \item 195 198 using trylock and generating new storage if the lock is busy, yielding a classic space versus time tradeoff; 199 \item 196 200 using one of the many lock-free approaches for reducing contention on basic data-structure operations~\cite{Oyama99}. 197 However, all of these approaches have degenerate cases where contention occurs. 201 \end{itemize} 202 However, all of these approaches have degenerate cases where program contention is high, which occurs outside of the allocator. 198 203 199 204 … … 275 280 \label{s:MultipleHeaps} 276 281 277 A single-threaded allocator has at most one thread and heap, while amulti-threaded allocator has potentially multiple threads and heaps.282 A multi-threaded allocator has potentially multiple threads and heaps. 278 283 The multiple threads cause complexity, and multiple heaps are a mechanism for dealing with the complexity. 279 284 The spectrum ranges from multiple threads using a single heap, denoted as T:1 (see \VRef[Figure]{f:SingleHeap}), to multiple threads sharing multiple heaps, denoted as T:H (see \VRef[Figure]{f:SharedHeaps}), to one thread per heap, denoted as 1:1 (see \VRef[Figure]{f:PerThreadHeap}), which is almost back to a single-threaded allocator. … … 339 344 An alternative implementation is for all heaps to share one reserved memory, which requires a separate lock for the reserved storage to ensure mutual exclusion when acquiring new memory. 340 345 Because multiple threads can allocate/free/reallocate adjacent storage, all forms of false sharing may occur. 341 Other storage-management options are to use @mmap@ to set aside (large) areas of virtual memory for each heap and suballocate each heap's storage within that area .346 Other storage-management options are to use @mmap@ to set aside (large) areas of virtual memory for each heap and suballocate each heap's storage within that area, pushing part of the storage management complexity back to the operating system. 342 347 343 348 \begin{figure} … … 368 373 369 374 370 \paragraph{1:1 model (thread heaps)} where each thread has its own heap , which eliminates most contention and locking because threads seldom accesses another thread's heap (see ownership in \VRef{s:Ownership}).375 \paragraph{1:1 model (thread heaps)} where each thread has its own heap eliminating most contention and locking because threads seldom access another thread's heap (see ownership in \VRef{s:Ownership}). 371 376 An additional benefit of thread heaps is improved locality due to better memory layout. 372 377 As each thread only allocates from its heap, all objects for a thread are consolidated in the storage area for that heap, better utilizing each CPUs cache and accessing fewer pages. … … 380 385 Second is to place the thread heap on a list of available heaps and reuse it for a new thread in the future. 381 386 Destroying the thread heap immediately may reduce external fragmentation sooner, since all free objects are freed to the global heap and may be reused by other threads. 382 Alternatively, reusing thread heaps may improve performance if the inheriting thread makes similar allocation requests as the thread that previously held the thread heap .387 Alternatively, reusing thread heaps may improve performance if the inheriting thread makes similar allocation requests as the thread that previously held the thread heap because any unfreed storage is immediately accessible.. 383 388 384 389 … … 388 393 However, an important goal of user-level threading is for fast operations (creation/termination/context-switching) by not interacting with the operating system, which allows the ability to create large numbers of high-performance interacting threads ($>$ 10,000). 389 394 It is difficult to retain this goal, if the user-threading model is directly involved with the heap model. 390 \VRef[Figure]{f:UserLevelKernelHeaps} shows that virtually all user-level threading systems use whatever kernel-level heap-model provided by the language runtime.395 \VRef[Figure]{f:UserLevelKernelHeaps} shows that virtually all user-level threading systems use whatever kernel-level heap-model is provided by the language runtime. 391 396 Hence, a user thread allocates/deallocates from/to the heap of the kernel thread on which it is currently executing. 392 397 … … 400 405 Adopting this model results in a subtle problem with shared heaps. 401 406 With kernel threading, an operation that is started by a kernel thread is always completed by that thread. 402 For example, if a kernel thread starts an allocation/deallocation on a shared heap, it always completes that operation with that heap even if preempted. 403 Any correctness locking associated with the shared heap is preserved across preemption. 407 For example, if a kernel thread starts an allocation/deallocation on a shared heap, it always completes that operation with that heap even if preempted, \ie any locking correctness associated with the shared heap is preserved across preemption. 404 408 405 409 However, this correctness property is not preserved for user-level threading. … … 409 413 However, eagerly disabling/enabling time-slicing on the allocation/deallocation fast path is expensive, because preemption is rare (10--100 milliseconds). 410 414 Instead, techniques exist to lazily detect this case in the interrupt handler, abort the preemption, and return to the operation so it can complete atomically. 411 Occasionally ignoring a preemption should be benign .415 Occasionally ignoring a preemption should be benign, but a persistent lack of preemption can result in both short and long term starvation. 412 416 413 417 … … 430 434 431 435 \newterm{Ownership} defines which heap an object is returned-to on deallocation. 432 If a thread returns an object to the heap it was originally allocated from, theheap has ownership of its objects.433 Alternatively, a thread can return an object to the heap it is currently a llocating from, which can be any heap accessible during a thread's lifetime.436 If a thread returns an object to the heap it was originally allocated from, a heap has ownership of its objects. 437 Alternatively, a thread can return an object to the heap it is currently associated with, which can be any heap accessible during a thread's lifetime. 434 438 \VRef[Figure]{f:HeapsOwnership} shows an example of multiple heaps (minus the global heap) with and without ownership. 435 439 Again, the arrows indicate the direction memory conceptually moves for each kind of operation. … … 539 543 Only with the 1:1 model and ownership is active and passive false-sharing avoided (see \VRef{s:Ownership}). 540 544 Passive false-sharing may still occur, if delayed ownership is used. 545 Finally, a completely free container can become reserved storage and be reset to allocate objects of a new size or freed to the global heap. 541 546 542 547 \begin{figure} … … 553 558 \caption{Free-list Structure with Container Ownership} 554 559 \end{figure} 555 556 A fragmented heap has multiple containers that may be partially or completely free.557 A completely free container can become reserved storage and be reset to allocate objects of a new size.558 When a heap reaches a threshold of free objects, it moves some free storage to the global heap for reuse to prevent heap blowup.559 Without ownership, when a heap frees objects to the global heap, individual objects must be passed, and placed on the global-heap's free-list.560 Containers cannot be freed to the global heap unless completely free because561 560 562 561 When a container changes ownership, the ownership of all objects within it change as well. … … 569 568 Note, once the object is freed by Task$_1$, no more false sharing can occur until the container changes ownership again. 570 569 To prevent this form of false sharing, container movement may be restricted to when all objects in the container are free. 571 One implementation approach that increases the freedom to return a free container to the operating system involves allocating containers using a call like @mmap@, which allows memory at an arbitrary address to be returned versus only storage at the end of the contiguous @sbrk@ area .570 One implementation approach that increases the freedom to return a free container to the operating system involves allocating containers using a call like @mmap@, which allows memory at an arbitrary address to be returned versus only storage at the end of the contiguous @sbrk@ area, again pushing storage management complexity back to the operating system. 572 571 573 572 \begin{figure} … … 700 699 \end{figure} 701 700 702 As mentioned, an implementation may have only one heap dealwith the global heap, so the other heap can be simplified.701 As mentioned, an implementation may have only one heap interact with the global heap, so the other heap can be simplified. 703 702 For example, if only the private heap interacts with the global heap, the public heap can be reduced to a lock-protected free-list of objects deallocated by other threads due to ownership, called a \newterm{remote free-list}. 704 703 To avoid heap blowup, the private heap allocates from the remote free-list when it reaches some threshold or it has no free storage. … … 721 720 An allocation buffer is reserved memory (see~\VRef{s:AllocatorComponents}) not yet allocated to the program, and is used for allocating objects when the free list is empty. 722 721 That is, rather than requesting new storage for a single object, an entire buffer is requested from which multiple objects are allocated later. 723 Both any heap may use an allocation buffer, resulting in allocation from the buffer before requesting objects (containers) from the global heap or operating system, respectively.722 Any heap may use an allocation buffer, resulting in allocation from the buffer before requesting objects (containers) from the global heap or operating system, respectively. 724 723 The allocation buffer reduces contention and the number of global/operating-system calls. 725 724 For coalescing, a buffer is split into smaller objects by allocations, and recomposed into larger buffer areas during deallocations. 726 725 727 Allocation buffers are useful initially when there are no freed objects in a heap because many allocations usually occur when a thread starts .726 Allocation buffers are useful initially when there are no freed objects in a heap because many allocations usually occur when a thread starts (simple bump allocation). 728 727 Furthermore, to prevent heap blowup, objects should be reused before allocating a new allocation buffer. 729 Thus, allocation buffers are often allocated more frequently at program/thread start, and then their use often diminishes.728 Thus, allocation buffers are often allocated more frequently at program/thread start, and then allocations often diminish. 730 729 731 730 Using an allocation buffer with a thread heap avoids active false-sharing, since all objects in the allocation buffer are allocated to the same thread. … … 746 745 \label{s:LockFreeOperations} 747 746 748 A lock-free algorithm guarantees safe concurrent-access to a data structure, so that at least one thread can make progress in the system, but an individual task has no bound to execution, and hence,may starve~\cite[pp.~745--746]{Herlihy93}.749 % A wait-free algorithm puts a finite bound on the number of steps any thread takes to complete an operation, so an individual task cannot starve 747 A \newterm{lock-free algorithm} guarantees safe concurrent-access to a data structure, so that at least one thread makes progress, but an individual task has no execution bound and may starve~\cite[pp.~745--746]{Herlihy93}. 748 (A \newterm{wait-free algorithm} puts a bound on the number of steps any thread takes to complete an operation to prevent starvation.) 750 749 Lock-free operations can be used in an allocator to reduce or eliminate the use of locks. 751 Locks are a problem for high contention or if the thread holding the lock is preempted and other threads attempt to use that lock.752 With respect to the heap, these situations are unlikely unless all threads make sextremely high use of dynamic-memory allocation, which can be an indication of poor design.750 While locks and lock-free data-structures often have equal performance, lock-free has the advantage of not holding a lock across preemption so other threads can continue to make progress. 751 With respect to the heap, these situations are unlikely unless all threads make extremely high use of dynamic-memory allocation, which can be an indication of poor design. 753 752 Nevertheless, lock-free algorithms can reduce the number of context switches, since a thread does not yield/block while waiting for a lock; 754 on the other hand, a thread may busy-wait for an unbounded period .753 on the other hand, a thread may busy-wait for an unbounded period holding a processor. 755 754 Finally, lock-free implementations have greater complexity and hardware dependency. 756 755 Lock-free algorithms can be applied most easily to simple free-lists, \eg remote free-list, to allow lock-free insertion and removal from the head of a stack. 757 Implementing lock-free operations for more complex data-structures (queue~\cite{Valois94}/deque~\cite{Sundell08}) is more complex.756 Implementing lock-free operations for more complex data-structures (queue~\cite{Valois94}/deque~\cite{Sundell08}) is correspondinglyy more complex. 758 757 Michael~\cite{Michael04} and Gidenstam \etal \cite{Gidenstam05} have created lock-free variations of the Hoard allocator. -
doc/theses/mubeen_zulfiqar_MMath/intro.tex
r65781a8 ra114743 48 48 Attempts have been made to perform quasi garbage collection in C/\CC~\cite{Boehm88}, but it is a compromise. 49 49 This thesis only examines dynamic memory-management with \emph{explicit} deallocation. 50 While garbage collection and compaction are not part this work, many of the results are applicable to the allocation phase in any memory-management approach.50 While garbage collection and compaction are not part this work, many of the work's results are applicable to the allocation phase in any memory-management approach. 51 51 52 52 Most programs use a general-purpose allocator, often the one provided implicitly by the programming-language's runtime. … … 65 65 \begin{enumerate}[leftmargin=*] 66 66 \item 67 Implementation of a new stand-lone concurrent low-latency memory-allocator ($\approx$1,200 lines of code) for C/\CC programs using kernel threads (1:1 threading), and specialized versions of the allocator for programming languages \uC and \CFA using user-level threads running over multiple kernel threads (M:N threading). 68 69 \item 70 Adopt returning of @nullptr@ for a zero-sized allocation, rather than an actual memory address, both of which can be passed to @free@. 71 72 \item 73 Extended the standard C heap functionality by preserving with each allocation its original request size versus the amount allocated, if an allocation is zero fill, and the allocation alignment. 74 75 \item 76 Use the zero fill and alignment as \emph{sticky} properties for @realloc@, to realign existing storage, or preserve existing zero-fill and alignment when storage is copied. 67 Implementation of a new stand-lone concurrent low-latency memory-allocator ($\approx$1,200 lines of code) for C/\CC programs using kernel threads (1:1 threading), and specialized versions of the allocator for the programming languages \uC and \CFA using user-level threads running over multiple kernel threads (M:N threading). 68 69 \item 70 Adopt @nullptr@ return for a zero-sized allocation, rather than an actual memory address, which can be passed to @free@. 71 72 \item 73 Extend the standard C heap functionality by preserving with each allocation: 74 \begin{itemize}[itemsep=0pt] 75 \item 76 its request size plus the amount allocated, 77 \item 78 whether an allocation is zero fill, 79 \item 80 and allocation alignment. 81 \end{itemize} 82 83 \item 84 Use the preserved zero fill and alignment as \emph{sticky} properties for @realloc@ to zero-fill and align when storage is extended or copied. 77 85 Without this extension, it is unsafe to @realloc@ storage initially allocated with zero-fill/alignment as these properties are not preserved when copying. 78 86 This silent generation of a problem is unintuitive to programmers and difficult to locate because it is transient. … … 86 94 @resize( oaddr, alignment, size )@ re-purpose an old allocation with new alignment but \emph{without} preserving fill. 87 95 \item 88 @realloc( oaddr, alignment, size )@ same as previous@realloc@ but adding or changing alignment.96 @realloc( oaddr, alignment, size )@ same as @realloc@ but adding or changing alignment. 89 97 \item 90 98 @aalloc( dim, elemSize )@ same as @calloc@ except memory is \emph{not} zero filled. … … 96 104 97 105 \item 98 Provide additional heap wrapper functions in \CFA to provide a completeorthogonal set of allocation operations and properties.106 Provide additional heap wrapper functions in \CFA creating an orthogonal set of allocation operations and properties. 99 107 100 108 \item … … 109 117 @malloc_size( addr )@ returns the size of the memory allocation pointed-to by @addr@. 110 118 \item 111 @malloc_usable_size( addr )@ returns the usable size of the memory pointed-to by @addr@, i.e., the bin size containing the allocation, where @malloc_size( addr )@ $\le$ @malloc_usable_size( addr )@.119 @malloc_usable_size( addr )@ returns the usable (total) size of the memory pointed-to by @addr@, i.e., the bin size containing the allocation, where @malloc_size( addr )@ $\le$ @malloc_usable_size( addr )@. 112 120 \end{itemize} 113 121
Note: See TracChangeset
for help on using the changeset viewer.