Changes in / [13cdc8c:bdfd0bd]


Ignore:
Location:
doc/theses/mubeen_zulfiqar_MMath
Files:
1 deleted
3 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mubeen_zulfiqar_MMath/Makefile

    r13cdc8c rbdfd0bd  
    3636# File Dependencies #
    3737
    38 %.dvi : ${TeXSRC} ${FigSRC:%.fig=%.tex} ${PicSRC:%.fig=%.pstex} ${BIBSRC} Makefile | ${Build}
     38${Build}/%.dvi : ${TeXSRC} ${FigSRC:%.fig=%.tex} ${PicSRC:%.fig=%.pstex} ${BIBSRC} Makefile | ${Build}
    3939        ${LaTeX} ${BASE}
    4040        ${BibTeX} ${Build}/${BASE}
  • doc/theses/mubeen_zulfiqar_MMath/background.tex

    r13cdc8c rbdfd0bd  
    3434\VRef[Figure]{f:AllocatorComponents} shows the two important data components for a memory allocator, management and storage, collectively called the \newterm{heap}.
    3535The \newterm{management data} is a data structure located at a known memory address and contains all information necessary to manage the storage data.
    36 The management data starts with fixed-sized information in the static-data memory that references components in the dynamic-allocation memory.
     36The management data starts with fixed-sized information in the static-data memory that flows into the dynamic-allocation memory.
    3737The \newterm{storage data} is composed of allocated and freed objects, and \newterm{reserved memory}.
    3838Allocated objects (white) are variable sized, and allocated and maintained by the program;
     
    4444\label{f:AllocatorComponents}
    4545\end{figure}
    46 Freed objects (light grey) represent memory deallocated by the program, which are linked into one or more lists facilitating easy location of new allocations.
     46Freed objects (light grey) are memory deallocated by the program, which are linked into one or more lists facilitating easy location for new allocations.
    4747Often the free list is chained internally so it does not consume additional storage, \ie the link fields are placed at known locations in the unused memory blocks.
    4848Reserved memory (dark grey) is one or more blocks of memory obtained from the operating system but not yet allocated to the program;
     
    8181Fragmentation is memory requested from the operating system but not used by the program;
    8282hence, allocated objects are not fragmentation.
    83 \VRef[Figure]{f:InternalExternalFragmentation} shows fragmentation is divided into two forms: internal or external.
     83\VRef[Figure]{f:InternalExternalFragmentation}) shows fragmentation is divided into two forms: internal or external.
    8484
    8585\begin{figure}
     
    9696An allocator should strive to keep internal management information to a minimum.
    9797
    98 \newterm{External fragmentation} is all memory space reserved from the operating system but not allocated to the program~\cite{Wilson95,Lim98,Siebert00}, which includes all external management data, freed objects, and reserved memory.
     98\newterm{External fragmentation} is all memory space reserved from the operating system but not allocated to the program~\cite{Wilson95,Lim98,Siebert00}, which includes freed objects, all external management data, and reserved memory.
    9999This memory is problematic in two ways: heap blowup and highly fragmented memory.
    100100\newterm{Heap blowup} occurs when memory freed by the program is not reused for future allocations leading to potentially unbounded external fragmentation growth~\cite{Berger00}.
     
    125125\end{figure}
    126126
    127 For a single-threaded memory allocator, three basic approaches for controlling fragmentation are identified~\cite{Johnstone99}.
     127For a single-threaded memory allocator, three basic approaches for controlling fragmentation have been identified~\cite{Johnstone99}.
    128128The first approach is a \newterm{sequential-fit algorithm} with one list of free objects that is searched for a block large enough to fit a requested object size.
    129129Different search policies determine the free object selected, \eg the first free object large enough or closest to the requested size.
     
    132132
    133133The second approach is a \newterm{segregated} or \newterm{binning algorithm} with a set of lists for different sized freed objects.
    134 When an object is allocated, the requested size is rounded up to the nearest bin-size, often leading to spacing after the object.
     134When an object is allocated, the requested size is rounded up to the nearest bin-size, possibly with spacing after the object.
    135135A binning algorithm is fast at finding free memory of the appropriate size and allocating it, since the first free object on the free list is used.
    136136The fewer bin-sizes, the fewer lists need to be searched and maintained;
     
    158158Temporal locality commonly occurs during an iterative computation with a fix set of disjoint variables, while spatial locality commonly occurs when traversing an array.
    159159
    160 Hardware takes advantage of temporal and spatial locality through multiple levels of caching, \ie memory hierarchy.
     160Hardware takes advantage of temporal and spatial locality through multiple levels of caching (\ie memory hierarchy).
    161161When an object is accessed, the memory physically located around the object is also cached with the expectation that the current and nearby objects will be referenced within a short period of time.
    162162For example, entire cache lines are transferred between memory and cache and entire virtual-memory pages are transferred between disk and memory.
     
    171171
    172172There are a number of ways a memory allocator can degrade locality by increasing the working set.
    173 For example, a memory allocator may access multiple free objects before finding one to satisfy an allocation request, \eg sequential-fit algorithm.
     173For example, a memory allocator may access multiple free objects before finding one to satisfy an allocation request (\eg sequential-fit algorithm).
    174174If there are a (large) number of objects accessed in very different areas of memory, the allocator may perturb the program's memory hierarchy causing multiple cache or page misses~\cite{Grunwald93}.
    175175Another way locality can be degraded is by spatially separating related data.
     
    181181
    182182A multi-threaded memory-allocator does not run any threads itself, but is used by a multi-threaded program.
    183 In addition to single-threaded design issues of fragmentation and locality, a multi-threaded allocator is simultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as mutual exclusion, false sharing, and additional forms of heap blowup.
     183In addition to single-threaded design issues of locality and fragmentation, a multi-threaded allocator may be simultaneously accessed by multiple threads, and hence, must deal with concurrency issues such as mutual exclusion, false sharing, and additional forms of heap blowup.
    184184
    185185
     
    192192Second is when multiple threads contend for a shared resource simultaneously, and hence, some threads must wait until the resource is released.
    193193Contention can be reduced in a number of ways:
    194 \begin{itemize}[itemsep=0pt]
    195 \item
    196194using multiple fine-grained locks versus a single lock, spreading the contention across a number of locks;
    197 \item
    198195using trylock and generating new storage if the lock is busy, yielding a classic space versus time tradeoff;
    199 \item
    200196using one of the many lock-free approaches for reducing contention on basic data-structure operations~\cite{Oyama99}.
    201 \end{itemize}
    202 However, all of these approaches have degenerate cases where program contention is high, which occurs outside of the allocator.
     197However, all of these approaches have degenerate cases where contention occurs.
    203198
    204199
     
    280275\label{s:MultipleHeaps}
    281276
    282 A multi-threaded allocator has potentially multiple threads and heaps.
     277A single-threaded allocator has at most one thread and heap, while a multi-threaded allocator has potentially multiple threads and heaps.
    283278The multiple threads cause complexity, and multiple heaps are a mechanism for dealing with the complexity.
    284279The spectrum ranges from multiple threads using a single heap, denoted as T:1 (see \VRef[Figure]{f:SingleHeap}), to multiple threads sharing multiple heaps, denoted as T:H (see \VRef[Figure]{f:SharedHeaps}), to one thread per heap, denoted as 1:1 (see \VRef[Figure]{f:PerThreadHeap}), which is almost back to a single-threaded allocator.
     
    344339An alternative implementation is for all heaps to share one reserved memory, which requires a separate lock for the reserved storage to ensure mutual exclusion when acquiring new memory.
    345340Because multiple threads can allocate/free/reallocate adjacent storage, all forms of false sharing may occur.
    346 Other storage-management options are to use @mmap@ to set aside (large) areas of virtual memory for each heap and suballocate each heap's storage within that area, pushing part of the storage management complexity back to the operating system.
     341Other storage-management options are to use @mmap@ to set aside (large) areas of virtual memory for each heap and suballocate each heap's storage within that area.
    347342
    348343\begin{figure}
     
    373368
    374369
    375 \paragraph{1:1 model (thread heaps)} where each thread has its own heap eliminating most contention and locking because threads seldom access another thread's heap (see ownership in \VRef{s:Ownership}).
     370\paragraph{1:1 model (thread heaps)} where each thread has its own heap, which eliminates most contention and locking because threads seldom accesses another thread's heap (see ownership in \VRef{s:Ownership}).
    376371An additional benefit of thread heaps is improved locality due to better memory layout.
    377372As each thread only allocates from its heap, all objects for a thread are consolidated in the storage area for that heap, better utilizing each CPUs cache and accessing fewer pages.
     
    385380Second is to place the thread heap on a list of available heaps and reuse it for a new thread in the future.
    386381Destroying the thread heap immediately may reduce external fragmentation sooner, since all free objects are freed to the global heap and may be reused by other threads.
    387 Alternatively, reusing thread heaps may improve performance if the inheriting thread makes similar allocation requests as the thread that previously held the thread heap because any unfreed storage is immediately accessible..
     382Alternatively, reusing thread heaps may improve performance if the inheriting thread makes similar allocation requests as the thread that previously held the thread heap.
    388383
    389384
     
    393388However, an important goal of user-level threading is for fast operations (creation/termination/context-switching) by not interacting with the operating system, which allows the ability to create large numbers of high-performance interacting threads ($>$ 10,000).
    394389It is difficult to retain this goal, if the user-threading model is directly involved with the heap model.
    395 \VRef[Figure]{f:UserLevelKernelHeaps} shows that virtually all user-level threading systems use whatever kernel-level heap-model is provided by the language runtime.
     390\VRef[Figure]{f:UserLevelKernelHeaps} shows that virtually all user-level threading systems use whatever kernel-level heap-model provided by the language runtime.
    396391Hence, a user thread allocates/deallocates from/to the heap of the kernel thread on which it is currently executing.
    397392
     
    405400Adopting this model results in a subtle problem with shared heaps.
    406401With kernel threading, an operation that is started by a kernel thread is always completed by that thread.
    407 For example, if a kernel thread starts an allocation/deallocation on a shared heap, it always completes that operation with that heap even if preempted, \ie any locking correctness associated with the shared heap is preserved across preemption.
     402For example, if a kernel thread starts an allocation/deallocation on a shared heap, it always completes that operation with that heap even if preempted.
     403Any correctness locking associated with the shared heap is preserved across preemption.
    408404
    409405However, this correctness property is not preserved for user-level threading.
     
    413409However, eagerly disabling/enabling time-slicing on the allocation/deallocation fast path is expensive, because preemption is rare (10--100 milliseconds).
    414410Instead, techniques exist to lazily detect this case in the interrupt handler, abort the preemption, and return to the operation so it can complete atomically.
    415 Occasionally ignoring a preemption should be benign, but a persistent lack of preemption can result in both short and long term starvation.
     411Occasionally ignoring a preemption should be benign.
    416412
    417413
     
    434430
    435431\newterm{Ownership} defines which heap an object is returned-to on deallocation.
    436 If a thread returns an object to the heap it was originally allocated from, a heap has ownership of its objects.
    437 Alternatively, a thread can return an object to the heap it is currently associated with, which can be any heap accessible during a thread's lifetime.
     432If a thread returns an object to the heap it was originally allocated from, the heap has ownership of its objects.
     433Alternatively, a thread can return an object to the heap it is currently allocating from, which can be any heap accessible during a thread's lifetime.
    438434\VRef[Figure]{f:HeapsOwnership} shows an example of multiple heaps (minus the global heap) with and without ownership.
    439435Again, the arrows indicate the direction memory conceptually moves for each kind of operation.
     
    543539Only with the 1:1 model and ownership is active and passive false-sharing avoided (see \VRef{s:Ownership}).
    544540Passive false-sharing may still occur, if delayed ownership is used.
    545 Finally, a completely free container can become reserved storage and be reset to allocate objects of a new size or freed to the global heap.
    546541
    547542\begin{figure}
     
    558553\caption{Free-list Structure with Container Ownership}
    559554\end{figure}
     555
     556A fragmented heap has multiple containers that may be partially or completely free.
     557A completely free container can become reserved storage and be reset to allocate objects of a new size.
     558When a heap reaches a threshold of free objects, it moves some free storage to the global heap for reuse to prevent heap blowup.
     559Without ownership, when a heap frees objects to the global heap, individual objects must be passed, and placed on the global-heap's free-list.
     560Containers cannot be freed to the global heap unless completely free because
    560561
    561562When a container changes ownership, the ownership of all objects within it change as well.
     
    568569Note, once the object is freed by Task$_1$, no more false sharing can occur until the container changes ownership again.
    569570To prevent this form of false sharing, container movement may be restricted to when all objects in the container are free.
    570 One implementation approach that increases the freedom to return a free container to the operating system involves allocating containers using a call like @mmap@, which allows memory at an arbitrary address to be returned versus only storage at the end of the contiguous @sbrk@ area, again pushing storage management complexity back to the operating system.
     571One implementation approach that increases the freedom to return a free container to the operating system involves allocating containers using a call like @mmap@, which allows memory at an arbitrary address to be returned versus only storage at the end of the contiguous @sbrk@ area.
    571572
    572573\begin{figure}
     
    699700\end{figure}
    700701
    701 As mentioned, an implementation may have only one heap interact with the global heap, so the other heap can be simplified.
     702As mentioned, an implementation may have only one heap deal with the global heap, so the other heap can be simplified.
    702703For example, if only the private heap interacts with the global heap, the public heap can be reduced to a lock-protected free-list of objects deallocated by other threads due to ownership, called a \newterm{remote free-list}.
    703704To avoid heap blowup, the private heap allocates from the remote free-list when it reaches some threshold or it has no free storage.
     
    720721An allocation buffer is reserved memory (see~\VRef{s:AllocatorComponents}) not yet allocated to the program, and is used for allocating objects when the free list is empty.
    721722That is, rather than requesting new storage for a single object, an entire buffer is requested from which multiple objects are allocated later.
    722 Any heap may use an allocation buffer, resulting in allocation from the buffer before requesting objects (containers) from the global heap or operating system, respectively.
     723Both any heap may use an allocation buffer, resulting in allocation from the buffer before requesting objects (containers) from the global heap or operating system, respectively.
    723724The allocation buffer reduces contention and the number of global/operating-system calls.
    724725For coalescing, a buffer is split into smaller objects by allocations, and recomposed into larger buffer areas during deallocations.
    725726
    726 Allocation buffers are useful initially when there are no freed objects in a heap because many allocations usually occur when a thread starts (simple bump allocation).
     727Allocation buffers are useful initially when there are no freed objects in a heap because many allocations usually occur when a thread starts.
    727728Furthermore, to prevent heap blowup, objects should be reused before allocating a new allocation buffer.
    728 Thus, allocation buffers are often allocated more frequently at program/thread start, and then allocations often diminish.
     729Thus, allocation buffers are often allocated more frequently at program/thread start, and then their use often diminishes.
    729730
    730731Using an allocation buffer with a thread heap avoids active false-sharing, since all objects in the allocation buffer are allocated to the same thread.
     
    745746\label{s:LockFreeOperations}
    746747
    747 A \newterm{lock-free algorithm} guarantees safe concurrent-access to a data structure, so that at least one thread makes progress, but an individual task has no execution bound and may starve~\cite[pp.~745--746]{Herlihy93}.
    748 (A \newterm{wait-free algorithm} puts a bound on the number of steps any thread takes to complete an operation to prevent starvation.)
     748A lock-free algorithm guarantees safe concurrent-access to a data structure, so that at least one thread can make progress in the system, but an individual task has no bound to execution, and hence, may starve~\cite[pp.~745--746]{Herlihy93}.
     749% A wait-free algorithm puts a finite bound on the number of steps any thread takes to complete an operation, so an individual task cannot starve
    749750Lock-free operations can be used in an allocator to reduce or eliminate the use of locks.
    750 While locks and lock-free data-structures often have equal performance, lock-free has the advantage of not holding a lock across preemption so other threads can continue to make progress.
    751 With respect to the heap, these situations are unlikely unless all threads make extremely high use of dynamic-memory allocation, which can be an indication of poor design.
     751Locks are a problem for high contention or if the thread holding the lock is preempted and other threads attempt to use that lock.
     752With respect to the heap, these situations are unlikely unless all threads makes extremely high use of dynamic-memory allocation, which can be an indication of poor design.
    752753Nevertheless, lock-free algorithms can reduce the number of context switches, since a thread does not yield/block while waiting for a lock;
    753 on the other hand, a thread may busy-wait for an unbounded period holding a processor.
     754on the other hand, a thread may busy-wait for an unbounded period.
    754755Finally, lock-free implementations have greater complexity and hardware dependency.
    755756Lock-free algorithms can be applied most easily to simple free-lists, \eg remote free-list, to allow lock-free insertion and removal from the head of a stack.
    756 Implementing lock-free operations for more complex data-structures (queue~\cite{Valois94}/deque~\cite{Sundell08}) is correspondinglyy more complex.
     757Implementing lock-free operations for more complex data-structures (queue~\cite{Valois94}/deque~\cite{Sundell08}) is more complex.
    757758Michael~\cite{Michael04} and Gidenstam \etal \cite{Gidenstam05} have created lock-free variations of the Hoard allocator.
  • doc/theses/mubeen_zulfiqar_MMath/intro.tex

    r13cdc8c rbdfd0bd  
    4848Attempts have been made to perform quasi garbage collection in C/\CC~\cite{Boehm88}, but it is a compromise.
    4949This thesis only examines dynamic memory-management with \emph{explicit} deallocation.
    50 While garbage collection and compaction are not part this work, many of the work's results are applicable to the allocation phase in any memory-management approach.
     50While garbage collection and compaction are not part this work, many of the results are applicable to the allocation phase in any memory-management approach.
    5151
    5252Most programs use a general-purpose allocator, often the one provided implicitly by the programming-language's runtime.
     
    6565\begin{enumerate}[leftmargin=*]
    6666\item
    67 Implementation of a new stand-lone concurrent low-latency memory-allocator ($\approx$1,200 lines of code) for C/\CC programs using kernel threads (1:1 threading), and specialized versions of the allocator for the programming languages \uC and \CFA using user-level threads running over multiple kernel threads (M:N threading).
    68 
    69 \item
    70 Adopt @nullptr@ return for a zero-sized allocation, rather than an actual memory address, which can be passed to @free@.
    71 
    72 \item
    73 Extend the standard C heap functionality by preserving with each allocation:
    74 \begin{itemize}[itemsep=0pt]
    75 \item
    76 its request size plus the amount allocated,
    77 \item
    78 whether an allocation is zero fill,
    79 \item
    80 and allocation alignment.
    81 \end{itemize}
    82 
    83 \item
    84 Use the preserved zero fill and alignment as \emph{sticky} properties for @realloc@ to zero-fill and align when storage is extended or copied.
     67Implementation of a new stand-lone concurrent low-latency memory-allocator ($\approx$1,200 lines of code) for C/\CC programs using kernel threads (1:1 threading), and specialized versions of the allocator for programming languages \uC and \CFA using user-level threads running over multiple kernel threads (M:N threading).
     68
     69\item
     70Adopt returning of @nullptr@ for a zero-sized allocation, rather than an actual memory address, both of which can be passed to @free@.
     71
     72\item
     73Extended the standard C heap functionality by preserving with each allocation its original request size versus the amount allocated, if an allocation is zero fill, and the allocation alignment.
     74
     75\item
     76Use the zero fill and alignment as \emph{sticky} properties for @realloc@, to realign existing storage, or preserve existing zero-fill and alignment when storage is copied.
    8577Without this extension, it is unsafe to @realloc@ storage initially allocated with zero-fill/alignment as these properties are not preserved when copying.
    8678This silent generation of a problem is unintuitive to programmers and difficult to locate because it is transient.
     
    9486@resize( oaddr, alignment, size )@ re-purpose an old allocation with new alignment but \emph{without} preserving fill.
    9587\item
    96 @realloc( oaddr, alignment, size )@ same as @realloc@ but adding or changing alignment.
     88@realloc( oaddr, alignment, size )@ same as previous @realloc@ but adding or changing alignment.
    9789\item
    9890@aalloc( dim, elemSize )@ same as @calloc@ except memory is \emph{not} zero filled.
     
    10496
    10597\item
    106 Provide additional heap wrapper functions in \CFA creating an orthogonal set of allocation operations and properties.
     98Provide additional heap wrapper functions in \CFA to provide a complete orthogonal set of allocation operations and properties.
    10799
    108100\item
     
    117109@malloc_size( addr )@ returns the size of the memory allocation pointed-to by @addr@.
    118110\item
    119 @malloc_usable_size( addr )@ returns the usable (total) size of the memory pointed-to by @addr@, i.e., the bin size containing the allocation, where @malloc_size( addr )@ $\le$ @malloc_usable_size( addr )@.
     111@malloc_usable_size( addr )@ returns the usable size of the memory pointed-to by @addr@, i.e., the bin size containing the allocation, where @malloc_size( addr )@ $\le$ @malloc_usable_size( addr )@.
    120112\end{itemize}
    121113
Note: See TracChangeset for help on using the changeset viewer.