# Changeset fb6691a for doc/theses

Ignore:
Timestamp:
May 20, 2022, 2:48:24 PM (4 months ago)
Branches:
Children:
598dc68
Parents:
25fa20a
Message:

final proofread of Mubeen's MMath thesis

Location:
doc/theses/mubeen_zulfiqar_MMath
Files:
6 edited

Unmodified
Removed
• ## doc/theses/mubeen_zulfiqar_MMath/allocator.tex

 r25fa20a Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}. \begin{quote} A sequence of code that is guaranteed to run to completion before being invoked to accept another input is called serially-reusable code.~\cite{SeriallyReusable} A sequence of code that is guaranteed to run to completion before being invoked to accept another input is called serially-reusable code.~\cite{SeriallyReusable}\label{p:SeriallyReusable} \end{quote} If a KT is preempted during an allocation operation, the operating system can schedule another KT on the same CPU, which can begin an allocation operation before the previous operation associated with this CPU has completed, invalidating heap correctness. (See \VRef[Figure]{f:THSharedHeaps} but with a heap bucket per KT and no bucket or local-pool lock.) Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted. \PAB{Heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap which is not shared with any other KT (modulo operations on the global pool and ownership).} Heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap, modulo operations on the global pool and ownership. Problems: Each heap uses segregated free-buckets that have free objects distributed across 91 different sizes from 16 to 4M. \PAB{All objects in a bucket are of the same size.} All objects in a bucket are of the same size. The number of buckets used is determined dynamically depending on the crossover point from @sbrk@ to @mmap@ allocation using @mallopt( M_MMAP_THRESHOLD )@, \ie small objects managed by the program and large objects managed by the operating system. Each free bucket of a specific size has the following two lists: \end{center} The storage between @E@ and @H@ is chained onto the appropriate free list for future allocations. \PAB{The same approach is used for sufficiently large free blocks}, where @E@ is the start of the free block, and any unused storage before @H@ or after the allocated object becomes free storage. The same approach is used for sufficiently large free blocks, where @E@ is the start of the free block, and any unused storage before @H@ or after the allocated object becomes free storage. In this approach, the aligned address @A@ is the same as the allocated storage address @P@, \ie @P@ $=$ @A@ for all allocation routines, which simplifies deallocation. However, if there are a large number of aligned requests, this approach leads to memory fragmentation from the small free areas around the aligned object. \end{description} The second field remembers the request size versus the allocation (bucket) size, \eg request 42 bytes which is rounded up to 64 bytes. \PAB{Since programmers think in request sizes rather than allocation sizes, the request size allows better generation of statistics or errors and also helps in memory management.} Since programmers think in request sizes rather than allocation sizes, the request size allows better generation of statistics or errors and also helps in memory management. \begin{figure} \end{figure} \PAB{The low-order 3-bits of the first field are \emph{unused} for any stored values as these values are 16-byte aligned by default, whereas the second field may use all of its bits.} The low-order 3-bits of the first field are \emph{unused} for any stored values as these values are 16-byte aligned by default, whereas the second field may use all of its bits. The 3 unused bits are used to represent mapped allocation, zero filled, and alignment, respectively. Note, the alignment bit is not used in the normal header and the zero-filled/mapped bits are not used in the fake header. To locate all statistic counters, heaps are linked together in statistics mode, and this list is locked and traversed to sum all counters across heaps. Note, the list is locked to prevent errors traversing an active list; \PAB{the statistics counters are not locked and can flicker during accumulation.} the statistics counters are not locked and can flicker during accumulation. \VRef[Figure]{f:StatiticsOutput} shows an example of statistics output, which covers all allocation operations and information about deallocating storage not owned by a thread. No other memory allocator studied provides as comprehensive statistical information. \label{s:UserlevelThreadingSupport} The serially-reusable problem (see \VRef{s:AllocationFastpath}) occurs for kernel threads in the T:H model, H = number of CPUs'' model and for user threads in the 1:1'' model, where llheap uses the 1:1'' model. \PAB{The solution is to prevent interrupts that can result in CPU or KT change during operations that are logically critical sections such as moving free storage from public heap to the private heap.} The serially-reusable problem (see \VPageref{p:SeriallyReusable}) occurs for kernel threads in the T:H model, H = number of CPUs'' model and for user threads in the 1:1'' model, where llheap uses the 1:1'' model. The solution is to prevent interrupts that can result in a CPU or KT change during operations that are logically critical sections such as starting a memory operation on one KT and completing it on another. Locking these critical sections negates any attempt for a quick fastpath and results in high contention. For user-level threading, the serially-reusable problem appears with time slicing for preemptable scheduling, as the signal handler context switches to another user-level thread. Without time slicing, a user thread performing a long computation can prevent the execution of (starve) other threads. \PAB{To prevent starvation for a memory-allocation-intensive thread, \ie the time slice always triggers in an allocation critical-section for one thread so the thread never gets time sliced, a thread-local \newterm{rollforward} flag is set in the signal handler when it aborts a time slice.} To prevent starvation for a memory-allocation-intensive thread, \ie the time slice always triggers in an allocation critical-section for one thread so the thread never gets time sliced, a thread-local \newterm{rollforward} flag is set in the signal handler when it aborts a time slice. The rollforward flag is tested at the end of each allocation funnel routine (see \VPageref{p:FunnelRoutine}), and if set, it is reset and a volunteer yield (context switch) is performed to allow other threads to execute. llheap uses two techniques to detect when execution is in an allocation operation or routine called from allocation operation, to abort any time slice during this period. On the slowpath when executing expensive operations, like @sbrk@ or @mmap@, \PAB{interrupts are disabled/enabled by setting kernel-thread-local flags so the signal handler aborts immediately.} \PAB{On the fastpath, disabling/enabling interrupts is too expensive as accessing kernel-thread-local storage can be expensive and not user-thread-safe.} On the slowpath when executing expensive operations, like @sbrk@ or @mmap@, interrupts are disabled/enabled by setting kernel-thread-local flags so the signal handler aborts immediately. On the fastpath, disabling/enabling interrupts is too expensive as accessing kernel-thread-local storage can be expensive and not user-thread-safe. For example, the ARM processor stores the thread-local pointer in a coprocessor register that cannot perform atomic base-displacement addressing. \PAB{Hence, there is a window between loading the kernel-thread-local pointer from the coprocessor register into a normal register and adding the displacement when a time slice can move a thread.} Hence, there is a window between loading the kernel-thread-local pointer from the coprocessor register into a normal register and adding the displacement when a time slice can move a thread. The fast technique (with lower run time cost) is to define a special code section and places all non-interruptible routines in this section. Programs can be statically or dynamically linked. \item \PAB{The order in which the linker schedules startup code is poorly supported so cannot be controlled entirely.} The order in which the linker schedules startup code is poorly supported so it cannot be controlled entirely. \item Knowing a KT's start and end independently from the KT code is difficult. The problem is getting initialization done before the first allocator call. However, there does not seem to be mechanism to tell either the static or dynamic loader to first perform initialization code before any calls to a loaded library. \PAB{Also, initialization code of other libraries and run-time envoronment may call memory allocation routines such as \lstinline{malloc}. So, this creates an even more difficult situation as there is no mechanism to tell either the static or dynamic loader to first perform initialization code of memory allocator before any other initialization that may involve a dynamic memory allocation call.} Also, initialization code of other libraries and the run-time environment may call memory allocation routines such as \lstinline{malloc}. This compounds the situation as there is no mechanism to tell either the static or dynamic loader to first perform the initialization code of the memory allocator before any other initialization that may involve a dynamic memory allocation call. As a result, calls to allocation routines occur without initialization. To deal with this problem, it is necessary to put a conditional initialization check along the allocation fastpath to trigger initialization (singleton pattern). \paragraph{\lstinline{void * aalloc( size_t dim, size_t elemSize )}} extends @calloc@ for allocating a dynamic array of objects without calculating the total size of array explicitly but \emph{without} zero-filling the memory. @aalloc@ is significantly faster than @calloc@, \PAB{which is the only alternative given by the memory allocation routines}. @aalloc@ is significantly faster than @calloc@, which is the only alternative given by the standard memory-allocation routines. \noindent\textbf{Usage} \paragraph{\lstinline{T * alloc( ... )} or \lstinline{T * alloc( size_t dim, ... )}} is overloaded with a variable number of specific allocation operations, or an integer dimension parameter followed by a variable number of specific allocation operations. \PAB{These allocation operations can be passed as positional arguments when calling \lstinline{alloc} routine.} These allocation operations can be passed as named arguments when calling the \lstinline{alloc} routine. A call without parameters returns a dynamically allocated object of type @T@ (@malloc@). A call with only the dimension (dim) parameter returns a dynamically allocated array of objects of type @T@ (@aalloc@).
• ## doc/theses/mubeen_zulfiqar_MMath/background.tex

 r25fa20a The \newterm{storage data} is composed of allocated and freed objects, and \newterm{reserved memory}. Allocated objects (light grey) are variable sized, and are allocated and maintained by the program; \PAB{\ie only the memory allocator knows the location of allocated storage, not the program.} \ie only the memory allocator knows the location of allocated storage, not the program. \begin{figure}[h] \centering if there are multiple reserved blocks, they are also chained together, usually internally. \PAB{In some allocator designs, allocated and freed objects have additional management data embedded within them.} In some allocator designs, allocated and freed objects have additional management data embedded within them. \VRef[Figure]{f:AllocatedObject} shows an allocated object with a header, trailer, and alignment padding and spacing around the object. The header contains information about the object, \eg size, type, etc. \VRef[Figure]{f:MemoryFragmentation} shows an example of how a small block of memory fragments as objects are allocated and deallocated over time. Blocks of free memory become smaller and non-contiguous making them less useful in serving allocation requests. \PAB{Memory is highly fragmented when most free blocks are unusable because of their sizes.} Memory is highly fragmented when most free blocks are unusable because of their sizes. For example, \VRef[Figure]{f:Contiguous} and \VRef[Figure]{f:HighlyFragmented} have the same quantity of external fragmentation, but \VRef[Figure]{f:HighlyFragmented} is highly fragmented. If there is a request to allocate a large object, \VRef[Figure]{f:Contiguous} is more likely to be able to satisfy it with existing free memory, while \VRef[Figure]{f:HighlyFragmented} likely has to request more memory from the operating system. For example, multiple heaps are managed in a pool, starting with a single or a fixed number of heaps that increase\-/decrease depending on contention\-/space issues. At creation, a thread is associated with a heap from the pool. \PAB{In some implementations of this model, when the thread attempts an allocation and its associated heap is locked (contention), it scans for an unlocked heap in the pool.} In some implementations of this model, when the thread attempts an allocation and its associated heap is locked (contention), it scans for an unlocked heap in the pool. If an unlocked heap is found, the thread changes its association and uses that heap. If all heaps are locked, the thread may create a new heap, use it, and then place the new heap into the pool; Multiple heaps increase external fragmentation as the ratio of heaps to threads increases, which can lead to heap blowup. The external fragmentation experienced by a program with a single heap is now multiplied by the number of heaps, since each heap manages its own free storage and allocates its own reserved memory. \PAB{Additionally, objects freed by one heap cannot be reused by other threads without increasing the cost of the memory operations, except indirectly by returning free memory to the operating system, which can be expensive.} Additionally, objects freed by one heap cannot be reused by other threads without increasing the cost of the memory operations, except indirectly by returning free memory to the operating system, which can be expensive. Depending on how the operating system provides dynamic storage to an application, returning storage may be difficult or impossible, \eg the contiguous @sbrk@ area in Unix. In the worst case, a program in which objects are allocated from one heap but deallocated to another heap means these freed objects are never reused. Bracketing every allocation with headers/trailers can result in significant internal fragmentation, as shown in \VRef[Figure]{f:ObjectHeaders}. Especially if the headers contain redundant management information \PAB{then storing that information is a waste of storage}, \eg object size may be the same for many objects because programs only allocate a small set of object sizes. Especially if the headers contain redundant management information, then storing that information is a waste of storage, \eg object size may be the same for many objects because programs only allocate a small set of object sizes. As well, it can result in poor cache usage, since only a portion of the cache line is holding useful information from the program's perspective. Spatial locality can also be negatively affected leading to poor cache locality~\cite{Feng05}: With local free-lists in containers, as in \VRef[Figure]{f:LocalFreeListWithinContainers}, the container is simply removed from one heap's free list and placed on the new heap's free list. Thus, when using local free-lists, the operation of moving containers is reduced from $O(N)$ to $O(1)$. \PAB{The cost that we have to pay for it is to add information to a header, which increases the header size, and therefore internal fragmentation.} However, there is the additional storage cost in the header, which increases the header size, and therefore internal fragmentation. \begin{figure}