Changeset fb6691a
- Timestamp:
- May 20, 2022, 2:48:24 PM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
- Children:
- 598dc68
- Parents:
- 25fa20a
- Location:
- doc/theses/mubeen_zulfiqar_MMath
- Files:
-
- 6 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/mubeen_zulfiqar_MMath/allocator.tex
r25fa20a rfb6691a 109 109 Need to prevent preemption during a dynamic memory operation because of the \newterm{serially-reusable problem}. 110 110 \begin{quote} 111 A sequence of code that is guaranteed to run to completion before being invoked to accept another input is called serially-reusable code.~\cite{SeriallyReusable} 111 A sequence of code that is guaranteed to run to completion before being invoked to accept another input is called serially-reusable code.~\cite{SeriallyReusable}\label{p:SeriallyReusable} 112 112 \end{quote} 113 113 If a KT is preempted during an allocation operation, the operating system can schedule another KT on the same CPU, which can begin an allocation operation before the previous operation associated with this CPU has completed, invalidating heap correctness. … … 138 138 (See \VRef[Figure]{f:THSharedHeaps} but with a heap bucket per KT and no bucket or local-pool lock.) 139 139 Hence, immediately after a KT starts, its heap is created and just before a KT terminates, its heap is (logically) deleted. 140 \PAB{Heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap which is not shared with any other KT (modulo operations on the global pool and ownership).} 140 Heaps are uncontended for a KTs memory operations as every KT has its own thread-local heap, modulo operations on the global pool and ownership. 141 141 142 142 Problems: … … 269 269 270 270 Each heap uses segregated free-buckets that have free objects distributed across 91 different sizes from 16 to 4M. 271 \PAB{All objects in a bucket are of the same size.} 271 All objects in a bucket are of the same size. 272 272 The number of buckets used is determined dynamically depending on the crossover point from @sbrk@ to @mmap@ allocation using @mallopt( M_MMAP_THRESHOLD )@, \ie small objects managed by the program and large objects managed by the operating system. 273 273 Each free bucket of a specific size has the following two lists: … … 401 401 \end{center} 402 402 The storage between @E@ and @H@ is chained onto the appropriate free list for future allocations. 403 \PAB{The same approach is used for sufficiently large free blocks}, where @E@ is the start of the free block, and any unused storage before @H@ or after the allocated object becomes free storage.403 The same approach is used for sufficiently large free blocks, where @E@ is the start of the free block, and any unused storage before @H@ or after the allocated object becomes free storage. 404 404 In this approach, the aligned address @A@ is the same as the allocated storage address @P@, \ie @P@ $=$ @A@ for all allocation routines, which simplifies deallocation. 405 405 However, if there are a large number of aligned requests, this approach leads to memory fragmentation from the small free areas around the aligned object. … … 488 488 \end{description} 489 489 The second field remembers the request size versus the allocation (bucket) size, \eg request 42 bytes which is rounded up to 64 bytes. 490 \PAB{Since programmers think in request sizes rather than allocation sizes, the request size allows better generation of statistics or errors and also helps in memory management.} 490 Since programmers think in request sizes rather than allocation sizes, the request size allows better generation of statistics or errors and also helps in memory management. 491 491 492 492 \begin{figure} … … 497 497 \end{figure} 498 498 499 \PAB{The low-order 3-bits of the first field are \emph{unused} for any stored values as these values are 16-byte aligned by default, whereas the second field may use all of its bits.} 499 The low-order 3-bits of the first field are \emph{unused} for any stored values as these values are 16-byte aligned by default, whereas the second field may use all of its bits. 500 500 The 3 unused bits are used to represent mapped allocation, zero filled, and alignment, respectively. 501 501 Note, the alignment bit is not used in the normal header and the zero-filled/mapped bits are not used in the fake header. … … 515 515 To locate all statistic counters, heaps are linked together in statistics mode, and this list is locked and traversed to sum all counters across heaps. 516 516 Note, the list is locked to prevent errors traversing an active list; 517 \PAB{the statistics counters are not locked and can flicker during accumulation.} 517 the statistics counters are not locked and can flicker during accumulation. 518 518 \VRef[Figure]{f:StatiticsOutput} shows an example of statistics output, which covers all allocation operations and information about deallocating storage not owned by a thread. 519 519 No other memory allocator studied provides as comprehensive statistical information. … … 558 558 \label{s:UserlevelThreadingSupport} 559 559 560 The serially-reusable problem (see \V Ref{s:AllocationFastpath}) occurs for kernel threads in the ``T:H model, H = number of CPUs'' model and for user threads in the ``1:1'' model, where llheap uses the ``1:1'' model.561 \PAB{The solution is to prevent interrupts that can result in CPU or KT change during operations that are logically critical sections such as moving free storage from public heap to the private heap.} 560 The serially-reusable problem (see \VPageref{p:SeriallyReusable}) occurs for kernel threads in the ``T:H model, H = number of CPUs'' model and for user threads in the ``1:1'' model, where llheap uses the ``1:1'' model. 561 The solution is to prevent interrupts that can result in a CPU or KT change during operations that are logically critical sections such as starting a memory operation on one KT and completing it on another. 562 562 Locking these critical sections negates any attempt for a quick fastpath and results in high contention. 563 563 For user-level threading, the serially-reusable problem appears with time slicing for preemptable scheduling, as the signal handler context switches to another user-level thread. 564 564 Without time slicing, a user thread performing a long computation can prevent the execution of (starve) other threads. 565 \PAB{To prevent starvation for a memory-allocation-intensive thread, \ie the time slice always triggers in an allocation critical-section for one thread so the thread never gets time sliced, a thread-local \newterm{rollforward} flag is set in the signal handler when it aborts a time slice.} 565 To prevent starvation for a memory-allocation-intensive thread, \ie the time slice always triggers in an allocation critical-section for one thread so the thread never gets time sliced, a thread-local \newterm{rollforward} flag is set in the signal handler when it aborts a time slice. 566 566 The rollforward flag is tested at the end of each allocation funnel routine (see \VPageref{p:FunnelRoutine}), and if set, it is reset and a volunteer yield (context switch) is performed to allow other threads to execute. 567 567 568 568 llheap uses two techniques to detect when execution is in an allocation operation or routine called from allocation operation, to abort any time slice during this period. 569 On the slowpath when executing expensive operations, like @sbrk@ or @mmap@, 570 \PAB{interrupts are disabled/enabled by setting kernel-thread-local flags so the signal handler aborts immediately.} 571 \PAB{On the fastpath, disabling/enabling interrupts is too expensive as accessing kernel-thread-local storage can be expensive and not user-thread-safe.} 569 On the slowpath when executing expensive operations, like @sbrk@ or @mmap@, interrupts are disabled/enabled by setting kernel-thread-local flags so the signal handler aborts immediately. 570 On the fastpath, disabling/enabling interrupts is too expensive as accessing kernel-thread-local storage can be expensive and not user-thread-safe. 572 571 For example, the ARM processor stores the thread-local pointer in a coprocessor register that cannot perform atomic base-displacement addressing. 573 \PAB{Hence, there is a window between loading the kernel-thread-local pointer from the coprocessor register into a normal register and adding the displacement when a time slice can move a thread.} 572 Hence, there is a window between loading the kernel-thread-local pointer from the coprocessor register into a normal register and adding the displacement when a time slice can move a thread. 574 573 575 574 The fast technique (with lower run time cost) is to define a special code section and places all non-interruptible routines in this section. … … 589 588 Programs can be statically or dynamically linked. 590 589 \item 591 \PAB{The order in which the linker schedules startup code is poorly supported so cannot be controlled entirely.} 590 The order in which the linker schedules startup code is poorly supported so it cannot be controlled entirely. 592 591 \item 593 592 Knowing a KT's start and end independently from the KT code is difficult. … … 607 606 The problem is getting initialization done before the first allocator call. 608 607 However, there does not seem to be mechanism to tell either the static or dynamic loader to first perform initialization code before any calls to a loaded library. 609 \PAB{Also, initialization code of other libraries and run-time envoronment may call memory allocation routines such as \lstinline{malloc}.610 So, this creates an even more difficult situation as there is no mechanism to tell either the static or dynamic loader to first perform initialization code of memory allocator before any other initialization that may involve a dynamic memory allocation call.} 608 Also, initialization code of other libraries and the run-time environment may call memory allocation routines such as \lstinline{malloc}. 609 This compounds the situation as there is no mechanism to tell either the static or dynamic loader to first perform the initialization code of the memory allocator before any other initialization that may involve a dynamic memory allocation call. 611 610 As a result, calls to allocation routines occur without initialization. 612 611 To deal with this problem, it is necessary to put a conditional initialization check along the allocation fastpath to trigger initialization (singleton pattern). … … 740 739 \paragraph{\lstinline{void * aalloc( size_t dim, size_t elemSize )}} 741 740 extends @calloc@ for allocating a dynamic array of objects without calculating the total size of array explicitly but \emph{without} zero-filling the memory. 742 @aalloc@ is significantly faster than @calloc@, \PAB{which is the only alternative given by the memory allocation routines}.741 @aalloc@ is significantly faster than @calloc@, which is the only alternative given by the standard memory-allocation routines. 743 742 744 743 \noindent\textbf{Usage} … … 935 934 \paragraph{\lstinline{T * alloc( ... )} or \lstinline{T * alloc( size_t dim, ... )}} 936 935 is overloaded with a variable number of specific allocation operations, or an integer dimension parameter followed by a variable number of specific allocation operations. 937 \PAB{These allocation operations can be passed as positional arguments when calling \lstinline{alloc} routine.} 936 These allocation operations can be passed as named arguments when calling the \lstinline{alloc} routine. 938 937 A call without parameters returns a dynamically allocated object of type @T@ (@malloc@). 939 938 A call with only the dimension (dim) parameter returns a dynamically allocated array of objects of type @T@ (@aalloc@). -
doc/theses/mubeen_zulfiqar_MMath/background.tex
r25fa20a rfb6691a 37 37 The \newterm{storage data} is composed of allocated and freed objects, and \newterm{reserved memory}. 38 38 Allocated objects (light grey) are variable sized, and are allocated and maintained by the program; 39 \ PAB{\ie only the memory allocator knows the location of allocated storage, not the program.}39 \ie only the memory allocator knows the location of allocated storage, not the program. 40 40 \begin{figure}[h] 41 41 \centering … … 49 49 if there are multiple reserved blocks, they are also chained together, usually internally. 50 50 51 \PAB{In some allocator designs, allocated and freed objects have additional management data embedded within them.} 51 In some allocator designs, allocated and freed objects have additional management data embedded within them. 52 52 \VRef[Figure]{f:AllocatedObject} shows an allocated object with a header, trailer, and alignment padding and spacing around the object. 53 53 The header contains information about the object, \eg size, type, etc. … … 104 104 \VRef[Figure]{f:MemoryFragmentation} shows an example of how a small block of memory fragments as objects are allocated and deallocated over time. 105 105 Blocks of free memory become smaller and non-contiguous making them less useful in serving allocation requests. 106 \PAB{Memory is highly fragmented when most free blocks are unusable because of their sizes.} 106 Memory is highly fragmented when most free blocks are unusable because of their sizes. 107 107 For example, \VRef[Figure]{f:Contiguous} and \VRef[Figure]{f:HighlyFragmented} have the same quantity of external fragmentation, but \VRef[Figure]{f:HighlyFragmented} is highly fragmented. 108 108 If there is a request to allocate a large object, \VRef[Figure]{f:Contiguous} is more likely to be able to satisfy it with existing free memory, while \VRef[Figure]{f:HighlyFragmented} likely has to request more memory from the operating system. … … 328 328 For example, multiple heaps are managed in a pool, starting with a single or a fixed number of heaps that increase\-/decrease depending on contention\-/space issues. 329 329 At creation, a thread is associated with a heap from the pool. 330 \PAB{In some implementations of this model, when the thread attempts an allocation and its associated heap is locked (contention), it scans for an unlocked heap in the pool.} 330 In some implementations of this model, when the thread attempts an allocation and its associated heap is locked (contention), it scans for an unlocked heap in the pool. 331 331 If an unlocked heap is found, the thread changes its association and uses that heap. 332 332 If all heaps are locked, the thread may create a new heap, use it, and then place the new heap into the pool; … … 361 361 Multiple heaps increase external fragmentation as the ratio of heaps to threads increases, which can lead to heap blowup. 362 362 The external fragmentation experienced by a program with a single heap is now multiplied by the number of heaps, since each heap manages its own free storage and allocates its own reserved memory. 363 \PAB{Additionally, objects freed by one heap cannot be reused by other threads without increasing the cost of the memory operations, except indirectly by returning free memory to the operating system, which can be expensive.} 363 Additionally, objects freed by one heap cannot be reused by other threads without increasing the cost of the memory operations, except indirectly by returning free memory to the operating system, which can be expensive. 364 364 Depending on how the operating system provides dynamic storage to an application, returning storage may be difficult or impossible, \eg the contiguous @sbrk@ area in Unix. 365 365 In the worst case, a program in which objects are allocated from one heap but deallocated to another heap means these freed objects are never reused. … … 485 485 486 486 Bracketing every allocation with headers/trailers can result in significant internal fragmentation, as shown in \VRef[Figure]{f:ObjectHeaders}. 487 Especially if the headers contain redundant management information \PAB{then storing that information is a waste of storage}, \eg object size may be the same for many objects because programs only allocate a small set of object sizes.487 Especially if the headers contain redundant management information, then storing that information is a waste of storage, \eg object size may be the same for many objects because programs only allocate a small set of object sizes. 488 488 As well, it can result in poor cache usage, since only a portion of the cache line is holding useful information from the program's perspective. 489 489 Spatial locality can also be negatively affected leading to poor cache locality~\cite{Feng05}: … … 660 660 With local free-lists in containers, as in \VRef[Figure]{f:LocalFreeListWithinContainers}, the container is simply removed from one heap's free list and placed on the new heap's free list. 661 661 Thus, when using local free-lists, the operation of moving containers is reduced from $O(N)$ to $O(1)$. 662 \PAB{The cost that we have to pay for it is to add information to a header, which increases the header size, and therefore internal fragmentation.} 662 However, there is the additional storage cost in the header, which increases the header size, and therefore internal fragmentation. 663 663 664 664 \begin{figure} -
doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex
r25fa20a rfb6691a 48 48 There is no interaction among threads, \ie no object sharing. 49 49 Each thread repeatedly allocates 100,000 \emph{8-byte} objects then deallocates them in the order they were allocated. 50 \PAB{Execution time of the benchmark evaluates its efficiency.} 50 The execution time of the benchmark evaluates its efficiency. 51 51 52 52 … … 75 75 \label{s:ChurnBenchmark} 76 76 77 The churn benchmark measures the runtime speed of an allocator in a multi-threaded scen erio, where each thread extensively allocates and frees dynamic memory.77 The churn benchmark measures the runtime speed of an allocator in a multi-threaded scenario, where each thread extensively allocates and frees dynamic memory. 78 78 Only @malloc@ and @free@ are used to eliminate any extra cost, such as @memcpy@ in @calloc@ or @realloc@. 79 79 Churn simulates a memory intensive program and can be tuned to create different scenarios. … … 133 133 When threads share a cache line, frequent reads/writes to their cache-line object causes cache misses, which cause escalating delays as cache distance increases. 134 134 135 Cache thrash tries to create a scen erio that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines.135 Cache thrash tries to create a scenario that leads to false sharing, if the underlying memory allocator is allocating dynamic memory to multiple threads on the same cache lines. 136 136 Ideally, a memory allocator should distance the dynamic memory region of one thread from another. 137 137 Having multiple threads allocating small objects simultaneously can cause a memory allocator to allocate objects on the same cache line, if its not distancing the memory among different threads. … … 201 201 Cache scratch tries to create a scenario that leads to false sharing and should make the memory allocator preserve the program-induced false sharing, if it does not return a freed object to its owner thread and, instead, re-uses it instantly. 202 202 An allocator using object ownership, as described in section \VRef{s:Ownership}, is less susceptible to allocator-induced passive false-sharing. 203 \PAB{If the object is returned to the thread that owns it, then the new object that the thread gets is less likely to be on the same cache line.} 203 If the object is returned to the thread that owns it, then the new object that the thread gets is less likely to be on the same cache line. 204 204 205 205 \VRef[Figure]{fig:benchScratchFig} shows the pseudo code for the cache-scratch micro-benchmark. … … 245 245 246 246 Similar to benchmark cache thrash in section \VRef{sec:benchThrashSec}, different cache access scenarios can be created using the following command-line arguments. 247 \begin{description}[ itemsep=0pt,parsep=0pt]247 \begin{description}[topsep=0pt,itemsep=0pt,parsep=0pt] 248 248 \item[threads:] 249 249 number of threads (K). … … 259 259 \subsection{Speed Micro-Benchmark} 260 260 \label{s:SpeedMicroBenchmark} 261 \vspace*{-4pt} 261 262 262 263 The speed benchmark measures the runtime speed of individual and sequences of memory allocation routines: 263 \begin{enumerate}[ itemsep=0pt,parsep=0pt]264 \begin{enumerate}[topsep=-5pt,itemsep=0pt,parsep=0pt] 264 265 \item malloc 265 266 \item realloc -
doc/theses/mubeen_zulfiqar_MMath/figures/Header.fig
r25fa20a rfb6691a 20 20 2 1 1 1 0 7 50 -1 -1 4.000 0 0 -1 0 0 2 21 21 3300 1500 3300 2400 22 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5 23 4200 1800 6600 1800 6600 2100 4200 2100 4200 1800 22 24 2 1 0 1 0 7 50 -1 -1 4.000 0 0 -1 1 0 3 23 25 1 1 1.00 45.00 90.00 24 4 050 2625 3750 2625 3750 240026 4200 2775 3750 2775 3750 1725 25 27 2 1 0 1 0 7 50 -1 -1 4.000 0 0 -1 1 0 3 26 28 1 1 1.00 45.00 90.00 27 4050 2850 3450 2850 3450 2400 28 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5 29 4200 1800 6600 1800 6600 2100 4200 2100 4200 1800 29 4200 2550 4050 2550 4050 1725 30 2 1 0 1 0 7 50 -1 -1 4.000 0 0 -1 1 0 3 31 1 1 1.00 45.00 90.00 32 4200 3000 3450 3000 3450 2025 30 33 4 0 0 50 -1 0 12 0.0000 2 180 1185 1875 1725 bucket pointer\001 31 34 4 0 0 50 -1 0 12 0.0000 2 180 1005 1875 2025 mapped size\001 32 35 4 0 0 50 -1 0 12 0.0000 2 135 1215 1875 2325 next free block\001 33 36 4 2 0 50 -1 0 12 0.0000 2 135 480 1725 2025 union\001 34 4 1 0 50 -1 0 12 0.0000 2 135 270 3775 2325 0/1\00135 4 1 0 50 -1 0 12 0.0000 2 135 270 3475 2325 0/1\00136 37 4 1 0 50 -1 0 12 0.0000 2 180 945 5400 2025 request size\001 37 38 4 1 0 50 -1 0 12 0.0000 2 180 765 5400 1425 4/8-bytes\001 38 39 4 1 0 50 -1 0 12 0.0000 2 180 765 3000 1425 4/8-bytes\001 39 4 0 0 50 -1 0 12 0.0000 2 135 825 4125 2700 zero filled\001 40 4 0 0 50 -1 0 12 0.0000 2 180 1515 4125 2925 mapped allocation\001 40 4 1 0 50 -1 0 12 0.0000 2 135 270 3475 2025 0/1\001 41 4 1 0 50 -1 0 12 0.0000 2 135 270 3775 1725 0/1\001 42 4 1 0 50 -1 0 12 0.0000 2 135 270 4075 1725 0/1\001 43 4 0 0 50 -1 0 12 0.0000 2 180 1515 4275 3075 mapped allocation\001 44 4 0 0 50 -1 0 12 0.0000 2 135 825 4275 2850 zero filled\001 45 4 0 0 50 -1 0 12 0.0000 2 180 1920 4275 2625 alignment (fake header)\001 -
doc/theses/mubeen_zulfiqar_MMath/performance.tex
r25fa20a rfb6691a 91 91 92 92 Each micro-benchmark is configured and run with each of the allocators, 93 \PAB{The less time an allocator takes to complete a benchmark the better so lower in the graphs is better, except for the Memory micro-benchmark graphs.} 93 The less time an allocator takes to complete a benchmark the better so lower in the graphs is better, except for the Memory micro-benchmark graphs. 94 94 All graphs use log scale on the Y-axis, except for the Memory micro-benchmark (see \VRef{s:MemoryMicroBenchmark}). 95 95 -
doc/theses/mubeen_zulfiqar_MMath/uw-ethesis-frontpgs.tex
r25fa20a rfb6691a 108 108 % D E C L A R A T I O N P A G E 109 109 % ------------------------------- 110 % The following is a sample De laration Page as provided by the GSO110 % The following is a sample Declaration Page as provided by the GSO 111 111 % December 13th, 2006. It is designed for an electronic thesis. 112 112 \begin{center}\textbf{Author's Declaration}\end{center} … … 141 141 The C allocation API is also extended with @resize@, advanced @realloc@, @aalloc@, @amemalign@, and @cmemalign@ so programmers do not make mistakes writing theses useful allocation operations. 142 142 llheap is embedded into the \uC and \CFA runtime systems, both of which have user-level threading. 143 \PAB{The ability to use \CFA's advanced type-system (and possibly \CC's too) to have one allocation routine with advanced memory operations as positional arguments shows how far the allocation API can be pushed, which increases safety and greatly simplifies programmer's use of dynamic allocation.} 143 The ability to use \CFA's advanced type-system (and possibly \CC's too) to combine advanced memory operations into one allocation routine using named arguments shows how far the allocation API can be pushed, which increases safety and greatly simplifies programmer's use of dynamic allocation. 144 144 145 145 The llheap allocator also provides comprehensive statistics for all allocation operations, which are invaluable in understanding and debugging a program's dynamic behaviour. … … 162 162 I would like to thank all the people who made this thesis possible. 163 163 164 I would like to acknowledge Peter A. Buhr for his assistance and support through tout the process.164 I would like to acknowledge Peter A. Buhr for his assistance and support throughout the process. 165 165 It would have been impossible without him. 166 167 I would like to acknowledge Gregor Richards and Trevor Brown for reading my thesis quickly and giving me great feedback on my work. 166 168 167 169 Also, I would say thanks to my team members at PLG especially Thierry, Michael, and Andrew for their input. … … 195 197 % L I S T O F T A B L E S 196 198 % --------------------------- 197 \addcontentsline{toc}{chapter}{List of Tables}198 \listoftables199 \cleardoublepage200 \phantomsection % allows hyperref to link to the correct page199 % \addcontentsline{toc}{chapter}{List of Tables} 200 % \listoftables 201 % \cleardoublepage 202 % \phantomsection % allows hyperref to link to the correct page 201 203 202 204 % Change page numbering back to Arabic numerals
Note: See TracChangeset
for help on using the changeset viewer.