Changeset 45200b67 for doc/theses


Ignore:
Timestamp:
Apr 28, 2022, 12:57:45 PM (2 years ago)
Author:
m3zulfiq <m3zulfiq@…>
Branches:
ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
Children:
73a57af2
Parents:
7b9391a1 (diff), e82a6e4f (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

Location:
doc/theses/mubeen_zulfiqar_MMath
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/mubeen_zulfiqar_MMath/conclusion.tex

    r7b9391a1 r45200b67  
    3636
    3737Starting a micro-benchmark test-suite for comparing allocators, rather than relying on a suite of arbitrary programs, has been an interesting challenge.
    38 The current micro-benchmark allows some understand of allocator implementation properties without actually looking at the implementation.
     38The current micro-benchmarks allow some understand of allocator implementation properties without actually looking at the implementation.
    3939For example, the memory micro-benchmark quickly identified how several of the allocators work at the global level.
    4040It was not possible to show how the micro-benchmarks adjustment knobs were used to tune to an interesting test point.
     
    5252
    5353After llheap is made available on gitHub, interacting with its users to locate problems and improvements, will make llbench a more robust memory allocator.
     54As well, feedback from the \uC and \CFA projects, which have adopted llheap for their memory allocator, will provide additional feedback.
  • doc/theses/mubeen_zulfiqar_MMath/performance.tex

    r7b9391a1 r45200b67  
    9292The each micro-benchmark is configured and run with each of the allocators,
    9393The less time an allocator takes to complete a benchmark the better, so lower in the graphs is better.
     94All graphs use log scale on the Y-axis, except for the Memory micro-benchmark (see \VRef{s:MemoryMicroBenchmark}).
    9495
    9596%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     
    139140\end{figure}
    140141
     142\paragraph{Assessment}
    141143All allocators did well in this micro-benchmark, except for \textsf{dl} on the ARM.
    142 \textsf{dl}'s performace decreases and the difference with the other allocators starts increases as the number of worker threads increase.
    143 \textsf{je} was the fastest, although there is not much difference between \textsf{je} and rest of the allocators.
    144 
    145 llheap is slightly slower because it uses ownership, where many of the allocations have remote frees, which requires locking.
    146 When llheap is compiled without ownership, its performance is the same as the other allocators (not shown).
     144\textsf{dl}'s is the slowest, indicating some small bottleneck with respect to the other allocators.
     145\textsf{je} is the fastest, with only a small benefit over the other allocators.
     146% llheap is slightly slower because it uses ownership, where many of the allocations have remote frees, which requires locking.
     147% When llheap is compiled without ownership, its performance is the same as the other allocators (not shown).
    147148
    148149%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     
    182183\end{figure}
    183184
     185\paragraph{Assessment}
    184186All allocators did well in this micro-benchmark, except for \textsf{dl} and \textsf{pt3}.
    185 \textsf{dl} uses a single heap for all threads so it is understable that it is generating so much active false-sharing.
    186 Requests from different threads will be dealt with sequientially by a single heap using locks which can allocate objects to different threads on the same cache line.
    187 \textsf{pt3} uses multiple heaps but it is not exactly per-thread heap.
    188 So, it is possible that multiple threads using one heap can get objects allocated on the same cache line which might be causing active false-sharing.
    189 Rest of the memory allocators generate little or no active false-sharing.
     187\textsf{dl} uses a single heap for all threads so it is understandable that it generates so much active false-sharing.
     188Requests from different threads are dealt with sequentially by the single heap (using a single lock), which can allocate objects to different threads on the same cache line.
     189\textsf{pt3} uses the T:H model, so multiple threads can use one heap, but the active false-sharing is less than \textsf{dl}.
     190The rest of the memory allocators generate little or no active false-sharing.
    190191
    191192%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     
    224225\end{figure}
    225226
    226 This micro-benchmark divided the allocators in 2 groups.
    227 First is the group of best performers \textsf{llh}, \textsf{je}, and \textsf{rp}.
    228 These memory alloctors generate little or no passive false-sharing and their performance difference is negligible.
    229 Second is the group of the low performers which includes rest of the memory allocators.
    230 These memory allocators seem to preserve program-induced passive false-sharing.
    231 \textsf{hrd}'s performance keeps getting worst as the number of threads increase.
    232 
    233 Interestingly, allocators such as \textsf{hrd} and \textsf{glc} were among the best performers in micro-benchmark cache thrash as described in section \ref{sec:cache-thrash-perf}.
    234 But, these allocators were among the low performers in this micro-benchmark.
    235 It tells us that these allocators do not actively produce false-sharing but they may preserve program-induced passive false sharing.
     227\paragraph{Assessment}
     228This micro-benchmark divides the allocators into two groups.
     229First is the high-performer group: \textsf{llh}, \textsf{je}, and \textsf{rp}.
     230These memory allocators generate little or no passive false-sharing and their performance difference is negligible.
     231Second is the low-performer group, which includes the rest of the memory allocators.
     232These memory allocators have significant program-induced passive false-sharing, where \textsf{hrd}'s is the worst performing allocator.
     233All of the allocator's in this group are sharing heaps among threads at some level.
     234
     235Interestingly, allocators such as \textsf{hrd} and \textsf{glc} performed well in micro-benchmark cache thrash (see \VRef{sec:cache-thrash-perf}).
     236But, these allocators are among the low performers in the cache scratch.
     237It suggests these allocators do not actively produce false-sharing but preserve program-induced passive false sharing.
    236238
    237239%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     
    288290\end{itemize}
    289291
    290 All allocators did well in this micro-benchmark across all allocation chains, except for \textsf{dl} and \textsf{pt3}.
    291 \textsf{dl} performed the lowest overall and its performce kept getting worse with increasing number of threads.
    292 \textsf{dl} uses a single heap with a global lock that can become a bottleneck.
    293 Multiple threads doing memory allocation in parallel can create contention on \textsf{dl}'s single heap.
    294 \textsf{pt3} which is a modification of \textsf{dl} for multi-threaded applications does not use per-thread heaps and may also have similar bottlenecks.
    295 
    296 There's a sudden increase in program completion time of chains that include \textsf{calloc} and all allocators perform relatively slower in these chains including \textsf{calloc}.
    297 \textsf{calloc} uses \textsf{memset} to set the allocated memory to zero.
    298 \textsf{memset} is a slow routine which takes a long time compared to the actual memory allocation.
    299 So, a major part of the time is taken for \textsf{memset} in performance of chains that include \textsf{calloc}.
    300 But the relative difference among the different memory allocators running the same chain of memory allocation operations still gives us an idea of theor relative performance.
     292\paragraph{Assessment}
     293This micro-benchmark divides the allocators into two groups: with and without @calloc@.
     294@calloc@ uses @memset@ to set the allocated memory to zero, which dominates the cost of the allocation chain (large performance increase) and levels performance across the allocators.
     295But the difference among the allocators in a @calloc@ chain still gives an idea of their relative performance.
     296
     297All allocators did well in this micro-benchmark across all allocation chains, except for \textsf{dl}, \textsf{pt3}, and \textsf{hrd}.
     298Again, the low-performing allocators are sharing heaps among threads, so the contention causes performance increases with increasing numbers of threads.
     299Furthermore, chains with @free@ can trigger coalescing, which slows the fast path.
     300The high-performing allocators all illustrate low latency across the allocation chains, \ie there are no performance spikes as the chain lengths, that might be caused by contention and/or coalescing.
     301Low latency is important for applications that are sensitive to unknown execution delays.
    301302
    302303%speed-3-malloc.eps
     
    414415\newpage
    415416\subsection{Memory Micro-Benchmark}
     417\label{s:MemoryMicroBenchmark}
    416418
    417419This experiment is run with the following two configurations for each allocator.
     
    522524The Y-axis shows the memory usage in bytes.
    523525
    524 For the experiment, at a certain time in the program's life, the difference between the memory requested by the benchmark (\textit{current\_req\_mem(B)}) and the memory that the process has received from system (\textit{heap}, \textit{mmap}) should be minimum.
     526For this experiment, the difference between the memory requested by the benchmark (\textit{current\_req\_mem(B)}) and the memory that the process has received from system (\textit{heap}, \textit{mmap}) should be minimum.
    525527This difference is the memory overhead caused by the allocator and shows the level of fragmentation in the allocator.
    526528
     529\paragraph{Assessment}
    527530First, the differences in the shape of the curves between architectures (top ARM, bottom x64) is small, where the differences are in the amount of memory used.
    528531Hence, it is possible to focus on either the top or bottom graph.
    529 The heap curve is remains zero for 4 memory allocators: \textsf{hrd}, \textsf{je}, \textsf{pt3}, and \textsf{rp}.
    530 These memory allocators are not using the sbrk area, instead they only use mmap to get memory from the system.
    531 
    532 \textsf{hrd}, and \textsf{tbb} have higher memory footprint than the others as they use more total dynamic memory.
    533 One reason for that can be the usage of superblocks as both of these memory allocators create superblocks where each block contains objects of the same size.
     532
     533Second, the heap curve is 0 for four memory allocators: \textsf{hrd}, \textsf{je}, \textsf{pt3}, and \textsf{rp}, indicating these memory allocators only use @mmap@ to get memory from the system and ignore the @sbrk@ area.
     534
     535The total dynamic memory is higher for \textsf{hrd} and \textsf{tbb} than the other allocators.
     536The main reason is the use of superblocks (see \VRef{s:ObjectContainers}) containing objects of the same size.
    534537These superblocks are maintained throughout the life of the program.
    535538
    536 \textsf{pt3} is the only memory allocator for which the total dynamic memory goes down in the second half of the program lifetime when the memory is freed by the benchmark program.
    537 It makes pt3 the only memory allocator that gives memory back to operating system as it is freed by the program.
     539\textsf{pt3} is the only memory allocator where the total dynamic memory goes down in the second half of the program lifetime when the memory is freed by the benchmark program.
     540It makes pt3 the only memory allocator that gives memory back to the operating system as it is freed by the program.
    538541
    539542% FOR 1 THREAD
Note: See TracChangeset for help on using the changeset viewer.