Changeset 999faf1 for doc/theses/thierry_delisle_PhD/thesis
- Timestamp:
- Jul 29, 2022, 2:18:52 PM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- 9d2609fa
- Parents:
- 8f09242
- Location:
- doc/theses/thierry_delisle_PhD/thesis/text
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex
r8f09242 r999faf1 1 1 \chapter{Macro-Benchmarks}\label{macrobench} 2 The previous chapter has demonstrated that the scheduler achieves its performance goal in small and controlled scenario. 3 The next step is then to demonstrate that this stays true in more realistic and complete scenarios. 4 This chapter presents two flavours of webservers that demonstrate that \CFA performs competitively with production environments. 2 5 3 \section{Static Web-Server} 4 5 In Memory Plain Text 6 7 Networked ZIPF 8 9 Nginx : 5Gb still good, 4Gb starts to suffer 10 11 Cforall : 10Gb too high, 4 Gb too low 6 Webservers where chosen because they offer fairly simple applications that are still useful as standalone products. 7 Furthermore, webservers are generally amenable to parallelisation since their workloads are mostly homogenous. 8 They therefore offer a stringent performance benchmark for \CFA. 9 Indeed existing solutions are likely to have close to optimal performance while the homogeneity of the workloads mean the additional fairness is not needed. 12 10 13 11 \section{Memcached} 12 Memcached~\cit{memcached} is an in memory key-value store that is used in many production environments, \eg \cit{Berk Atikoglu et al., Workload Analysis of a Large-Scale Key-Value Store, 13 SIGMETRICS 2012}. 14 This also server also has the notable added benefit that there exists a full-featured front-end for performance testing called @mutilate@~\cit{mutilate}. 15 Experimenting on memcached allows for a simple test of the \CFA runtime as a whole, it will exercise the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets. 16 This experiment does not exercise the \io subsytem with regards to disk operations. 17 The experiments compare 3 different varitions of memcached: 18 \begin{itemize} 19 \item \emph{vanilla}: the official release of memcached, version~1.6.9. 20 \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}. 21 \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA. 22 \end{itemize} 14 23 15 24 \subsection{Benchmark Environment} … … 22 31 The network route uses 1 Mellanox SX1012 10/40 Gigabit Ethernet cluster switch. 23 32 33 \subsection{Throughput} 34 \begin{figure} 35 \centering 36 \input{result.memcd.rate.qps.pstex_t} 37 \caption[Memcached Benchmark: Throughput]{Memcached Benchmark: Throughput\smallskip\newline Desired vs Actual request rate for 15360 connections. Target QPS is the request rate that the clients are attempting to maintain and Actual QPS is the rate at which the server is able to respond.} 38 \label{fig:memcd:rate:qps} 39 \end{figure} 40 Figure~\ref{fig:memcd:rate:qps} shows the result for the throughput of all three webservers. 41 This experiment is done by having the clients establish 15360 total connections, which persist for the duration of the experiments. 42 The clients then send requests, attempting to follow a desired request rate. 43 The servers respond to the desired rate as best they can and the difference between desired rate, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS''. 44 The results show that \CFA achieves equivalent throughput even when the server starts to reach saturation. 45 Only then does it start to fall behind slightly. 46 This is a demonstration of the \CFA runtime achieving its performance goal. 24 47 48 \subsection{Tail Latency} 49 \begin{figure} 50 \centering 51 \input{result.memcd.rate.99th.pstex_t} 52 \caption[Memcached Benchmark : 99th Percentile Lantency]{Memcached Benchmark : 99th Percentile Lantency\smallskip\newline 99th Percentile of the response latency as a function of \emph{desired} request rate for 15360 connections. } 53 \label{fig:memcd:rate:tail} 54 \end{figure} 55 Another important performance metric to look at is \newterm{tail} latency. 56 Since many web applications rely on a combination of different requests made in parallel, the latency of the slowest response, \ie tail latency, can dictate overall performance. 57 Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same experiment memcached experiment. 58 As is expected, the latency starts low and increases as the server gets close to saturation, point at which the latency increses dramatically. 59 Note that the figure shows \emph{target} request rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment. 25 60 61 \subsection{Update rate} 26 62 \begin{figure} 27 63 \centering … … 38 74 \end{figure} 39 75 76 77 78 \section{Static Web-Server} 79 The memcached experiment has two aspects of the \io subsystem it does not exercise, accepting new connections and interacting with disks. 80 On the other hand, static webservers, servers that offer static webpages, do stress disk \io since they serve files from disk\footnote{Dynamic webservers, which construct pages as they are sent, are not as interesting since the construction of the pages do not exercise the runtime in a meaningfully different way.}. 81 The static webserver experiments will compare NGINX with a custom webserver developped for this experiment. 82 83 \subsection{Benchmark Environment} 84 Unlike the memcached experiment, the webserver run on a more heterogenous environment. 85 The server runs Ubuntu 20.04.4 LTS on top of Linux Kernel 5.13.0-52. 86 It has an AMD Opteron(tm) Processor 6380 running at 2.50GHz. 87 These CPUs has only 8 \glspl{hthrd} enabled by grub, which is sufficient to achieve line rate. 88 This cpus each have 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively. 89 90 The client machines each have two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards. 91 Each client machine runs two copies of the workload generator. 92 They run a 2.6.11-1 SMP Linux kernel, which permits each client load-generator to run on a separate CPU. 93 Since the clients outnumber the server 8-to-1, this is plenty sufficient to generate enough load for the clients not to become the bottleneck. 94 95 \todo{switch} 96 97 98 99 \subsection{Throughput} 40 100 \begin{figure} 41 101 \centering 42 \input{result. memcd.rate.qps.pstex_t}43 \caption[ Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description}44 \label{fig: memcd:rate:qps}102 \input{result.swbsrv.25gb.pstex_t} 103 \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline } 104 \label{fig:swbsrv} 45 105 \end{figure} 46 106 47 \begin{figure} 48 \centering 49 \input{result.memcd.rate.99th.pstex_t} 50 \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description} 51 \label{fig:memcd:rate:tail} 52 \end{figure} 107 Networked ZIPF 108 109 Nginx : 5Gb still good, 4Gb starts to suffer 110 111 Cforall : 10Gb too high, 4 Gb too low -
doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
r8f09242 r999faf1 37 37 The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready queue. 38 38 Since these two operation also describe a @yield@ operation, many systems use this operation as the most basic benchmark. 39 However, yielding can be treated as a special case by optimizing it away (dead code)since the number of ready \glspl{at} does not change.39 However, yielding can be treated as a special case by optimizing it away since the number of ready \glspl{at} does not change. 40 40 Not all systems perform this optimization, but those that do have an artificial performance benefit because the yield becomes a \emph{nop}. 41 41 For this reason, I chose a different first benchmark, called \newterm{Cycle Benchmark}. … … 77 77 78 78 \subsection{Results} 79 Figure~\ref{fig:cycle:jax} shows the throughput as a function of \proc count, where each run uses 100 cycles per \proc and 5 \ats per cycle. 80 81 \begin{figure} 82 \subfloat[][Throughput, 100 \ats per \proc]{ 79 \begin{figure} 80 \subfloat[][Throughput, 100 cycles per \proc]{ 83 81 \resizebox{0.5\linewidth}{!}{ 84 82 \input{result.cycle.jax.ops.pstex_t} … … 86 84 \label{fig:cycle:jax:ops} 87 85 } 88 \subfloat[][Throughput, 1 \atsper \proc]{86 \subfloat[][Throughput, 1 cycle per \proc]{ 89 87 \resizebox{0.5\linewidth}{!}{ 90 88 \input{result.cycle.low.jax.ops.pstex_t} … … 93 91 } 94 92 95 \subfloat[][ Latency, 100 \ats per \proc]{93 \subfloat[][Scalability, 100 cycles per \proc]{ 96 94 \resizebox{0.5\linewidth}{!}{ 97 95 \input{result.cycle.jax.ns.pstex_t} 98 96 } 99 100 } 101 \subfloat[][ Latency, 1 \atsper \proc]{97 \label{fig:cycle:jax:ns} 98 } 99 \subfloat[][Scalability, 1 cycle per \proc]{ 102 100 \resizebox{0.5\linewidth}{!}{ 103 101 \input{result.cycle.low.jax.ns.pstex_t} … … 105 103 \label{fig:cycle:jax:low:ns} 106 104 } 107 \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput a s a function of \proc count with 100 cycles per \proc and 5 \ats per cycle.}105 \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count 5 \ats per cycle and different cycle count. For Throughput higher is better, for Scalability lower is better.} 108 106 \label{fig:cycle:jax} 109 107 \end{figure} 110 108 111 \todo{results discussion} 109 \begin{figure} 110 \subfloat[][Throughput, 100 cycles per \proc]{ 111 \resizebox{0.5\linewidth}{!}{ 112 \input{result.cycle.nasus.ops.pstex_t} 113 } 114 \label{fig:cycle:nasus:ops} 115 } 116 \subfloat[][Throughput, 1 cycle per \proc]{ 117 \resizebox{0.5\linewidth}{!}{ 118 \input{result.cycle.low.nasus.ops.pstex_t} 119 } 120 \label{fig:cycle:nasus:low:ops} 121 } 122 123 \subfloat[][Scalability, 100 cycles per \proc]{ 124 \resizebox{0.5\linewidth}{!}{ 125 \input{result.cycle.nasus.ns.pstex_t} 126 } 127 128 } 129 \subfloat[][Scalability, 1 cycle per \proc]{ 130 \resizebox{0.5\linewidth}{!}{ 131 \input{result.cycle.low.nasus.ns.pstex_t} 132 } 133 \label{fig:cycle:nasus:low:ns} 134 } 135 \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count 5 \ats per cycle and different cycle count. For Throughput higher is better, for Scalability lower is better.} 136 \label{fig:cycle:nasus} 137 \end{figure} 138 Figure~\ref{fig:cycle:jax} and Figure~\ref{fig:cycle:nasus} shows the throughput as a function of \proc count on Intel and AMD respectively, where each cycle has 5 \ats. 139 The graphs show traditional throughput on the top row and \newterm{scalability} on the bottom row. 140 Where scalability uses the same data but the Y axis is calculated as throughput over the number of \procs. 141 In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same. 142 The left column shows results for 100 cycles per \proc, enough cycles to always keep every \proc busy. 143 The right column shows results for only 1 cycle per \proc, where the ready queues are expected to be near empty at all times. 144 The distinction is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily. 145 146 The performance goal of \CFA is to obtain equivalent performance to other, less fair schedulers and that is what results show. 147 Figure~\ref{fig:cycle:jax:ops} and \ref{fig:cycle:jax:ns} show very good throughput and scalability for all runtimes. 148 The experimental setup prioritizes running on 2 \glspl{hthrd} per core before running on multiple sockets. 149 The effect of that setup is seen from 25 to 48 \procs, running on 24 core with 2 \glspl{hthrd} per core. 150 This effect is again repeated from 73 and 96 \procs, where it happens on the second CPU. 151 When running only a single cycle, most runtime achieve lower throughput because of the idle-sleep mechanism. 152 In Figure~\ref{fig:cycle:jax:ops} and \ref{fig:cycle:jax:ns} 153 154 Figure~\ref{fig:cycle:nasus} show effectively the same story happening on AMD as it does on Intel. 155 The different performance bumps due to cache topology happen at different locations and there is a little more variability. 156 However, in all cases \CFA is still competitive with other runtimes. 157 112 158 113 159 \section{Yield} … … 136 182 137 183 \subsection{Results} 184 \begin{figure} 185 \subfloat[][Throughput, 100 \ats per \proc]{ 186 \resizebox{0.5\linewidth}{!}{ 187 \input{result.yield.jax.ops.pstex_t} 188 } 189 \label{fig:yield:jax:ops} 190 } 191 \subfloat[][Throughput, 1 \ats per \proc]{ 192 \resizebox{0.5\linewidth}{!}{ 193 \input{result.yield.low.jax.ops.pstex_t} 194 } 195 \label{fig:yield:jax:low:ops} 196 } 197 198 \subfloat[][Scalability, 100 \ats per \proc]{ 199 \resizebox{0.5\linewidth}{!}{ 200 \input{result.yield.jax.ns.pstex_t} 201 } 202 \label{fig:yield:jax:ns} 203 } 204 \subfloat[][Scalability, 1 \ats per \proc]{ 205 \resizebox{0.5\linewidth}{!}{ 206 \input{result.yield.low.jax.ns.pstex_t} 207 } 208 \label{fig:yield:jax:low:ns} 209 } 210 \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count, using 1 \ats per \proc. For Throughput higher is better, for Scalability lower is better.} 211 \label{fig:yield:jax} 212 \end{figure} 213 214 \begin{figure} 215 \subfloat[][Throughput, 100 \ats per \proc]{ 216 \resizebox{0.5\linewidth}{!}{ 217 \input{result.yield.nasus.ops.pstex_t} 218 } 219 \label{fig:yield:nasus:ops} 220 } 221 \subfloat[][Throughput, 1 \at per \proc]{ 222 \resizebox{0.5\linewidth}{!}{ 223 \input{result.yield.low.nasus.ops.pstex_t} 224 } 225 \label{fig:yield:nasus:low:ops} 226 } 227 228 \subfloat[][Scalability, 100 \ats per \proc]{ 229 \resizebox{0.5\linewidth}{!}{ 230 \input{result.yield.nasus.ns.pstex_t} 231 } 232 233 } 234 \subfloat[][Scalability, 1 \at per \proc]{ 235 \resizebox{0.5\linewidth}{!}{ 236 \input{result.yield.low.nasus.ns.pstex_t} 237 } 238 \label{fig:yield:nasus:low:ns} 239 } 240 \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count, using 1 \ats per \proc. For Throughput higher is better, for Scalability lower is better.} 241 \label{fig:yield:nasus} 242 \end{figure} 138 243 139 244 Figure~\ref{fig:yield:jax} shows the throughput as a function of \proc count, where each run uses 100 \ats per \proc. 140 141 \begin{figure} 142 \subfloat[][Throughput, 100 \ats per \proc]{ 143 \resizebox{0.5\linewidth}{!}{ 144 \input{result.yield.jax.ops.pstex_t} 145 } 146 \label{fig:yield:jax:ops} 147 } 148 \subfloat[][Throughput, 1 \ats per \proc]{ 149 \resizebox{0.5\linewidth}{!}{ 150 \input{result.yield.low.jax.ops.pstex_t} 151 } 152 \label{fig:yield:jax:low:ops} 153 } 154 155 \subfloat[][Latency, 100 \ats per \proc]{ 156 \resizebox{0.5\linewidth}{!}{ 157 \input{result.yield.jax.ns.pstex_t} 158 } 159 \label{fig:yield:jax:ns} 160 } 161 \subfloat[][Latency, 1 \ats per \proc]{ 162 \resizebox{0.5\linewidth}{!}{ 163 \input{result.yield.low.jax.ns.pstex_t} 164 } 165 \label{fig:yield:jax:low:ns} 166 } 167 \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput as a function of \proc count, using 1 \ats per \proc.} 168 \label{fig:yield:jax} 169 \end{figure} 170 171 \todo{results discussion} 245 It is fairly obvious why I claim this benchmark is more artificial. 246 The throughput is dominated by the mechanism used to handle the @yield@. 247 \CFA does not have special handling for @yield@ and achieves very similar performance to the cycle benchmark. 248 Libfibre uses the fact that @yield@ doesn't change the number of ready fibres and by-passes the idle-sleep mechanism entirely, producing significantly better throughput. 249 Go puts yielding goroutines on a secondary global ready-queue, giving them lower priority. 250 The result is that multiple \glspl{hthrd} contend for the global queue and performance suffers drastically. 251 Based on the scalability, Tokio obtains the same poor performance and therefore it is likely it handles @yield@ in a similar fashion. 252 253 When the number of \ats is reduce to 1 per \proc, the cost of idle sleep also comes into play in a very significant way. 254 If anything causes a \at migration, where two \ats end-up on the same ready-queue, work-stealing will start occuring and cause every \at to shuffle around. 255 In the process, several \procs can go to sleep transiently if they fail to find where the \ats were shuffled to. 256 In \CFA, spurious bursts of latency can trick a \proc into helping, triggering this effect. 257 However, since user-level threading with equal number of \ats and \procs is a somewhat degenerate case, especially when ctxswitching very often, this result is not particularly meaningful and is only included for completness. 258 259 Again, Figure~\ref{fig:yield:nasus} show effectively the same story happening on AMD as it does on Intel. 260 \CFA fairs slightly better with many \ats per \proc, but the performance is satisfactory on both architectures. 261 262 Since \CFA obtains the same satisfactory performance as the previous benchmark this is still a success, albeit a less meaningful one. 263 172 264 173 265 \section{Churn}
Note: See TracChangeset
for help on using the changeset viewer.