source: doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

Last change on this file was 511a9368, checked in by Thierry Delisle <tdelisle@…>, 8 days ago

Filled in eval section for existing results.
Except update ratio which will be redone.

  • Property mode set to 100644
File size: 11.9 KB
Line 
1\chapter{Macro-Benchmarks}\label{macrobench}
2The previous chapter has demonstrated that the scheduler achieves its performance goal in small and controlled scenario.
3The next step is then to demonstrate that this stays true in more realistic and complete scenarios.
4This chapter presents two flavours of webservers that demonstrate that \CFA performs competitively with production environments.
5
6Webservers where chosen because they offer fairly simple applications that are still useful as standalone products.
7Furthermore, webservers are generally amenable to parallelisation since their workloads are mostly homogenous.
8They therefore offer a stringent performance benchmark for \CFA.
9Indeed existing solutions are likely to have close to optimal performance while the homogeneity of the workloads mean the additional fairness is not needed.
10
11\section{Memcached}
12Memcached~\cit{memcached} is an in memory key-value store that is used in many production environments, \eg \cit{Berk Atikoglu et al., Workload Analysis of a Large-Scale Key-Value Store,
13SIGMETRICS 2012}.
14This also server also has the notable added benefit that there exists a full-featured front-end for performance testing called @mutilate@~\cit{mutilate}.
15Experimenting on memcached allows for a simple test of the \CFA runtime as a whole, it will exercise the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.
16This experiment does not exercise the \io subsytem with regards to disk operations.
17
18\subsection{Benchmark Environment}
19These experiments are run on a cluster of homogenous Supermicro SYS-6017R-TDF compute nodes with the following characteristics:
20The server runs Ubuntu 20.04.3 LTS on top of Linux Kernel 5.11.0-34.
21Each node has 2 Intel(R) Xeon(R) CPU E5-2620 v2 running at 2.10GHz.
22These CPUs have 6 cores per CPUs and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.
23The cpus each have 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches respectively.
24Each node is connected to the network through a Mellanox 10 Gigabit Ethernet port.
25The network route uses 1 Mellanox SX1012 10/40 Gigabit Ethernet cluster switch.
26
27\subsection{Memcached with threads per connection}
28Comparing against memcached using a user-level runtime only really make sense if the server actually uses this threading model.
29Indeed, evaluating a user-level runtime with 1 \at per \proc is not meaningful since it does not exercise the runtime, it simply adds some overhead to the underlying OS scheduler.
30
31One approach is to use a webserver that uses a thread-per-connection model, where each incoming connection is served by a single \at in a strict 1-to-1 pairing.
32This models adds flexibility to the implementation, as the serving logic can now block on user-level primitives without affecting other connections.
33
34Memcached is not built according to a thread-per-connection model, but there exists a port of it that is, which was built for libfibre in \cite{DBLP:journals/pomacs/KarstenB20}.
35Therefore this version can both be compared to the original version and to a port to the \CFA runtime.
36
37As such, this memcached experiment compares 3 different varitions of memcached:
38\begin{itemize}
39 \item \emph{vanilla}: the official release of memcached, version~1.6.9.
40 \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.
41 \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.
42\end{itemize}
43
44\subsection{Throughput} \label{memcd:tput}
45\begin{figure}
46        \centering
47        \input{result.memcd.rate.qps.pstex_t}
48        \caption[Memcached Benchmark: Throughput]{Memcached Benchmark: Throughput\smallskip\newline Desired vs Actual request rate for 15360 connections. Target QPS is the request rate that the clients are attempting to maintain and Actual QPS is the rate at which the server is able to respond.}
49        \label{fig:memcd:rate:qps}
50\end{figure}
51Figure~\ref{fig:memcd:rate:qps} shows the result for the throughput of all three webservers.
52This experiment is done by having the clients establish 15360 total connections, which persist for the duration of the experiments.
53The clients then send requests, attempting to follow a desired request rate.
54The servers respond to the desired rate as best they can and the difference between desired rate, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS''.
55The results show that \CFA achieves equivalent throughput even when the server starts to reach saturation.
56Only then does it start to fall behind slightly.
57This is a demonstration of the \CFA runtime achieving its performance goal.
58
59\subsection{Tail Latency}
60\begin{figure}
61        \centering
62        \input{result.memcd.rate.99th.pstex_t}
63        \caption[Memcached Benchmark : 99th Percentile Lantency]{Memcached Benchmark : 99th Percentile Lantency\smallskip\newline 99th Percentile of the response latency as a function of \emph{desired} request rate for 15360 connections. }
64        \label{fig:memcd:rate:tail}
65\end{figure}
66Another important performance metric to look at is \newterm{tail} latency.
67Since many web applications rely on a combination of different requests made in parallel, the latency of the slowest response, \ie tail latency, can dictate overall performance.
68Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same experiment memcached experiment.
69As is expected, the latency starts low and increases as the server gets close to saturation, point at which the latency increses dramatically.
70Note that the figure shows \emph{target} request rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
71
72\subsection{Update rate}
73\begin{figure}
74        \centering
75        \input{result.memcd.updt.qps.pstex_t}
76        \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description}
77        \label{fig:memcd:updt:qps}
78\end{figure}
79
80\begin{figure}
81        \centering
82        \input{result.memcd.updt.lat.pstex_t}
83        \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description}
84        \label{fig:memcd:updt:lat}
85\end{figure}
86
87
88
89\section{Static Web-Server}
90The memcached experiment has two aspects of the \io subsystem it does not exercise, accepting new connections and interacting with disks.
91On the other hand, static webservers, servers that offer static webpages, do stress disk \io since they serve files from disk\footnote{Dynamic webservers, which construct pages as they are sent, are not as interesting since the construction of the pages do not exercise the runtime in a meaningfully different way.}.
92The static webserver experiments will compare NGINX with a custom webserver developped for this experiment.
93
94\subsection{\CFA webserver}
95Unlike the memcached experiment, the webserver experiment relies on a custom designed webserver.
96It is a simple thread-per-connection webserver where a fixed number of \ats are created upfront.
97Each of the \at calls @accept@, through @io_uring@, on the listening port and handle the incomming connection once accepted.
98Most of the implementation is fairly straight forward however the inclusion of file \io introduces a new challenge that had to be hacked around.
99
100Normally, webservers use @sendfile@\cit{sendfile} to send files over the socket.
101@io_uring@ does not support @sendfile@, it supports @splice@\cit{splice} instead, which is strictly more powerful.
102However, because of how linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@'s implementation must delegate calls to splice to worker threads inside the kernel.
103As of Linux 5.13, @io_uring@ caps the numer of these worker threads to @RLIMIT_NPROC@ and therefore, when tens of thousands of splice requests are made, it can create tens of thousands of \glspl{kthrd}.
104Such a high number of \glspl{kthrd} is more than Linux can handle in this scenario so performance suffers significantly.
105For this reason, the \CFA webserver calls @sendfile@ directly.
106This approach works up to a certain point, but once the server approaches saturation, it leads to a new problem.
107
108When the saturation point of the server is attained, latency will increase and inevitably some client connections will timeout.
109As these clients close there connections, the server must close these sockets without delay so the OS can reclaim the resources used by these connections.
110Indeed, until they are closed on the server end, the connection will linger in the CLOSE-WAIT tcp state~\cit{RFC793} and the tcp buffers will be preserved.
111However, this poses a problem using blocking @sendfile@ calls.
112The calls can block if they do not have suffcient memory, which can be caused by having too many connections in the CLOSE-WAIT state.
113Since blocking in calls to @sendfile@ blocks the \proc rather than the \at, this prevents other connections from closing their sockets.
114This leads to a vicious cycle where timeouts lead to @sendfile@ calls running out of resources, which lead to more timeouts.
115
116Normally, this is address by marking the sockets as non-blocking and using @epoll@ to wait for sockets to have sufficient resources.
117However, since @io_uring@ respects non-blocking semantics marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely.
118For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@.
119Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases.
120
121It is important to state that in Linux 5.15 @io_uring@ introduces the ability for users to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option.
122However, as of writing this document Ubuntu does not have a stable release of Linux 5.15.
123There exists versions of the kernel that are currently under testing, but these caused unrelated but nevertheless prohibitive issues in this experiment.
124Presumably, the new kernel would remove the need for the hack described above, as it would allow connections in the CLOSE-WAIT state to be closed even while the calls to @splice@/@sendfile@ are underway.
125However, since this could not be tested, this is purely a conjecture at this point.
126
127\subsection{Benchmark Environment}
128Unlike the memcached experiment, the webserver run on a more heterogenous environment.
129The server runs Ubuntu 20.04.4 LTS on top of Linux Kernel 5.13.0-52.
130It has an AMD Opteron(tm) Processor 6380 running at 2.50GHz.
131These CPUs has only 8 \glspl{hthrd} enabled by grub, which is sufficient to achieve line rate.
132This cpus each have 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
133The kernel is setup to limit the memory at 25Gb.
134
135The client machines each have two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
136Each client machine runs two copies of the workload generator.
137They run a 2.6.11-1 SMP Linux kernel, which permits each client load-generator to run on a separate CPU.
138Since the clients outnumber the server 8-to-1, this is plenty sufficient to generate enough load for the clients not to become the bottleneck.
139
140\todo{switch}
141
142\subsection{Throughput}
143\begin{figure}
144        \subfloat[][Throughput]{
145                \input{result.swbsrv.25gb.pstex_t}
146                \label{fig:swbsrv:ops}
147        }
148
149        \subfloat[][Rate of Errors]{
150                \input{result.swbsrv.25gb.err.pstex_t}
151                \label{fig:swbsrv:err}
152        }
153        \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline Throughput vs request rate for short lived connections connections.}
154        \label{fig:swbsrv}
155\end{figure}
156Figure~\ref{fig:swbsrv} shows the results comparing \CFA to nginx in terms of throughput.
157It demonstrate that the \CFA webserver described above is able to match the performance of nginx up-to and beyond the saturation point of the machine.
158Furthermore, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achives notably fewet errors once the machine reaches saturation.
Note: See TracBrowser for help on using the repository browser.