Context Navigation

source: doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex @ 94ce03a

ADTast-experimentalpthread-emulation

Last change on this file since 94ce03a was 94ce03a, checked in by Thierry Delisle <tdelisle@…>, 2 years ago
Filled in macro section after results for new experiment
Property mode set to `100644`
File size: 16.9 KB

Line
1	\chapter{Macro-Benchmarks}\label{macrobench}
2	The previous chapter has demonstrated that the scheduler achieves its performance goal in small and controlled scenario.
3	The next step is then to demonstrate that this stays true in more realistic and complete scenarios.
4	This chapter presents two flavours of webservers that demonstrate that \CFA performs competitively with production environments.
5
6	Webservers where chosen because they offer fairly simple applications that are still useful as standalone products.
7	Furthermore, webservers are generally amenable to parallelisation since their workloads are mostly homogenous.
8	They therefore offer a stringent performance benchmark for \CFA.
9	Indeed existing solutions are likely to have close to optimal performance while the homogeneity of the workloads mean the additional fairness is not needed.
10	This means that there is very little room to use for the extra cost of fairness.
11	As such, these experiements should highlight the fairness cost in realistic scenarios.
12
13	\section{Memcached}
14	Memcached~\cite{memcached} is an in memory key-value store that is used in many production environments, \eg \cite{atikoglu2012workload}.
15	This also server also has the notable added benefit that there exists a full-featured front-end for performance testing called @mutilate@~\cite{GITHUB:mutilate}.
16	Experimenting on memcached allows for a simple test of the \CFA runtime as a whole, it will exercise the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.
17	Note that this experiment does not exercise the \io subsytem with regards to disk operations.
18
19	\subsection{Benchmark Environment}
20	These experiments are run on a cluster of homogenous Supermicro SYS-6017R-TDF compute nodes with the following characteristics:
21	The server runs Ubuntu 20.04.3 LTS on top of Linux Kernel 5.11.0-34.
22	Each node has 2 Intel(R) Xeon(R) CPU E5-2620 v2 running at 2.10GHz.
23	These CPUs have 6 cores per CPUs and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.
24	The cpus each have 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches respectively.
25	Each node is connected to the network through a Mellanox 10 Gigabit Ethernet port.
26	The network route uses 1 Mellanox SX1012 10/40 Gigabit Ethernet cluster switch.
27
28	\subsection{Memcached with threads per connection}
29	Comparing against memcached using a user-level runtime only really make sense if the server actually uses this threading model.
30	Indeed, evaluating a user-level runtime with 1 \at per \proc is not meaningful since it does not exercise the runtime, it simply adds some overhead to the underlying OS scheduler.
31
32	One approach is to use a webserver that uses a thread-per-connection model, where each incoming connection is served by a single \at in a strict 1-to-1 pairing.
33	This models adds flexibility to the implementation, as the serving logic can now block on user-level primitives without affecting other connections.
34
35	Memcached is not built according to a thread-per-connection model, but there exists a port of it that is, which was built for libfibre in \cite{DBLP:journals/pomacs/KarstenB20}.
36	Therefore this version can both be compared to the original version and to a port to the \CFA runtime.
37
38	As such, this memcached experiment compares 3 different varitions of memcached:
39	\begin{itemize}
40	\item \emph{vanilla}: the official release of memcached, version~1.6.9.
41	\item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.
42	\item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.
43	\end{itemize}
44
45	\subsection{Throughput} \label{memcd:tput}
46	\begin{figure}
47	\centering
48	\input{result.memcd.rate.qps.pstex_t}
49	\caption[Memcached Benchmark: Throughput]{Memcached Benchmark: Throughput\smallskip\newline Desired vs Actual query rate for 15360 connections. Target QPS is the query rate that the clients are attempting to maintain and Actual QPS is the rate at which the server is able to respond.}
50	\label{fig:memcd:rate:qps}
51	\end{figure}
52	This experiment is done by having the clients establish 15360 total connections, which persist for the duration of the experiments.
53	The clients then send queries, attempting to follow a desired query rate and the server responds to the desired rate as best they can.
54	Figure~\ref{fig:memcd:rate:qps} shows the difference between desired rate, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS'', for all three webservers.
55	As with the experiments in the previous chapter, 15 runs for each rate were measured and the graph shows all datapoints.
56	The solid line represents the median while the dashed and dotted lines represent the maximum and minimum respectively.
57	For rates below 500K queries per seconds, all three webservers can easily keep up to the desired rate, resulting in all datapoints being perfectly overlapped.
58	Beyond this limit, individual runs become visible and all three servers begin to distinguish themselves, where vanilla memcached generally achieves better throughput while \CFA and libfibre fight for second place.
59	Overall however the performance of all three servers is very similar, especially considering that at 500K the server has reached saturation, which is discussed more in the next section.
60
61	\subsection{Tail Latency}
62	\begin{figure}
63	\centering
64	\input{result.memcd.rate.99th.pstex_t}
65	\caption[Memcached Benchmark : 99th Percentile Lantency]{Memcached Benchmark : 99th Percentile Lantency\smallskip\newline 99th Percentile of the response latency as a function of \emph{desired} query rate for 15360 connections. }
66	\label{fig:memcd:rate:tail}
67	\end{figure}
68	Another important performance metric to look at is \newterm{tail} latency.
69	Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate overall performance.
70	Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same experiment memcached experiment.
71	Again, each series is made of 15 runs with the median, maximum and minimum highlighted with lines.
72	As is expected, the latency starts low and increases as the server gets close to saturation, point at which the latency increses dramatically.
73	Because of this dramatic increase, the Y axis is presented using log scale.
74	Note that the figure shows \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
75
76	For all three servers the saturation point is reached before 500K queries per second, which was when throughput started to change among the webservers.
77	In this experiement, all three webservers are much more distinguishable than the throughput experiment.
78	Vanilla achieves low latency mostly across the board followed by libfibre and \CFA.
79	However, all three webservers achieve micro second latencies and the increases in latency mostly follow eachother.
80
81	\subsection{Update rate}
82	Since Memcached is effectively a simple database, an aspect that can significantly affect performance is wirtes.
83	The information that is cached by memcached can be written to concurrently with other queries.
84	I could therefore be interesting to see how this update rate affects performance.
85	\begin{figure}
86	\subfloat[][\CFA: Throughput]{
87	\resizebox{0.5\linewidth}{!}{
88	\input{result.memcd.forall.qps.pstex_t}
89	}
90	\label{fig:memcd:updt:forall:qps}
91	}
92	\subfloat[][\CFA: Latency]{
93	\resizebox{0.5\linewidth}{!}{
94	\input{result.memcd.forall.lat.pstex_t}
95	}
96	\label{fig:memcd:updt:forall:lat}
97	}
98
99	\subfloat[][LibFibre: Throughput]{
100	\resizebox{0.5\linewidth}{!}{
101	\input{result.memcd.fibre.qps.pstex_t}
102	}
103	\label{fig:memcd:updt:fibre:qps}
104	}
105	\subfloat[][LibFibre: Latency]{
106	\resizebox{0.5\linewidth}{!}{
107	\input{result.memcd.fibre.lat.pstex_t}
108	}
109	\label{fig:memcd:updt:fibre:lat}
110	}
111
112	\subfloat[][Vanilla: Throughput]{
113	\resizebox{0.5\linewidth}{!}{
114	\input{result.memcd.vanilla.qps.pstex_t}
115	}
116	\label{fig:memcd:updt:vanilla:qps}
117	}
118	\subfloat[][Vanilla: Latency]{
119	\resizebox{0.5\linewidth}{!}{
120	\input{result.memcd.vanilla.lat.pstex_t}
121	}
122	\label{fig:memcd:updt:vanilla:lat}
123	}
124	\caption[Throughput and Latency results at different update rates (percentage of writes).]{Throughput and Latency results at different update rates (percentage of writes).\smallskip\newline Description}
125	\label{fig:memcd:updt}
126	\end{figure}
127	Figure~\ref{fig:memcd:updt} shows the results for the same experiement as the throughput and latency experiement but with multiple update rate.
128	Each experiment was repeated with a update percentage of 3\%, 5\%, 10\% and 50\%.
129	The previous experiements were run with 3\% update rates.
130	In the end, this experiment mostly demonstrates that the performance of memcached is affected very little by the update rate.
131	I believe this is because the underlying locking pattern is actually fairly similar.
132	Indeed, since values can be much bigger than what the server can read atomically, a lock must be acquired while the value is read.
133	These results suggects that memcached does not use a readers-writer lock to protect each values and instead relies on having a sufficient number of keys to limit the contention.
134	In the end, this shows yet again that \CFA achieves equivalent performance.
135
136
137	\section{Static Web-Server}
138	The memcached experiment has two aspects of the \io subsystem it does not exercise, accepting new connections and interacting with disks.
139	On the other hand, static webservers, servers that offer static webpages, do stress disk \io since they serve files from disk\footnote{Dynamic webservers, which construct pages as they are sent, are not as interesting since the construction of the pages do not exercise the runtime in a meaningfully different way.}.
140	The static webserver experiments will compare NGINX~\cit{nginx} with a custom webserver developped for this experiment.
141
142	\subsection{\CFA webserver}
143	Unlike the memcached experiment, the webserver experiment relies on a custom designed webserver.
144	It is a simple thread-per-connection webserver where a fixed number of \ats are created upfront.
145	Each of the \at calls @accept@, through @io_uring@, on the listening port and handle the incomming connection once accepted.
146	Most of the implementation is fairly straight forward however the inclusion of file \io introduces a new challenge that had to be hacked around.
147
148	Normally, webservers use @sendfile@\cite{MAN:sendfile} to send files over the socket.
149	@io_uring@ does not support @sendfile@, it supports @splice@\cite{MAN:splice} instead, which is strictly more powerful.
150	However, because of how linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@'s implementation must delegate calls to splice to worker threads inside the kernel.
151	As of Linux 5.13, @io_uring@ caps the numer of these worker threads to @RLIMIT_NPROC@ and therefore, when tens of thousands of splice requests are made, it can create tens of thousands of \glspl{kthrd}.
152	Such a high number of \glspl{kthrd} is more than Linux can handle in this scenario so performance suffers significantly.
153	For this reason, the \CFA webserver calls @sendfile@ directly.
154	This approach works up to a certain point, but once the server approaches saturation, it leads to a new problem.
155
156	When the saturation point of the server is attained, latency will increase and inevitably some client connections will timeout.
157	As these clients close there connections, the server must close these sockets without delay so the OS can reclaim the resources used by these connections.
158	Indeed, until they are closed on the server end, the connection will linger in the CLOSE-WAIT tcp state~\cite{rfc:tcp} and the tcp buffers will be preserved.
159	However, this poses a problem using blocking @sendfile@ calls.
160	The calls can block if they do not have suffcient memory, which can be caused by having too many connections in the CLOSE-WAIT state.
161	Since blocking in calls to @sendfile@ blocks the \proc rather than the \at, this prevents other connections from closing their sockets.
162	This leads to a vicious cycle where timeouts lead to @sendfile@ calls running out of resources, which lead to more timeouts.
163
164	Normally, this is address by marking the sockets as non-blocking and using @epoll@ to wait for sockets to have sufficient resources.
165	However, since @io_uring@ respects non-blocking semantics marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely.
166	For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@.
167	Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases.
168
169	It is important to state that in Linux 5.15 @io_uring@ introduces the ability for users to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option.
170	However, as of writing this document Ubuntu does not have a stable release of Linux 5.15.
171	There exists versions of the kernel that are currently under testing, but these caused unrelated but nevertheless prohibitive issues in this experiment.
172	Presumably, the new kernel would remove the need for the hack described above, as it would allow connections in the CLOSE-WAIT state to be closed even while the calls to @splice@/@sendfile@ are underway.
173	However, since this could not be tested, this is purely a conjecture at this point.
174
175	\subsection{Benchmark Environment}
176	Unlike the memcached experiment, the webserver run on a more heterogenous environment.
177	The server runs Ubuntu 20.04.4 LTS on top of Linux Kernel 5.13.0-52.
178	It has an AMD Opteron(tm) Processor 6380 running at 2.50GHz.
179	These CPUs has only 8 \glspl{hthrd} enabled by grub, which is sufficient to achieve line rate.
180	This cpus each have 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
181	The kernel is setup to limit the memory at 25Gb.
182
183	The client machines each have two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
184	Each client machine runs two copies of the workload generator.
185	They run a 2.6.11-1 SMP Linux kernel, which permits each client load-generator to run on a separate CPU.
186	Since the clients outnumber the server 8-to-1, this is plenty sufficient to generate enough load for the clients not to become the bottleneck.
187
188	\todo{switch}
189
190	\subsection{Throughput}
191	To measure the throughput of both webservers, each server is loaded with over 30,000 files making over 4.5 Gigabytes in total.
192	Each client runs httperf~\cit{httperf} which establishes a connection, does an http request for one or more files, closes the connection and repeats the process.
193	The connections and requests are made according to a Zipfian distribution~\cite{zipf}.
194	Throughput is measured by aggregating the results from httperf of all the clients.
195	\begin{figure}
196	\subfloat[][Throughput]{
197	\input{result.swbsrv.25gb.pstex_t}
198	\label{fig:swbsrv:ops}
199	}
200
201	\subfloat[][Rate of Errors]{
202	\input{result.swbsrv.25gb.err.pstex_t}
203	\label{fig:swbsrv:err}
204	}
205	\caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline Throughput vs request rate for short lived connections connections.}
206	\label{fig:swbsrv}
207	\end{figure}
208	Figure~\ref{fig:swbsrv} shows the results comparing \CFA to NGINX in terms of throughput.
209	These results are fairly straight forward.
210	Both servers achieve the same throughput until around 57,500 requests per seconds.
211	Since the clients are asking for the same files, the fact that the throughput matches exactly is expected as long as both servers are able to serve the desired rate.
212	Once the saturation point is reached, both servers are still very close.
213	NGINX achieves slightly better throughtput.
214	However, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achives notably fewet errors once the machine reaches saturation.
215	This suggest that \CFA is slightly more fair and NGINX may sloghtly sacrifice some fairness for improved throughtput.
216	It demonstrate that the \CFA webserver described above is able to match the performance of NGINX up-to and beyond the saturation point of the machine.
217
218	\subsection{Disk Operations}
219	The throughput was made using a server with 25gb of memory, this was sufficient to hold the entire fileset in addition to all the code and data needed to run the webserver and the reste of the machine.
220	Previous work like \cit{Cite Ashif's stuff} demonstrate that an interesting follow-up experiment is to rerun the same throughput experiment but allowing significantly less memory on the machine.
221	If the machine is constrained enough, it will force the OS to evict files from the file cache and cause calls to @sendfile@ to have to read from disk.
222	However, what these low memory experiments demonstrate is how the memory footprint of the webserver affects the performance.
223	However, since what I am to evaluate in this thesis is the runtime of \CFA, I diceded to forgo experiments on low memory server.
224	The implementation of the webserver itself is simply too impactful to be an interesting evaluation of the underlying runtime.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: