Context Navigation

source: doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

Last change on this file was ddcaff6, checked in by Thierry Delisle <tdelisle@…>, 18 months ago
Last corrections to my thesis... hopefully
Property mode set to `100644`
File size: 28.2 KB

Line
1	\chapter{Macro-Benchmarks}\label{macrobench}
2	The previous chapter demonstrated that the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.
3	The next step is to demonstrate performance stays true in more realistic and complete scenarios.
4	Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate that \CFA performs competitively compared to web servers used in production environments.
5
6	Web servers are chosen because they offer fairly simple applications that perform complex I/O, both network and disk, and are useful as standalone products.
7	Furthermore, web servers are generally amenable to parallelization since their workloads are mostly homogeneous.
8	Therefore, web servers offer a stringent performance benchmark for \CFA.
9	Indeed, existing web servers have close to optimal performance, while the homogeneity of the workload means fairness may not be a problem.
10	As such, these experiments should highlight the overhead due to any \CFA fairness cost in realistic scenarios.
11
12	The most obvious performance metric for web servers is throughput.
13	This metric generally measures the speed at which the server answers and relatedly how fast clients can send requests before the server can no longer keep-up.
14	Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service?
15	Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.
16
17	\section{Memcached}
18	Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}.
19	The Memcached server is so popular there exists a full-featured front-end for performance testing, called @mutilate@~\cite{GITHUB:mutilate}.
20	Experimenting on Memcached allows for a simple test of the \CFA runtime as a whole, exercising the scheduler, the idle-sleep mechanism, as well as the \io subsystem for sockets.
21	Note that this experiment does not exercise the \io subsystem with regard to disk operations because Memcached is an in-memory server.
22
23	\subsection{Benchmark Environment}
24	The Memcached experiments are run on a cluster of homogeneous Supermicro SYS-6017R-TDF compute nodes with the following characteristics.
25	\begin{itemize}
26	\item
27	The server runs Ubuntu 20.04.3 LTS on top of Linux Kernel 5.11.0-34.
28	\item
29	Each node has 2 Intel(R) Xeon(R) CPU E5-2620 v2 running at 2.10GHz.
30	\item
31	Each CPU has 6 cores and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.
32	\item
33	The machine is configured to run each servers on 12 dedicated \glspl{hthrd} and uses 6 of the remaining \glspl{hthrd} for the software interrupt handling~\cite{wiki:softirq}, resulting in maximum CPU utilization of 75\% (18 / 24 \glspl{hthrd})
34	\item
35	A CPU has 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches, respectively.
36	\item
37	The compute nodes are connected to the network through a Mellanox 10 Gigabit Ethernet port.
38	\item
39	Network routing is performed by a Mellanox SX1012 10/40 Gigabit Ethernet switch.
40	\end{itemize}
41
42	\subsection{Memcached threading}\label{memcd:thrd}
43	Memcached can be built to use multiple threads in addition to its @libevent@ subsystem to handle requests.
44	When enabled, the threading implementation operates as follows~\cite[\S~16.2.2.8]{MemcachedThreading}:
45	\begin{itemize}
46	\item
47	Threading is handled by wrapping functions within the code to provide basic protection from updating the same global structures at the same time.
48	\item
49	Each thread uses its own instance of the @libevent@ to help improve performance.
50	\item
51	TCP/IP connections are handled with a single thread listening on the TCP/IP socket.
52	Each connection is then distributed to one of the active threads on a simple round-robin basis.
53	Each connection then operates solely within this thread while the connection remains open.
54	\item
55	For UDP connections, all the threads listen to a single UDP socket for incoming requests.
56	Threads that are currently dealing with another request ignore the incoming packet.
57	One of the remaining, non-busy, threads reads the request and sends the response.
58	This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request.
59	\end{itemize}
60	Here, Memcached is based on an event-based web server architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple largely independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O.
61	Alternative web server architectures are:
62	\begin{itemize}
63	\item
64	pipeline~\cite{Welsh01}, where the event engine is subdivided into multiple stages and the stages are connected with asynchronous buffers, where the final stage has multiple threads to handle blocking I/O.
65	\item
66	thread-per-connection~\cite{apache,Behren03}, where each incoming connection is served by a single \at in a strict 1-to-1 pairing, using the thread stack to hold the event state and folding the event engine implicitly into the threading runtime with its nonblocking I/O mechanism.
67	\end{itemize}
68	Both pipelining and thread-per-connection add flexibility to the implementation, as the serving logic can now block without halting the event engine~\cite{Harji12}.
69
70	However, \gls{kthrd}ing in Memcached is not amenable to this work, which is based on \gls{uthrding}.
71	While it is feasible to layer one user thread per kernel thread, it is not meaningful as it fails to exercise the user runtime;
72	it simply adds extra scheduling overhead over the kernel threading.
73	Hence, there is no direct way to compare Memcached using a kernel-level runtime with a user-level runtime.
74
75	Fortunately, there exists a recent port of Memcached to \gls{uthrding} based on the libfibre~\cite{DBLP:journals/pomacs/KarstenB20} \gls{uthrding} library.
76	This port did all of the heavy-lifting, making it straightforward to replace the libfibre user-threading with the \gls{uthrding} in \CFA.
77	It is now possible to compare the original kernel-threading Memcached with both user-threading runtimes in libfibre and \CFA.
78
79	As such, this Memcached experiment compares 3 different variations of Memcached:
80	\begin{itemize}
81	\item \emph{vanilla}: the official release of Memcached, version~1.6.9.
82	\item \emph{fibre}: a modification of vanilla using the thread-per-connection model on top of the libfibre runtime.
83	\item \emph{cfa}: a modification of the fibre web server that replaces the libfibre runtime with \CFA.
84	\end{itemize}
85
86	\subsection{Throughput} \label{memcd:tput}
87	This experiment is done by having the clients establish 15,360 total connections, which persist for the duration of the experiment.
88	The clients then send read and write queries with 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.
89	Figure~\ref{fig:memcd:rate:qps} shows the 3 server versions at different client rates, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS'', for all three web servers.
90
91	Like the experimental setup in Chapter~\ref{microbench}, each experiment is run 15 times, and for each client rate, the measured web server rate is plotted.
92	The solid line represents the median while the dashed and dotted lines represent the maximum and minimum respectively.
93	For rates below 500K queries per second, all three web servers match the client rate.
94	Beyond 500K, the web servers cannot match the client rate.
95	During this interval, vanilla Memcached achieves the highest web server throughput, with libfibre and \CFA slightly lower but very similar throughput.
96	Overall the performance of all three web servers is very similar, especially considering that at 500K the servers have reached saturation, which is discussed more in the next section.
97
98	\begin{figure}
99	\centering
100	\resizebox{0.83\linewidth}{!}{\input{result.memcd.rate.qps.pstex_t}}
101	\caption[Memcached Benchmark: Throughput]{Memcached Benchmark: Throughput\smallskip\newline Desired vs Actual query rate for 15,360 connections. Target QPS is the query rate that the clients are attempting to maintain and Actual QPS is the rate at which the server can respond.}
102	\label{fig:memcd:rate:qps}
103	%\end{figure}
104	\bigskip
105	%\begin{figure}
106	\centering
107	\resizebox{0.83\linewidth}{!}{\input{result.memcd.rate.99th.pstex_t}}
108	\caption[Memcached Benchmark: 99th Percentile Latency]{Memcached Benchmark: 99th Percentile Latency\smallskip\newline 99th Percentile of the response latency as a function of \emph{desired} query rate for 15,360 connections. }
109	\label{fig:memcd:rate:tail}
110	\end{figure}
111
112	\subsection{Tail Latency}
113	Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same Memcached experiment.
114
115	Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines.
116	As expected, the latency starts low and increases as the server gets close to saturation, at which point the latency increases dramatically because the web servers cannot keep up with the connection rate, so client requests are disproportionally delayed.
117	Because of this dramatic increase, the Y-axis is presented using a log scale.
118	Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
119
120	For all three servers, the saturation point is reached before 500K queries per second, which is when throughput starts to decline among the web servers.
121	In this experiment, all three web servers are much more distinguishable than in the throughput experiment.
122	Vanilla Memcached achieves the lowest latency until 600K, after which all the web servers are struggling to respond to client requests.
123	\CFA begins to decline at 600K, indicating some bottleneck after saturation.
124	Overall, all three web servers achieve microsecond latencies and the increases in latency mostly follow each other.
125
126	\subsection{Update rate}
127	Since Memcached is effectively a simple database, the cache information can be written to concurrently by multiple queries.
128	And since writes can significantly affect performance, it is interesting to see how varying the update rate affects performance.
129	Figure~\ref{fig:memcd:updt} shows the results for the same experiment as the throughput and latency experiment but increasing the update percentage to 5\%, 10\% and 50\%, respectively, versus the original 3\% update percentage.
130
131	\begin{figure}
132	\hspace{-15pt}
133	\subfloat[][\CFA: Throughput]{
134	\resizebox{0.5\linewidth}{!}{
135	\input{result.memcd.forall.qps.pstex_t}
136	}
137	\label{fig:memcd:updt:forall:qps}
138	}
139	\subfloat[][\CFA: Latency]{
140	\resizebox{0.52\linewidth}{!}{
141	\input{result.memcd.forall.lat.pstex_t}
142	}
143	\label{fig:memcd:updt:forall:lat}
144	}
145
146	\hspace{-15pt}
147	\subfloat[][LibFibre: Throughput]{
148	\resizebox{0.5\linewidth}{!}{
149	\input{result.memcd.fibre.qps.pstex_t}
150	}
151	\label{fig:memcd:updt:fibre:qps}
152	}
153	\subfloat[][LibFibre: Latency]{
154	\resizebox{0.52\linewidth}{!}{
155	\input{result.memcd.fibre.lat.pstex_t}
156	}
157	\label{fig:memcd:updt:fibre:lat}
158	}
159
160	\hspace{-15pt}
161	\subfloat[][Vanilla: Throughput]{
162	\resizebox{0.5\linewidth}{!}{
163	\input{result.memcd.vanilla.qps.pstex_t}
164	}
165	\label{fig:memcd:updt:vanilla:qps}
166	}
167	\subfloat[][Vanilla: Latency]{
168	\resizebox{0.52\linewidth}{!}{
169	\input{result.memcd.vanilla.lat.pstex_t}
170	}
171	\label{fig:memcd:updt:vanilla:lat}
172	}
173	\caption[Throughput and Latency results at different update rates (percentage of writes).]{Throughput and Latency results at different update rates (percentage of writes).\smallskip\newline On the left, throughput as Desired vs Actual query rate.
174	Target QPS is the query rate that the clients are attempting to maintain and Actual QPS is the rate at which the server can respond.
175	On the right, tail latency, \ie 99th Percentile of the response latency as a function of \emph{desired} query rate.
176	For throughput, higher is better, for tail-latency, lower is better.
177	Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
178	All runs have 15,360 client connections.
179	\label{fig:memcd:updt}
180	\end{figure}
181
182	In the end, this experiment mostly demonstrates that the performance of Memcached is affected very little by the update rate.
183	Indeed, since values read/written can be bigger than what can be read/written atomically, a lock must be acquired while the value is read.
184	Hence, I believe the underlying locking pattern for reads and writes is fairly similar, if not the same.
185	These results suggest Memcached does not attempt to optimize reads/writes using a readers-writer lock to protect each value and instead just relies on having a sufficient number of keys to limit contention.
186	In the end, the update experiment shows that \CFA is achieving equivalent performance.
187
188	\section{Static Web-Server}
189	The Memcached experiment does not exercise two key aspects of the \io subsystem: accept\-ing new connections and interacting with disks.
190	On the other hand, a web server servicing static web pages does stress both accepting connections and disk \io by accepting tens of thousands of client requests per second where these requests return static data serviced from the file-system cache or disk.\footnote{
191	web servers servicing dynamic requests, which read from multiple locations and construct a response, are not as interesting since creating the response takes more time and does not exercise the runtime in a meaningfully different way.}
192	The static web server experiment compares NGINX~\cite{nginx} with a custom \CFA-based web server developed for this experiment.
193	NGINX is a high-performance, \emph{full-service}, event-driven web server.
194	It can handle both static and dynamic web content, as well as serve as a reverse proxy and a load balancer~\cite{reese2008nginx}.
195	This wealth of capabilities comes with a variety of potential configurations, dictating available features and performance.
196	The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes.
197	In comparison, the custom \CFA web server was developed specifically with this experiment in mind.
198	However, nothing seems to indicate that NGINX suffers from the increased flexibility.
199	When tuned for performance, NGINX appears to achieve the performance that the underlying hardware can achieve.
200
201	\subsection{NGINX threading}
202	When running as a static web server, NGINX uses an event-driven architecture to service incoming requests.
203	Incoming connections are assigned a \emph{stackless} HTTP state machine and worker processes can handle thousands of these state machines.
204	For the following experiment, NGINX is configured to use @epoll@ to listen for events on these state machines and have each worker process independently accept new connections.
205	Because of the realities of Linux, (Subsection~\ref{ononblock}), NGINX also maintains a pool of auxiliary threads to handle blocking \io.
206	The configuration can set the number of worker processes desired, as well as the size of the auxiliary pool.
207	However, for the following experiments, NGINX is configured to let the master process decide the appropriate number of threads.
208
209	\subsection{\CFA web server}
210	The \CFA web server is a straightforward thread-per-connection web server, where a fixed number of \ats are created upfront.
211	Each \at calls @accept@, through @io_uring@, on the listening port and handles the incoming connection once accepted.
212	Most of the implementation is fairly straightforward;
213	however, the inclusion of file \io found an @io_uring@ problem that required an unfortunate workaround.
214
215	Normally, web servers use @sendfile@~\cite{MAN:sendfile} to send files over a socket because it performs a direct move in the kernel from the file-system cache to the NIC, eliminating reading/writing the file into the web server.
216	While @io_uring@ does not support @sendfile@, it does support @splice@~\cite{MAN:splice}, which is strictly more powerful.
217	However, because of how Linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@ must delegate splice calls to worker threads \emph{inside} the kernel.
218	As of Linux 5.13, @io_uring@ had no mechanism to restrict the number of worker threads, and therefore, when tens of thousands of splice requests are made, it correspondingly creates tens of thousands of internal \glspl{kthrd}.
219	Such a high number of \glspl{kthrd} slows Linux significantly.
220	Rather than abandon the experiment, the \CFA web server was switched to @sendfile@.
221
222	Starting with \emph{blocking} @sendfile@, \CFA achieves acceptable performance until saturation is reached.
223	At saturation, latency increases and client connections begin to timeout.
224	As these clients close their connection, the server must close its corresponding side without delay so the OS can reclaim the resources used by these connections.
225	Indeed, until the server connection is closed, the connection lingers in the CLOSE-WAIT TCP state~\cite{rfc:tcp} and the TCP buffers are preserved.
226	However, this poses a problem using blocking @sendfile@ calls:
227	when @sendfile@ blocks, the \proc rather than the \at blocks, preventing other connections from closing their sockets.
228	The call can block if there is insufficient memory, which can be caused by having too many connections in the CLOSE-WAIT state.\footnote{
229	\lstinline{sendfile} can always block even in nonblocking mode if the file to be sent is not in the file-system cache, because Linux does not provide nonblocking disk I/O.}
230	This effect results in a negative feedback loop where more timeouts lead to more @sendfile@ calls running out of resources.
231
232	Normally, this problem is addressed by using @select@/@epoll@ to wait for sockets to have sufficient resources.
233	However, since @io_uring@ does not support @sendfile@ but does respect non\-blocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely:
234	all calls simply immediately return @EAGAIN@ and all asynchronicity is lost.
235
236	Switching the entire \CFA runtime to @epoll@ for this experiment is unrealistic and does not help in the evaluation of the \CFA runtime.
237	For this reason, the \CFA web server sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@.
238	However, when the nonblocking @sendfile@ returns @EAGAIN@, the \CFA server cannot block the \at because its I/O subsystem uses @io_uring@.
239	Therefore, the \at spins performing the @sendfile@, yields if the call returns @EAGAIN@ and retries in these cases.
240
241	Interestingly, Linux 5.15 @io_uring@ introduces the ability to limit the number of worker threads that are created through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option.
242	Presumably, this limit would prevent the explosion of \glspl{kthrd}, which justified using @sendfile@ over @io_uring@ and @splice@.
243	However, recall from Section~\ref{iouring} that @io_uring@ maintains two pools of workers: bounded workers and unbounded workers.
244	For a web server, the unbounded workers should handle accepts and reads on sockets, and the bounded workers should handle reading files from disk.
245	This setup allows fine-grained control over the number of workers needed for each operation type and presumably leads to good performance.
246
247	However, @io_uring@ must contend with another reality of Linux: the versatility of @splice@.
248	Indeed, @splice@ can be used both for reading and writing to or from any type of file descriptor.
249	This generality makes it ambiguous which pool @io_uring@ should delegate @splice@ calls to.
250	In the case of splicing from a socket to a pipe, @splice@ behaves like an unbounded operation, but when splicing from a regular file to a pipe, @splice@ becomes a bounded operation.
251	To make things more complicated, @splice@ can read from a pipe and write to a regular file.
252	In this case, the read is an unbounded operation but the write is a bounded one.
253	This leaves @io_uring@ in a difficult situation where it can be very difficult to delegate splice operations to the appropriate type of worker.
254	Since there is little or no context available to @io_uring@, it seems to always delegate @splice@ operations to the unbounded workers.
255	This decision is unfortunate for this specific experiment since it prevents the web server from limiting the number of parallel calls to @splice@ without affecting the performance of @read@ or @accept@.
256	For this reason, the @sendfile@ approach described above is still the most performant solution in Linux 5.15.
257
258	One possible workaround is to create more @io_uring@ instances so @splice@ operations can be issued to a different instance than the @read@ and @accept@ operations.
259	However, I do not believe this solution is appropriate in general;
260	it simply replaces my current web server hack with a different, equivalent hack.
261
262	\subsection{Benchmark Environment}
263	Unlike the Memcached experiment, the web server experiment is run on a heterogeneous environment.
264	\begin{itemize}
265	\item
266	The server runs Ubuntu 20.04.4 LTS on top of Linux Kernel 5.13.0-52.
267	\item
268	The server computer has four AMD Opteron\texttrademark Processor 6380 with 16 cores running at 2.5GHz, for a total of 64 \glspl{hthrd}.
269	\item
270	The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate.
271	\item
272	Both servers are setup with enough parallelism to achieve 100\% CPU utilization, which happens at higher request rates.
273	\item
274	Each CPU has 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
275	\item
276	The computer is booted with only 25GB of memory to restrict the file-system cache.
277	\end{itemize}
278	There are 8 client machines.
279	\begin{itemize}
280	\item
281	A client runs a 2.6.11-1 SMP Linux kernel, which permits each client load generator to run on a separate CPU.
282	\item
283	It has two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
284	\item
285	Network routing is performed by an HP 2530 10 Gigabit Ethernet switch.
286	\item
287	A client machine runs two copies of the workload generator.
288	\end{itemize}
289	The clients and network are sufficiently provisioned to drive the server to saturation and beyond.
290	Hence, any server effects are attributable solely to the runtime system and web server.
291	Finally, without restricting the server hardware resources, it is impossible to determine if a runtime system or the web server using it has any specific design restrictions, \eg using space to reduce time.
292	Trying to determine these restrictions with large numbers of processors or memory simply means running equally large experiments, which take longer and are harder to set up.
293
294	\subsection{Throughput}
295	To measure web server throughput, the server computer is loaded with 21,600 files, sharded across 650 directories, occupying about 2.2GB of disk, distributed over the server's RAID-5 4-drives to achieve high throughput for disk I/O.
296	The clients run httperf~\cite{httperf} to request a set of static files.
297	The httperf load generator is used with session files to simulate a large number of users and to implement a partially open-loop system.
298	This permits httperf to produce overload conditions, generate multiple requests from persistent HTTP/1.1 connections, and include both active and inactive off periods to model browser processing times and user think times~\cite{Barford98}.
299
300	The experiments are run with 16 clients, each running a copy of httperf (one copy per CPU), requiring a set of 16 log files with requests conforming to a Zipf distribution.
301	This distribution is representative of users accessing static data through a web browser.
302	Each request reads a file name from its trace, establishes a connection, performs an HTTP GET request for the file name, receives the file data, closes the connection, and repeats the process.
303	Some trace elements have multiple file names that are read across a persistent connection.
304	A client times out if the server does not complete a request within 10 seconds.
305
306	An experiment consists of running a server with request rates ranging from 10,000 to 70,000 requests per second;
307	each rate takes about 5 minutes to complete.
308	There are 20 seconds of idle time between rates and between experiments to allow connections in the TIME-WAIT state to clear.
309	Server throughput is measured both at peak and after saturation (\ie after peak).
310	Peak indicates the level of client requests the server can handle and after peak indicates if a server degrades gracefully.
311	Throughput is measured by aggregating the results from httperf for all the clients.
312
313	This experiment can be done for two workload scenarios by reconfiguring the server with different amounts of memory: 25 GB and 2.5 GB.
314	The two workloads correspond to in-memory and disk-I/O respectively.
315	Due to the Zipf distribution, only a small amount of memory is needed to service a significant percentage of requests.
316	Table~\ref{t:CumulativeMemory} shows the cumulative memory required to satisfy the specified percentage of requests; e.g., 95\% of the requests come from 126.5 MB of the file set and 95\% of the requests are for files less than or equal to 51,200 bytes.
317	Interestingly, with 2.5 GB of memory, significant disk-I/O occurs.
318
319	\begin{table}
320	\caption{Cumulative memory for requests by file size}
321	\label{t:CumulativeMemory}
322	\begin{tabular}{r\|rrrrrrrr}
323	\% Requests & 10 & 30 & 50 & 70 & 80 & 90 & \textbf{95} & 100 \\
324	Memory (MB) & 0.5 & 1.5 & 8.4 & 12.2 & 20.1 & 94.3 & \textbf{126.5} & 2,291.6 \\
325	File Size (B) & 409 & 716 & 4,096 & 5,120 & 7,168 & 40,960 & \textbf{51,200} & 921,600
326	\end{tabular}
327	\end{table}
328
329	\begin{figure}
330	\centering
331	\subfloat[][Throughput]{
332	\resizebox{0.85\linewidth}{!}{\input{result.swbsrv.25gb.pstex_t}}
333	\label{fig:swbsrv:ops}
334	}
335
336	\subfloat[][Rate of Errors]{
337	\resizebox{0.85\linewidth}{!}{\input{result.swbsrv.25gb.err.pstex_t}}
338	\label{fig:swbsrv:err}
339	}
340	\caption[Static web server Benchmark: Throughput]{Static web server Benchmark: Throughput\smallskip\newline Throughput vs request rate for short-lived connections.}
341	\label{fig:swbsrv}
342	\end{figure}
343
344	Figure~\ref{fig:swbsrv} shows the results comparing \CFA to NGINX in terms of throughput.
345	These results are fairly straightforward.
346	Both servers achieve the same throughput until around 57,500 requests per second.
347	Since the clients are asking for the same files, the fact that the throughput matches exactly is expected as long as both servers are able to serve the request rate.
348	Once the saturation point is reached, both servers are still very close.
349	NGINX achieves slightly better throughput.
350	However, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achieves notably fewer errors once the servers reach saturation.
351	This suggests \CFA is slightly fairer with less throughput, while NGINX sacrifices fairness for more throughput.
352	This experiment demonstrates that the \CFA web server is able to match the performance of NGINX up to and beyond the saturation point of the machine.
353
354	\subsection{Disk Operations}
355	With 25GB of memory, the entire experimental file-set plus the web server and OS fit in memory.
356	If memory is constrained, the OS must evict files from the file cache, which causes @sendfile@ to read from disk.\footnote{
357	For the in-memory experiments, the file-system cache was warmed by running an experiment three times before measuring started to ensure all files are in the file-system cache.}
358	web servers can behave very differently once file I/O begins and increases.
359	Hence, prior work~\cite{Harji10} suggests running both kinds of experiments to test overall web server performance.
360
361	However, after reducing memory to 2.5GB, the problem with @splice@ and @io_uring@ rears its ugly head again.
362	Indeed, in the in-memory configuration, replacing @splice@ with calls to @sendfile@ works because the bounded side basically never blocks.
363	Like @splice@, @sendfile@ is in a situation where the read side requires bounded blocking, \eg reading from a regular file, while the write side requires unbounded blocking, \eg blocking until the socket is available for writing.
364	The unbounded side can be handled by yielding when it returns @EAGAIN@, as mentioned above, but this trick does not work for the bounded side.
365	The only solution for the bounded side is to spawn more threads and let these handle the blocking.
366
367	Supporting this case in the web server would require creating more \procs or creating a dedicated thread pool.
368	However, I felt this kind of modification moves too far away from my goal of evaluating the \CFA runtime, \ie it begins writing another runtime system;
369	hence, I decided to forgo experiments on low-memory performance.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: