Context Navigation

-                      r82a90d4
+                      rddcaff6
 \chapter{Macro-Benchmarks}\label{macrobench}
 The previous chapter demonstrated the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.
+The previous chapter demonstrated that the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.
 The next step is to demonstrate performance stays true in more realistic and complete scenarios.
 Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate \CFA performs competitively compared to web servers used in production environments.
+Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate that \CFA performs competitively compared to web servers used in production environments.
 Web servers are chosen because they offer fairly simple applications that perform complex I/O, both network and disk, and are useful as standalone products.
 …
 As such, these experiments should highlight the overhead due to any \CFA fairness cost in realistic scenarios.
+The most obvious performance metric for web servers is throughput.
+This metric generally measures the speed at which the server answers and relatedly how fast clients can send requests before the server can no longer keep-up.
+Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service?
+Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.
 \section{Memcached}
 Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}.
 …
 Each CPU has 6 cores and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.
 \item
+The machine is configured to run each servers on 12 dedicated \glspl{hthrd} and uses 6 of the remaining \glspl{hthrd} for the software interrupt handling~\cite{wiki:softirq}, resulting in maximum CPU utilization of 75\% (18 / 24  \glspl{hthrd})
+\item
 A CPU has 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches, respectively.
 \item
 …
 \item
 For UDP connections, all the threads listen to a single UDP socket for incoming requests.
 Threads that are not currently dealing with another request ignore the incoming packet.
+Threads that are currently dealing with another request ignore the incoming packet.
 One of the remaining, non-busy, threads reads the request and sends the response.
 This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request.
 …
 \subsection{Throughput} \label{memcd:tput}
 This experiment is done by having the clients establish 15,360 total connections, which persist for the duration of the experiment.
 The clients then send read and write queries with only 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.
+The clients then send read and write queries with 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.
 Figure~\ref{fig:memcd:rate:qps} shows the 3 server versions at different client rates, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS'', for all three web servers.
 …
 \subsection{Tail Latency}
-Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service?
-Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.
 Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same Memcached experiment.
 Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines.
 As expected, the latency starts low and increases as the server gets close to saturation, at which point, the latency increases dramatically because the web servers cannot keep up with the connection rate so client requests are disproportionally delayed.
+As expected, the latency starts low and increases as the server gets close to saturation, at which point the latency increases dramatically because the web servers cannot keep up with the connection rate, so client requests are disproportionally delayed.
 Because of this dramatic increase, the Y-axis is presented using a log scale.
 Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
 …
 web servers servicing dynamic requests, which read from multiple locations and construct a response, are not as interesting since creating the response takes more time and does not exercise the runtime in a meaningfully different way.}
 The static web server experiment compares NGINX~\cite{nginx} with a custom \CFA-based web server developed for this experiment.
-\subsection{NGINX threading}
 NGINX is a high-performance, \emph{full-service}, event-driven web server.
 It can handle both static and dynamic web content, as well as serve as a reverse proxy and a load balancer~\cite{reese2008nginx}.
 This wealth of capabilities comes with a variety of potential configurations, dictating available features and performance.
 The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes.
+When running as a static web server, it uses an event-driven architecture to service incoming requests.
+In comparison, the custom \CFA web server was developed specifically with this experiment in mind.
+However, nothing seems to indicate that NGINX suffers from the increased flexibility.
+When tuned for performance, NGINX appears to achieve the performance that the underlying hardware can achieve.
+\subsection{NGINX threading}
+When running as a static web server, NGINX uses an event-driven architecture to service incoming requests.
 Incoming connections are assigned a \emph{stackless} HTTP state machine and worker processes can handle thousands of these state machines.
 For the following experiment, NGINX is configured to use @epoll@ to listen for events on these state machines and have each worker process independently accept new connections.
 Because of the realities of Linux, see Subsection~\ref{ononblock}, NGINX also maintains a pool of auxiliary threads to handle blocking \io.
+Because of the realities of Linux, (Subsection~\ref{ononblock}), NGINX also maintains a pool of auxiliary threads to handle blocking \io.
 The configuration can set the number of worker processes desired, as well as the size of the auxiliary pool.
 However, for the following experiments, NGINX is configured to let the master process decide the appropriate number of threads.
 …
 The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate.
 \item
+Both servers are setup with enough parallelism to achieve 100\% CPU utilization, which happens at higher request rates.
+\item
 Each CPU has 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
 \item

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset ddcaff6 for doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

Legend:

doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

Download in other formats: