Ignore:
Timestamp:
Nov 24, 2022, 3:41:44 PM (21 months ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master
Children:
dacd8e6e
Parents:
82a90d4
Message:

Last corrections to my thesis... hopefully

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    r82a90d4 rddcaff6  
    11\chapter{Macro-Benchmarks}\label{macrobench}
    2 The previous chapter demonstrated the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.
     2The previous chapter demonstrated that the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.
    33The next step is to demonstrate performance stays true in more realistic and complete scenarios.
    4 Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate \CFA performs competitively compared to web servers used in production environments.
     4Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate that \CFA performs competitively compared to web servers used in production environments.
    55
    66Web servers are chosen because they offer fairly simple applications that perform complex I/O, both network and disk, and are useful as standalone products.
     
    1010As such, these experiments should highlight the overhead due to any \CFA fairness cost in realistic scenarios.
    1111
     12The most obvious performance metric for web servers is throughput.
     13This metric generally measures the speed at which the server answers and relatedly how fast clients can send requests before the server can no longer keep-up.
     14Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service?
     15Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.
     16
    1217\section{Memcached}
    1318Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}.
     
    2631Each CPU has 6 cores and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.
    2732\item
     33The machine is configured to run each servers on 12 dedicated \glspl{hthrd} and uses 6 of the remaining \glspl{hthrd} for the software interrupt handling~\cite{wiki:softirq}, resulting in maximum CPU utilization of 75\% (18 / 24  \glspl{hthrd})
     34\item
    2835A CPU has 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches, respectively.
    2936\item
     
    4754\item
    4855For UDP connections, all the threads listen to a single UDP socket for incoming requests.
    49 Threads that are not currently dealing with another request ignore the incoming packet.
     56Threads that are currently dealing with another request ignore the incoming packet.
    5057One of the remaining, non-busy, threads reads the request and sends the response.
    5158This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request.
     
    7986\subsection{Throughput} \label{memcd:tput}
    8087This experiment is done by having the clients establish 15,360 total connections, which persist for the duration of the experiment.
    81 The clients then send read and write queries with only 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.
     88The clients then send read and write queries with 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.
    8289Figure~\ref{fig:memcd:rate:qps} shows the 3 server versions at different client rates, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS'', for all three web servers.
    8390
     
    104111
    105112\subsection{Tail Latency}
    106 Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service?
    107 Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.
    108113Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same Memcached experiment.
    109114
    110115Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines.
    111 As expected, the latency starts low and increases as the server gets close to saturation, at which point, the latency increases dramatically because the web servers cannot keep up with the connection rate so client requests are disproportionally delayed.
     116As expected, the latency starts low and increases as the server gets close to saturation, at which point the latency increases dramatically because the web servers cannot keep up with the connection rate, so client requests are disproportionally delayed.
    112117Because of this dramatic increase, the Y-axis is presented using a log scale.
    113118Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
     
    186191web servers servicing dynamic requests, which read from multiple locations and construct a response, are not as interesting since creating the response takes more time and does not exercise the runtime in a meaningfully different way.}
    187192The static web server experiment compares NGINX~\cite{nginx} with a custom \CFA-based web server developed for this experiment.
    188 
    189 \subsection{NGINX threading}
    190193NGINX is a high-performance, \emph{full-service}, event-driven web server.
    191194It can handle both static and dynamic web content, as well as serve as a reverse proxy and a load balancer~\cite{reese2008nginx}.
    192195This wealth of capabilities comes with a variety of potential configurations, dictating available features and performance.
    193196The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes.
    194 When running as a static web server, it uses an event-driven architecture to service incoming requests.
     197In comparison, the custom \CFA web server was developed specifically with this experiment in mind.
     198However, nothing seems to indicate that NGINX suffers from the increased flexibility.
     199When tuned for performance, NGINX appears to achieve the performance that the underlying hardware can achieve.
     200
     201\subsection{NGINX threading}
     202When running as a static web server, NGINX uses an event-driven architecture to service incoming requests.
    195203Incoming connections are assigned a \emph{stackless} HTTP state machine and worker processes can handle thousands of these state machines.
    196204For the following experiment, NGINX is configured to use @epoll@ to listen for events on these state machines and have each worker process independently accept new connections.
    197 Because of the realities of Linux, see Subsection~\ref{ononblock}, NGINX also maintains a pool of auxiliary threads to handle blocking \io.
     205Because of the realities of Linux, (Subsection~\ref{ononblock}), NGINX also maintains a pool of auxiliary threads to handle blocking \io.
    198206The configuration can set the number of worker processes desired, as well as the size of the auxiliary pool.
    199207However, for the following experiments, NGINX is configured to let the master process decide the appropriate number of threads.
     
    262270The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate.
    263271\item
     272Both servers are setup with enough parallelism to achieve 100\% CPU utilization, which happens at higher request rates.
     273\item
    264274Each CPU has 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
    265275\item
Note: See TracChangeset for help on using the changeset viewer.