Ignore:
Timestamp:
Aug 23, 2022, 6:40:54 AM (2 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
8baa40aa
Parents:
4fee301 (diff), 94eff4c (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    r4fee301 r0c40bfe  
    11\chapter{Macro-Benchmarks}\label{macrobench}
    2 The previous chapter has demonstrated that the scheduler achieves its performance goal in small and controlled scenario.
    3 The next step is then to demonstrate that this stays true in more realistic and complete scenarios.
    4 This chapter presents two flavours of webservers that demonstrate that \CFA performs competitively with production environments.
    5 
    6 Webservers where chosen because they offer fairly simple applications that are still useful as standalone products.
    7 Furthermore, webservers are generally amenable to parallelisation since their workloads are mostly homogenous.
    8 They therefore offer a stringent performance benchmark for \CFA.
    9 Indeed existing solutions are likely to have close to optimal performance while the homogeneity of the workloads mean the additional fairness is not needed.
     2The previous chapter demonstrated the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.
     3The next step is to demonstrate performance stays true in more realistic and complete scenarios.
     4Therefore, this chapter exercises both \at and I/O scheduling using two flavours of webservers that demonstrate \CFA performs competitively with production environments.
     5
     6Webservers are chosen because they offer fairly simple applications that perform complex I/O, both network and disk, and are useful as standalone products.
     7Furthermore, webservers are generally amenable to parallelization since their workloads are mostly homogeneous.
     8Therefore, webservers offer a stringent performance benchmark for \CFA.
     9Indeed, existing webservers have close to optimal performance, while the homogeneity of the workload means fairness may not be a problem.
     10As such, these experiments should highlight any \CFA fairness cost (overhead) in realistic scenarios.
    1011
    1112\section{Memcached}
    12 Memcached~\cite{memcached} is an in memory key-value store that is used in many production environments, \eg \cite{atikoglu2012workload}.
    13 This also server also has the notable added benefit that there exists a full-featured front-end for performance testing called @mutilate@~\cite{GITHUB:mutilate}.
    14 Experimenting on memcached allows for a simple test of the \CFA runtime as a whole, it will exercise the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.
    15 This experiment does not exercise the \io subsytem with regards to disk operations.
     13Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}.
     14In fact, the Memcached server is so popular there exists a full-featured front-end for performance testing, called @mutilate@~\cite{GITHUB:mutilate}.
     15Experimenting on Memcached allows for a simple test of the \CFA runtime as a whole, exercising the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.
     16Note, this experiment does not exercise the \io subsystem with regards to disk operations because Memcached is an in-memory server.
    1617
    1718\subsection{Benchmark Environment}
    18 These experiments are run on a cluster of homogenous Supermicro SYS-6017R-TDF compute nodes with the following characteristics:
     19The Memcached experiments are run on a cluster of homogeneous Supermicro SYS-6017R-TDF compute nodes with the following characteristics.
     20\begin{itemize}
     21\item
    1922The server runs Ubuntu 20.04.3 LTS on top of Linux Kernel 5.11.0-34.
     23\item
    2024Each node has 2 Intel(R) Xeon(R) CPU E5-2620 v2 running at 2.10GHz.
     25\item
    2126These CPUs have 6 cores per CPUs and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.
    22 The cpus each have 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches respectively.
     27\item
     28The CPUs each have 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches respectively.
     29\item
    2330Each node is connected to the network through a Mellanox 10 Gigabit Ethernet port.
    24 The network route uses 1 Mellanox SX1012 10/40 Gigabit Ethernet cluster switch.
    25 
    26 \subsection{Memcached with threads per connection}
    27 Comparing against memcached using a user-level runtime only really make sense if the server actually uses this threading model.
    28 Indeed, evaluating a user-level runtime with 1 \at per \proc is not meaningful since it does not exercise the runtime, it simply adds some overhead to the underlying OS scheduler.
    29 
    30 One approach is to use a webserver that uses a thread-per-connection model, where each incoming connection is served by a single \at in a strict 1-to-1 pairing.
    31 This models adds flexibility to the implementation, as the serving logic can now block on user-level primitives without affecting other connections.
    32 
    33 Memcached is not built according to a thread-per-connection model, but there exists a port of it that is, which was built for libfibre in \cite{DBLP:journals/pomacs/KarstenB20}.
    34 Therefore this version can both be compared to the original version and to a port to the \CFA runtime.
    35 
    36 As such, this memcached experiment compares 3 different varitions of memcached:
    37 \begin{itemize}
    38  \item \emph{vanilla}: the official release of memcached, version~1.6.9.
    39  \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.
     31\item
     32Network routing is performed by a Mellanox SX1012 10/40 Gigabit Ethernet switch.
     33\end{itemize}
     34
     35\subsection{Memcached threading}
     36Memcached can be built to use multiple threads in addition to its @libevent@ subsystem to handle requests.
     37When enabled, the threading implementation operates as follows~\cite{https://docs.oracle.com/cd/E17952_01/mysql-5.6-en/ha-memcached-using-threads.html}:
     38\begin{itemize}
     39\item
     40Threading is handled by wrapping functions within the code to provide basic protection from updating the same global structures at the same time.
     41\item
     42Each thread uses its own instance of the @libevent@ to help improve performance.
     43\item
     44TCP/IP connections are handled with a single thread listening on the TCP/IP socket.
     45Each connection is then distributed to one of the active threads on a simple round-robin basis.
     46Each connection then operates solely within this thread while the connection remains open.
     47\item
     48For UDP connections, all the threads listen to a single UDP socket for incoming requests.
     49Threads that are not currently dealing with another request ignore the incoming packet.
     50One of the remaining, nonbusy, threads reads the request and sends the response.
     51This implementation can lead to increased CPU load as threads wake from sleep to potentially process the request.
     52\end{itemize}
     53Here, Memcached is based on an event-based webserver architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple (largely) independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O.
     54Alternative webserver architecture are:
     55\begin{itemize}
     56\item
     57pipeline~\cite{Welsh01}, where the event engine is subdivided into multiple stages and the stages are connected with asynchronous buffers, where the final stage has multiple threads to handle blocking I/O.
     58\item
     59thread-per-connection~\cite{apache,Behren03}, where each incoming connection is served by a single \at in a strict 1-to-1 pairing, using the thread stack to hold the event state and folding the event engine implicitly into the threading runtime with its nonblocking I/O mechanism.
     60\end{itemize}
     61Both pipelining and thread-per-connection add flexibility to the implementation, as the serving logic can now block without halting the event engine~\cite{Harji12}.
     62
     63However, \gls{kthrd}ing in Memcached is not amenable to this work, which is based on \gls{uthrding}.
     64While it is feasible to layer one user thread per kernel thread, it is not meaningful as it fails to exercise the user runtime;
     65it simply adds extra scheduling overhead over the kernel threading.
     66Hence, there is no direct way to compare Memcached using a kernel-level runtime with a user-level runtime.
     67
     68Fortunately, there exists a recent port of Memcached to \gls{uthrding} based on the libfibre~\cite{DBLP:journals/pomacs/KarstenB20} \gls{uthrding} library.
     69This port did all of the heavy-lifting, making it straightforward to replace the libfibre user-threading with the \gls{uthrding} in \CFA.
     70It is now possible to compare the original kernel-threading Memcached with both user-threading runtimes in libfibre and \CFA.
     71
     72As such, this Memcached experiment compares 3 different variations of Memcached:
     73\begin{itemize}
     74 \item \emph{vanilla}: the official release of Memcached, version~1.6.9.
     75 \item \emph{fibre}: a modification of vanilla using the thread-per-connection model on top of the libfibre runtime.
    4076 \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.
    4177\end{itemize}
    4278
    4379\subsection{Throughput} \label{memcd:tput}
     80This experiment is done by having the clients establish 15,360 total connections, which persist for the duration of the experiment.
     81The clients then send read and write queries with only 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.
     82Figure~\ref{fig:memcd:rate:qps} shows the 3 server versions at different client rates, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS'', for all three webservers.
     83
     84Like the experimental setup in Chapter~\ref{microbench}, each experiment is run 15 times, and for each client rate, the measured webserver rate is plotted.
     85The solid line represents the median while the dashed and dotted lines represent the maximum and minimum respectively.
     86For rates below 500K queries per seconds, all three webservers match the client rate.
     87Beyond 500K, the webservers cannot match the client rate.
     88During this interval, vanilla Memcached achieves the highest webserver throughput, with libfibre and \CFA slightly lower but very similar throughput.
     89Overall the performance of all three webservers is very similar, especially considering that at 500K the servers have reached saturation, which is discussed more in the next section.
     90
    4491\begin{figure}
    4592        \centering
    46         \input{result.memcd.rate.qps.pstex_t}
    47         \caption[Memcached Benchmark: Throughput]{Memcached Benchmark: Throughput\smallskip\newline Desired vs Actual request rate for 15360 connections. Target QPS is the request rate that the clients are attempting to maintain and Actual QPS is the rate at which the server is able to respond.}
     93        \resizebox{0.83\linewidth}{!}{\input{result.memcd.rate.qps.pstex_t}}
     94        \caption[Memcached Benchmark: Throughput]{Memcached Benchmark: Throughput\smallskip\newline Desired vs Actual query rate for 15,360 connections. Target QPS is the query rate that the clients are attempting to maintain and Actual QPS is the rate at which the server is able to respond.}
    4895        \label{fig:memcd:rate:qps}
    49 \end{figure}
    50 Figure~\ref{fig:memcd:rate:qps} shows the result for the throughput of all three webservers.
    51 This experiment is done by having the clients establish 15360 total connections, which persist for the duration of the experiments.
    52 The clients then send requests, attempting to follow a desired request rate.
    53 The servers respond to the desired rate as best they can and the difference between desired rate, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS''.
    54 The results show that \CFA achieves equivalent throughput even when the server starts to reach saturation.
    55 Only then does it start to fall behind slightly.
    56 This is a demonstration of the \CFA runtime achieving its performance goal.
    57 
    58 \subsection{Tail Latency}
    59 \begin{figure}
     96%\end{figure}
     97\bigskip
     98%\begin{figure}
    6099        \centering
    61         \input{result.memcd.rate.99th.pstex_t}
    62         \caption[Memcached Benchmark : 99th Percentile Lantency]{Memcached Benchmark : 99th Percentile Lantency\smallskip\newline 99th Percentile of the response latency as a function of \emph{desired} request rate for 15360 connections. }
     100        \resizebox{0.83\linewidth}{!}{\input{result.memcd.rate.99th.pstex_t}}
     101        \caption[Memcached Benchmark : 99th Percentile Lantency]{Memcached Benchmark : 99th Percentile Lantency\smallskip\newline 99th Percentile of the response latency as a function of \emph{desired} query rate for 15,360 connections. }
    63102        \label{fig:memcd:rate:tail}
    64103\end{figure}
    65 Another important performance metric to look at is \newterm{tail} latency.
    66 Since many web applications rely on a combination of different requests made in parallel, the latency of the slowest response, \ie tail latency, can dictate overall performance.
    67 Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same experiment memcached experiment.
    68 As is expected, the latency starts low and increases as the server gets close to saturation, point at which the latency increses dramatically.
    69 Note that the figure shows \emph{target} request rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
     104
     105\subsection{Tail Latency}
     106Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service.
     107Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.
     108Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same Memcached experiment.
     109
     110Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines.
     111As expected, the latency starts low and increases as the server gets close to saturation, at which point, the latency increases dramatically because the webservers cannot keep up with the connection rate so client requests are disproportionally delayed.
     112Because of this dramatic increase, the Y axis is presented using log scale.
     113Note that the graph shows \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
     114
     115For all three servers, the saturation point is reached before 500K queries per second, which is when throughput starts to decline among the webservers.
     116In this experiment, all three webservers are much more distinguishable than the throughput experiment.
     117Vanilla Memcached achieves the lowest latency until 600K, after which all the webservers are struggling to respond to client requests.
     118\CFA begins to decline at 600K, indicating some bottleneck after saturation.
     119Overall, all three webservers achieve micro-second latencies and the increases in latency mostly follow each other.
    70120
    71121\subsection{Update rate}
     122Since Memcached is effectively a simple database, the information that is cached can be written to concurrently by multiple queries.
     123And since writes can significantly affect performance, it is interesting to see how varying the update rate affects performance.
     124Figure~\ref{fig:memcd:updt} shows the results for the same experiment as the throughput and latency experiment but increasing the update percentage to 5\%, 10\% and 50\%, respectively, versus the original 3\% update percentage.
     125
    72126\begin{figure}
    73         \centering
    74         \subfloat[][Throughput]{
    75                 \input{result.memcd.forall.qps.pstex_t}
    76         }
    77 
    78         \subfloat[][Latency]{
    79                 \input{result.memcd.forall.lat.pstex_t}
    80         }
    81         \caption[forall Latency results at different update rates]{forall Latency results at different update rates\smallskip\newline Description}
    82         \label{fig:memcd:updt:forall}
     127        \subfloat[][\CFA: Throughput]{
     128                \resizebox{0.5\linewidth}{!}{
     129                        \input{result.memcd.forall.qps.pstex_t}
     130                }
     131                \label{fig:memcd:updt:forall:qps}
     132        }
     133        \subfloat[][\CFA: Latency]{
     134                \resizebox{0.5\linewidth}{!}{
     135                        \input{result.memcd.forall.lat.pstex_t}
     136                }
     137                \label{fig:memcd:updt:forall:lat}
     138        }
     139
     140        \subfloat[][LibFibre: Throughput]{
     141                \resizebox{0.5\linewidth}{!}{
     142                        \input{result.memcd.fibre.qps.pstex_t}
     143                }
     144                \label{fig:memcd:updt:fibre:qps}
     145        }
     146        \subfloat[][LibFibre: Latency]{
     147                \resizebox{0.5\linewidth}{!}{
     148                        \input{result.memcd.fibre.lat.pstex_t}
     149                }
     150                \label{fig:memcd:updt:fibre:lat}
     151        }
     152
     153        \subfloat[][Vanilla: Throughput]{
     154                \resizebox{0.5\linewidth}{!}{
     155                        \input{result.memcd.vanilla.qps.pstex_t}
     156                }
     157                \label{fig:memcd:updt:vanilla:qps}
     158        }
     159        \subfloat[][Vanilla: Latency]{
     160                \resizebox{0.5\linewidth}{!}{
     161                        \input{result.memcd.vanilla.lat.pstex_t}
     162                }
     163                \label{fig:memcd:updt:vanilla:lat}
     164        }
     165        \caption[Throughput and Latency results at different update rates (percentage of writes).]{Throughput and Latency results at different update rates (percentage of writes).\smallskip\newline Description}
     166        \label{fig:memcd:updt}
    83167\end{figure}
    84168
    85 \begin{figure}
    86         \centering
    87         \subfloat[][Throughput]{
    88                 \input{result.memcd.fibre.qps.pstex_t}
    89         }
    90 
    91         \subfloat[][Latency]{
    92                 \input{result.memcd.fibre.lat.pstex_t}
    93         }
    94         \caption[fibre Latency results at different update rates]{fibre Latency results at different update rates\smallskip\newline Description}
    95         \label{fig:memcd:updt:fibre}
    96 \end{figure}
    97 
    98 \begin{figure}
    99         \centering
    100         \subfloat[][Throughput]{
    101                 \input{result.memcd.vanilla.qps.pstex_t}
    102         }
    103 
    104         \subfloat[][Latency]{
    105                 \input{result.memcd.vanilla.lat.pstex_t}
    106         }
    107         \caption[vanilla Latency results at different update rates]{vanilla Latency results at different update rates\smallskip\newline Description}
    108         \label{fig:memcd:updt:vanilla}
    109 \end{figure}
    110 
    111 
     169In the end, this experiment mostly demonstrates that the performance of Memcached is affected very little by the update rate.
     170Indeed, since values read/written can be bigger than what can be read/written atomically, a lock must be acquired while the value is read.
     171Hence, I believe the underlying locking pattern for reads and writes is fairly similar, if not the same.
     172These results suggest Memcached does not attempt to optimize reads/writes using a readers-writer lock to protect each value and instead just relies on having a sufficient number of keys to limit contention.
     173In the end, the update experiment shows that \CFA is achieving equivalent performance.
    112174
    113175\section{Static Web-Server}
    114 The memcached experiment has two aspects of the \io subsystem it does not exercise, accepting new connections and interacting with disks.
    115 On the other hand, static webservers, servers that offer static webpages, do stress disk \io since they serve files from disk\footnote{Dynamic webservers, which construct pages as they are sent, are not as interesting since the construction of the pages do not exercise the runtime in a meaningfully different way.}.
    116 The static webserver experiments will compare NGINX~\cit{nginx} with a custom webserver developped for this experiment.
     176The Memcached experiment does not exercise two key aspects of the \io subsystem: accept\-ing new connections and interacting with disks.
     177On the other hand, a webserver servicing static web-pages does stress both accepting connections and disk \io by accepting tens of thousands of client requests per second where these requests return static data serviced from the file-system cache or disk.\footnote{
     178Webservers servicing dynamic requests, which read from multiple locations and construct a response, are not as interesting since creating the response takes more time and does not exercise the runtime in a meaningfully different way.}
     179The static webserver experiment compares NGINX~\cite{nginx} with a custom \CFA-based webserver developed for this experiment.
    117180
    118181\subsection{\CFA webserver}
    119 Unlike the memcached experiment, the webserver experiment relies on a custom designed webserver.
    120 It is a simple thread-per-connection webserver where a fixed number of \ats are created upfront.
    121 Each of the \at calls @accept@, through @io_uring@, on the listening port and handle the incomming connection once accepted.
    122 Most of the implementation is fairly straight forward however the inclusion of file \io introduces a new challenge that had to be hacked around.
    123 
    124 Normally, webservers use @sendfile@\cite{MAN:sendfile} to send files over the socket.
    125 @io_uring@ does not support @sendfile@, it supports @splice@\cite{MAN:splice} instead, which is strictly more powerful.
    126 However, because of how linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@'s implementation must delegate calls to splice to worker threads inside the kernel.
    127 As of Linux 5.13, @io_uring@ caps the numer of these worker threads to @RLIMIT_NPROC@ and therefore, when tens of thousands of splice requests are made, it can create tens of thousands of \glspl{kthrd}.
    128 Such a high number of \glspl{kthrd} is more than Linux can handle in this scenario so performance suffers significantly.
    129 For this reason, the \CFA webserver calls @sendfile@ directly.
    130 This approach works up to a certain point, but once the server approaches saturation, it leads to a new problem.
    131 
    132 When the saturation point of the server is attained, latency will increase and inevitably some client connections will timeout.
    133 As these clients close there connections, the server must close these sockets without delay so the OS can reclaim the resources used by these connections.
    134 Indeed, until they are closed on the server end, the connection will linger in the CLOSE-WAIT tcp state~\cite{rfc:tcp} and the tcp buffers will be preserved.
    135 However, this poses a problem using blocking @sendfile@ calls.
    136 The calls can block if they do not have suffcient memory, which can be caused by having too many connections in the CLOSE-WAIT state.
    137 Since blocking in calls to @sendfile@ blocks the \proc rather than the \at, this prevents other connections from closing their sockets.
    138 This leads to a vicious cycle where timeouts lead to @sendfile@ calls running out of resources, which lead to more timeouts.
    139 
    140 Normally, this is address by marking the sockets as non-blocking and using @epoll@ to wait for sockets to have sufficient resources.
    141 However, since @io_uring@ respects non-blocking semantics marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely.
     182The \CFA webserver is a straightforward thread-per-connection webserver, where a fixed number of \ats are created upfront (tuning parameter).
     183Each \at calls @accept@, through @io_uring@, on the listening port and handles the incoming connection once accepted.
     184Most of the implementation is fairly straightforward;
     185however, the inclusion of file \io found an @io_uring@ problem that required an unfortunate workaround.
     186
     187Normally, webservers use @sendfile@~\cite{MAN:sendfile} to send files over a socket because it performs a direct move in the kernel from the file-system cache to the NIC, eliminating reading/writing the file into the webserver.
     188While @io_uring@ does not support @sendfile@, it does supports @splice@~\cite{MAN:splice}, which is strictly more powerful.
     189However, because of how Linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@ must delegate splice calls to worker threads inside the kernel.
     190As of Linux 5.13, @io_uring@ had no mechanism to restrict the number of worker threads, and therefore, when tens of thousands of splice requests are made, it correspondingly creates tens of thousands of internal \glspl{kthrd}.
     191Such a high number of \glspl{kthrd} slows Linux significantly.
     192Rather than abandon the experiment, the \CFA webserver was switched to nonblocking @sendfile@.
     193However, when the nonblocking @sendfile@ returns @EAGAIN@, the \CFA server cannot block the \at because its I/O subsystem uses @io_uring@.
     194Therefore, the \at must spin performing the @sendfile@ and yield if the call returns @EAGAIN@.
     195This workaround works up to the saturation point, when other problems occur.
     196
     197At saturation, latency increases so some client connections timeout.
     198As these clients close their connection, the server must close its corresponding side without delay so the OS can reclaim the resources used by these connections.
     199Indeed, until the server connection is closed, the connection lingers in the CLOSE-WAIT TCP state~\cite{rfc:tcp} and the TCP buffers are preserved.
     200However, this poses a problem using nonblocking @sendfile@ calls:
     201the call can still block if there is insufficient memory, which can be caused by having too many connections in the CLOSE-WAIT state.\footnote{
     202\lstinline{sendfile} can always block even in nonblocking mode if the file to be sent is not in the file-system cache, because Linux does not provide nonblocking disk I/O.}
     203When @sendfile@ blocks, the \proc rather than the \at blocks, preventing other connections from closing their sockets.
     204This effect results in a negative feedback where more timeouts lead to more @sendfile@ calls running out of resources.
     205
     206Normally, this is address by using @select@/@epoll@ to wait for sockets to have sufficient resources.
     207However, since @io_uring@ respects nonblocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely.
    142208For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@.
    143209Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases.
    144210
    145 It is important to state that in Linux 5.15 @io_uring@ introduces the ability for users to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option.
     211Interestingly, Linux 5.15 @io_uring@ introduces the ability to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option.
    146212However, as of writing this document Ubuntu does not have a stable release of Linux 5.15.
    147213There exists versions of the kernel that are currently under testing, but these caused unrelated but nevertheless prohibitive issues in this experiment.
     
    150216
    151217\subsection{Benchmark Environment}
    152 Unlike the memcached experiment, the webserver run on a more heterogenous environment.
     218Unlike the Memcached experiment, the webserver experiment is run on a heterogeneous environment.
     219\begin{itemize}
     220\item
    153221The server runs Ubuntu 20.04.4 LTS on top of Linux Kernel 5.13.0-52.
    154 It has an AMD Opteron(tm) Processor 6380 running at 2.50GHz.
    155 These CPUs has only 8 \glspl{hthrd} enabled by grub, which is sufficient to achieve line rate.
    156 This cpus each have 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
    157 The kernel is setup to limit the memory at 25Gb.
    158 
    159 The client machines each have two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
    160 Each client machine runs two copies of the workload generator.
    161 They run a 2.6.11-1 SMP Linux kernel, which permits each client load-generator to run on a separate CPU.
    162 Since the clients outnumber the server 8-to-1, this is plenty sufficient to generate enough load for the clients not to become the bottleneck.
    163 
     222\item
     223It has an AMD Opteron(tm) Processor 6380 running at 2.5GHz.
     224\item
     225Each CPU has 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
     226\item
     227The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate.
     228\item
     229The computer is booted with only 25GB of memory to restrict the file-system cache.
     230\end{itemize}
     231There are 8 client machines.
     232\begin{itemize}
     233\item
     234A client runs a 2.6.11-1 SMP Linux kernel, which permits each client load-generator to run on a separate CPU.
     235\item
     236It has two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
     237\item
    164238\todo{switch}
     239\item
     240A client machine runs two copies of the workload generator.
     241\end{itemize}
     242The clients and network are sufficiently provisioned to drive the server to saturation and beyond.
     243Hence, any server effects are attributable solely to the runtime system and webserver.
     244Finally, without restricting the server hardware resources, it is impossible to determine if a runtime system or the webserver using it has any specific design restrictions, \eg using space to reduce time.
     245Trying to determine these restriction with large numbers of processors or memory simply means running equally large experiments, which takes longer and are harder to set up.
    165246
    166247\subsection{Throughput}
    167 To measure the throughput of both webservers, each server is loaded with over 30,000 files making over 4.5 Gigabytes in total.
    168 Each client runs httperf~\cit{httperf} which establishes a connection, does an http request for one or more files, closes the connection and repeats the process.
    169 The connections and requests are made according to a Zipfian distribution~\cite{zipf}.
     248To measure webserver throughput, the server computer is loaded with 21,600 files, sharded across 650 directories, occupying about 2.2GB of disk, distributed over the server's RAID-5 4-drives to achieve high throughput for disk I/O.
     249The clients run httperf~\cite{httperf} to request a set of static files.
     250The httperf load-generator is used with session files to simulate a large number of users and to implement a partially open-loop system.
     251This permits httperf to produce overload conditions, generate multiple requests from persistent HTTP/1.1 connections, and include both active and inactive off periods to model browser processing times and user think times~\cite{Barford98}.
     252
     253The experiments are run with 16 clients, each running a copy of httperf (one copy per CPU), requiring a set of 16 log files with requests conforming to a Zipf distribution.
     254This distribution is representative of users accessing static data through a web-browser.
     255Each request reads a file name from its trace, establishes a connection, performs an HTTP get-request for the file name, receive the file data, close the connection, and repeat the process.
     256Some trace elements have multiple file names that are read across a persistent connection.
     257A client times-out if the server does not complete a request within 10 seconds.
     258
     259An experiment consists of running a server with request rates ranging from 10,000 to 70,000 requests per second;
     260each rate takes about 5 minutes to complete.
     261There is 20 seconds idle time between rates and between experiments to allow connections in the TIME-WAIT state to clear.
     262Server throughput is measured both at peak and after saturation (\ie after peak).
     263Peak indicates the level of client requests the server can handle and after peak indicates if a server degrades gracefully.
    170264Throughput is measured by aggregating the results from httperf of all the clients.
     265
     266Two workload scenarios are created by reconfiguring the server with different amounts of memory: 4 GB and 2 GB.
     267The two workloads correspond to in-memory (4 GB) and disk-I/O (2 GB).
     268Due to the Zipf distribution, only a small amount of memory is needed to service a significant percentage of requests.
     269Table~\ref{t:CumulativeMemory} shows the cumulative memory required to satisfy the specified percentage of requests; e.g., 95\% of the requests come from 126.5 MB of the file set and 95\% of the requests are for files less than or equal to 51,200 bytes.
     270Interestingly, with 2 GB of memory, significant disk-I/O occurs.
     271
     272\begin{table}
     273\caption{Cumulative memory for requests by file size}
     274\label{t:CumulativeMemory}
     275\begin{tabular}{r|rrrrrrrr}
     276\% Requests   & 10 & 30 & 50 & 70 & 80 & 90 & \textbf{95} & 100 \\
     277Memory (MB)   & 0.5 & 1.5 & 8.4 & 12.2 & 20.1 & 94.3 & \textbf{126.5} & 2,291.6 \\
     278File Size (B) & 409 & 716 & 4,096 & 5,120 & 7,168 & 40,960 & \textbf{51,200} & 921,600
     279\end{tabular}
     280\end{table}
     281
     282Figure~\ref{fig:swbsrv} shows the results comparing \CFA to NGINX in terms of throughput.
     283These results are fairly straightforward.
     284Both servers achieve the same throughput until around 57,500 requests per seconds.
     285Since the clients are asking for the same files, the fact that the throughput matches exactly is expected as long as both servers are able to serve the desired rate.
     286Once the saturation point is reached, both servers are still very close.
     287NGINX achieves slightly better throughput.
     288However, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achieves notably fewer errors once the machine reaches saturation.
     289This suggest that \CFA is slightly more fair and NGINX may slightly sacrifice some fairness for improved throughput.
     290It demonstrate that the \CFA webserver described above is able to match the performance of NGINX up-to and beyond the saturation point of the machine.
     291
    171292\begin{figure}
    172293        \subfloat[][Throughput]{
    173                 \input{result.swbsrv.25gb.pstex_t}
     294                \resizebox{0.85\linewidth}{!}{\input{result.swbsrv.25gb.pstex_t}}
    174295                \label{fig:swbsrv:ops}
    175296        }
    176297
    177298        \subfloat[][Rate of Errors]{
    178                 \input{result.swbsrv.25gb.err.pstex_t}
     299                \resizebox{0.85\linewidth}{!}{\input{result.swbsrv.25gb.err.pstex_t}}
    179300                \label{fig:swbsrv:err}
    180301        }
     
    182303        \label{fig:swbsrv}
    183304\end{figure}
    184 Figure~\ref{fig:swbsrv} shows the results comparing \CFA to NGINX in terms of throughput.
    185 These results are fairly straight forward.
    186 Both servers achieve the same throughput until around 57,500 requests per seconds.
    187 Since the clients are asking for the same files, the fact that the throughput matches exactly is expected as long as both servers are able to serve the desired rate.
    188 Once the saturation point is reached, both servers are still very close.
    189 NGINX achieves slightly better throughtput.
    190 However, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achives notably fewet errors once the machine reaches saturation.
    191 This suggest that \CFA is slightly more fair and NGINX may sloghtly sacrifice some fairness for improved throughtput.
    192 It demonstrate that the \CFA webserver described above is able to match the performance of NGINX up-to and beyond the saturation point of the machine.
    193305
    194306\subsection{Disk Operations}
    195 The throughput was made using a server with 25gb of memory, this was sufficient to hold the entire fileset in addition to all the code and data needed to run the webserver and the reste of the machine.
     307The throughput was made using a server with 25gb of memory, this was sufficient to hold the entire fileset in addition to all the code and data needed to run the webserver and the rest of the machine.
    196308Previous work like \cit{Cite Ashif's stuff} demonstrate that an interesting follow-up experiment is to rerun the same throughput experiment but allowing significantly less memory on the machine.
    197309If the machine is constrained enough, it will force the OS to evict files from the file cache and cause calls to @sendfile@ to have to read from disk.
    198310However, what these low memory experiments demonstrate is how the memory footprint of the webserver affects the performance.
    199 However, since what I am to evaluate in this thesis is the runtime of \CFA, I diceded to forgo experiments on low memory server.
     311However, since what I am to evaluate in this thesis is the runtime of \CFA, I decided to forgo experiments on low memory server.
    200312The implementation of the webserver itself is simply too impactful to be an interesting evaluation of the underlying runtime.
Note: See TracChangeset for help on using the changeset viewer.