Changeset 511a9368


Ignore:
Timestamp:
Aug 5, 2022, 12:49:49 PM (2 years ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
8040286, 878be178
Parents:
1c4f063
Message:

Filled in eval section for existing results.
Except update ratio which will be redone.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    r1c4f063 r511a9368  
    1515Experimenting on memcached allows for a simple test of the \CFA runtime as a whole, it will exercise the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.
    1616This experiment does not exercise the \io subsytem with regards to disk operations.
    17 The experiments compare 3 different varitions of memcached:
    18 \begin{itemize}
    19  \item \emph{vanilla}: the official release of memcached, version~1.6.9.
    20  \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.
    21  \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.
    22 \end{itemize}
    2317
    2418\subsection{Benchmark Environment}
     
    3125The network route uses 1 Mellanox SX1012 10/40 Gigabit Ethernet cluster switch.
    3226
    33 \subsection{Throughput}
     27\subsection{Memcached with threads per connection}
     28Comparing against memcached using a user-level runtime only really make sense if the server actually uses this threading model.
     29Indeed, evaluating a user-level runtime with 1 \at per \proc is not meaningful since it does not exercise the runtime, it simply adds some overhead to the underlying OS scheduler.
     30
     31One approach is to use a webserver that uses a thread-per-connection model, where each incoming connection is served by a single \at in a strict 1-to-1 pairing.
     32This models adds flexibility to the implementation, as the serving logic can now block on user-level primitives without affecting other connections.
     33
     34Memcached is not built according to a thread-per-connection model, but there exists a port of it that is, which was built for libfibre in \cite{DBLP:journals/pomacs/KarstenB20}.
     35Therefore this version can both be compared to the original version and to a port to the \CFA runtime.
     36
     37As such, this memcached experiment compares 3 different varitions of memcached:
     38\begin{itemize}
     39 \item \emph{vanilla}: the official release of memcached, version~1.6.9.
     40 \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.
     41 \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.
     42\end{itemize}
     43
     44\subsection{Throughput} \label{memcd:tput}
    3445\begin{figure}
    3546        \centering
     
    8192The static webserver experiments will compare NGINX with a custom webserver developped for this experiment.
    8293
     94\subsection{\CFA webserver}
     95Unlike the memcached experiment, the webserver experiment relies on a custom designed webserver.
     96It is a simple thread-per-connection webserver where a fixed number of \ats are created upfront.
     97Each of the \at calls @accept@, through @io_uring@, on the listening port and handle the incomming connection once accepted.
     98Most of the implementation is fairly straight forward however the inclusion of file \io introduces a new challenge that had to be hacked around.
     99
     100Normally, webservers use @sendfile@\cit{sendfile} to send files over the socket.
     101@io_uring@ does not support @sendfile@, it supports @splice@\cit{splice} instead, which is strictly more powerful.
     102However, because of how linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@'s implementation must delegate calls to splice to worker threads inside the kernel.
     103As of Linux 5.13, @io_uring@ caps the numer of these worker threads to @RLIMIT_NPROC@ and therefore, when tens of thousands of splice requests are made, it can create tens of thousands of \glspl{kthrd}.
     104Such a high number of \glspl{kthrd} is more than Linux can handle in this scenario so performance suffers significantly.
     105For this reason, the \CFA webserver calls @sendfile@ directly.
     106This approach works up to a certain point, but once the server approaches saturation, it leads to a new problem.
     107
     108When the saturation point of the server is attained, latency will increase and inevitably some client connections will timeout.
     109As these clients close there connections, the server must close these sockets without delay so the OS can reclaim the resources used by these connections.
     110Indeed, until they are closed on the server end, the connection will linger in the CLOSE-WAIT tcp state~\cit{RFC793} and the tcp buffers will be preserved.
     111However, this poses a problem using blocking @sendfile@ calls.
     112The calls can block if they do not have suffcient memory, which can be caused by having too many connections in the CLOSE-WAIT state.
     113Since blocking in calls to @sendfile@ blocks the \proc rather than the \at, this prevents other connections from closing their sockets.
     114This leads to a vicious cycle where timeouts lead to @sendfile@ calls running out of resources, which lead to more timeouts.
     115
     116Normally, this is address by marking the sockets as non-blocking and using @epoll@ to wait for sockets to have sufficient resources.
     117However, since @io_uring@ respects non-blocking semantics marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely.
     118For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@.
     119Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases.
     120
     121It is important to state that in Linux 5.15 @io_uring@ introduces the ability for users to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option.
     122However, as of writing this document Ubuntu does not have a stable release of Linux 5.15.
     123There exists versions of the kernel that are currently under testing, but these caused unrelated but nevertheless prohibitive issues in this experiment.
     124Presumably, the new kernel would remove the need for the hack described above, as it would allow connections in the CLOSE-WAIT state to be closed even while the calls to @splice@/@sendfile@ are underway.
     125However, since this could not be tested, this is purely a conjecture at this point.
     126
    83127\subsection{Benchmark Environment}
    84128Unlike the memcached experiment, the webserver run on a more heterogenous environment.
     
    87131These CPUs has only 8 \glspl{hthrd} enabled by grub, which is sufficient to achieve line rate.
    88132This cpus each have 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
     133The kernel is setup to limit the memory at 25Gb.
    89134
    90135The client machines each have two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
     
    95140\todo{switch}
    96141
    97 
    98 
    99142\subsection{Throughput}
    100143\begin{figure}
    101         \centering
    102         \input{result.swbsrv.25gb.pstex_t}
    103         \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline }
     144        \subfloat[][Throughput]{
     145                \input{result.swbsrv.25gb.pstex_t}
     146                \label{fig:swbsrv:ops}
     147        }
     148
     149        \subfloat[][Rate of Errors]{
     150                \input{result.swbsrv.25gb.err.pstex_t}
     151                \label{fig:swbsrv:err}
     152        }
     153        \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline Throughput vs request rate for short lived connections connections.}
    104154        \label{fig:swbsrv}
    105155\end{figure}
    106 
    107 Networked ZIPF
    108 
    109 Nginx : 5Gb still good, 4Gb starts to suffer
    110 
    111 Cforall : 10Gb too high, 4 Gb too low
     156Figure~\ref{fig:swbsrv} shows the results comparing \CFA to nginx in terms of throughput.
     157It demonstrate that the \CFA webserver described above is able to match the performance of nginx up-to and beyond the saturation point of the machine.
     158Furthermore, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achives notably fewet errors once the machine reaches saturation.
Note: See TracChangeset for help on using the changeset viewer.