Changeset 511a9368 for doc/theses/thierry_delisle_PhD/thesis
- Timestamp:
- Aug 5, 2022, 12:49:49 PM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- 8040286, 878be178
- Parents:
- 1c4f063
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex
r1c4f063 r511a9368 15 15 Experimenting on memcached allows for a simple test of the \CFA runtime as a whole, it will exercise the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets. 16 16 This experiment does not exercise the \io subsytem with regards to disk operations. 17 The experiments compare 3 different varitions of memcached:18 \begin{itemize}19 \item \emph{vanilla}: the official release of memcached, version~1.6.9.20 \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.21 \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.22 \end{itemize}23 17 24 18 \subsection{Benchmark Environment} … … 31 25 The network route uses 1 Mellanox SX1012 10/40 Gigabit Ethernet cluster switch. 32 26 33 \subsection{Throughput} 27 \subsection{Memcached with threads per connection} 28 Comparing against memcached using a user-level runtime only really make sense if the server actually uses this threading model. 29 Indeed, evaluating a user-level runtime with 1 \at per \proc is not meaningful since it does not exercise the runtime, it simply adds some overhead to the underlying OS scheduler. 30 31 One approach is to use a webserver that uses a thread-per-connection model, where each incoming connection is served by a single \at in a strict 1-to-1 pairing. 32 This models adds flexibility to the implementation, as the serving logic can now block on user-level primitives without affecting other connections. 33 34 Memcached is not built according to a thread-per-connection model, but there exists a port of it that is, which was built for libfibre in \cite{DBLP:journals/pomacs/KarstenB20}. 35 Therefore this version can both be compared to the original version and to a port to the \CFA runtime. 36 37 As such, this memcached experiment compares 3 different varitions of memcached: 38 \begin{itemize} 39 \item \emph{vanilla}: the official release of memcached, version~1.6.9. 40 \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}. 41 \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA. 42 \end{itemize} 43 44 \subsection{Throughput} \label{memcd:tput} 34 45 \begin{figure} 35 46 \centering … … 81 92 The static webserver experiments will compare NGINX with a custom webserver developped for this experiment. 82 93 94 \subsection{\CFA webserver} 95 Unlike the memcached experiment, the webserver experiment relies on a custom designed webserver. 96 It is a simple thread-per-connection webserver where a fixed number of \ats are created upfront. 97 Each of the \at calls @accept@, through @io_uring@, on the listening port and handle the incomming connection once accepted. 98 Most of the implementation is fairly straight forward however the inclusion of file \io introduces a new challenge that had to be hacked around. 99 100 Normally, webservers use @sendfile@\cit{sendfile} to send files over the socket. 101 @io_uring@ does not support @sendfile@, it supports @splice@\cit{splice} instead, which is strictly more powerful. 102 However, because of how linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@'s implementation must delegate calls to splice to worker threads inside the kernel. 103 As of Linux 5.13, @io_uring@ caps the numer of these worker threads to @RLIMIT_NPROC@ and therefore, when tens of thousands of splice requests are made, it can create tens of thousands of \glspl{kthrd}. 104 Such a high number of \glspl{kthrd} is more than Linux can handle in this scenario so performance suffers significantly. 105 For this reason, the \CFA webserver calls @sendfile@ directly. 106 This approach works up to a certain point, but once the server approaches saturation, it leads to a new problem. 107 108 When the saturation point of the server is attained, latency will increase and inevitably some client connections will timeout. 109 As these clients close there connections, the server must close these sockets without delay so the OS can reclaim the resources used by these connections. 110 Indeed, until they are closed on the server end, the connection will linger in the CLOSE-WAIT tcp state~\cit{RFC793} and the tcp buffers will be preserved. 111 However, this poses a problem using blocking @sendfile@ calls. 112 The calls can block if they do not have suffcient memory, which can be caused by having too many connections in the CLOSE-WAIT state. 113 Since blocking in calls to @sendfile@ blocks the \proc rather than the \at, this prevents other connections from closing their sockets. 114 This leads to a vicious cycle where timeouts lead to @sendfile@ calls running out of resources, which lead to more timeouts. 115 116 Normally, this is address by marking the sockets as non-blocking and using @epoll@ to wait for sockets to have sufficient resources. 117 However, since @io_uring@ respects non-blocking semantics marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely. 118 For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@. 119 Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases. 120 121 It is important to state that in Linux 5.15 @io_uring@ introduces the ability for users to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option. 122 However, as of writing this document Ubuntu does not have a stable release of Linux 5.15. 123 There exists versions of the kernel that are currently under testing, but these caused unrelated but nevertheless prohibitive issues in this experiment. 124 Presumably, the new kernel would remove the need for the hack described above, as it would allow connections in the CLOSE-WAIT state to be closed even while the calls to @splice@/@sendfile@ are underway. 125 However, since this could not be tested, this is purely a conjecture at this point. 126 83 127 \subsection{Benchmark Environment} 84 128 Unlike the memcached experiment, the webserver run on a more heterogenous environment. … … 87 131 These CPUs has only 8 \glspl{hthrd} enabled by grub, which is sufficient to achieve line rate. 88 132 This cpus each have 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively. 133 The kernel is setup to limit the memory at 25Gb. 89 134 90 135 The client machines each have two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards. … … 95 140 \todo{switch} 96 141 97 98 99 142 \subsection{Throughput} 100 143 \begin{figure} 101 \centering 102 \input{result.swbsrv.25gb.pstex_t} 103 \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline } 144 \subfloat[][Throughput]{ 145 \input{result.swbsrv.25gb.pstex_t} 146 \label{fig:swbsrv:ops} 147 } 148 149 \subfloat[][Rate of Errors]{ 150 \input{result.swbsrv.25gb.err.pstex_t} 151 \label{fig:swbsrv:err} 152 } 153 \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline Throughput vs request rate for short lived connections connections.} 104 154 \label{fig:swbsrv} 105 155 \end{figure} 106 107 Networked ZIPF 108 109 Nginx : 5Gb still good, 4Gb starts to suffer 110 111 Cforall : 10Gb too high, 4 Gb too low 156 Figure~\ref{fig:swbsrv} shows the results comparing \CFA to nginx in terms of throughput. 157 It demonstrate that the \CFA webserver described above is able to match the performance of nginx up-to and beyond the saturation point of the machine. 158 Furthermore, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achives notably fewet errors once the machine reaches saturation.
Note: See TracChangeset
for help on using the changeset viewer.