Index: doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex	(revision 1c4f063145498fc1a68d66a0c2f2d4d6bf9a7a53)
+++ doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex	(revision 511a936813b37ad7d2002f0d594c479aed4dd644)
@@ -15,10 +15,4 @@
 Experimenting on memcached allows for a simple test of the \CFA runtime as a whole, it will exercise the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.
 This experiment does not exercise the \io subsytem with regards to disk operations.
-The experiments compare 3 different varitions of memcached:
-\begin{itemize}
- \item \emph{vanilla}: the official release of memcached, version~1.6.9.
- \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.
- \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.
-\end{itemize}
 
 \subsection{Benchmark Environment}
@@ -31,5 +25,22 @@
 The network route uses 1 Mellanox SX1012 10/40 Gigabit Ethernet cluster switch.
 
-\subsection{Throughput}
+\subsection{Memcached with threads per connection}
+Comparing against memcached using a user-level runtime only really make sense if the server actually uses this threading model.
+Indeed, evaluating a user-level runtime with 1 \at per \proc is not meaningful since it does not exercise the runtime, it simply adds some overhead to the underlying OS scheduler.
+
+One approach is to use a webserver that uses a thread-per-connection model, where each incoming connection is served by a single \at in a strict 1-to-1 pairing.
+This models adds flexibility to the implementation, as the serving logic can now block on user-level primitives without affecting other connections.
+
+Memcached is not built according to a thread-per-connection model, but there exists a port of it that is, which was built for libfibre in \cite{DBLP:journals/pomacs/KarstenB20}.
+Therefore this version can both be compared to the original version and to a port to the \CFA runtime.
+
+As such, this memcached experiment compares 3 different varitions of memcached:
+\begin{itemize}
+ \item \emph{vanilla}: the official release of memcached, version~1.6.9.
+ \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.
+ \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.
+\end{itemize}
+
+\subsection{Throughput} \label{memcd:tput}
 \begin{figure}
 	\centering
@@ -81,4 +92,37 @@
 The static webserver experiments will compare NGINX with a custom webserver developped for this experiment.
 
+\subsection{\CFA webserver}
+Unlike the memcached experiment, the webserver experiment relies on a custom designed webserver.
+It is a simple thread-per-connection webserver where a fixed number of \ats are created upfront.
+Each of the \at calls @accept@, through @io_uring@, on the listening port and handle the incomming connection once accepted.
+Most of the implementation is fairly straight forward however the inclusion of file \io introduces a new challenge that had to be hacked around.
+
+Normally, webservers use @sendfile@\cit{sendfile} to send files over the socket.
+@io_uring@ does not support @sendfile@, it supports @splice@\cit{splice} instead, which is strictly more powerful.
+However, because of how linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@'s implementation must delegate calls to splice to worker threads inside the kernel.
+As of Linux 5.13, @io_uring@ caps the numer of these worker threads to @RLIMIT_NPROC@ and therefore, when tens of thousands of splice requests are made, it can create tens of thousands of \glspl{kthrd}.
+Such a high number of \glspl{kthrd} is more than Linux can handle in this scenario so performance suffers significantly.
+For this reason, the \CFA webserver calls @sendfile@ directly.
+This approach works up to a certain point, but once the server approaches saturation, it leads to a new problem.
+
+When the saturation point of the server is attained, latency will increase and inevitably some client connections will timeout.
+As these clients close there connections, the server must close these sockets without delay so the OS can reclaim the resources used by these connections.
+Indeed, until they are closed on the server end, the connection will linger in the CLOSE-WAIT tcp state~\cit{RFC793} and the tcp buffers will be preserved.
+However, this poses a problem using blocking @sendfile@ calls.
+The calls can block if they do not have suffcient memory, which can be caused by having too many connections in the CLOSE-WAIT state.
+Since blocking in calls to @sendfile@ blocks the \proc rather than the \at, this prevents other connections from closing their sockets.
+This leads to a vicious cycle where timeouts lead to @sendfile@ calls running out of resources, which lead to more timeouts.
+
+Normally, this is address by marking the sockets as non-blocking and using @epoll@ to wait for sockets to have sufficient resources.
+However, since @io_uring@ respects non-blocking semantics marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely.
+For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@.
+Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases.
+
+It is important to state that in Linux 5.15 @io_uring@ introduces the ability for users to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option.
+However, as of writing this document Ubuntu does not have a stable release of Linux 5.15.
+There exists versions of the kernel that are currently under testing, but these caused unrelated but nevertheless prohibitive issues in this experiment.
+Presumably, the new kernel would remove the need for the hack described above, as it would allow connections in the CLOSE-WAIT state to be closed even while the calls to @splice@/@sendfile@ are underway.
+However, since this could not be tested, this is purely a conjecture at this point.
+
 \subsection{Benchmark Environment}
 Unlike the memcached experiment, the webserver run on a more heterogenous environment.
@@ -87,4 +131,5 @@
 These CPUs has only 8 \glspl{hthrd} enabled by grub, which is sufficient to achieve line rate.
 This cpus each have 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
+The kernel is setup to limit the memory at 25Gb.
 
 The client machines each have two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
@@ -95,17 +140,19 @@
 \todo{switch}
 
-
-
 \subsection{Throughput}
 \begin{figure}
-	\centering
-	\input{result.swbsrv.25gb.pstex_t}
-	\caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline }
+	\subfloat[][Throughput]{
+		\input{result.swbsrv.25gb.pstex_t}
+		\label{fig:swbsrv:ops}
+	}
+
+	\subfloat[][Rate of Errors]{
+		\input{result.swbsrv.25gb.err.pstex_t}
+		\label{fig:swbsrv:err}
+	}
+	\caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline Throughput vs request rate for short lived connections connections.}
 	\label{fig:swbsrv}
 \end{figure}
-
-Networked ZIPF
-
-Nginx : 5Gb still good, 4Gb starts to suffer
-
-Cforall : 10Gb too high, 4 Gb too low
+Figure~\ref{fig:swbsrv} shows the results comparing \CFA to nginx in terms of throughput.
+It demonstrate that the \CFA webserver described above is able to match the performance of nginx up-to and beyond the saturation point of the machine.
+Furthermore, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achives notably fewet errors once the machine reaches saturation.