Context Navigation

← Previous Changeset
Next Changeset →

Changeset fc6c410

Timestamp:

Aug 25, 2022, 11:52:00 AM (2 years ago)

Author:

Thierry Delisle <tdelisle@…>

Branches:

ADT, ast-experimental, master, pthread-emulation

Children:

d2f09e4

Parents:

82b9e956

Message:

Added description of NGINX's threading model.
Added section to io.tex describing bounded and unbounded workers.
Added section in eval_macro describing the results for 5.15.
Re-wrote the last paragraph about disk experiments.
It should all be ready to read.

Location:

doc/theses/thierry_delisle_PhD/thesis

Files:

: 3 edited

local.bib (modified) (1 diff)
text/eval_macro.tex (modified) (7 diffs)
text/io.tex (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/theses/thierry_delisle_PhD/thesis/local.bib

-                      r82b9e956
+                      rfc6c410
+}
+@misc{xkcd:cloud,
+  author = "Randall Munroe",
+  title = "908: The Cloud",
+  year = "2011",
+  month = "June",
+  howpublished = "\href{https://xkcd.com/908/}",
+  note = "[Online; accessed 25-August-2022]"
+}
 @misc{go:safepoints,
   author = "The Go Programming Language",

doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

-                      r82b9e956
+                      rfc6c410
 Therefore, webservers offer a stringent performance benchmark for \CFA.
 Indeed, existing webservers have close to optimal performance, while the homogeneity of the workload means fairness may not be a problem.
 As such, these experiments should highlight any \CFA fairness cost (overhead) in realistic scenarios.
+As such, these experiments should highlight the overhead tue to any \CFA fairness cost in realistic scenarios.
 \section{Memcached}
 …
 \end{itemize}
 \subsection{Memcached threading}
+\subsection{Memcached threading}\label{memcd:thrd}
 Memcached can be built to use multiple threads in addition to its @libevent@ subsystem to handle requests.
 When enabled, the threading implementation operates as follows~\cite{https://docs.oracle.com/cd/E17952_01/mysql-5.6-en/ha-memcached-using-threads.html}:
 …
 Threads that are not currently dealing with another request ignore the incoming packet.
 One of the remaining, nonbusy, threads reads the request and sends the response.
 This implementation can lead to increased CPU load as threads wake from sleep to potentially process the request.
 \end{itemize}
 Here, Memcached is based on an event-based webserver architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple (largely) independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O.
+This implementation can lead to increased CPU load as threads wake from sleep to potentially process the request.
+\end{itemize}
+Here, Memcached is based on an event-based webserver architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple largely independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O.
 Alternative webserver architecture are:
 \begin{itemize}
 …
 The static webserver experiment compares NGINX~\cite{nginx} with a custom \CFA-based webserver developed for this experiment.
+\subsection{NGINX threading}
+Like memcached, NGINX can be makde to use multiple \glspl{kthrd}.
+It has a very similar architecture to the memcached architecture decscribed in Section~\ref{memcd:thrd}, where multiple \glspl{kthrd} each run a mostly independent network logic.
+While it does not necessarily use a dedicated listening thread, each connection is arbitrarily assigned to one of the \newterm{worker} threads.
+Each worker threads handles multiple connections exclusively, effectively dividing the connections into distinct sets.
+Again, this is effectively the \emph{event-based server} approach.
+\cit{https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/}
 \subsection{\CFA webserver}
 The \CFA webserver is a straightforward thread-per-connection webserver, where a fixed number of \ats are created upfront (tuning parameter).
+The \CFA webserver is a straightforward thread-per-connection webserver, where a fixed number of \ats are created upfront.
 Each \at calls @accept@, through @io_uring@, on the listening port and handles the incoming connection once accepted.
 Most of the implementation is fairly straightforward;
 …
 As of Linux 5.13, @io_uring@ had no mechanism to restrict the number of worker threads, and therefore, when tens of thousands of splice requests are made, it correspondingly creates tens of thousands of internal \glspl{kthrd}.
 Such a high number of \glspl{kthrd} slows Linux significantly.
+Rather than abandon the experiment, the \CFA webserver was switched to nonblocking @sendfile@.
+However, when the nonblocking @sendfile@ returns @EAGAIN@, the \CFA server cannot block the \at because its I/O subsystem uses @io_uring@.
+Therefore, the \at must spin performing the @sendfile@ and yield if the call returns @EAGAIN@.
+This workaround works up to the saturation point, when other problems occur.
+Rather than abandon the experiment, the \CFA webserver was switched to @sendfile@.
+With a blocking @sendfile@ the \CFA achieves acceptable performance until saturation is reached.
 At saturation, latency increases so some client connections timeout.
 As these clients close their connection, the server must close its corresponding side without delay so the OS can reclaim the resources used by these connections.
 Indeed, until the server connection is closed, the connection lingers in the CLOSE-WAIT TCP state~\cite{rfc:tcp} and the TCP buffers are preserved.
 However, this poses a problem using nonblocking @sendfile@ calls:
+the call can still block if there is insufficient memory, which can be caused by having too many connections in the CLOSE-WAIT state.\footnote{
+when @sendfile@ blocks, the \proc rather than the \at blocks, preventing other connections from closing their sockets.
+The call can block if there is insufficient memory, which can be caused by having too many connections in the CLOSE-WAIT state.\footnote{
 \lstinline{sendfile} can always block even in nonblocking mode if the file to be sent is not in the file-system cache, because Linux does not provide nonblocking disk I/O.}
-When @sendfile@ blocks, the \proc rather than the \at blocks, preventing other connections from closing their sockets.
 This effect results in a negative feedback where more timeouts lead to more @sendfile@ calls running out of resources.
 Normally, this is address by using @select@/@epoll@ to wait for sockets to have sufficient resources.
+However, since @io_uring@ respects nonblocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely.
+However, since @io_uring@ respects nonblocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely:
+all calls would simply immediately return @EAGAIN@ and all asynchronicity would be lost.
 For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@.
+However, when the nonblocking @sendfile@ returns @EAGAIN@, the \CFA server cannot block the \at because its I/O subsystem uses @io_uring@.
+Therefore, the \at must spin performing the @sendfile@ and yield if the call returns @EAGAIN@.
 Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases.
 Interestingly, Linux 5.15 @io_uring@ introduces the ability to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option.
+However, as of writing this document Ubuntu does not have a stable release of Linux 5.15.
+There exists versions of the kernel that are currently under testing, but these caused unrelated but nevertheless prohibitive issues in this experiment.
+Presumably, the new kernel would remove the need for the hack described above, as it would allow connections in the CLOSE-WAIT state to be closed even while the calls to @splice@/@sendfile@ are underway.
+However, since this could not be tested, this is purely a conjecture at this point.
+Presumably, this limit could prevent the explosion of \glspl{kthrd} which justified using @sendfile@ over @io_uring@ and @splice@.
+However, recall from Section~\ref{iouring} that @io_uring@ maintains two pools of workers: bounded workers and unbounded workers.
+In the particular case of the webserver, we would want the unbounded workers to handle accepts and reads on socket and bounded workers to handle reading the files from disk.
+This would allow fine grained countrol over the number of workers needed for each operation type and would presumably lead to good performance.
+However, @io_uring@ must contend with another reality of Linux: the versatility of @splice@.
+Indeed, @splice@ can be used both for reading and writing, to or from any type of file descriptor.
+This makes it more ambiguous which pool @io_uring@ should delegate @splice@ calls to.
+In the case of splicing from a socket to pipe, @splice@ will behave like an unbounded operation, but when splicing from a regular file to a pipe, @splice@ becomes a bounded operation.
+To make things more complicated, @splice@ can read from a pipe and write out to a regular file.
+In this case, the read is an unbounded operation but the write is a bounded one.
+This leaves @io_uring@ in a difficult situation where it can be very difficult to delegate splice operations to the appropriate type of worker.
+Since there is little to no context available to @io_uring@, I believe it makes the decision to always delegate @splice@ operations to the unbounded workers.
+This is unfortunate for this specific experiment, since it prevents the webserver from limiting the number of calls to @splice@ happening in parallel without affecting the performance of @read@ or @accept@.
+For this reason, the @sendfile@ approach described above is still the most performant solution in Linux 5.15.
+Note that it could be possible to workaround this problem, for example by creating more @io_uring@ instances so @splice@ operations can be issued to a different instance than the @read@ and @accept@ operations.
+However, I do not believe this solution is appropriate in general, it simply replaces a hack in the webserver with a different, equivalent hack.
 \subsection{Benchmark Environment}
 …
 Throughput is measured by aggregating the results from httperf of all the clients.
 Two workload scenarios are created by reconfiguring the server with different amounts of memory: 4 GB and 2 GB.
 The two workloads correspond to in-memory (4 GB) and disk-I/O (2 GB).
+This experiment can be done for two workload scenarios by reconfiguring the server with different amounts of memory: 25 GB and 2.5 GB.
+The two workloads correspond to in-memory and disk-I/O respectively.
 Due to the Zipf distribution, only a small amount of memory is needed to service a significant percentage of requests.
 Table~\ref{t:CumulativeMemory} shows the cumulative memory required to satisfy the specified percentage of requests; e.g., 95\% of the requests come from 126.5 MB of the file set and 95\% of the requests are for files less than or equal to 51,200 bytes.
 Interestingly, with 2 GB of memory, significant disk-I/O occurs.
+Interestingly, with 2.5 GB of memory, significant disk-I/O occurs.
 \begin{table}
 …
 Previous work like \cit{Cite Ashif's stuff} demonstrate that an interesting follow-up experiment is to rerun the same throughput experiment but allowing significantly less memory on the machine.
 If the machine is constrained enough, it will force the OS to evict files from the file cache and cause calls to @sendfile@ to have to read from disk.
+However, what these low memory experiments demonstrate is how the memory footprint of the webserver affects the performance.
+However, in this configuration, the problem with @splice@ and @io_uring@ rears its ugly head again.
+Indeed, in the in-memory configuration, replacing @splice@ with calls to @sendfile@ works because the bounded side basically never blocks.
+Like @splice@, @sendfile@ is in a situation where the read side requires bounded blocking, \eg reading from a regular file, while the write side requires unbounded blocking, \eg blocking until the socket is available for writing.
+The unbounded side can be handled by yielding when it returns @EAGAIN@ like mentioned above, but this trick does not work for the bounded side.
+The only solution for the bounded side is to spawn more threads and let these handle the blocking.
+Supporting this case in the webserver would require creating more \procs or creating a dedicated thread-pool.
 However, since what I am to evaluate in this thesis is the runtime of \CFA, I decided to forgo experiments on low memory server.
 The implementation of the webserver itself is simply too impactful to be an interesting evaluation of the underlying runtime.

doc/theses/thierry_delisle_PhD/thesis/text/io.tex

-                      r82b9e956
+                      rfc6c410
 The parked \glspl{thrd} are then rescheduled by the event engine once the desired operation has completed.
 \subsection{\lstinline{io_uring} in depth}
+\subsection{\lstinline{io_uring} in depth}\label{iouring}
 Before going into details on the design of my event engine, more details on @io_uring@ usage are presented, each important in the design of the engine.
 Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance.
 …
 This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
+An important detail to keep in mind is that just like ``The cloud is just someone else's computer''\cite{xkcd:cloud}, asynchronous operations are just operation using someone else's threads.
+Indeed, asynchronous operation can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads.
+In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}.
+To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user.
+There are three kind of operations that can need the \glspl{kthrd}:
+\paragraph{Operations using} @IOSQE_ASYNC@.
+This is a straightforward case, users can explicitly set the @IOSQE_ASYNC@ flag on an SQE to specify that it \emph{must} be delegated to a different \gls{kthrd}.
+\paragraph{Bounded operations.}
+This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices.
+@io_uring@ must also take this reality into account by delegating operations on regular files and block devices.
+In fact @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}.
+\paragraph{Unbounded operations that must be retried.}
+While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket.
+Since this retry cannot necessarily be done in the system call, @io_uring@ must delegate these calls to a \gls{kthrd}.
+@io_uring@ maintains a separate pool for these operations.
+The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}.
+Unbounded workers are also responsible of handling operations using @IOSQE_ASYNC@.
+@io_uring@ implicitly spawns and joins both the bounded and unbounded workers based on its evaluation of the needs of the workload.
+This effectively encapsulates the work that is needed when using @epoll@.
+Indeed, @io_uring@ does not change Linux's underlying handling of \io opeartions, it simply offers an asynchronous \glsxtrshort{api} on top of the existing system.
 \subsection{Multiplexing \io: Submission}

Note: See TracChangeset for help on using the changeset viewer.