Changeset fc6c410


Ignore:
Timestamp:
Aug 25, 2022, 11:52:00 AM (18 months ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
d2f09e4
Parents:
82b9e956
Message:

Added description of NGINX's threading model.
Added section to io.tex describing bounded and unbounded workers.
Added section in eval_macro describing the results for 5.15.
Re-wrote the last paragraph about disk experiments.
It should all be ready to read.

Location:
doc/theses/thierry_delisle_PhD/thesis
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/local.bib

    r82b9e956 rfc6c410  
    969969}
    970970
     971@misc{xkcd:cloud,
     972  author = "Randall Munroe",
     973  title = "908: The Cloud",
     974  year = "2011",
     975  month = "June",
     976  howpublished = "\href{https://xkcd.com/908/}",
     977  note = "[Online; accessed 25-August-2022]"
     978}
     979
    971980@misc{go:safepoints,
    972981  author = "The Go Programming Language",
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    r82b9e956 rfc6c410  
    88Therefore, webservers offer a stringent performance benchmark for \CFA.
    99Indeed, existing webservers have close to optimal performance, while the homogeneity of the workload means fairness may not be a problem.
    10 As such, these experiments should highlight any \CFA fairness cost (overhead) in realistic scenarios.
     10As such, these experiments should highlight the overhead tue to any \CFA fairness cost in realistic scenarios.
    1111
    1212\section{Memcached}
     
    3333\end{itemize}
    3434
    35 \subsection{Memcached threading}
     35\subsection{Memcached threading}\label{memcd:thrd}
    3636Memcached can be built to use multiple threads in addition to its @libevent@ subsystem to handle requests.
    3737When enabled, the threading implementation operates as follows~\cite{https://docs.oracle.com/cd/E17952_01/mysql-5.6-en/ha-memcached-using-threads.html}:
     
    4949Threads that are not currently dealing with another request ignore the incoming packet.
    5050One of the remaining, nonbusy, threads reads the request and sends the response.
    51 This implementation can lead to increased CPU load as threads wake from sleep to potentially process the request. 
    52 \end{itemize}
    53 Here, Memcached is based on an event-based webserver architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple (largely) independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O.
     51This implementation can lead to increased CPU load as threads wake from sleep to potentially process the request.
     52\end{itemize}
     53Here, Memcached is based on an event-based webserver architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple largely independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O.
    5454Alternative webserver architecture are:
    5555\begin{itemize}
     
    179179The static webserver experiment compares NGINX~\cite{nginx} with a custom \CFA-based webserver developed for this experiment.
    180180
     181\subsection{NGINX threading}
     182Like memcached, NGINX can be makde to use multiple \glspl{kthrd}.
     183It has a very similar architecture to the memcached architecture decscribed in Section~\ref{memcd:thrd}, where multiple \glspl{kthrd} each run a mostly independent network logic.
     184While it does not necessarily use a dedicated listening thread, each connection is arbitrarily assigned to one of the \newterm{worker} threads.
     185Each worker threads handles multiple connections exclusively, effectively dividing the connections into distinct sets.
     186Again, this is effectively the \emph{event-based server} approach.
     187
     188\cit{https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/}
     189
     190
    181191\subsection{\CFA webserver}
    182 The \CFA webserver is a straightforward thread-per-connection webserver, where a fixed number of \ats are created upfront (tuning parameter).
     192The \CFA webserver is a straightforward thread-per-connection webserver, where a fixed number of \ats are created upfront.
    183193Each \at calls @accept@, through @io_uring@, on the listening port and handles the incoming connection once accepted.
    184194Most of the implementation is fairly straightforward;
     
    190200As of Linux 5.13, @io_uring@ had no mechanism to restrict the number of worker threads, and therefore, when tens of thousands of splice requests are made, it correspondingly creates tens of thousands of internal \glspl{kthrd}.
    191201Such a high number of \glspl{kthrd} slows Linux significantly.
    192 Rather than abandon the experiment, the \CFA webserver was switched to nonblocking @sendfile@.
    193 However, when the nonblocking @sendfile@ returns @EAGAIN@, the \CFA server cannot block the \at because its I/O subsystem uses @io_uring@.
    194 Therefore, the \at must spin performing the @sendfile@ and yield if the call returns @EAGAIN@.
    195 This workaround works up to the saturation point, when other problems occur.
    196 
     202Rather than abandon the experiment, the \CFA webserver was switched to @sendfile@.
     203
     204With a blocking @sendfile@ the \CFA achieves acceptable performance until saturation is reached.
    197205At saturation, latency increases so some client connections timeout.
    198206As these clients close their connection, the server must close its corresponding side without delay so the OS can reclaim the resources used by these connections.
    199207Indeed, until the server connection is closed, the connection lingers in the CLOSE-WAIT TCP state~\cite{rfc:tcp} and the TCP buffers are preserved.
    200208However, this poses a problem using nonblocking @sendfile@ calls:
    201 the call can still block if there is insufficient memory, which can be caused by having too many connections in the CLOSE-WAIT state.\footnote{
     209when @sendfile@ blocks, the \proc rather than the \at blocks, preventing other connections from closing their sockets.
     210The call can block if there is insufficient memory, which can be caused by having too many connections in the CLOSE-WAIT state.\footnote{
    202211\lstinline{sendfile} can always block even in nonblocking mode if the file to be sent is not in the file-system cache, because Linux does not provide nonblocking disk I/O.}
    203 When @sendfile@ blocks, the \proc rather than the \at blocks, preventing other connections from closing their sockets.
    204212This effect results in a negative feedback where more timeouts lead to more @sendfile@ calls running out of resources.
    205213
    206214Normally, this is address by using @select@/@epoll@ to wait for sockets to have sufficient resources.
    207 However, since @io_uring@ respects nonblocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely.
     215However, since @io_uring@ respects nonblocking semantics, marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely:
     216all calls would simply immediately return @EAGAIN@ and all asynchronicity would be lost.
     217
    208218For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@.
     219However, when the nonblocking @sendfile@ returns @EAGAIN@, the \CFA server cannot block the \at because its I/O subsystem uses @io_uring@.
     220Therefore, the \at must spin performing the @sendfile@ and yield if the call returns @EAGAIN@.
    209221Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases.
    210222
    211223Interestingly, Linux 5.15 @io_uring@ introduces the ability to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option.
    212 However, as of writing this document Ubuntu does not have a stable release of Linux 5.15.
    213 There exists versions of the kernel that are currently under testing, but these caused unrelated but nevertheless prohibitive issues in this experiment.
    214 Presumably, the new kernel would remove the need for the hack described above, as it would allow connections in the CLOSE-WAIT state to be closed even while the calls to @splice@/@sendfile@ are underway.
    215 However, since this could not be tested, this is purely a conjecture at this point.
     224Presumably, this limit could prevent the explosion of \glspl{kthrd} which justified using @sendfile@ over @io_uring@ and @splice@.
     225However, recall from Section~\ref{iouring} that @io_uring@ maintains two pools of workers: bounded workers and unbounded workers.
     226In the particular case of the webserver, we would want the unbounded workers to handle accepts and reads on socket and bounded workers to handle reading the files from disk.
     227This would allow fine grained countrol over the number of workers needed for each operation type and would presumably lead to good performance.
     228However, @io_uring@ must contend with another reality of Linux: the versatility of @splice@.
     229Indeed, @splice@ can be used both for reading and writing, to or from any type of file descriptor.
     230This makes it more ambiguous which pool @io_uring@ should delegate @splice@ calls to.
     231In the case of splicing from a socket to pipe, @splice@ will behave like an unbounded operation, but when splicing from a regular file to a pipe, @splice@ becomes a bounded operation.
     232To make things more complicated, @splice@ can read from a pipe and write out to a regular file.
     233In this case, the read is an unbounded operation but the write is a bounded one.
     234This leaves @io_uring@ in a difficult situation where it can be very difficult to delegate splice operations to the appropriate type of worker.
     235Since there is little to no context available to @io_uring@, I believe it makes the decision to always delegate @splice@ operations to the unbounded workers.
     236This is unfortunate for this specific experiment, since it prevents the webserver from limiting the number of calls to @splice@ happening in parallel without affecting the performance of @read@ or @accept@.
     237For this reason, the @sendfile@ approach described above is still the most performant solution in Linux 5.15.
     238
     239Note that it could be possible to workaround this problem, for example by creating more @io_uring@ instances so @splice@ operations can be issued to a different instance than the @read@ and @accept@ operations.
     240However, I do not believe this solution is appropriate in general, it simply replaces a hack in the webserver with a different, equivalent hack.
    216241
    217242\subsection{Benchmark Environment}
     
    264289Throughput is measured by aggregating the results from httperf of all the clients.
    265290
    266 Two workload scenarios are created by reconfiguring the server with different amounts of memory: 4 GB and 2 GB.
    267 The two workloads correspond to in-memory (4 GB) and disk-I/O (2 GB).
     291This experiment can be done for two workload scenarios by reconfiguring the server with different amounts of memory: 25 GB and 2.5 GB.
     292The two workloads correspond to in-memory and disk-I/O respectively.
    268293Due to the Zipf distribution, only a small amount of memory is needed to service a significant percentage of requests.
    269294Table~\ref{t:CumulativeMemory} shows the cumulative memory required to satisfy the specified percentage of requests; e.g., 95\% of the requests come from 126.5 MB of the file set and 95\% of the requests are for files less than or equal to 51,200 bytes.
    270 Interestingly, with 2 GB of memory, significant disk-I/O occurs.
     295Interestingly, with 2.5 GB of memory, significant disk-I/O occurs.
    271296
    272297\begin{table}
     
    308333Previous work like \cit{Cite Ashif's stuff} demonstrate that an interesting follow-up experiment is to rerun the same throughput experiment but allowing significantly less memory on the machine.
    309334If the machine is constrained enough, it will force the OS to evict files from the file cache and cause calls to @sendfile@ to have to read from disk.
    310 However, what these low memory experiments demonstrate is how the memory footprint of the webserver affects the performance.
     335However, in this configuration, the problem with @splice@ and @io_uring@ rears its ugly head again.
     336Indeed, in the in-memory configuration, replacing @splice@ with calls to @sendfile@ works because the bounded side basically never blocks.
     337Like @splice@, @sendfile@ is in a situation where the read side requires bounded blocking, \eg reading from a regular file, while the write side requires unbounded blocking, \eg blocking until the socket is available for writing.
     338The unbounded side can be handled by yielding when it returns @EAGAIN@ like mentioned above, but this trick does not work for the bounded side.
     339The only solution for the bounded side is to spawn more threads and let these handle the blocking.
     340
     341Supporting this case in the webserver would require creating more \procs or creating a dedicated thread-pool.
    311342However, since what I am to evaluate in this thesis is the runtime of \CFA, I decided to forgo experiments on low memory server.
    312343The implementation of the webserver itself is simply too impactful to be an interesting evaluation of the underlying runtime.
  • doc/theses/thierry_delisle_PhD/thesis/text/io.tex

    r82b9e956 rfc6c410  
    158158The parked \glspl{thrd} are then rescheduled by the event engine once the desired operation has completed.
    159159
    160 \subsection{\lstinline{io_uring} in depth}
     160\subsection{\lstinline{io_uring} in depth}\label{iouring}
    161161Before going into details on the design of my event engine, more details on @io_uring@ usage are presented, each important in the design of the engine.
    162162Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance.
     
    216216This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
    217217
     218An important detail to keep in mind is that just like ``The cloud is just someone else's computer''\cite{xkcd:cloud}, asynchronous operations are just operation using someone else's threads.
     219Indeed, asynchronous operation can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads.
     220In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}.
     221To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user.
     222There are three kind of operations that can need the \glspl{kthrd}:
     223
     224\paragraph{Operations using} @IOSQE_ASYNC@.
     225This is a straightforward case, users can explicitly set the @IOSQE_ASYNC@ flag on an SQE to specify that it \emph{must} be delegated to a different \gls{kthrd}.
     226
     227\paragraph{Bounded operations.}
     228This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices.
     229@io_uring@ must also take this reality into account by delegating operations on regular files and block devices.
     230In fact @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}.
     231
     232\paragraph{Unbounded operations that must be retried.}
     233While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket.
     234Since this retry cannot necessarily be done in the system call, @io_uring@ must delegate these calls to a \gls{kthrd}.
     235@io_uring@ maintains a separate pool for these operations.
     236The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}.
     237Unbounded workers are also responsible of handling operations using @IOSQE_ASYNC@.
     238
     239@io_uring@ implicitly spawns and joins both the bounded and unbounded workers based on its evaluation of the needs of the workload.
     240This effectively encapsulates the work that is needed when using @epoll@.
     241Indeed, @io_uring@ does not change Linux's underlying handling of \io opeartions, it simply offers an asynchronous \glsxtrshort{api} on top of the existing system.
     242
     243
    218244\subsection{Multiplexing \io: Submission}
    219245
Note: See TracChangeset for help on using the changeset viewer.