Ignore:
Timestamp:
Jun 27, 2023, 4:45:40 PM (21 months ago)
Author:
caparsons <caparson@…>
Branches:
master
Children:
a1f0cb6
Parents:
917e1fd
Message:

first draft of full waituntil chapter and conclusion chapter. Lots of graph/plotting utilities cleanup. Reran all CFA actor benchmarks after recent changes. Small changes to actor.tex in performance section

Location:
doc/theses/colby_parsons_MMAth/text
Files:
1 added
2 edited

Legend:

Unmodified
Added
Removed
  • TabularUnified doc/theses/colby_parsons_MMAth/text/actors.tex

    r917e1fd r14e1053  
    55% ======================================================================
    66
    7 % C_TODO: add citations throughout chapter
    87Actors are an indirect concurrent feature that abstracts threading away from a programmer, and instead provides \gls{actor}s and messages as building blocks for concurrency, where message passing means there is no shared data to protect, making actors amenable in a distributed environment.
    98Actors are another message passing concurrency feature, similar to channels but with more abstraction, and are in the realm of \gls{impl_concurrency}, where programmers write concurrent code without dealing with explicit thread creation or interaction.
     
    423422Push operations are amortized $O(1)$ since pushes may cause doubling reallocations of the underlying dynamic-sized array (like \CC @vector@).
    424423
    425 % C_TODO: maybe make copy_queue diagram
    426 
    427424Since the copy queue is an array, envelopes are allocated first on the stack and then copied into the copy queue to persist until they are no longer needed.
    428425For many workload, the copy queues grow in size to facilitate the average number of messages in flight and there is no further dynamic allocations.
     
    834831In another example, if the average gulp size is very high, it could indicate that the executor could use more queue sharding.
    835832
    836 % C_TODO cite poison pill messages and add languages
    837833Another productivity feature that is included is a group of poison-pill messages.
    838 Poison-pill messages are common across actor systems, including Akka and ProtoActor \cite{}.
     834Poison-pill messages are common across actor systems, and are used in actor libraries Akka and ProtoActor~\cite{Akka,ProtoActor}.
    839835Poison-pill messages inform an actor to terminate.
    840836In \CFA, due to the allocation of actors and lack of garbage collection, there needs to be a suite of poison-pills.
     
    881877        & \multicolumn{1}{c|}{\CFA (100M)} & \multicolumn{1}{c|}{CAF (10M)} & \multicolumn{1}{c|}{Akka (100M)} & \multicolumn{1}{c|}{\uC (100M)} & \multicolumn{1}{c@{}}{ProtoActor (100M)} \\
    882878        \hline
    883         AMD             & \input{data/pykeSendStatic} \\
     879        AMD             & \input{data/nasusSendStatic} \\
    884880        \hline
    885         Intel   & \input{data/nasusSendStatic}
     881        Intel   & \input{data/pykeSendStatic}
    886882\end{tabular}
    887883
     
    894890        & \multicolumn{1}{c|}{\CFA (20M)} & \multicolumn{1}{c|}{CAF (2M)} & \multicolumn{1}{c|}{Akka (2M)} & \multicolumn{1}{c|}{\uC (20M)} & \multicolumn{1}{c@{}}{ProtoActor (2M)} \\
    895891        \hline
    896         AMD             & \input{data/pykeSendDynamic} \\
     892        AMD             & \input{data/nasusSendDynamic} \\
    897893        \hline
    898         Intel   & \input{data/nasusSendDynamic}
     894        Intel   & \input{data/pykeSendDynamic}
    899895\end{tabular}
    900896\end{table}
     
    916912The results from the static/dynamic send benchmarks are shown in Figures~\ref{t:StaticActorMessagePerformance} and \ref{t:DynamicActorMessagePerformance} respectively.
    917913\CFA leads the charts in both benchmarks, largely due to the copy queue removing the majority of the envelope allocations.
     914Additionally, the receive of all messages sent in \CFA is statically known and is determined via a function pointer cast, which incurrs a compile-time cost.
     915All the other systems use their virtual system to find the correct behaviour at message send.
     916This requires two virtual dispatch operations, which is an additional runtime send cost that \CFA does not have.
     917Note that Akka also statically checks message sends, but still uses their virtual system at runtime.
    918918In the static send benchmark all systems except CAF have static send costs that are in the same ballpark, only varying by ~70ns.
    919919In the dynamic send benchmark all systems experience slower message sends, as expected due to the extra allocations.
     
    10841084
    10851085Figure~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores.
    1086 \CFA has a high watermark relative to the other non-garbage collected systems \uC, and CAF.
     1086\CFA has a high watermark relative to the other non-garbage-collected systems \uC, and CAF.
    10871087This is a result of the copy queue data structure, as it will over-allocate storage and not clean up eagerly, whereas the per envelope allocations will always allocate exactly the amount of storage needed.
     1088Despite having a higher watermark, the \CFA memory usage remains comparable to other non-garbage-collected systems.
    10881089
    10891090\subsection{Matrix Multiply}
  • TabularUnified doc/theses/colby_parsons_MMAth/text/waituntil.tex

    r917e1fd r14e1053  
    8080This enables fully expressive \gls{synch_multiplex} predicates.
    8181
    82 There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channe}, and C++14's @when_any@ over futures~\cite{cpp:whenany}.
     82There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channel}, and C++14's @when_any@ over futures~\cite{cpp:whenany}.
    8383Note that while C++14 and Rust provide \gls{synch_multiplex}, their implemetations leave much to be desired as they both rely on busy-waiting polling to wait on multiple resources.
    8484
     
    9999All of the \gls{synch_multiplex} features mentioned so far are monomorphic, only supporting one resource to wait on, select(2) supports file descriptors, Go's select supports channel operations, \uC's select supports futures, and Ada's select supports monitor method calls.
    100100The waituntil statement in \CFA is polymorphic and provides \gls{synch_multiplex} over any objects that satisfy the trait in Figure~\ref{f:wu_trait}.
     101No other language provides a synchronous multiplexing tool polymorphic over resources like \CFA's waituntil.
     102All others them tie themselves to some specific type of resource.
    101103
    102104\begin{figure}
     
    370372
    371373\subsection{Channel Benchmark}
    372 The channel microbenchmark compares \CFA's waituntil and Go's select, where the resource being waited on is a set of channels.
    373 
    374 %C_TODO explain benchmark
    375 
    376 %C_TODO show results
    377 
    378 %C_TODO discuss results
     374The channel multiplexing microbenchmarks compare \CFA's waituntil and Go's select, where the resource being waited on is a set of channels.
     375The basic structure of the microbenchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there would be 4 producer threads and 4 consumer threads.
     376The number of clauses @C@ is also varied, with results shown with 2, 4, and 8 clauses.
     377Each clause has a respective channel that is operates on.
     378Each producer and consumer repeatedly waits to either produce or consume from one of the @C@ clauses and respective channels.
     379An example in \CFA syntax of the work loop in the consumer main with @C = 4@ clauses follows.
     380
     381\begin{cfa}
     382    for (;;)
     383        waituntil( val << chans[0] ) {} or waituntil( val << chans[1] ) {}
     384        or waituntil( val << chans[2] ) {} or waituntil( val << chans[3] ) {}
     385\end{cfa}
     386A successful consumption is counted as a channel operation, and the throughput of these operations is measured over 10 seconds.
     387The first microbenchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels.
     388The results are shown in Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} respectively.
     389
     390\begin{figure}
     391        \centering
     392    \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
     393        \subfloat[AMD]{
     394                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_2.pgf}}
     395        }
     396        \subfloat[Intel]{
     397                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_2.pgf}}
     398        }
     399    \bigskip
     400
     401        \subfloat[AMD]{
     402                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_4.pgf}}
     403        }
     404        \subfloat[Intel]{
     405                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_4.pgf}}
     406        }
     407    \bigskip
     408
     409        \subfloat[AMD]{
     410                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_8.pgf}}
     411        }
     412        \subfloat[Intel]{
     413                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_8.pgf}}
     414        }
     415        \caption{The channel synchronous multiplexing benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).}
     416        \label{f:select_contend_bench}
     417\end{figure}
     418
     419\begin{figure}
     420        \centering
     421    \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
     422        \subfloat[AMD]{
     423                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_2.pgf}}
     424        }
     425        \subfloat[Intel]{
     426                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_2.pgf}}
     427        }
     428    \bigskip
     429
     430        \subfloat[AMD]{
     431                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_4.pgf}}
     432        }
     433        \subfloat[Intel]{
     434                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_4.pgf}}
     435        }
     436    \bigskip
     437
     438        \subfloat[AMD]{
     439                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_8.pgf}}
     440        }
     441        \subfloat[Intel]{
     442                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_8.pgf}}
     443        }
     444        \caption{The asynchronous multiplexing channel benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).}
     445        \label{f:select_spin_bench}
     446\end{figure}
     447
     448Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing @select@ and @waituntil@.
     449In the AMD benchmarks, the performance is very similar as the number of cores scale.
     450The AMD machine has been observed to have higher caching contention cost, which creates on a bottleneck on the channel locks, which results in similar scaling between \CFA and Go.
     451At low cores, Go has significantly better performance, which is likely due to an optimization in their scheduler.
     452Go heavily optimizes thread handoffs on their local runqueue, which can result in very good performance for low numbers of threads which are parking/unparking eachother~\cite{go:sched}.
     453In the Intel benchmarks, \CFA performs better than Go as the number of cores scale and as the number of clauses scale.
     454This is likely due to Go's implementation choice of acquiring all channel locks when registering and unregistering channels on a @select@.
     455Go then has to hold a lock for every channel, so it follows that this results in worse performance as the number of channels increase.
     456In \CFA, since races are consolidated without holding all locks, it scales much better both with cores and clauses since more work can occur in parallel.
     457This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has been observed to have lower cache contention costs.
     458
     459The Go approach of holding all internal channel locks in the select has some additional drawbacks.
     460This approach results in some pathological cases where Go's system throughput on channels can greatly suffer.
     461Consider the case where there are two channels, @A@ and @B@.
     462There are both a producer thread and a consumer thread, @P1@ and @C1@, selecting both @A@ and @B@.
     463Additionally, there is another producer and another consumer thread, @P2@ and @C2@, that are both operating solely on @B@.
     464Compared to \CFA this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired.
     465This case may not be as pathological as it may seem.
     466If the set of channels belonging to a select have channels that overlap with the set of another select, they lose the ability to operate on their select in parallel.
     467The implementation in \CFA only ever holds a single lock at a time, resulting in better locking granularity.
     468Comparison of this pathological case is shown in Table~\ref{t:pathGo}.
     469The AMD results highlight the worst case scenario for Go since contention is more costly on this machine than the Intel machine.
     470
     471\begin{table}[t]
     472\centering
     473\setlength{\extrarowheight}{2pt}
     474\setlength{\tabcolsep}{5pt}
     475
     476\caption{Throughput (channel operations per second) of \CFA and Go for a pathologically bad case for contention in Go's select implementation}
     477\label{t:pathGo}
     478\begin{tabular}{*{5}{r|}r}
     479    & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c@{}}{Go} \\
     480    \hline
     481    AMD         & \input{data/nasus_Order} \\
     482    \hline
     483    Intel       & \input{data/pyke_Order}
     484\end{tabular}
     485\end{table}
     486
     487Another difference between Go and \CFA is the order of clause selection when multiple clauses are available.
     488Go "randomly" selects a clause, but \CFA chooses the clause in the order they are listed~\cite{go:select}.
     489This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in  Table~\ref{}.
     490If \CFA didn't have priorities, the performance difference in Table~\ref{} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection.
    379491
    380492\subsection{Future Benchmark}
    381493The future benchmark compares \CFA's waituntil with \uC's @_Select@, with both utilities waiting on futures.
    382 
    383 %C_TODO explain benchmark
    384 
    385 %C_TODO show results
    386 
    387 %C_TODO discuss results
     494Both \CFA's @waituntil@ and \uC's @_Select@ have very similar semantics, however @_Select@ can only wait on futures, whereas the @waituntil@ is polymorphic.
     495They both support @and@ and @or@ operators, but the underlying implementation of the operators differs between @waituntil@ and @_Select@.
     496The @waituntil@ statement checks for statement completion using a predicate function, whereas the @_Select@ statement maintains a tree that represents the state of the internal predicate.
     497
     498\begin{figure}
     499        \centering
     500        \subfloat[AMD Future Synchronization Benchmark]{
     501                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Future.pgf}}
     502                \label{f:futureAMD}
     503        }
     504        \subfloat[Intel Future Synchronization Benchmark]{
     505                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Future.pgf}}
     506                \label{f:futureIntel}
     507        }
     508        \caption{\CFA waituntil and \uC \_Select statement throughput synchronizing on a set of futures with varying wait predicates (higher is better).}
     509    \caption{}
     510        \label{f:futurePerf}
     511\end{figure}
     512
     513This microbenchmark aims to measure the impact of various predicates on the performance of the @waituntil@ and @_Select@ statements.
     514This benchmark and section does not try to directly compare the @waituntil@ and @_Select@ statements since the performance of futures in \CFA and \uC differ by a significant margin, making them incomparable.
     515Results of this benchmark are shown in Figure~\ref{f:futurePerf}.
     516Each set of columns is marked with a name representing the predicate for that set of columns.
     517The predicate name and corresponding waituntil statement is shown below:
     518
     519\begin{cfa}
     520#ifdef OR
     521waituntil( A ) { get( A ); }
     522or waituntil( B ) { get( B ); }
     523or waituntil( C ) { get( C ); }
     524#endif
     525#ifdef AND
     526waituntil( A ) { get( A ); }
     527and waituntil( B ) { get( B ); }
     528and waituntil( C ) { get( C ); }
     529#endif
     530#ifdef ANDOR
     531waituntil( A ) { get( A ); }
     532and waituntil( B ) { get( B ); }
     533or waituntil( C ) { get( C ); }
     534#endif
     535#ifdef ORAND
     536(waituntil( A ) { get( A ); }
     537or waituntil( B ) { get( B ); }) // brackets create higher precedence for or
     538and waituntil( C ) { get( C ); }
     539#endif
     540\end{cfa}
     541
     542In Figure~\ref{f:futurePerf}, the @OR@ column for \CFA is more performant than the other \CFA predicates, likely due to the special-casing of waituntil statements with only @or@ operators.
     543For both \uC and \CFA the @AND@ column is the least performant, which is expected since all three futures need to be fulfilled for each statement completion, unlike any of the other operators.
     544Interestingly, \CFA has lower variation across predicates on the AMD (excluding the special OR case), whereas \uC has lower variation on the Intel.
Note: See TracChangeset for help on using the changeset viewer.