Changeset 58fe85a for doc/theses
- Timestamp:
- Jan 7, 2021, 3:27:00 PM (5 years ago)
- Branches:
- ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast-unique-expr, pthread-emulation, qualifiedEnum, stuck-waitfor-destruct
- Children:
- 2b4daf2, 64aeca0
- Parents:
- 3c64c668 (diff), eef8dfb (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the(diff)links above to see all the changes relative to each parent. - Location:
- doc/theses
- Files:
-
- 64 added
- 1 deleted
- 4 edited
- 13 moved
-
andrew_beach_MMath/.gitignore (added)
-
andrew_beach_MMath/Makefile (added)
-
andrew_beach_MMath/cfalab.sty (added)
-
andrew_beach_MMath/existing.tex (added)
-
andrew_beach_MMath/features.tex (added)
-
andrew_beach_MMath/future.tex (added)
-
andrew_beach_MMath/glossaries.tex (added)
-
andrew_beach_MMath/thesis-frontpgs.tex (added)
-
andrew_beach_MMath/thesis.bib (added)
-
andrew_beach_MMath/thesis.tex (added)
-
andrew_beach_MMath/unwinding.tex (added)
-
andrew_beach_MMath/uw-ethesis.cls (added)
-
fangren_yu_COOP_S20/Makefile (added)
-
fangren_yu_COOP_S20/Report.tex (added)
-
fangren_yu_COOP_S20/cfa_developer_reference.pdf (added)
-
fangren_yu_COOP_S20/figures/DeepNodeSharing.fig (added)
-
fangren_yu_COOP_S20/figures/DeepNodeSharing.fig.bak (added)
-
thierry_delisle_PhD/.gitignore (modified) (1 diff)
-
thierry_delisle_PhD/code/readQ_example/Makefile (added)
-
thierry_delisle_PhD/code/readQ_example/proto-gui/main.cpp (added)
-
thierry_delisle_PhD/code/readQ_example/thrdlib/Makefile (added)
-
thierry_delisle_PhD/code/readQ_example/thrdlib/cforall.hpp (added)
-
thierry_delisle_PhD/code/readQ_example/thrdlib/fibre.hpp (added)
-
thierry_delisle_PhD/code/readQ_example/thrdlib/pthread.hpp (added)
-
thierry_delisle_PhD/code/readQ_example/thrdlib/thread.cpp (added)
-
thierry_delisle_PhD/code/readQ_example/thrdlib/thread.hpp (added)
-
thierry_delisle_PhD/code/readyQ_proto/Makefile (moved) (moved from doc/theses/thierry_delisle_PhD/code/Makefile )
-
thierry_delisle_PhD/code/readyQ_proto/assert.hpp (moved) (moved from doc/theses/thierry_delisle_PhD/code/assert.hpp )
-
thierry_delisle_PhD/code/readyQ_proto/bitbench/select.cpp (added)
-
thierry_delisle_PhD/code/readyQ_proto/bts.cpp (added)
-
thierry_delisle_PhD/code/readyQ_proto/bts_test.cpp (moved) (moved from doc/theses/thierry_delisle_PhD/code/bts_test.cpp )
-
thierry_delisle_PhD/code/readyQ_proto/links.hpp (added)
-
thierry_delisle_PhD/code/readyQ_proto/prefetch.cpp (moved) (moved from doc/theses/thierry_delisle_PhD/code/prefetch.cpp )
-
thierry_delisle_PhD/code/readyQ_proto/process.sh (added)
-
thierry_delisle_PhD/code/readyQ_proto/processor.hpp (moved) (moved from doc/theses/thierry_delisle_PhD/code/processor.hpp )
-
thierry_delisle_PhD/code/readyQ_proto/processor_list.hpp (moved) (moved from doc/theses/thierry_delisle_PhD/code/processor_list.hpp )
-
thierry_delisle_PhD/code/readyQ_proto/processor_list_fast.cpp (moved) (moved from doc/theses/thierry_delisle_PhD/code/processor_list_fast.cpp )
-
thierry_delisle_PhD/code/readyQ_proto/processor_list_good.cpp (moved) (moved from doc/theses/thierry_delisle_PhD/code/processor_list_good.cpp )
-
thierry_delisle_PhD/code/readyQ_proto/randbit.cpp (moved) (moved from doc/theses/thierry_delisle_PhD/code/randbit.cpp )
-
thierry_delisle_PhD/code/readyQ_proto/relaxed_list.cpp (moved) (moved from doc/theses/thierry_delisle_PhD/code/relaxed_list.cpp ) (25 diffs)
-
thierry_delisle_PhD/code/readyQ_proto/relaxed_list.hpp (added)
-
thierry_delisle_PhD/code/readyQ_proto/relaxed_list_layout.cpp (moved) (moved from doc/theses/thierry_delisle_PhD/code/relaxed_list_layout.cpp )
-
thierry_delisle_PhD/code/readyQ_proto/runperf.sh (added)
-
thierry_delisle_PhD/code/readyQ_proto/scale.sh (moved) (moved from doc/theses/thierry_delisle_PhD/code/scale.sh )
-
thierry_delisle_PhD/code/readyQ_proto/snzi-packed.hpp (added)
-
thierry_delisle_PhD/code/readyQ_proto/snzi.hpp (added)
-
thierry_delisle_PhD/code/readyQ_proto/snzm.hpp (added)
-
thierry_delisle_PhD/code/readyQ_proto/utils.hpp (moved) (moved from doc/theses/thierry_delisle_PhD/code/utils.hpp ) (3 diffs)
-
thierry_delisle_PhD/code/readyQ_proto/work_stealing.hpp (added)
-
thierry_delisle_PhD/code/relaxed_list.hpp (deleted)
-
thierry_delisle_PhD/comp_II/Makefile (modified) (5 diffs)
-
thierry_delisle_PhD/comp_II/comp_II.tex (modified) (5 diffs)
-
thierry_delisle_PhD/comp_II/comp_II_PAB.tex (added)
-
thierry_delisle_PhD/comp_II/img/base.fig (added)
-
thierry_delisle_PhD/comp_II/img/empty.fig (added)
-
thierry_delisle_PhD/comp_II/img/emptybit.fig (added)
-
thierry_delisle_PhD/comp_II/img/emptytls.fig (added)
-
thierry_delisle_PhD/comp_II/img/emptytree.fig (added)
-
thierry_delisle_PhD/comp_II/img/resize.fig (added)
-
thierry_delisle_PhD/comp_II/img/system.fig (added)
-
thierry_delisle_PhD/comp_II/local.bib (modified) (2 diffs)
-
thierry_delisle_PhD/comp_II/presentation.tex (added)
-
thierry_delisle_PhD/comp_II/presentationstyle.sty (added)
-
thierry_delisle_PhD/thesis/Makefile (added)
-
thierry_delisle_PhD/thesis/fig/base.fig (added)
-
thierry_delisle_PhD/thesis/fig/empty.fig (added)
-
thierry_delisle_PhD/thesis/fig/emptybit.fig (added)
-
thierry_delisle_PhD/thesis/fig/emptytls.fig (added)
-
thierry_delisle_PhD/thesis/fig/emptytree.fig (added)
-
thierry_delisle_PhD/thesis/fig/fairness.py (added)
-
thierry_delisle_PhD/thesis/fig/resize.fig (added)
-
thierry_delisle_PhD/thesis/fig/system.fig (added)
-
thierry_delisle_PhD/thesis/glossary.tex (added)
-
thierry_delisle_PhD/thesis/local.bib (added)
-
thierry_delisle_PhD/thesis/text/core.tex (added)
-
thierry_delisle_PhD/thesis/text/existing.tex (added)
-
thierry_delisle_PhD/thesis/text/front.tex (added)
-
thierry_delisle_PhD/thesis/text/intro.tex (added)
-
thierry_delisle_PhD/thesis/text/io.tex (added)
-
thierry_delisle_PhD/thesis/text/practice.tex (added)
-
thierry_delisle_PhD/thesis/text/runtime.tex (added)
-
thierry_delisle_PhD/thesis/thesis.tex (added)
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/.gitignore
r3c64c668 r58fe85a 8 8 9 9 comp_II/build/ 10 comp_II/img/*.fig.bak 10 11 comp_II/comp_II.pdf 11 12 comp_II/comp_II.ps 13 comp_II/presentation.pdf 14 15 thesis/build/ 16 thesis/fig/*.fig.bak 17 thesis/thesis.pdf 18 thesis/thesis.ps 12 19 13 20 !Makefile -
doc/theses/thierry_delisle_PhD/code/readyQ_proto/relaxed_list.cpp
r3c64c668 r58fe85a 1 #include "relaxed_list.hpp" 1 #if !defined(LIST_VARIANT_HPP) 2 #define LIST_VARIANT_HPP "relaxed_list.hpp" 3 #endif 4 5 #include LIST_VARIANT_HPP 6 #if !defined(LIST_VARIANT) 7 #error not variant selected 8 #endif 2 9 3 10 #include <array> … … 35 42 36 43 template<> 37 thread_local relaxed_list<Node>::TLS relaxed_list<Node>::tls = {};44 thread_local LIST_VARIANT<Node>::TLS LIST_VARIANT<Node>::tls = {}; 38 45 39 46 template<> 40 relaxed_list<Node> * relaxed_list<Node>::head = nullptr;47 std::atomic_uint32_t LIST_VARIANT<Node>::ticket = { 0 }; 41 48 42 49 #ifndef NO_STATS 43 50 template<> 44 relaxed_list<Node>::GlobalStats relaxed_list<Node>::global_stats = {};51 LIST_VARIANT<Node>::GlobalStats LIST_VARIANT<Node>::global_stats = {}; 45 52 #endif 46 53 … … 57 64 size_t valmax = 0; 58 65 size_t valmin = 100000000ul; 66 struct { 67 size_t val = 0; 68 size_t cnt = 0; 69 } comp; 70 struct { 71 size_t val = 0; 72 size_t cnt = 0; 73 } subm; 59 74 }; 60 75 … … 67 82 std::atomic_size_t valmax = { 0 }; 68 83 std::atomic_size_t valmin = { 100000000ul }; 84 struct { 85 std::atomic_size_t val = { 0 }; 86 std::atomic_size_t cnt = { 0 }; 87 } comp; 88 struct { 89 std::atomic_size_t val = { 0 }; 90 std::atomic_size_t cnt = { 0 }; 91 } subm; 69 92 }; 70 93 … … 96 119 global.crc_out += local.crc_out; 97 120 121 global.comp.val += local.comp.val; 122 global.comp.cnt += local.comp.cnt; 123 global.subm.val += local.subm.val; 124 global.subm.cnt += local.subm.cnt; 125 98 126 atomic_max(global.valmax, local.valmax); 99 127 atomic_min(global.valmin, local.valmin); 100 128 101 relaxed_list<Node>::stats_tls_tally();129 LIST_VARIANT<Node>::stats_tls_tally(); 102 130 } 103 131 … … 106 134 auto before = Clock::now(); 107 135 barrier.wait(0); 136 bool is_tty = isatty(STDOUT_FILENO); 108 137 109 138 while(true) { … … 115 144 break; 116 145 } 117 std::cout << "\r" << std::setprecision(4) << durr.count(); 118 std::cout.flush(); 146 if(is_tty) { 147 std::cout << "\r" << std::setprecision(4) << durr.count(); 148 std::cout.flush(); 149 } 119 150 } 120 151 … … 159 190 auto dur_nano = duration_cast<std::nano>(1.0); 160 191 192 if(global.valmax != 0) { 193 std::cout << "Max runs : " << global.valmax << "\n"; 194 std::cout << "Min runs : " << global.valmin << "\n"; 195 } 196 if(global.comp.cnt != 0) { 197 std::cout << "Submit count : " << global.subm.cnt << "\n"; 198 std::cout << "Submit average: " << ((double(global.subm.val)) / global.subm.cnt) << "\n"; 199 std::cout << "Complete count: " << global.comp.cnt << "\n"; 200 std::cout << "Complete avg : " << ((double(global.comp.val)) / global.comp.cnt) << "\n"; 201 } 161 202 std::cout << "Duration : " << duration << "s\n"; 162 203 std::cout << "ns/Op : " << ( dur_nano / ops_thread )<< "\n"; … … 164 205 std::cout << "Ops/sec : " << ops_sec << "\n"; 165 206 std::cout << "Total ops : " << ops << "(" << global.in << "i, " << global.out << "o, " << global.empty << "e)\n"; 166 if(global.valmax != 0) {167 std::cout << "Max runs : " << global.valmax << "\n";168 std::cout << "Min runs : " << global.valmin << "\n";169 }170 207 #ifndef NO_STATS 171 relaxed_list<Node>::stats_print(std::cout);208 LIST_VARIANT<Node>::stats_print(std::cout); 172 209 #endif 173 210 } … … 186 223 unsigned nslots, 187 224 local_stat_t & local, 188 relaxed_list<Node> & list225 LIST_VARIANT<Node> & list 189 226 ) { 190 227 while(__builtin_expect(!done.load(std::memory_order_relaxed), true)) { … … 224 261 std::cout << "Initializing "; 225 262 size_t npushed = 0; 226 relaxed_list<Node> list = { nthread *nqueues };263 LIST_VARIANT<Node> list = { nthread, nqueues }; 227 264 { 228 265 Node** all_nodes[nthread]; … … 310 347 unsigned nnodes, 311 348 local_stat_t & local, 312 relaxed_list<Node> & list349 LIST_VARIANT<Node> & list 313 350 ) { 314 351 Node * nodes[nnodes]; … … 354 391 std::cout << "Initializing "; 355 392 // List being tested 356 relaxed_list<Node> list = { nthread *nqueues };393 LIST_VARIANT<Node> list = { nthread, nqueues }; 357 394 { 358 395 enable_stats = true; … … 395 432 396 433 enable_stats = false; 434 } 435 436 print_stats(duration, nthread, global); 437 } 438 439 // ================================================================================================ 440 struct __attribute__((aligned(64))) Slot { 441 Node * volatile node; 442 }; 443 444 __attribute__((noinline)) void runProducer_body( 445 std::atomic<bool>& done, 446 Random & rand, 447 Slot * slots, 448 int nslots, 449 local_stat_t & local, 450 LIST_VARIANT<Node> & list 451 ) { 452 while(__builtin_expect(!done.load(std::memory_order_relaxed), true)) { 453 454 Node * node = list.pop(); 455 if(!node) { 456 local.empty ++; 457 continue; 458 } 459 460 local.crc_out += node->value; 461 local.out++; 462 463 if(node->id == 0) { 464 unsigned cnt = 0; 465 for(int i = 0; i < nslots; i++) { 466 Node * found = __atomic_exchange_n( &slots[i].node, nullptr, __ATOMIC_SEQ_CST ); 467 if( found ) { 468 local.crc_in += found->value; 469 local.in++; 470 cnt++; 471 list.push( found ); 472 } 473 } 474 475 local.crc_in += node->value; 476 local.in++; 477 list.push( node ); 478 479 local.comp.cnt++; 480 local.comp.val += cnt; 481 } 482 else { 483 unsigned len = 0; 484 while(true) { 485 auto off = rand.next(); 486 for(int i = 0; i < nslots; i++) { 487 Node * expected = nullptr; 488 int idx = (i + off) % nslots; 489 Slot & slot = slots[ idx ]; 490 if( 491 slot.node == nullptr && 492 __atomic_compare_exchange_n( &slot.node, &expected, node, false, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST ) 493 ) { 494 local.subm.cnt++; 495 local.subm.val += len; 496 goto LOOP; 497 } 498 assert( expected != node ); 499 len++; 500 } 501 } 502 } 503 504 LOOP:; 505 } 506 } 507 508 void runProducer(unsigned nthread, unsigned nqueues, double duration, unsigned nnodes) { 509 std::cout << "Producer Benchmark" << std::endl; 510 511 // Barrier for synchronization 512 barrier_t barrier(nthread + 1); 513 514 // Data to check everything is OK 515 global_stat_t global; 516 517 // Flag to signal termination 518 std::atomic_bool done = { false }; 519 520 std::cout << "Initializing "; 521 522 int nslots = nnodes * 4; 523 Slot * slots = new Slot[nslots]; 524 std::cout << nnodes << " nodes (" << nslots << " slots)" << std::endl; 525 526 // List being tested 527 LIST_VARIANT<Node> list = { nthread, nqueues }; 528 { 529 Random rand(rdtscl()); 530 for(unsigned i = 0; i < nnodes; i++) { 531 Node * node = new Node(rand.next() % 100); 532 node->id = i; 533 global.crc_in += node->value; 534 list.push(node); 535 } 536 537 for(int i = 0; i < nslots; i++) { 538 slots[i].node = nullptr; 539 } 540 } 541 542 { 543 enable_stats = true; 544 545 std::thread * threads[nthread]; 546 unsigned i = 1; 547 for(auto & t : threads) { 548 t = new std::thread([&done, &list, &barrier, &global, slots, nslots](unsigned tid) { 549 Random rand(tid + rdtscl()); 550 551 local_stat_t local; 552 barrier.wait(tid); 553 554 // EXPERIMENT START 555 556 runProducer_body(done, rand, slots, nslots, local, list); 557 558 // EXPERIMENT END 559 560 barrier.wait(tid); 561 562 tally_stats(global, local); 563 }, i++); 564 } 565 566 waitfor(duration, barrier, done); 567 568 for(auto t : threads) { 569 t->join(); 570 delete t; 571 } 572 573 enable_stats = false; 574 } 575 576 { 577 while(Node * node = list.pop()) { 578 global.crc_out += node->value; 579 delete node; 580 } 581 582 for(int i = 0; i < nslots; i++) { 583 delete slots[i].node; 584 } 585 586 delete [] slots; 397 587 } 398 588 … … 410 600 unsigned nnodes, 411 601 local_stat_t & local, 412 relaxed_list<Node> & list602 LIST_VARIANT<Node> & list 413 603 ) { 414 604 Node * nodes[nnodes]; … … 470 660 471 661 // List being tested 472 relaxed_list<Node> list = { nthread *nqueues };662 LIST_VARIANT<Node> list = { nthread, nqueues }; 473 663 { 474 664 enable_stats = true; … … 521 711 print_stats(duration, nthread, global); 522 712 523 save_fairness(data_out.get(), 100, nthread, width, length, output);713 // save_fairness(data_out.get(), 100, nthread, width, length, output); 524 714 } 525 715 … … 547 737 Churn, 548 738 PingPong, 739 Producer, 549 740 Fairness, 550 741 NONE … … 577 768 case PingPong: 578 769 nnodes = 1; 579 nslots = 1;580 770 switch(argc - optind) { 581 771 case 0: break; … … 591 781 break; 592 782 default: 593 std::cerr << "'PingPong' benchmark doesn't accept more than 2 extra arguments" << std::endl; 783 std::cerr << "'PingPong' benchmark doesn't accept more than 1 extra arguments" << std::endl; 784 goto usage; 785 } 786 break; 787 case Producer: 788 nnodes = 32; 789 switch(argc - optind) { 790 case 0: break; 791 case 1: 792 try { 793 arg = optarg = argv[optind]; 794 nnodes = stoul(optarg, &len); 795 if(len != arg.size()) { throw std::invalid_argument(""); } 796 } catch(std::invalid_argument &) { 797 std::cerr << "Number of nodes must be a positive integer, was " << arg << std::endl; 798 goto usage; 799 } 800 break; 801 default: 802 std::cerr << "'Producer' benchmark doesn't accept more than 1 extra arguments" << std::endl; 594 803 goto usage; 595 804 } … … 662 871 break; 663 872 } 873 if(iequals(arg, "producer")) { 874 benchmark = Producer; 875 break; 876 } 664 877 if(iequals(arg, "fairness")) { 665 878 benchmark = Fairness; … … 702 915 std::cerr << "Usage: " << argv[0] << ": [options] -b churn [NNODES] [NSLOTS = NNODES]" << std::endl; 703 916 std::cerr << " or: " << argv[0] << ": [options] -b pingpong [NNODES]" << std::endl; 917 std::cerr << " or: " << argv[0] << ": [options] -b producer [NNODES]" << std::endl; 704 918 std::cerr << std::endl; 705 919 std::cerr << " -d, --duration=DURATION Duration of the experiment, in seconds" << std::endl; … … 714 928 715 929 std::cout << "Running " << nthreads << " threads (" << (nthreads * nqueues) << " queues) for " << duration << " seconds" << std::endl; 930 std::cout << "Relaxed list variant: " << LIST_VARIANT<Node>::name() << std::endl; 716 931 switch(benchmark) { 717 932 case Churn: … … 720 935 case PingPong: 721 936 runPingPong(nthreads, nqueues, duration, nnodes); 937 break; 938 case Producer: 939 runProducer(nthreads, nqueues, duration, nnodes); 722 940 break; 723 941 case Fairness: … … 801 1019 } 802 1020 803 void save_fairness(const int data[], int factor, unsigned nthreads, size_t columns, size_t rows, const std::string & output) {804 std::ofstream os(output);805 os << "<html>\n";806 os << "<head>\n";807 os << "<style>\n";808 os << "</style>\n";809 os << "</head>\n";810 os << "<body>\n";811 os << "<table style=\"width=100%\">\n";812 813 size_t idx = 0;814 for(size_t r = 0ul; r < rows; r++) {815 os << "<tr>\n";816 for(size_t c = 0ul; c < columns; c++) {817 os << "<td class=\"custom custom" << data[idx] << "\"></td>\n";818 idx++;819 }820 os << "</tr>\n";821 }822 823 os << "</table>\n";824 os << "</body>\n";825 os << "</html>\n";826 os << std::endl;827 }828 829 #include <png.h>830 #include <setjmp.h>1021 // void save_fairness(const int data[], int factor, unsigned nthreads, size_t columns, size_t rows, const std::string & output) { 1022 // std::ofstream os(output); 1023 // os << "<html>\n"; 1024 // os << "<head>\n"; 1025 // os << "<style>\n"; 1026 // os << "</style>\n"; 1027 // os << "</head>\n"; 1028 // os << "<body>\n"; 1029 // os << "<table style=\"width=100%\">\n"; 1030 1031 // size_t idx = 0; 1032 // for(size_t r = 0ul; r < rows; r++) { 1033 // os << "<tr>\n"; 1034 // for(size_t c = 0ul; c < columns; c++) { 1035 // os << "<td class=\"custom custom" << data[idx] << "\"></td>\n"; 1036 // idx++; 1037 // } 1038 // os << "</tr>\n"; 1039 // } 1040 1041 // os << "</table>\n"; 1042 // os << "</body>\n"; 1043 // os << "</html>\n"; 1044 // os << std::endl; 1045 // } 1046 1047 // #include <png.h> 1048 // #include <setjmp.h> 831 1049 832 1050 /* -
doc/theses/thierry_delisle_PhD/code/readyQ_proto/utils.hpp
r3c64c668 r58fe85a 35 35 }; 36 36 37 // class Random { 38 // private: 39 // unsigned int seed; 40 // public: 41 // Random(int seed) { 42 // this->seed = seed; 43 // } 44 45 // /** returns pseudorandom x satisfying 0 <= x < n. **/ 46 // unsigned int next() { 47 // seed ^= seed << 6; 48 // seed ^= seed >> 21; 49 // seed ^= seed << 7; 50 // return seed; 51 // } 52 // }; 53 54 constexpr uint64_t extendedEuclidY(uint64_t a, uint64_t b); 55 constexpr uint64_t extendedEuclidX(uint64_t a, uint64_t b){ 56 return (b==0) ? 1 : extendedEuclidY(b, a - b * (a / b)); 57 } 58 constexpr uint64_t extendedEuclidY(uint64_t a, uint64_t b){ 59 return (b==0) ? 0 : extendedEuclidX(b, a - b * (a / b)) - (a / b) * extendedEuclidY(b, a - b * (a / b)); 60 } 61 37 62 class Random { 38 63 private: 39 unsigned int seed; 64 uint64_t x; 65 66 static constexpr const uint64_t M = 1ul << 48ul; 67 static constexpr const uint64_t A = 25214903917; 68 static constexpr const uint64_t C = 11; 69 static constexpr const uint64_t D = 16; 70 40 71 public: 41 Random(int seed) { 42 this->seed = seed; 72 static constexpr const uint64_t m = M; 73 static constexpr const uint64_t a = A; 74 static constexpr const uint64_t c = C; 75 static constexpr const uint64_t d = D; 76 static constexpr const uint64_t ai = extendedEuclidX(A, M); 77 public: 78 Random(unsigned int seed) { 79 this->x = seed * a; 43 80 } 44 81 45 82 /** returns pseudorandom x satisfying 0 <= x < n. **/ 46 83 unsigned int next() { 47 seed ^= seed << 6; 48 seed ^= seed >> 21; 49 seed ^= seed << 7; 50 return seed; 51 } 84 //nextx = (a * x + c) % m; 85 x = (A * x + C) & (M - 1); 86 return x >> D; 87 } 88 unsigned int prev() { 89 //prevx = (ainverse * (x - c)) mod m 90 unsigned int r = x >> D; 91 x = ai * (x - C) & (M - 1); 92 return r; 93 } 94 95 void set_raw_state(uint64_t _x) { 96 this->x = _x; 97 } 98 99 uint64_t get_raw_state() { 100 return this->x; 101 } 52 102 }; 53 103 … … 106 156 } 107 157 158 static inline unsigned rand_bit(unsigned rnum, size_t mask) __attribute__((artificial)); 108 159 static inline unsigned rand_bit(unsigned rnum, size_t mask) { 109 160 unsigned bit = mask ? rnum % __builtin_popcountl(mask) : 0; … … 143 194 #endif 144 195 } 196 197 struct spinlock_t { 198 std::atomic_bool ll = { false }; 199 200 inline void lock() { 201 while( __builtin_expect(ll.exchange(true),false) ) { 202 while(ll.load(std::memory_order_relaxed)) 203 asm volatile("pause"); 204 } 205 } 206 207 inline bool try_lock() { 208 return false == ll.exchange(true); 209 } 210 211 inline void unlock() { 212 ll.store(false, std::memory_order_release); 213 } 214 215 inline explicit operator bool() { 216 return ll.load(std::memory_order_relaxed); 217 } 218 }; 219 220 static inline bool bts(std::atomic_size_t & target, size_t bit ) { 221 //* 222 int result = 0; 223 asm volatile( 224 "LOCK btsq %[bit], %[target]\n\t" 225 :"=@ccc" (result) 226 : [target] "m" (target), [bit] "r" (bit) 227 ); 228 return result != 0; 229 /*/ 230 size_t mask = 1ul << bit; 231 size_t ret = target.fetch_or(mask, std::memory_order_relaxed); 232 return (ret & mask) != 0; 233 //*/ 234 } 235 236 static inline bool btr(std::atomic_size_t & target, size_t bit ) { 237 //* 238 int result = 0; 239 asm volatile( 240 "LOCK btrq %[bit], %[target]\n\t" 241 :"=@ccc" (result) 242 : [target] "m" (target), [bit] "r" (bit) 243 ); 244 return result != 0; 245 /*/ 246 size_t mask = 1ul << bit; 247 size_t ret = target.fetch_and(~mask, std::memory_order_relaxed); 248 return (ret & mask) != 0; 249 //*/ 250 } -
doc/theses/thierry_delisle_PhD/comp_II/Makefile
r3c64c668 r58fe85a 2 2 3 3 Build = build 4 Figures = figures4 Figures = img 5 5 Macros = ../../../LaTeXmacros 6 6 TeXLIB = .:${Macros}:${Build}:../../../bibliography: … … 12 12 13 13 ## Define the text source files. 14 15 SOURCES = ${addsuffix .tex, \16 comp_II \17 }18 19 14 FIGURES = ${addsuffix .tex, \ 15 emptybit \ 16 emptytree \ 17 emptytls \ 18 resize \ 20 19 } 21 20 22 21 PICTURES = ${addsuffix .pstex, \ 22 base \ 23 empty \ 24 system \ 23 25 } 24 26 … … 30 32 31 33 ## Define the documents that need to be made. 34 all: comp_II.pdf presentation.pdf 35 comp_II.pdf: ${FIGURES} ${PICTURES} 36 presentation.pdf: presentationstyle.sty base.dark.pstex empty.dark.pstex system.dark.pstex 32 37 33 DOCUMENT = comp_II.pdf 38 DOCUMENT = comp_II.pdf presentation.pdf 34 39 BASE = ${basename ${DOCUMENT}} 35 40 … … 45 50 # File Dependencies # 46 51 47 ${DOCUMENT} : ${BASE}.ps 52 %.pdf : build/%.ps | ${Build} 48 53 ps2pdf $< 49 54 50 ${BASE}.ps : ${BASE}.dvi 51 dvips $ {Build}/$< -o $@55 build/%.ps : build/%.dvi | ${Build} 56 dvips $< -o $@ 52 57 53 ${BASE}.dvi : Makefile ${GRAPHS} ${PROGRAMS} ${PICTURES} ${FIGURES} ${SOURCES} \ 54 ${Macros}/common.tex ${Macros}/indexstyle ../../../bibliography/pl.bib \ 55 local.bib glossary.tex | ${Build} 58 build/%.dvi : %.tex Makefile | ${Build} 56 59 # Must have *.aux file containing citations for bibtex 57 if [ ! -r ${basename $@}.aux ] ; then ${LaTeX} $ {basename $@}.tex; fi58 -${BibTeX} ${ Build}/${basename $@}60 if [ ! -r ${basename $@}.aux ] ; then ${LaTeX} $< ; fi 61 -${BibTeX} ${basename $@} 59 62 # Some citations reference others so run again to resolve these citations 60 ${LaTeX} $ {basename $@}.tex61 -${BibTeX} ${ Build}/${basename $@}63 ${LaTeX} $< 64 -${BibTeX} ${basename $@} 62 65 # Make index from *.aux entries and input index at end of document 63 makeglossaries -q -s ${Build}/${basename $@}.ist ${Build}/${basename $@}66 -makeglossaries -q -s ${basename $@}.ist ${basename $@} 64 67 # Run again to finish citations 65 ${LaTeX} $ {basename $@}.tex68 ${LaTeX} $< 66 69 67 70 ## Define the default recipes. … … 70 73 mkdir -p ${Build} 71 74 72 %.tex : %.fig${Build}75 %.tex : img/%.fig | ${Build} 73 76 fig2dev -L eepic $< > ${Build}/$@ 74 77 75 %.ps : %.fig | ${Build}78 %.ps : img/%.fig | ${Build} 76 79 fig2dev -L ps $< > ${Build}/$@ 77 80 78 %.pstex : %.fig | ${Build}81 %.pstex : img/%.fig | ${Build} 79 82 fig2dev -L pstex $< > ${Build}/$@ 83 fig2dev -L pstex_t -p ${Build}/$@ $< > ${Build}/$@_t 84 85 ## pstex with inverted colors 86 %.dark.pstex : img/%.fig Makefile | ${Build} 87 fig2dev -L pstex $< > ${Build}/$@ 88 sed -i 's/\/col-1 {0 setgray} bind def/\/col-1 {1 setgray} bind def/g' ${Build}/$@ 89 sed -i 's/\/col0 {0.000 0.000 0.000 srgb} bind def/\/col0 {1.000 1.000 1.000 srgb} bind def/g' ${Build}/$@ 90 sed -i 's/\/col7 {1.000 1.000 1.000 srgb} bind def/\/col7 {0.000 0.000 0.000 srgb} bind def/g' ${Build}/$@ 80 91 fig2dev -L pstex_t -p ${Build}/$@ $< > ${Build}/$@_t 81 92 -
doc/theses/thierry_delisle_PhD/comp_II/comp_II.tex
r3c64c668 r58fe85a 1 \documentclass[11pt,fullpage]{article} 1 \documentclass[11pt]{article} 2 \usepackage{fullpage} 2 3 \usepackage[T1]{fontenc} 3 4 \usepackage[utf8]{inputenc} 4 \usepackage{listings} % for code listings5 5 \usepackage{xspace} 6 6 \usepackage{xcolor} 7 7 \usepackage{graphicx} 8 \usepackage[hidelinks]{hyperref} 8 \usepackage{epic,eepic} 9 \usepackage{listings} % for code listings 9 10 \usepackage{glossaries} 10 11 \usepackage{textcomp} 11 \usepackage{geometry}12 13 12 % cfa macros used in the document 14 13 \input{common} 14 15 \setlist{topsep=6pt,parsep=0pt} % global reduce spacing between points 16 \newcommand{\uC}{$\mu$\CC} 17 \usepackage[hidelinks]{hyperref} 18 \setlength{\abovecaptionskip}{5pt plus 3pt minus 2pt} 19 \lstMakeShortInline$% % single-character for \lstinline 20 %\usepackage[margin=1in]{geometry} 21 %\usepackage{float} 22 15 23 \input{glossary} 16 24 … … 24 32 25 33 \author{ 26 \huge Thierry Delisle \ \27 \Large \ vspace*{0.1in} \texttt{tdelisle@uwaterloo.ca} \\34 \huge Thierry Delisle \vspace*{5pt} \\ 35 \Large \texttt{tdelisle@uwaterloo.ca} \vspace*{5pt} \\ 28 36 \Large Cheriton School of Computer Science \\ 29 37 \Large University of Waterloo … … 36 44 \begin{document} 37 45 \maketitle 46 \thispagestyle{empty} 38 47 \cleardoublepage 39 48 40 49 \newcommand{\cit}{\textsuperscript{[Citation Needed]}\xspace} 41 \newcommand{\TODO}{ ~\newline{\large\bf\color{red} TODO :}\xspace}50 \newcommand{\TODO}{{\large\bf\color{red} TODO: }\xspace} 42 51 43 52 % =============================================================================== … … 51 60 \section{Introduction} 52 61 \subsection{\CFA and the \CFA concurrency package} 53 \CFA\cit is a modern, polymorphic, non-object-oriented, backwards-compatible extension of the C programming language. It aims to add high productivity features while maintaning the predictible performance of C. As such concurrency in \CFA\cit aims to offer simple and safe high-level tools while still allowing performant code. Concurrent code is written in the syncrhonous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programing. As such the \CFA scheduler is a user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}. 54 55 The goal of this research is to produce a scheduler that is simple to use and offers acceptable performance in all cases. Here simplicity does not refer to the API but to how much scheduling concerns programmers need to take into account when using the \CFA concurrency package. Therefore, the main goal of this proposal is as follows : 62 \CFA~\cite{Moss18} is a modern, polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language. 63 It aims to add high-productivity features while maintaining the predictable performance of C. 64 As such, concurrency in \CFA~\cite{Delisle19} aims to offer simple and safe high-level tools while still allowing performant code. 65 \CFA concurrent code is written in the synchronous programming paradigm but uses \glspl{uthrd} to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programming. 66 As such, the \CFA \newterm{scheduler} is a preemptive user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}. 67 68 \subsection{Scheduling} 69 \newterm{Scheduling} occurs when execution switches from one thread to another, where the second thread is implicitly chosen by the scheduler. 70 This scheduling is an indirect handoff, as opposed to generators and coroutines that explicitly switch to the next generator and coroutine respectively. 71 The cost of switching between two threads for an indirect handoff has two components: 72 \begin{enumerate} 73 \item 74 the cost of actually context-switching, \ie changing the relevant registers to move execution from one thread to the other, 75 \item 76 and the cost of scheduling, \ie deciding which thread to run next among all the threads ready to run. 77 \end{enumerate} 78 The first cost is generally constant\footnote{Affecting the constant context-switch cost is whether it is done in one step, where the first thread schedules the second, or in two steps, where the first thread context switches to a third scheduler thread.}, while the scheduling cost can vary based on the system state. 79 Adding multiple \glspl{kthrd} does not fundamentally change the scheduler semantics or requirements, it simply adds new correctness requirements, \ie \newterm{linearizability}\footnote{Meaning however fast the CPU threads run, there is an equivalent sequential order that gives the same result.}, and a new dimension to performance: scalability, where scheduling cost also depends on contention. 80 The more threads switch, the more the administration cost of scheduling becomes noticeable. 81 It is therefore important to build a scheduler with the lowest possible cost and latency. 82 Another important consideration is \newterm{fairness}. 83 In principle, scheduling should give the illusion of perfect fairness, where all threads ready to run are running \emph{simultaneously}. 84 In practice, there can be advantages to unfair scheduling, similar to the express cash register at a grocery store. 85 While the illusion of simultaneity is easier to reason about, it can break down if the scheduler allows too much unfairness. 86 Therefore, the scheduler should offer as much fairness as needed to guarantee eventual progress, but use unfairness to help performance. 87 88 \subsection{Research Goal} 89 The goal of this research is to produce a scheduler that is simple for programmers to understand and offers good general performance. 90 Here understandability does not refer to the API but to how much scheduling concerns programmers need to take into account when writing a \CFA concurrent package. 91 Therefore, the main consequence of this goal is : 56 92 \begin{quote} 57 The \CFA scheduler should be \emph{viable} for anyworkload.93 The \CFA scheduler should be \emph{viable} for \emph{any} workload. 58 94 \end{quote} 59 95 60 This objective includes producing a scheduling strategy with minimal fairness guarantees, creating an abstraction layer over the operating system to handle kernel-threads spinning unnecessarily and hide blocking I/O operations and, writing sufficient library tools to allow developpers to properly use the scheduler. 61 62 % =============================================================================== 63 % =============================================================================== 64 65 \section{Scheduling for \CFA} 66 While the \CFA concurrency package doesn't have any particular scheduling needs beyond those of any concurrency package which uses \glspl{uthrd}, it is important that the default \CFA Scheduler be viable in general. Indeed, since the \CFA Scheduler does not target any specific workloads, it is unrealistic to demand that it use the best scheduling strategy in all cases. However, it should offer a viable ``out of the box'' solution for most scheduling problems so that programmers can quickly write performant concurrent without needed to think about which scheduling strategy is more appropriate for their workload. Indeed, only programmers with exceptionnaly high performance requirements should need to write their own scheduler. More specifically, two broad types of schedulering strategies should be avoided in order to avoid penalizing certain types of workloads : feedback-based and priority schedulers. 96 For a general-purpose scheduler, it is impossible to produce an optimal algorithm as that requires knowledge of the future behaviour of threads. 97 As such, scheduling performance is generally either defined by a best-case scenario, \ie a workload to which the scheduler is tailored, or a worst-case scenario, \ie the scheduler behaves no worse than \emph{X}. 98 For this proposal, the performance is evaluated using the second approach to allow \CFA programmers to rely on scheduling performance. 99 Because there is no optimal scheduler, ultimately \CFA may allow programmers to write their own scheduler; but that is not the subject of this proposal, which considers only the default scheduler. 100 As such, it is important that only programmers with exceptionally high performance requirements should need to write their own scheduler and replace the scheduler in this proposal. 101 102 To achieve the \CFA scheduling goal includes: 103 \begin{enumerate} 104 \item producing a scheduling strategy with sufficient fairness guarantees, 105 \item creating an abstraction layer over the operating system to handle kernel-threads spinning unnecessarily, 106 \item scheduling blocking I/O operations, 107 \item and writing sufficient library tools to allow developers to indirectly use the scheduler, either through tuning knobs in the default scheduler or replacing the default scheduler. 108 \end{enumerate} 109 110 % =============================================================================== 111 % =============================================================================== 112 113 \section{\CFA Scheduling} 114 To schedule user-level threads across all workloads, the scheduler has a number of requirements: 115 116 \paragraph{Correctness} As with any other concurrent data structure or algorithm, the correctness requirement is paramount. 117 The scheduler cannot allow threads to be dropped from the ready queue, \ie scheduled but never run, or be executed multiple times when only being scheduled once. 118 Since \CFA concurrency has no spurious wake up, this definition of correctness also means the scheduler should have no spurious wake up. 119 The \CFA scheduler must be correct. 120 121 \paragraph{Performance} The performance of a scheduler can generally be measured in terms of scheduling cost, scalability and latency. 122 \newterm{Scheduling cost} is the cost to switch from one thread to another, as mentioned above. 123 For compute-bound concurrent applications with little context switching, the scheduling cost is negligible. 124 For applications with high context-switch rates, scheduling cost can begin to dominating the cost. 125 \newterm{Scalability} is the cost of adding multiple kernel threads. 126 It can increase the time for scheduling because of contention from the multiple threads accessing shared resources, \eg a single ready queue. 127 Finally, \newterm{tail latency} is service delay and relates to thread fairness. 128 Specifically, latency measures how long a thread waits to run once scheduled and is evaluated by the worst case. 129 The \CFA scheduler should offer good performance for all three metrics. 130 131 \paragraph{Fairness} Like performance, this requirement has several aspects : eventual progress, predictability and performance reliability. 132 \newterm{Eventual progress} guarantees every scheduled thread is eventually run, \ie prevent starvation. 133 As a hard requirement, the \CFA scheduler must guarantee eventual progress, otherwise the above-mentioned illusion of simultaneous execution is broken and the scheduler becomes much more complex to reason about. 134 \newterm{Predictability} and \newterm{reliability} mean similar workloads achieve similar performance so programmer execution intuition is respected. 135 For example, a thread that yields aggressively should not run more often than other threads. 136 While this is intuitive, it does not hold true for many work-stealing or feedback based schedulers. 137 The \CFA scheduler must guarantee eventual progress, should be predictable, and offer reliable performance. 138 139 \paragraph{Efficiency} Finally, efficient usage of CPU resources is also an important requirement and is discussed in depth towards the end of the proposal. 140 \newterm{Efficiency} means avoiding using CPU cycles when there are no threads to run (to conserve energy), and conversely, using as many available CPU cycles when the workload can benefit from it. 141 Balancing these two states is where the complexity lies. 142 The \CFA scheduler should be efficient with respect to the underlying (shared) computer. 143 144 \bigskip To achieve these requirements, I can reject two broad types of scheduling strategies : feedback-based and priority schedulers. 67 145 68 146 \subsection{Feedback-Based Schedulers} 69 Many operating systems use schedulers based on feadback loops in some form, they measure how much CPU a particular thread has used\footnote{Different metrics can be used to here but it is not relevant to the discussion.} and schedule threads based on this metric. These strategies are sensible for operating systems but rely on two assumptions on the workload : 70 71 \begin{enumerate} 72 \item Threads live long enough to be scheduled many times. 73 \item Cooperation among all threads is not simply infeasible, it is a security risk. 74 \end{enumerate} 75 76 While these two assumptions generally hold for operating systems, they may not for \CFA programs. In fact, \CFA uses \glspl{uthrd} which have the explicit goal of reducing the cost of threading primitives to allow many smaller threads. This can naturally lead to have threads with much shorter lifetime and only being scheduled a few times. Scheduling strategies based on feadback loops cannot be effective in these cases because they will not have the opportunity to measure the metrics that underlay the algorithm. Note that the problem of feadback loop convergence (reacting too slowly to scheduling events) is not specific to short lived threads but can also occur with threads that show drastic changes in scheduling event, e.g., threads running for long periods of time and then suddenly blocking and unblocking quickly and repeatedly. 77 78 In the context of operating systems, these concerns can be overshadowed by a more pressing concern : security. When multiple users are involved, it is possible that some users are malevolent and try to exploit the scheduling strategy in order to achieve some nefarious objective. Security concerns mean that more precise and robust fairness metrics must be used. In the case of the \CFA scheduler, every thread runs in the same user-space and are controlled from the same user. It is then possible to safely ignore the possibility that threads are malevolent and assume that all threads will ignore or cooperate with each other. This allows for a much simpler fairness metric and in this proposal ``fairness'' will be considered as equal opportunities to run once scheduled. 79 80 Since feadback is not necessarily feasible within the lifetime of all threads and a simple fairness metric can be used, the scheduling strategy proposed for the \CFA runtime does not user per-threads feedback. Feedback loops in general are not rejected for secondary concerns like idle sleep, but no feedback loop is used to decide which thread to run next. 147 Many operating systems use schedulers based on feedback in some form, \eg measuring how much CPU a particular thread has used\footnote{Different metrics can be measured but it is not relevant to the discussion.} and schedule threads based on this metric. 148 These strategies are sensible for operating systems but rely on two assumptions for the workload: 149 150 \begin{enumerate} 151 \item Threads live long enough for useful feedback information to be gathered. 152 \item Threads belong to multiple users so fairness across users is important. 153 \end{enumerate} 154 155 While these two assumptions generally hold for operating systems, they may not for user-level threading. 156 Since \CFA has the explicit goal of allowing many smaller threads, this can naturally lead to threads with much shorter lifetimes that are only scheduled a few times. 157 Scheduling strategies based on feedback cannot be effective in these cases because there is no opportunity to measure the metrics that underlie the algorithm. 158 Note, the problem of \newterm{feedback convergence} (reacting too slowly to scheduling events) is not specific to short-lived threads but can also occur with threads that show drastic changes in scheduling, \eg threads running for long periods of time and then suddenly blocking and unblocking quickly and repeatedly. 159 160 In the context of operating systems, these concerns can be overshadowed by a more pressing concern : security. 161 When multiple users are involved, it is possible some users are malevolent and try to exploit the scheduling strategy to achieve some nefarious objective. 162 Security concerns mean more precise and robust fairness metrics must be used to guarantee fairness across processes created by users as well as threads created within a process. 163 In the case of the \CFA scheduler, every thread runs in the same user space and is controlled by the same user. 164 Fairness across users is therefore a given and it is then possible to safely ignore the possibility that threads are malevolent. 165 This approach allows for a much simpler fairness metric, and in this proposal, \emph{fairness} is defined as: 166 \begin{quote} 167 When multiple threads are cycling through the system, the total ordering of threads being scheduled, \ie pushed onto the ready queue, should not differ much from the total ordering of threads being executed, \ie popped from the ready queue. 168 \end{quote} 169 170 Since feedback is not necessarily feasible within the lifetime of all threads and a simple fairness metric can be used, the scheduling strategy proposed for the \CFA runtime does not use per-threads feedback. 171 Feedback in general is not rejected for secondary concerns like idle sleep for kernel threads, but no feedback is used to decide which thread to run next. 81 172 82 173 \subsection{Priority Schedulers} 83 Another broad category of schedulers are priority schedulers. In these scheduling strategies threads have priorities and the runtime schedules the threads with the highest priority before scheduling other threads. Threads with equal priority are scheduled using a secondary strategy, often something simple like round-robin or FIFO. These priority mean that, as long as there is a thread with a higher priority that desires to run, a thread with a lower priority will not run. This possible starving of threads can dramatically increase programming complexity since starving threads and priority inversion (prioritising a lower priority thread) can both lead to serious problems, leaving programmers between a rock and a hard place. 84 85 An important observation to make is that threads do not need to have explicit priorities for problems to be possible. Indeed, any system with multiple ready-queues and attempts to exhaust one queue before accessing the other queues, could encounter starvation problems. A popular scheduling strategy that suffers from implicit priorities is work-stealing. Work-stealing is generally presented as follows : 86 87 \begin{itemize} 88 \item Each processor has a list of threads. 89 \end{itemize} 90 \begin{enumerate} 91 \item Run threads from ``this'' processor's list. 92 \item If ``this'' processor's list is empty, run threads from some other processor's list. 93 \end{enumerate} 94 95 In a loaded system\footnote{A loaded system is a system where threads are being run at the same rate they are scheduled}, if a thread does not yield or block for an extended period of time, threads on the same processor list will starve if no other processors can exhaust their list. 96 97 Since priorities can be complex to handle for programmers, the scheduling strategy proposed for the \CFA runtime does not use a strategy with either implicit or explicit thread priorities. 98 99 \subsection{Schedulers without feadback or priorities} 100 I claim that the ideal default scheduler for the \CFA runtime is a scheduler that offers good scalability and a simple fairness guarantee that is easy for programmers to reason about. The simplest fairness guarantee is to guarantee FIFO ordering, i.e., threads scheduled first will run first. However, enforcing FIFO ordering generally conflicts with scalability across multiple processors because of the additionnal synchronization. Thankfully, strict FIFO is not needed for scheduling. Since concurrency is inherently non-deterministic, fairness concerns in scheduling are only a problem if a thread repeatedly runs before another thread can run\footnote{This is because the non-determinism means that programmers must already handle ordering problems in order to produce correct code and already must rely on weak guarantees, for example that a specific thread will \emph{eventually} run.}. This need for unfairness to persist before problems occur means that the FIFO fairness guarantee can be significantly relaxed without causing problems. For this proposal, the target guarantee is that the \CFA scheduler guarantees \emph{probable} FIFO ordering, which is defined as follows : 101 \begin{itemize} 102 \item Given two threads $X$ and $Y$, the odds that thread $X$ runs $N$ times \emph{after} thread $Y$ is scheduled but \emph{before} it is run, decreases exponentially with regards to $N$. 103 \end{itemize} 104 105 While this is not a strong guarantee, the probability that problems persist for long period of times decreases exponentially, making persisting problems virtually impossible. 106 107 \subsection{Real-Time} 108 While the objective of this proposed scheduler is similar to the objective of real-time scheduling, this proposal is not a proposal for real-time scheduler and as such makes no attempt to offer either soft or hard guarantees on scheduling delays. 109 110 % =============================================================================== 111 % =============================================================================== 112 \section{Proposal} 113 114 \subsection{Ready-Queue} 115 Using trevor's paper\cit as basis, it is simple to build a relaxed FIFO list that is fast and scalable for loaded or overloaded systems. The described queue uses an array of underlying strictly FIFO queue. Pushing new data is done by selecting one of these underlying queues at random, recording a timestamp for the push and pushing to the selected queue. Popping is done by selecting two queues at random and popping from the queue for which the head has the oldest timestamp. In loaded or overloaded systems, it is higly likely that the queues is far from empty, e.i., several tasks are on each of the underlying queues. This means that selecting a queue at random to pop from is higly likely to yield a queue that is not empty. 116 117 When the ready queue is "more empty", i.e., several of the inner queues are empty, selecting a random queue for popping is less likely to yield a valid selection and more attempts need to be made, resulting in a performance degradation. In cases, with few elements on the ready queue and few processors running, performance can be improved by adding information to help processors find which inner queues are used. Preliminary performance tests indicate that with few processors, a bitmask can be used to identify which inner queues are currently in use. This is especially effective in the single-thread case, where the bitmask will always be up-to-date. Furthermore, modern x86 CPUs have a BMI2 extension which allow using the bitmask with very little overhead over directly accessing the readyqueue offerring decent performance even in cases with many empty inner queues. This technique does not solve the problem completely, it randomly attempts to find a block of 64 queues where at least one is used, instead of attempting to find a used queue. For systems with a large number of cores this does not completely solve the problem, but it is a fixed improvement. The size of the blocks are limited by the maximum size atomic instruction can operate on, therefore atomic instructions on large words would increase the 64 queues per block limit. 118 119 \TODO double check the next sentence 120 Preliminary result indicate that the bitmask approach with the BMI2 extension can lead to multi-threaded performance that is contention agnostic in the worst case. 121 This result suggests that the contention penalty and the increase performance for additionnal thread cancel each other exactly. This may indicate that a relatively small reduction in contention may tip the performance into positive scalling even for the worst case. It can be noted that in cases of high-contention, the use of the bitmask to find queues that are not empty is much less reliable. Indeed, if contention on the bitmask is high, it means it probably changes significantly between the moment it is read and the actual operation on the queues it represents. Furthermore, the objective of the bitmask is to avoid probing queues that are empty. Therefore, in cases where the bitmask is highly contented, it may be preferrable to probe queues randomly, either until contention decreases or until a prior prefetch of the bitmask completes. Ideally, the scheduler would be able to observe that the bitmask is highly contented and adjust its behaviour appropriately. However, I am not aware of any mechanism to query whether a cacheline is in cache or to run other instructions until a cacheline is fetch without blocking on the cacheline. As such, an alternative that may have a similar impact would be for each thread to have their own bitmask, which would be updated both after each scheduler action and after a certain number of failed probing. If the bitmask has little contention, the local bitmask will be mostly up-to-date and several threads won't need to contend as much on the global bitmask. If the bitmask has significant contention, then fetching it becomes more expensive and threads may as well probe randomly. This solution claims that probing randomly or against an out-of-date bitmask is equivalent. 122 123 In cases where this is insufficient, another approach is to use a hiearchical data structure. Creating a tree of nodes to reduce contention has been shown to work in similar cases\cit(SNZI: Scalable NonZero Indicators)\footnote{This particular paper seems to be patented in the US. How does that affect \CFA? Can I use it in my work?}. However, this approach may lead to poorer single-threaded performance due to the inherent pointer chasing, as such, it was not considered as the first approach but as a fallback in case the bitmask approach does not satisfy the performance goals. 124 125 Part of this performance relies on contention being low when there are few threads on the readyqueue. However, this can be assumed reliably if the system handles putting idle processors to sleep, which is addressed in section \ref{sleep}. 174 Another broad category of schedulers are priority schedulers. 175 In these scheduling strategies, threads have priorities and the runtime schedules the threads with the highest priority before scheduling other threads. 176 Threads with equal priority are scheduled using a secondary strategy, often something simple like round robin or FIFO. 177 A consequence of priority is that, as long as there is a thread with a higher priority that desires to run, a thread with a lower priority does not run. 178 The potential for thread starvation dramatically increases programming complexity since starving threads and priority inversion (prioritizing a lower priority thread) can both lead to serious problems. 179 180 An important observation is that threads do not need to have explicit priorities for problems to occur. 181 Indeed, any system with multiple ready queues that attempts to exhaust one queue before accessing the other queues, essentially provides implicit priority, which can encounter starvation problems. 182 For example, a popular scheduling strategy that suffers from implicit priorities is work stealing. 183 \newterm{Work stealing} is generally presented as follows: 184 \begin{enumerate} 185 \item Each processor has a list of ready threads. 186 \item Each processor runs threads from its ready queue first. 187 \item If a processor's ready queue is empty, attempt to run threads from some other processor's ready queue. 188 \end{enumerate} 189 In a loaded system\footnote{A \newterm{loaded system} is a system where threads are being run at the same rate they are scheduled.}, if a thread does not yield, block, or preempt for an extended period of time, threads on the same processor's list starve if no other processors exhaust their list. 190 191 Since priorities can be complex for programmers to incorporate into their execution intuition, the \CFA scheduling strategy does not provided explicit priorities and attempts to eliminate implicit priorities. 192 193 \subsection{Schedulers without feedback or priorities} 194 This proposal conjectures that it is possible to construct a default scheduler for the \CFA runtime that offers good scalability and a simple fairness guarantee that is easy for programmers to reason about. 195 The simplest fairness guarantee is FIFO ordering, \ie threads scheduled first run first. 196 However, enforcing FIFO ordering generally conflicts with scalability across multiple processors because of the additional synchronization. 197 Thankfully, strict FIFO is not needed for sufficient fairness. 198 Since concurrency is inherently non-deterministic, fairness concerns in scheduling are only a problem if a thread repeatedly runs before another thread can run. 199 Some relaxation is possible because non-determinism means programmers already handle ordering problems to produce correct code and hence rely on weak guarantees, \eg that a thread \emph{eventually} runs. 200 Since some reordering does not break correctness, the FIFO fairness guarantee can be significantly relaxed without causing problems. 201 For this proposal, the target guarantee is that the \CFA scheduler provides \emph{probable} FIFO ordering, which allows reordering but makes it improbable that threads are reordered far from their position in total ordering. 202 203 The \CFA scheduler fairness is defined as follows: 204 \begin{quote} 205 Given two threads $X$ and $Y$, the odds that thread $X$ runs $N$ times \emph{after} thread $Y$ is scheduled but \emph{before} it is run, decreases exponentially with regard to $N$. 206 \end{quote} 207 While this is not a bounded guarantee, the probability that unfairness persist for long periods of times decreases exponentially, making persisting unfairness virtually impossible. 208 209 % =============================================================================== 210 % =============================================================================== 211 \section{Proposal Details} 212 213 \subsection{Central Ready Queue} \label{sec:queue} 214 A central ready queue can be built from a FIFO queue, where user threads are pushed onto the queue when they are ready to run, and processors (kernel-threads acting as virtual processors) pop the user threads from the queue and execute them. 215 Alistarh \etal~\cite{alistarh2018relaxed} show it is straightforward to build a relaxed FIFO list that is fast and scalable for loaded or overloaded systems. 216 The described queue uses an array of underlying strictly FIFO queues as shown in Figure~\ref{fig:base}\footnote{For this section, the number of underlying queues is assumed to be constant. 217 Section~\ref{sec:resize} discusses resizing the array.}. 218 Pushing new data is done by selecting one of the underlying queues at random, recording a timestamp for the operation, and pushing to the selected queue. 219 Popping is done by selecting two queues at random and popping from the queue with the oldest timestamp. 220 A higher number of underlying queues leads to less contention on each queue and therefore better performance. 221 In a loaded system, it is highly likely the queues are non-empty, \ie several threads are on each of the underlying queues. 222 For this case, selecting a queue at random to pop from is highly likely to yield a queue with available items. 223 In Figure~\ref{fig:base}, ignoring the ellipsis, the chances of getting an empty queue is 2/7 per pick, meaning two random picks yield an item approximately 9 times out of 10. 224 225 \begin{figure} 226 \begin{center} 227 \input{base.pstex_t} 228 \end{center} 229 \caption{Loaded relaxed FIFO list base on an array of strictly FIFO lists. 230 A timestamp appears in each node and array cell.} 231 \label{fig:base} 232 \end{figure} 233 234 \begin{figure} 235 \begin{center} 236 \input{empty.pstex_t} 237 \end{center} 238 \caption{Underloaded relaxed FIFO list where the array contains many empty cells.} 239 \label{fig:empty} 240 \end{figure} 241 242 In an underloaded system, several of the queues are empty, so selecting a random queue for popping is less likely to yield a successful selection and more attempts are needed, resulting in a performance degradation. 243 Figure~\ref{fig:empty} shows an example with fewer elements, where the chances of getting an empty queue is 5/7 per pick, meaning two random picks yield an item only half the time. 244 Since the ready queue is not empty, the pop operation \emph{must} find an element before returning and therefore must retry. 245 Note, the popping kernel thread has no work to do, but CPU cycles are wasted both for available user and kernel threads during the pop operation as the popping thread is using a CPU. 246 Overall performance is therefore influenced by the contention on the underlying queues and pop performance is influenced by the item density. 247 248 This leads to four performance cases for the centralized ready queue, as depicted in Table~\ref{tab:perfcases}. 249 The number of processors (many or few) refers to the number of kernel threads \emph{actively} attempting to pop user threads from the queues, not the total number of kernel threads. 250 The number of threads (many or few) refers to the number of user threads ready to be run. 251 Many threads means they outnumber processors significantly and most underlying queues have items, few threads mean there are barely more threads than processors and most underlying queues are empty. 252 Cases with fewer threads than processors are discussed in Section~\ref{sec:sleep}. 253 254 \begin{table} 255 \begin{center} 256 \begin{tabular}{|r|l|l|} 257 \cline{2-3} 258 \multicolumn{1}{r|}{} & \multicolumn{1}{c|}{Many Processors} & \multicolumn{1}{c|}{Few Processors} \\ 259 \hline 260 Many Threads & A: good performance & B: good performance \\ 261 \hline 262 Few Threads & C: worst performance & D: poor performance \\ 263 \hline 264 \end{tabular} 265 \end{center} 266 \caption{Expected performance of the relaxed FIFO list in different cases.} 267 \label{tab:perfcases} 268 \end{table} 269 270 Performance can be improved in Table~\ref{tab:perfcases} case~D by adding information to help processors find which inner queues are used. 271 This addition aims to avoid the cost of retrying the pop operation but does not affect contention on the underlying queues and can incur some management cost for both push and pop operations. 272 The approach used to encode this information can vary in density and be either global or local. 273 \newterm{Density} means the information is either packed in a few cachelines or spread across several cachelines, and \newterm{local information} means each thread uses an independent copy instead of a single global, \ie common, source of information. 274 275 For example, Figure~\ref{fig:emptybit} shows a dense bitmask to identify which inner queues are currently in use. 276 This approach means processors can often find user threads in constant time, regardless of how many underlying queues are empty. 277 Furthermore, modern x86 CPUs have extended bit manipulation instructions (BMI2) that allow using the bitmask with very little overhead compared to the randomized selection approach for a filled ready queue, offering good performance even in cases with many empty inner queues. 278 However, this technique has its limits: with a single word\footnote{Word refers here to however many bits can be written atomically.} bitmask, the total number of underlying queues in the ready queue is limited to the number of bits in the word. 279 With a multi-word bitmask, this maximum limit can be increased arbitrarily, but it is not possible to check if the queue is empty by reading the bitmask atomically. 280 281 Finally, a dense bitmap, either single or multi-word, causes additional problems in Table~\ref{tab:perfcases} case C, because many processors are continuously scanning the bitmask to find the few available threads. 282 This increased contention on the bitmask(s) reduces performance because of cache misses after updates and the bitmask is updated more frequently by the scanning processors racing to read and/or update that information. 283 This increased update frequency means the information in the bitmask is more often stale before a processor can use it to find an item, \ie mask read says there are available user threads but none on queue. 284 285 \begin{figure} 286 \begin{center} 287 {\resizebox{0.73\textwidth}{!}{\input{emptybit}}} 288 \end{center} 289 \vspace*{-5pt} 290 \caption{Underloaded queue with added bitmask to indicate which array cells have items.} 291 \label{fig:emptybit} 292 \begin{center} 293 {\resizebox{0.73\textwidth}{!}{\input{emptytree}}} 294 \end{center} 295 \vspace*{-5pt} 296 \caption{Underloaded queue with added binary search tree indicate which array cells have items.} 297 \label{fig:emptytree} 298 \begin{center} 299 {\resizebox{0.9\textwidth}{!}{\input{emptytls}}} 300 \end{center} 301 \vspace*{-5pt} 302 \caption{Underloaded queue with added per processor bitmask to indicate which array cells have items.} 303 \label{fig:emptytls} 304 \end{figure} 305 306 Figure~\ref{fig:emptytree} shows an approach using a hierarchical tree data-structure to reduce contention and has been shown to work in similar cases~\cite{ellen2007snzi}. 307 However, this approach may lead to poorer performance in Table~\ref{tab:perfcases} case~B due to the inherent pointer chasing cost and already low contention cost in that case. 308 309 Figure~\ref{fig:emptytls} shows an approach using dense information, similar to the bitmap, but have each thread keep its own independent copy of it. 310 While this approach can offer good scalability \emph{and} low latency, the liveliness of the information can become a problem. 311 In the simple cases, local copies can become stale and end-up not being useful for the pop operation. 312 A more serious problem is that reliable information is necessary for some parts of this algorithm to be correct. 313 As mentioned in this section, processors must know \emph{reliably} whether the list is empty or not to decide if they can return \texttt{NULL} or if they must keep looking during a pop operation. 314 Section~\ref{sec:sleep} discusses another case where reliable information is required for the algorithm to be correct. 315 316 There is a fundamental tradeoff among these approach. 317 Dense global information about empty underlying queues helps zero-contention cases at the cost of the high-contention case. 318 Sparse global information helps high-contention cases but increases latency in zero-contention cases to read and ``aggregate'' the information\footnote{Hierarchical structures, \eg binary search tree, effectively aggregate information but follow pointer chains, learning information at each node. 319 Similarly, other sparse schemes need to read multiple cachelines to acquire all the information needed.}. 320 Finally, dense local information has both the advantages of low latency in zero-contention cases and scalability in high-contention cases. 321 However, the information can become stale making it difficult to use to ensure correctness. 322 The fact that these solutions have these fundamental limits suggest to me a better solution that attempts to combine these properties in an interesting way. 323 Also, the lock discussed in Section~\ref{sec:resize} allows for solutions that adapt to the number of processors, which could also prove useful. 126 324 127 325 \paragraph{Objectives and Existing Work} 128 How much scalability is actually needed is highly debatable, libfibre\cit is has compared favorably to other schedulers in webserver tests\cit and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask. As such the single atomic instruction on a shared cacheline may be sufficiently performant. 129 130 I have built a prototype of this ready-queue (including the bitmask and BMI2 usage, but not the sharded bitmask) and ran performance experiments on it but it is difficult to compare this prototype to a thread scheduler as the prototype is used as a data-queue. I have also integrated this prototype into the \CFA runtime, but have not yet created performance experiments to compare results. I believe that the bitmask approach is currently one of the larger risks of the proposal, early tests lead me to believe it may work but it is not clear that the contention problem can be overcome. The worst-case scenario is a case where the number of processors and the number of ready threads are similar, yet scheduling events are very frequent. Fewer threads should lead to the Idle Sleep mechanism reducing contention while having many threads ready leads to optimal performance. It is difficult to evaluate the likeliness of this worst-case scenario in real workloads. I believe, frequent scheduling events suggest a more ``bursty'' workload where new work is finely divided among many threads which race to completion. This type of workload would only see a peek of contention close to the end of the work, but no sustained contention. Very fine-grained pipelines are less ``bursty'', these may lead to more sustained contention. However, they could also easily benefit from a direct hand-off strategy which would circumvent the problem entirely. 131 132 \subsection{Dynamic Resizing} 133 The \CFA runtime system currently handles dynamically adding and removing processors from clusters at any time. Since this is part of the existing design, the proposed scheduler must also support this behaviour. However, dynamicly resizing the clusters is considered a rare event associated with setup, teardown and major configuration changes. This assumptions is made both in the design of the proposed scheduler as well as in the original design of the \CFA runtime system. As such, the proposed scheduler must honor the correctness of these behaviour but does not have any performance objectives with regards to resizing a cluster. How long adding or removing processors take and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long period of times. This description effectively matches with te description of a Reader-Writer lock, in frequent but invasive updates among frequent (mostly) read operations. In the case of the Ready-Queue described above, read operations are operations that push or pop from the ready-queue but do not invalidate any references to the ready queue data structures. Writes on the other-hand would add or remove inner queues, invalidating references to the array of inner queues in the process. Therefore, the current proposed approach to this problem is the add a per-cluster Reader Writer lock around the ready queue to prevent restructuring of the ready-queue data structure while threads are being pushed or popped. 134 135 There are possible alternatives to the Reader Writer lock solution. This problem is effectively a memory reclamation problem and as such there is a large body of research on the subject. However, the RWlock solution is simple and can be leveraged to solve other problems (e.g. processor ordering and memory reclamation of threads) which makes it an attractive solution. 326 327 How much scalability is actually needed is highly debatable. 328 \emph{libfibre}~\cite{libfibre} has compared favourably to other schedulers in webserver tests~\cite{Karsten20} and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask. 329 As such, the single atomic instruction on a shared cacheline may be sufficiently performant. 330 331 I have built a prototype of this ready queue in the shape of a data queue, \ie nodes on the queue are structures with a single $int$ representing a thread and intrusive data fields. 332 Using this prototype, preliminary performance experiments confirm the expected performance in Table~\ref{tab:perfcases}. 333 However, these experiments only offer a hint at the actual performance of the scheduler since threads are involved in more complex operations, \eg threads are not independent of each other: when a thread blocks some other thread must intervene to wake it. 334 335 I have also integrated this prototype into the \CFA runtime, but have not yet created performance experiments to compare results, as creating one-to-one comparisons between the prototype and the \CFA runtime will be complex. 336 337 \subsection{Dynamic Resizing} \label{sec:resize} 338 339 \begin{figure} 340 \begin{center} 341 \input{system.pstex_t} 342 \end{center} 343 \caption{Global structure of the \CFA runtime system.} 344 \label{fig:system} 345 \end{figure} 346 347 The \CFA runtime system groups processors together as \newterm{clusters}, as shown in Figure~\ref{fig:system}. 348 Threads on a cluster are always scheduled on one of the processors of the cluster. 349 Currently, the runtime handles dynamically adding and removing processors from clusters at any time. 350 Since this feature is part of the existing design, the proposed scheduler must also support this behaviour. 351 However, dynamically resizing a cluster is considered a rare event associated with setup, tear down and major configuration changes. 352 This assumption is made both in the design of the proposed scheduler as well as in the original design of the \CFA runtime system. 353 As such, the proposed scheduler must honour the correctness of this behaviour but does not have any performance objectives with regard to resizing a cluster. 354 That is, the time to add or remove processors and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long periods of times. 355 However, as mentioned in Section~\ref{sec:queue}, contention on the underlying queues can have a direct impact on performance. 356 The number of underlying queues must therefore be adjusted as the number of processors grows or shrinks. 357 Since the underlying queues are stored in a dense array, changing the number of queues requires resizing the array and expanding the array requires moving it, which can introduce memory reclamation problems if not done correctly. 358 359 \begin{figure} 360 \begin{center} 361 \input{resize} 362 \end{center} 363 \caption{Copy of data structure shown in Figure~\ref{fig:base}.} 364 \label{fig:base2} 365 \end{figure} 366 367 It is important to note how the array is used in this case. 368 While the array cells are modified by every push and pop operation, the array itself, \ie the pointer that would change when resized, is only read during these operations. 369 Therefore the use of this pointer can be described as frequent reads and infrequent writes. 370 This description effectively matches with the description of a reader-writer lock, infrequent but invasive updates among frequent read operations. 371 In the case of the ready queue described above, read operations are operations that push or pop from the ready queue but do not invalidate any references to the ready queue data structures. 372 Writes, on the other hand, would add or remove inner queues, invalidating references to the array of inner queues in a process. 373 Therefore, the current proposed approach to this problem is to add a per-cluster reader-writer lock around the ready queue to prevent restructuring of the ready-queue data-structure while threads are being pushed or popped. 374 375 There are possible alternatives to the reader-writer lock solution. 376 This problem is effectively a memory reclamation problem and as such there is a large body of research on the subject~\cite{brown2015reclaiming, michael2004hazard}. 377 However, the reader-write lock-solution is simple and can be leveraged to solve other problems (\eg processor ordering and memory reclamation of threads), which makes it an attractive solution. 136 378 137 379 \paragraph{Objectives and Existing Work} 138 The lock must offer scalability and performance on par with the actual ready-queue in order not to introduce a new bottle neck. I have already built a lock that fits the desired requirements and preliminary testing show scalability and performance that exceed the target. As such, I do not consider this lock to be a risk on this project. 139 140 \subsection{Idle Sleep} \label{sleep} 141 As mentionned above, idle sleep is the process of putting processors to sleep while they do not have threads to execute. In this context processors are kernel-threads and sleeping refers to asking the kernel to block a thread. This can be achieved with either thread synchronization operations like pthread\_cond\_wait or using signal operations like sigsuspend. 142 143 Support for idle sleep broadly involves calling the operating system to block the kernel thread but also handling the race between the sleeping and the waking up, and handling which kernel thread should sleep or wake-up. 144 145 When a processor decides to sleep, there is a race that occurs between it signalling that it will go to sleep (so other processors can find sleeping processors) and actually blocking the kernel thread. This is equivalent to the classic problem of missing signals when using condition variables, the ``sleepy'' processor indicates that it will sleep but has not yet gone to sleep, if another processor attempts to wake it up, the waking-up operation may claim nothing needs to be done and the signal will have been missed. In cases where threads are scheduled from processors on the current cluster, loosing signals is not necessarily critical, because at least some processors on the cluster are awake. Individual processors always finish shceduling threads before looking for new work, which means that the last processor to go to sleep cannot miss threads scheduled from inside the cluster (if they do, that demonstrates the ready-queue is not linearizable). However, this guarantee does not hold if threads are shceduled from outside the cluster, either due to an external event like timers and I/O, or due to a thread migrating from a different cluster. In this case, missed signals can lead to the cluster deadlocking where it should not\footnote{Clusters ``should'' never deadlock, but for this proposal, cases where \CFA users \emph{actually} wrote \CFA code that leads to a deadlock it is considered as a deadlock that ``should'' happen. }. Therefore, it is important that the scheduling of threads include a mechanism where signals \emph{cannot} be missed. For performance reasons, it can be advantageous to have a secondary mechanism that allows signals to be missed in cases where it cannot lead to a deadlock. To be safe, this process must include a ``handshake'' where it is guaranteed that either~: the sleepy processor notices that a thread was scheduled after it signalled its intent to block or code scheduling threads well see the intent to sleep before scheduling and be able to wake-up the processor. This matter is complicated by the fact that pthread offers few tools to implement this solution and offers no guarantee of ordering of threads waking up for most of these tools. 146 147 Another issues is trying to avoid kernel sleeping and waking frequently. A possible partial solution is to order the processors so that the one which most recently went to sleep is woken up. This allows other sleeping processors to reach deeper sleep state (when these are available) while keeping ``hot'' processors warmer. Note that while this generally means organising the processors in a stack, I believe that the unique index provided by the ReaderWriter lock can be reused to strictly order the waking order of processors, causing a LIFO like waking order. While a strict LIFO stack is probably better, using the processor index could proove useful and offer a sufficiently LIFO ordering. 148 149 Finally, another important aspect of Idle Sleep is when should processors make the decision to sleep and when it is appropriate for sleeping processors to be woken up. Processors that are unnecessarily awake lead to unnecessary contention and power consumption, while too many sleeping processors can lead to sub-optimal throughput. Furthermore, transitions from sleeping to awake and vice-versa also add unnecessary latency. There is already a wealth of research on the subject and I do not plan to implement a novel idea for the Idle Sleep heuristic in this project. 380 The lock must offer scalability and performance on par with the actual ready queue in order not to introduce a new bottleneck. 381 I have already built a lock that fits the desired requirements and preliminary testing show scalability and performance that exceed the target. 382 As such, I do not consider this lock to be a risk for this project. 383 384 \subsection{Idle Sleep} \label{sec:sleep} 385 386 \newterm{Idle sleep} is the process of putting processors to sleep when they have no threads to execute. 387 In this context, processors are kernel threads and sleeping refers to asking the kernel to block a thread. 388 This operation can be achieved with either thread synchronization operations like $pthread_cond_wait$ or using signal operations like $sigsuspend$. 389 The goal of putting idle processors to sleep is: 390 \begin{enumerate} 391 \item 392 reduce contention on the ready queue, since the otherwise idle processors generally contend trying to pop items from the queue, 393 \item 394 give back unneeded CPU time associated with a process to other user processors executing on the computer, 395 \item 396 and reduce energy consumption in cases where more idle kernel-threads translate into idle CPUs, which can cycle down. 397 \end{enumerate} 398 Support for idle sleep broadly involves calling the operating system to block the kernel thread and handling the race between a blocking thread and the waking thread, and handling which kernel thread should sleep or wake up. 399 400 When a processor decides to sleep, there is a race that occurs between it signalling that is going to sleep (so other processors can find sleeping processors) and actually blocking the kernel thread. 401 This operation is equivalent to the classic problem of missing signals when using condition variables: the ``sleepy'' processor indicates its intention to block but has not yet gone to sleep when another processor attempts to wake it up. 402 The waking-up operation sees the blocked process and signals it, but the blocking process is racing to sleep so the signal is missed. 403 In cases where kernel threads are managed as processors on the current cluster, losing signals is not necessarily critical, because at least some processors on the cluster are awake and may check for more processors eventually. 404 Individual processors always finish scheduling user threads before looking for new work, which means that the last processor to go to sleep cannot miss threads scheduled from inside the cluster (if they do, that demonstrates the ready queue is not linearizable). 405 However, this guarantee does not hold if threads are scheduled from outside the cluster, either due to an external event like timers and I/O, or due to a user (or kernel) thread migrating from a different cluster. 406 In this case, missed signals can lead to the cluster deadlocking\footnote{Clusters should only deadlock in cases where a \CFA programmer \emph{actually} writes \CFA code that leads to a deadlock.}. 407 Therefore, it is important that the scheduling of threads include a mechanism where signals \emph{cannot} be missed. 408 For performance reasons, it can be advantageous to have a secondary mechanism that allows signals to be missed in cases where it cannot lead to a deadlock. 409 To be safe, this process must include a ``handshake'' where it is guaranteed that either: 410 \begin{enumerate} 411 \item 412 the sleeping processor notices that a user thread is scheduled after the sleeping processor signalled its intent to block or 413 \item 414 code scheduling threads sees the intent to sleep before scheduling and be able to wake-up the processor. 415 \end{enumerate} 416 This matter is complicated by the fact that pthreads and Linux offer few tools to implement this solution and no guarantee of ordering of threads waking up for most of these tools. 417 418 Another important issue is avoiding kernel threads sleeping and waking frequently because there is a significant operating-system cost. 419 This scenario happens when a program oscillates between high and low activity, needing most and then few processors. 420 A possible partial solution is to order the processors so that the one which most recently went to sleep is woken up. 421 This allows other sleeping processors to reach deeper sleep state (when these are available) while keeping ``hot'' processors warmer. 422 Note that while this generally means organizing the processors in a stack, I believe that the unique index provided in my reader-writer lock can be reused to strictly order the waking processors, causing a mostly LIFO order. 423 While a strict LIFO stack is probably better, the processor index could prove useful for other reasons, while still offering a sufficiently LIFO ordering. 424 425 A final important aspect of idle sleep is when should processors make the decision to sleep and when is it appropriate for sleeping processors to be woken up. 426 Processors that are unnecessarily unblocked lead to unnecessary contention, CPU usage, and power consumption, while too many sleeping processors can lead to suboptimal throughput. 427 Furthermore, transitions from sleeping to awake and vice versa also add unnecessary latency. 428 There is already a wealth of research on the subject~\cite{schillings1996engineering, wiki:thunderherd} and I may use an existing approach for the idle-sleep heuristic in this project, \eg~\cite{Karsten20}. 150 429 151 430 \subsection{Asynchronous I/O} 152 The final aspect of this proposal is asynchronous I/O. Without it, user threads that execute I/O operations will block the underlying kernel thread. This leads to poor throughput, it would be preferrable to block the user-thread and reuse the underlying kernel-thread to run other ready threads. This requires intercepting the user-threads' calls to I/O operations, redirecting them to an asynchronous I/O interface and handling the multiplexing between the synchronous and asynchronous API. As such, these are the three components needed to implemented to support asynchronous I/O : an OS abstraction layer over the asynchronous interface, an event-engine to (de)multiplex the operations and a synchronous interface for users to use. None of these components currently exist in \CFA and I will need to build all three for this project. 153 154 \paragraph{OS Abstraction} 155 One of the fundamental part of this converting blocking I/O operations into non-blocking ones. This relies on having an underlying asynchronous I/O interface to which to direct the I/O operations. While there exists many different APIs for asynchronous I/O, it is not part of this proposal to create a novel API, simply to use an existing one that is sufficient. uC++ uses the \texttt{select} as its interface, which handles pipes and sockets. It entails significant complexity and has performances problems which make it a less interesting alternative. Another interface which is becoming popular recently\cit is \texttt{epoll}. However, epoll also does not handle file system and seems to have problem to linux pipes and \texttt{TTY}s\cit. A very recent alternative that must still be investigated is \texttt{io\_uring}. It claims to address some of the issues with \texttt{epoll} but is too recent to be confident that it does. Finally, a popular cross-platform alternative is \texttt{libuv}, which offers asynchronous sockets and asynchronous file system operations (among other features). However, as a full-featured library it includes much more than what is needed and could conflict with other features of \CFA unless significant efforts are made to merge them together. 156 157 \paragraph{Event-Engine} 158 Laying on top of the asynchronous interface layer is the event-engine. This engine is responsible for multiplexing (batching) the synchronous I/O requests into an asynchronous I/O request and demultiplexing the results onto appropriate blocked threads. This can be straightforward for the simple cases, but can become quite complex. Decisions that will need to be made include : whether to poll from a seperate kernel thread or a regularly scheduled user thread, what should be the ordering used when results satisfy many requests, how to handle threads waiting for multiple operations, etc. 431 432 The final aspect of this proposal is asynchronous I/O. 433 Without it, user threads that execute I/O operations block the underlying kernel thread, which leads to poor throughput. 434 It is preferable to block the user thread performing the I/O and reuse the underlying kernel-thread to run other ready user threads. 435 This approach requires intercepting user-thread calls to I/O operations, redirecting them to an asynchronous I/O interface, and handling the multiplexing/demultiplexing between the synchronous and asynchronous API. 436 As such, there are three components needed to implement support for asynchronous I/O: 437 \begin{enumerate} 438 \item 439 an OS abstraction layer over the asynchronous interface, 440 \item 441 an event-engine to (de)multiplex the operations, 442 \item 443 and a synchronous interface for users. 444 \end{enumerate} 445 None of these components currently exist in \CFA and I will need to build all three for this project. 446 447 \paragraph{OS Asynchronous Abstraction} 448 One fundamental part for converting blocking I/O operations into non-blocking is having an underlying asynchronous I/O interface to direct the I/O operations. 449 While there exists many different APIs for asynchronous I/O, it is not part of this proposal to create a novel API. 450 It is sufficient to make one work in the complex context of the \CFA runtime. 451 \uC uses the $select$~\cite{select} as its interface, which handles ttys, pipes and sockets, but not disk. 452 $select$ entails significant complexity and is being replaced in UNIX operating systems, which make it a less interesting alternative. 453 Another popular interface is $epoll$~\cite{epoll}, which is supposed to be cheaper than $select$. 454 However, $epoll$ also does not handle the file system and anecdotal evidence suggest it has problems with Linux pipes and ttys. 455 A popular cross-platform alternative is $libuv$~\cite{libuv}, which offers asynchronous sockets and asynchronous file system operations (among other features). 456 However, as a full-featured library it includes much more than I need and could conflict with other features of \CFA unless significant effort is made to merge them together. 457 A very recent alternative that I am investigating is $io_uring$~\cite{io_uring}. 458 It claims to address some of the issues with $epoll$ and my early investigating suggests that the claim is accurate. 459 $io_uring$ uses a much more general approach where system calls are registered to a queue and later executed by the kernel, rather than relying on system calls to support returning an error instead of blocking. 460 I believe this approach allows for fewer problems, \eg the manpage for $open$~\cite{open} states: 461 \begin{quote} 462 Note that [the $O_NONBLOCK$ flag] has no effect for regular files and block devices; 463 that is, I/O operations will (briefly) block when device activity is required, regardless of whether $O_NONBLOCK$ is set. 464 Since $O_NONBLOCK$ semantics might eventually be implemented, applications should not depend upon blocking behaviour when specifying this flag for regular files and block devices. 465 \end{quote} 466 This makes approaches based on $select$/$epoll$ less reliable since they may not work for every file descriptors. 467 For this reason, I plan to use $io_uring$ as the OS abstraction for the \CFA runtime unless further work encounters a fatal problem. 468 However, only a small subset of the features are available in Ubuntu as of April 2020~\cite{wiki:ubuntu-linux}, which will limit performance comparisons. 469 I do not believe this will affect the comparison result. 470 471 \paragraph{Event Engine} 472 Above the OS asynchronous abstraction is the event engine. 473 This engine is responsible for multiplexing (batching) the synchronous I/O requests into asynchronous I/O requests and demultiplexing the results to appropriate blocked user threads. 474 This step can be straightforward for simple cases, but becomes quite complex when there are thousands of user threads performing both reads and writes, possibly on overlapping file descriptors. 475 Decisions that need to be made include: 476 \begin{enumerate} 477 \item 478 whether to poll from a separate kernel thread or a regularly scheduled user thread, 479 \item 480 what should be the ordering used when results satisfy many requests, 481 \item 482 how to handle threads waiting for multiple operations, etc. 483 \end{enumerate} 159 484 160 485 \paragraph{Interface} 161 Finally, for these components to be available, it is necessary to expose them through a synchronous interface. This can be a novel interface but it is preferrable to attempt to intercept the existing POSIX interface in order to be compatible with existing code. This will allow C programs written using this interface to be transparently converted to \CFA with minimal effeort. Where this is not applicable, a novel interface will be created to fill the gaps. 486 Finally, for these non-blocking I/O components to be available, it is necessary to expose them through a synchronous interface because that is the \CFA concurrent programming style. 487 The interface can be novel but it is preferable to match the existing POSIX interface when possible to be compatible with existing code. 488 Matching allows C programs written using this interface to be transparently converted to \CFA with minimal effort. 489 Where new functionality is needed, I will add novel interface extensions to fill gaps and provide advanced features. 162 490 163 491 … … 165 493 % =============================================================================== 166 494 \section{Discussion} 167 495 I believe that runtime system and scheduling are still open topics. 496 Many ``state of the art'' production frameworks still use single-threaded event loops because of performance considerations, \eg~\cite{nginx-design}, and, to my knowledge, no widely available system language offers modern threading facilities. 497 I believe the proposed work offers a novel runtime and scheduling package, where existing work only offers fragments that users must assemble themselves when possible. 168 498 169 499 % =============================================================================== 170 500 % =============================================================================== 171 501 \section{Timeline} 172 173 174 \cleardoublepage 502 \begin{center} 503 \begin{tabular}{ | r @{--} l | p{4in} | } 504 \hline May 2020 & October 2020 & Creation of the performance benchmark. \\ 505 \hline November 2020 & March 2021 & Completion of the implementation. \\ 506 \hline March 2021 & April 2021 & Final Performance experiments. \\ 507 \hline May 2021 & August 2021 & Thesis writing and defence. \\ 508 \hline 509 \end{tabular} 510 \end{center} 175 511 176 512 % B I B L I O G R A P H Y 177 513 % ----------------------------- 178 \addcontentsline{toc}{chapter}{Bibliography} 514 \cleardoublepage 515 \phantomsection % allows hyperref to link to the correct page 516 \addcontentsline{toc}{section}{\refname} 179 517 \bibliographystyle{plain} 180 518 \bibliography{pl,local} 519 520 % G L O S S A R Y 521 % ----------------------------- 181 522 \cleardoublepage 182 523 \phantomsection % allows hyperref to link to the correct page 183 184 % G L O S S A R Y 185 % ----------------------------- 186 \addcontentsline{toc}{chapter}{Glossary} 524 \addcontentsline{toc}{section}{Glossary} 187 525 \printglossary 188 \cleardoublepage189 \phantomsection % allows hyperref to link to the correct page190 526 191 527 \end{document} -
doc/theses/thierry_delisle_PhD/comp_II/local.bib
r3c64c668 r58fe85a 76 76 77 77 @article{finkel1987dib, 78 title={DIB —a distributed implementation of backtracking},78 title={DIB-a distributed implementation of backtracking}, 79 79 author={Finkel, Raphael and Manber, Udi}, 80 80 journal={ACM Transactions on Programming Languages and Systems (TOPLAS)}, … … 221 221 organization={ACM} 222 222 } 223 224 % =============================================================================== 225 % Algorithms 226 % =============================================================================== 227 @article{michael2004hazard, 228 title={Hazard pointers: Safe memory reclamation for lock-free objects}, 229 author={Michael, Maged M}, 230 journal={IEEE Transactions on Parallel and Distributed Systems}, 231 volume={15}, 232 number={6}, 233 pages={491--504}, 234 year={2004}, 235 publisher={IEEE} 236 } 237 238 @inproceedings{brown2015reclaiming, 239 title={Reclaiming memory for lock-free data structures: There has to be a better way}, 240 author={Brown, Trevor Alexander}, 241 booktitle={Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing}, 242 pages={261--270}, 243 year={2015} 244 } 245 246 % Trevor's relaxed FIFO list 247 @inproceedings{alistarh2018relaxed, 248 title={Relaxed schedulers can efficiently parallelize iterative algorithms}, 249 author={Alistarh, Dan and Brown, Trevor and Kopinsky, Justin and Nadiradze, Giorgi}, 250 booktitle={Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing}, 251 pages={377--386}, 252 year={2018} 253 } 254 255 % Scalable counters which only support is !0 256 @inproceedings{ellen2007snzi, 257 title={SNZI: Scalable nonzero indicators}, 258 author={Ellen, Faith and Lev, Yossi and Luchangco, Victor and Moir, Mark}, 259 booktitle={Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing}, 260 pages={13--22}, 261 year={2007} 262 } 263 264 % =============================================================================== 265 % Linux Man Pages 266 % =============================================================================== 267 @manual{open, 268 key = "open", 269 title = "open(2) Linux User's Manual", 270 year = "2020", 271 month = "February", 272 } 273 274 @manual{epoll, 275 key = "epoll", 276 title = "epoll(7) Linux User's Manual", 277 year = "2019", 278 month = "March", 279 } 280 281 @manual{select, 282 key = "select", 283 title = "select(7) Linux User's Manual", 284 year = "2019", 285 month = "March", 286 } 287 288 @misc{io_uring, 289 title = {Efficient IO with io\_uring}, 290 author = {Axboe, Jens}, 291 year = "2019", 292 month = "March", 293 version = {0,4}, 294 howpublished = {\url{https://kernel.dk/io_uring.pdf}} 295 } 296 297 @misc{libuv, 298 key = "libuv", 299 title = {libuv}, 300 howpublished = {\url{https://github.com/libuv/libuv}} 301 } 302 303 % =============================================================================== 304 % MISC 305 % =============================================================================== 306 307 @misc{nginx-design, 308 key = "nginx", 309 title={Inside {NGINX}: How We Designed for Performance \& Scale}, 310 howpublished= {\href{https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale} 311 {https://\-www.nginx.com/\-blog/\-inside\--nginx\--how\--we\--designed\--for\--performance\--scale}}, 312 } 313 314 @article{schillings1996engineering, 315 title={Be engineering insights: Benaphores}, 316 author={Schillings, Benoit}, 317 journal={Be Newsletters}, 318 volume={1}, 319 number={26}, 320 year={1996} 321 } 322 323 @misc{wiki:thunderherd, 324 author = "{Wikipedia contributors}", 325 title = "Thundering herd problem --- {W}ikipedia{,} The Free Encyclopedia", 326 year = "2020", 327 howpublished = {\href{https://en.wikipedia.org/wiki/Thundering_herd_problem} 328 {https://\-en.wikipedia.org/\-wiki/\-Thundering\_herd\_problem}},}, 329 note = "[Online; accessed 14-April-2020]" 330 } 331 332 @misc{wiki:ubuntu-linux, 333 author = "{Wikipedia contributors}", 334 title = "Ubuntu version history : Table of versions --- {W}ikipedia{,} The Free Encyclopedia", 335 year = "2020", 336 howpublished = {\href{https://en.wikipedia.org/wiki/Ubuntu_version_history\#Table_of_versions} 337 {https://\-en.wikipedia.org/\-wiki/\-Ubuntu\_version\_history\#Table\_of\_versions}}, 338 note = "[Online; accessed 15-April-2020]" 339 }
Note:
See TracChangeset
for help on using the changeset viewer.