Compiler Applications to Query Processing

Summary: This presentation has 3 main themes, presented in order: (1) Query Interpretation, (2) Query Compilation / Code Generation, and (3) an overview of the paper: Query Compilation Without Regrets, which tries to reconcile the two. Detailed outline: 0:00 Video intro 0:41 Basic SQL 1:32 An explanation of joins 3:29 Introduction to query plans 4:28 A query plan with a selection operator 5:08 A query plan with joins 5:41 A query plan featuring all the operators 6:22 Introducing query execution and processing models 7:18 Introducing the Iterator/Volcano model 8:03 A scan-and-projection plan in the Iterator model 9:25 Adding a projection in the Iterator model 9:47 What is a pipeline? 11:10 Joins are pipeline splitters 13:07 What interpretation means in query processing 16:10 Vectorized interpretation 16:46 Introducing query compilation / code generation 18:02 It should be possible to compile query plans 19:28 Background on the code generation scheme 20:19 Basics of the scheme: emit, produce, consume 21:15 Overview of the scheme code 23:05 Applying the scheme step-by-step 26:16 The scheme is complete (generic) 27:08 Handling joins 28:05 Applying the scheme to a plan with a join 31:06 Benefits and drawbacks of Query Compilation 31:49 Introducing Query Compilation Without Regrets 32:37: Introducing tracing 33:56 Running a program to execute vs generate code 36:42 Projection, selection, and scan in Nautilus 37:27 Tracing in Nautilus 40:51 Takeaways Notes: 16:34: I'll admit I kind of botched this explanation. Here's a paper great paper on the topic if you're interested: https://15721.courses.cs.cmu.edu/spri... 19:30: Paper: https://www.vldb.org/pvldb/vol4/p539-... 31:40 You may be wondering why one would want to generate LLVM IR. Some reasons that are still relevant are: (1) You can do low-level optimizations that are hard to do in C, (2) It may seem that generating C is easier, but given the abysmal state of tooling surrounding C and C's bad compilation model, it's probably as easy to generate LLVM IR, (3) Generally you get faster compilation times by going through LLVM IR. This is easy to see if the compiler we use is Clang because Clang anyway needs to go through LLVM IR, and given the fact that Clang is not particularly fast nowadays, it will be probably slower than the domain-specific LLVM IR generator of a query compiler. Also, note that C/C++t don't have a faster binary format while LLVM IR does. Note that the time budget for compiling code when executing queries is pretty slim because the compilation happens when the query is issued (i.e., it is online), not ahead of time (OTOH, the operators used in interpretation don't have this problem because they are compiled once and they are composed during query execution). Finally, the above doesn't mean that compiling C is slow. But compiling C through the popular optimizing compilers like LLVM and GCC _is_. If you don't care about using these compilers, you can get much faster speeds from compilers like tinyc (possibly even MSVC with optimizations on is faster than GCC and Clang nowadays). 31:50: Paper: https://dl.acm.org/doi/pdf/10.1145/36... 33:10: This is a misleading explanation of tracing. The traces are not created offline, but rather as the program runs, i.e., it's a just-in-time (JIT) compilation technique. For example, TraceMonkey has a counter basically such that if a path has been visited multiple times (actually 2 IIRC), then it generates a trace for that path. Also, there are practical reasons why it makes sense for the trace to be (almost) straight-line code. 35:12: I know... When this explanation started, you didn't expect me to spend that much time explaining stack machines and the arcane SpiderMonkey bytecode. Neither did I :) 35:18: This replacement of operators that we cannot do in C++, as I explain later, because we cannot e.g., replace the implementation of an `if`. 39:17: Paper: https://intimeand.space/docs/buildit.pdf One thing I forgot in the video is that this is heavily inspired by Andy Pavlo's lectures. However, the clarifications in this video were necessary for me. The original presentation was given at the Compiler Meetup: https://compiler-meetup.cs.illinois.edu/ You can find more content from me on my website: https://sbaziotis.com/