IR types and transformation passes
This section explains various IR types in asterius, and hopefully presents a clear picture of how information flows from Haskell to WebAssembly. (There's a similar section in
jsffi.md which explains implementation details of JSFFI)
Everything starts from Cmm, or more specifically, "raw" Cmm which satisfies:
- All calls are tail calls, parameters are passed by global registers like R1 or on the stack.
- All info tables are converted to binary data segments.
Cmm module in
ghc package to get started on Cmm.
Asterius obtains in-memory raw Cmm via:
cmmToRawCmmHookin our custom GHC fork. This allow us to lay our fingers on Cmm generated by either compiling Haskell modules, or
.cmmfiles (which are in
- There is some abstraction in
ghc-toolkit, the compiler logic is actually in the
Compilerdatatype as some callbacks, and
ghc-toolkitconverts them to hooks, frontend plugins and
There is one minor annoyance with the Cmm types in GHC (or any other GHC IR type): it's very hard to serialize/deserialize them without setting up complicated contexts related to package databases, etc. To experiment with new backends, it's reasonable to marshal to a custom serializable IR first.
Pre-linking expression IR
We then marshal raw Cmm to an expression IR defined in
Asterius.Types. Each compilation unit (Haskell module or
.cmm file) maps to one
AsteriusModule, and each
AsteriusModule is serialized to a
.asterius_o object file which will be deserialized at link time. Since we serialize/deserialize a structured expression IR faithfully, it's possible to perform aggressive LTO by traversing/rewriting IR at link time, and that's what we're doing right now.
The expression IR is mostly a Haskell modeling of a subset of
binaryen's expression IR, with some additions:
Unresolvedrelated variants, which allow us to use a symbol as an expression. At link time, the symbols are re-written to absolute addresses.
- Unresolved locals/globals. At link time, unresolved locals are laid out to wasm locals, and unresolved globals (which are really just Cmm global regs) become fields in the global Capability's
EmitErrorMessage, as a placeholder of emitting a string error message then trapping. At link time, such error messages are collected into an "error message pool", and the wasm code is just "calling some error message reporting function with an array index".
Null. We're civilized, educated functional programmers and should really be using
Maybe Expressionin some fields instead of adding a
Nullconstructor, but this is just handy. Blame me.
It's possible to encounter things we can't handle in Cmm (unsupported primops, etc). So
AsteriusModule also contains compile-time error messages when something isn't supported, but the errors are not reported, instead they are deferred to runtime error messages. (Ideally link-time, but it turns out to be hard)
The symbols are simply converted to Z-encoded strings that also contain module prefixes, and they are assumed to be unique across different compilation units.
AsteriusStore type in
Asterius.Types. It's an immutable data structure that maps symbols to underlying entities in the expression IR for every single module, and is a critical component of the linker.
Modeling the store as a self-contained data structure makes it pleasant to write linker logic, at the cost of exploding RAM usage. So we implemented a poor man's KV store in
Asterius.Store which performs lazy-loading of modules: when initializing the store, we only load the symbols, but not the actual modules; only when a module is "requested" for the first time, we perform deserialization for that module.
AsteriusStore supports merging. It's a handy operation, since we can first initialize a "global" store that represents the standard libraries, then make another store based on compiling user input, simply merge the two and we can start linking from the output store.
Post-linking expression IR
At link time, we take
AsteriusStore which contains everything (standard libraries and user input code), then performs live-code discovery: starting from a "root symbol set" (something like
Main_main_closure), iteratively fetch the entity from the store, traverse the AST and collect new symbols. When we reach a fixpoint, that fixpoint is the outcome of dependency analysis, representing a self-contained wasm module.
We then do some rewriting work on the self contained module: making symbol tables, rewriting symbols to absolute addresses, using our own relooper to convert from control-flow graphs to structured control flow, etc. Most of the logic is in
The output of linker is
Module. It differs from
AsteriusModule, and although it shares quite some datatypes with
AsteriusModule (for example,
Expression), it guarantees that some variants will not appear (for example,
Module is ready to be fed to a backend which emits real wasm binary code.
There are some useful linker byproducts. For example, there's
LinkReport which contains mappings from symbols to addresses which will be lost in wasm binary code, but is still useful for debugging.
Generating binary code via binaryen
Once we have a
Module (which is essentially just Haskell modeling of binaryen C API), we can invoke binaryen to validate it and generate wasm binary code. The low-level bindings are maintained in the
binaryen package, and
Asterius.Marshal contains the logic to call the imported functions to do actual work.
Generating binary code via wasm-toolkit
We can also convert
Module to IR types of
wasm-toolkit, which is our native Haskell wasm engine. It's now the default backend of
ahc-link, but the binaryen backend can still be chosen by
- Common runtime which can be reused across different asterius compiled modules. It's in
- Stub code which contains specific information like error messages, etc.