Compiler IR: Midgard

One of the internal representations used by our compiler is called Midgard. It is a reference to the middle earth of Norse mythology, said to be in a central position between the other eight worlds. Fittingly, Midgard sits between the front-end and back-end of the compiler, isolating them from each other. The references to Norse mythology actually started with Asgard (for ASG, the Abstract Semantic Graph) and went on with Jotunheim (for Jot, our Just-In-Time code generator).

Midgard has an interesting implementation. It was designed to be an immutable directed acyclic graph, yet still make analysis and rewriting as easy as possible.

One problem with immutable data types is that, to perform a tiny rewriting in the middle of the structure, you need to rebuild large portions of it. If you have many types of nodes, each rewriting turns into a cloning party and every constructor is invited.

let constant_condition tree = 
  let rec aux = function 
    | If (Const true, ifTrue, ifFalse) -> aux ifTrue
    | If (Const false, ifTrue, ifFalse) -> aux ifFalse
    | If (cond, IfTrue, ifFalse) -> If (aux cond, aux ifTrue, aux ifFalse)
    | Add (a, b) -> Add (aux a, aux b)
    | Sub (a, b) -> Sub (aux a, aux b)
    | Mul (a, b) -> Mul (aux a, aux b)
    | Div (a, b) -> Div (aux a, aux b)
    | (* 50 more lines of cloning *)
  aux tree

Some nodes are very complex, including dictionaries of arrays of structures, where each structure contains sub-nodes, so even one clone function represents a large amount of boilerplate code.

Another thing. Midgard, like most useful IRs, is a graph and not a tree. Having variables is nice when you expect humans to write scripts, but they are messy to deal with—you need to actually match up each variable read with the corresponding write, which is a complex algorithm in itself—and so most IRs get rid of them altogether. In Midgard, every use of a variable points instead to the last expression that was assigned to that variable. If it is used twice, then two different parts of the script point at the same expression, and every rewriting pass will need to make sure that this expression is only visited once.

In our compiler, a « Midgard script » contains an array of nodes and an array of indices. Instead of referencing each other directly, nodes reference cells in the array of indices, which contain the index of cells in the array of nodes. In other words, to get the first sub-expression of If (a, b, c), one would access nodes.[indices.[a]]. Of course, these layers are properly abstracted away, so that actual users of Midgard simply write script.[a]. The two key elements here are that :

Each node in a Midgard script has its own unique integer identifier (its position in the node array) which can be used as a key in a dictionary to bind arbitrary data to that node.
It's possible to have the same node appear in two different scripts, no cloning required, and have its child nodes be different in each script by altering the array of nodes or array of indices appropriately.

How does this help ? For a quick example, let's determine which nodes in a script are compile-time boolean constants. First, with a typical pattern-matching approach :

let rec bool_const = function
  | Const t   -> Some t 
  | And (a,b) -> match (bool_const a, bool_const b) with 
                 | Some false, _  
                 | _         , Some false -> Some false
                 | Some true , Some true  -> Some true
                 | _                      -> None
  | Or (a,b)  -> match (bool_const a, bool_const b) with 
                 | Some true , _ 
                 | _         , Some true  -> Some true
                 | Some false, Some false -> Some false
                 | _                      -> None
  | _         -> None

There are several things wrong with this function. It's recursive, so a long-winded graph can cause a stack overflow—imagine the behaviour of A1 && A2 && .. && A1000, and remember that boolean constant propagation is the easiest kind of analysis you can perform on a script. Even worse, the function does not account for shared sub-expressions, so you can build a linear script that takes exponential time to analyze :

A1 = true 
A2 = A1 && A1  
... 
A100 = A99 && A99 // A100 is true, but it takes 2^100 recursive calls to prove it

How does Midgard solve this ?

let bool_const script = 
  script.Map (fun id map ->
    match script.[id] with 
      | Const t   -> Some t 
      | And (a,b) -> match (map.[a], map.[b]) with 
                     | Some false, _  
                     | _         , Some false -> Some false
                     | Some true , Some true  -> Some true
                     | _                      -> None
      | Or (a,b)  -> match (map.[a], map.[b]) with 
                     | Some true , _ 
                     | _         , Some true  -> Some true
                     | Some false, Some false -> Some false
                     | _                      -> None
      | _         -> None)

This looks suspiciously like the recursive function from before, where recursive calls have been replaced with map.

What happens here is that method script.Map visits all nodes in the script, starting with the leaves, and applies a function to each of them. That function receives two arguments : the identifier of the node and a map dictionary containing all the results returned by this function on previously visited nodes. When script.Map is done, it returns the completed dictionary map.

This iterative approach solves both the stack overflow problem, by eliminating recursion and traversing the graph in child-parent order, and the shared expression problem, by calling the function on each node exactly once. It is not perfect, however : it will evaluate the function for the entire script, whereas the recursive version could be called only on the expressions for which the answer mattered. In other words, it sacrifices best-case performance to improve worst-case performance and avoid stack overflows.

Let's go back to script rewriting. Most rewrites only touch a small portion of the script, leaving the rest unchanged. The parts that do not change are simply copied over from the old script by adding those nodes to the node array of the new script. The index array is then used to rewire the connections between nodes.

let propagate_bool_const script = 
  let const = bool_const script
  let builder = Rewriter script
  let _ = script.Map (fun id map -> 
    match const.[id] with
    | Some v -> builder.Add (Const v)
    | None -> match script.[id] with 
              | If (c, t, f) -> match const.[c] with
                                | Some true -> map.[t]
                                | Some false -> map.[f]
                                | None -> builder.Copy id    
              | _ -> builder.Copy id)
  builder.ToScript()

With annotations :

let propagate_bool_const script = 

  (* Preliminary analysis: the constant bool value of each node, if any *)

  let const = bool_const script

  (* A rewriter that copies nodes from 'script' *)

  let builder = Rewriter script

  (* Map node identifiers in 'script' to the corresponding identifiers in 'builder' *)

  let _ = script.Map (fun id map -> 
    match const.[id] with

  (* If 'node' evaluates to boolean constant, add a brand new node to 'builder'. 
     Any copied nodes that pointed to 'node' will point to the new node instead *)

    | Some v -> builder.Add (Const v)
    | None -> match script.[id] with 
              | If (c, t, f) -> match const.[c] with
 
  (* If the node is a conditional, and the condition evaluates to a constant in 'script', 
     replace the entire node with 't' or 'f'. Their index in 'builder' is stored in 'map' *)

                                | Some true -> map.[t]
                                | Some false -> map.[f]

  (* Otherwise, just copy over the node *)

                                | None -> builder.Copy map id    
              | _ -> builder.Copy map id)
  builder.ToScript()

The entire operation still requires copying over the entire script to a new one (such is the nature of immutable structures, after all), but individual nodes are not rewritten or cloned, they are simply referenced from the nodes array of the new script. In terms of memory, in addition to the array-of-nodes and array-of-indices, the only required allocations are the nodes that did not exist in the original script, and there are usually only a handful of these.