Reposted from: http://juliacomputing.com/blog/2016/04/04/inferenceconvergence.html
Adventures in Type Inference Convergence
Julia uses type inference to determine the types of program variables and generate fast, optimized code. But how does it really work? I recently redesigned the implementation of Julia’s type inference algorithm, and decided to blog what I’ve learned. If you’re already comfortable with these algorithms, I might not say anything new. But since only 26 people have committed any changes to Julia’s inference.jl
file, I suspect that this is an unfamiliar topic for most people.
However, I’m not going to focus on explaining the typical algorithms, but instead focus on some of the implementation challenges.
A high level view of type inference when approached as a dataflow problem (as Julia does) is that it involves running an “interpreter” on the program, but only looking at types instead of values. Unlike normal program execution, this process can be proven to always converge and terminate if certain criteria are met. A typical explanation of the algorithm might end with a statement such as “now iterate to convergence”. But this simple phrase masks a significant design complexity, which can impact the speed and precision of the algorithm. This is the component of the Julia algorithm that needed a significant redesign to address a particular recursion bug and make inference on selfrecursive functions faster.
Basic Algorithm
Many algorithms for type inference exist. For statically typed languages, the algorithm of choice is usually a flowinsensitive unificationbased algorithm like HindleyMilner. Julia’s usage of dataflow analysis, however, is better suited for dynamic typing. This approach allows, for example, variables to have different types at different points in the program, adding extra flexibility and convenience. Unlike static typing systems, type inference is used as an optimization, rather than to guarantee correctness at compile time. To succeed, it does not need to prove a unique tightest bound on types, so it is allowed to widen (overapproximate) types heuristically. As we will see, these heuristics have an interesting design space.
I have illustrated Julia’s algorithm below for a simple for loop. In the animation below, pc
is the “program counter” indicating the line currently being processed and pc´
is highlighted to indicate the lines that remain to be processed next:
In Julia’s dataflow algorithm, inference execution starts with the statements in a function and initializes several states:
 The list of the current instruction pointers or active lines is set to be the entry point of the function.
 The types of all variables in the function are set to be
Union{}
(akaBottom
) or the type of the corresponding argument.  The function also has an estimated return type of
Union{}
(does not return).
Then the algorithm acts like a “virtual machine”, iterating the following process until there are no lines waiting to be examined:
 The virtual machine takes one statement from the list of current instruction pointers.
 The sideeffects of this statement are computed. For example, a variable assignment will result in the type of that variable changing. If the expression is a call, then the virtual machine looks up or computes the current best guess at the return value of that call, given the types of the input arguments. By applying this recursively, a current guess of the types of all of the elements of that statement is produced.
 The next line of the function is then added to the list of currently active instruction pointers, and the types of all variables on that next line are merged with the types from the current line, modified by any changes made by the current statement. This is not an easy update to describe in text, since it is merging two sets of types while taking into account the changes made in the previous statement. The best description I can give is that it should do exactly what you would expect: return the set of types possible for the variables on the next line, which is the same as the set of types on the current line, plus any changes made on the current line.
 For any
if
statement or other control operation (while
,goto
, etc.), the previous operation is performed for all of the possible next statements, instead of just the next one. For example, given anif
statement, inference will take both theiftrue
andiffalse
paths through the function (and merge the results at the end).  To avoid infinite looping, the next instruction is only added to the list of current instruction pointers if the type of any variable on that line changed. Otherwise, the algorithm loops repeatedly until all of the types stop changing. The specific algorithm used to widen the types at each iteration can have dramatic effects on the runtime of the algorithm in the worstcases. For this reason, Julia’s typemerging function typically tries to widen its estimate very quickly.
As I shall discuss next, once convergence has been reached, inference is done. The return type and variable types are marked and cached for code generation and future reuse (for example, if the function is inlined). The function can also then potentially be better optimized now that all of the types are known.
Convergence
Now we come to the real challenge of type inference: reaching convergence.
The dataflow algorithm can be proven to converge if two conditions are met: monotonicity and finiteness.
This first requirement means that merging two types must result in a less specific type, and that knowing less about type for an argument to a function call does not lead to knowing more about its return type. This condition is fairly easy to meet, since all we need to do is describe the behavior of language primitives in a wellbehaved way.
The second requirement means that there must be limit on number of types that can be constructed. This condition is more interesting, since many useful type lattices, including Julia’s, do not obey it^{1}. Instead, for the purposes of reaching convergence, the lattice is “forced” to finite height using a technique called “widening”. In widening, heuristics are applied to make types grow much faster than they would in a naive implementation of the algorithm, leading to faster convergence. This ensures that Julia’s typeinference algorithm is guaranteed to reach convergence eventually.
That still leaves the practical challenge of determining how to reach convergence. In Julia, detecting convergence is handled in several parts: Julia has local convergence at the statement level within a function, convergence of static_typeof
variables (comprehensions), and global convergence of all functions in the entire call graph.
Historical Perspective
Initially, Julia only had the first test. This made inference quite simple to implement as a recursive descent over the depthfirst search on the call graph. At each call, the algorithm pauses inference on the current function to recur into the callee. Eventually a leaf function is reached (one which does not make any calls to an uninferred function) and then the whole stack can unravel, filling in the return type information as it goes, and continuing with iterating convergence on the current function.
There is a slight wrinkle with this plan however if recursion is encountered. If there is a recursion, there is no leaf function. Instead, when entering a new function, the call stack is first checked to see if this function had already been encountered. Then the algorithm needed to take several actions. First, it marked the function as being recursive (toprec
) and marked every intervening caller as depending on a recursive function (rec
). Then as the recursion unwound, those functions would record that their return type was an incomplete, underapproximation of the real return type and discard their incomplete inference work (marked true
in the tfunc
cache). The topmost function in the recursion would then iterate this process until its return type reached a fixedpoint.
Over time, this test was made more complicated to handle cases where recursion in the call graph can form multiple loops. Until v0.5, the algorithm computed a cycleid for each recursion detected, to attempt to separate the dependency graph into separate recursion loops. This was not entirely correct because the graph is not always separable, which led to bug #9770 and then PR #9926 to compute it more correctly (I never attempted to prove it was fully correct, just that it was more correct on that issue). Unfortunately, that PR implementation also discarded and repeated significantly more incomplete work while handling recursion. The result was that inference became significantly slower on most tasks (such as inferring a single key handler for the repl), so the PR was never merged. To overcome these limitations, a new approach is required to deal with this case.
I mentioned previously that cycles cause issues for the existing type inference algorithm. The root of the problem is that using recursion to solve for the return types makes it inherently difficult to pause and resume type inference to incorporate new information. This impairs computation of the simultaneous convergence of all of the functions involved a cycle and instead led to a solution where only the topmost function in the callgraph is converged at a time. The resulting rework is very timeexpensive in cases with heavy recursion, such as running inference on the inference code itself.
The solution is allow pausing and restarting inference on any part of a function at any time. This could still be written recursively, but once the state has been encapsulated in this way, it is easier to represent the algorithm as pure iteration. Indeed, this has obvious symmetries to inference within a function, where the working state is represented via an integer set of lines to revisit. This should not be particularly surprising, since there exists an isomorphism between statements and functions allowing one to be rewritten in terms of the other (e.g. replacing recursion with iteration or inlining one function into another).
The remaining construct that affects convergence is the static_typeof
expression. This construct appears in the lowered^{2} code for a comprehension is an unusual complication in Julia’s type inference algorithm. It has required special care to make it look like the result of comprehensions are predictable, even though the current design is not^{3}. That is intended to be fixed soon, so I’ve relegated an overview of this construct to the footnotes^{4}.
Corrected Convergence
The corrected convergence algorithm in PR #15300 works iteratively over all functions in the call graph, adding new functions to the queue as it comes across them, until all of them have collectively reached global convergence. When asked for the return type of a function, inference first checks the list in the .tfunc
field of the LambdaInfo method definition for whether the return type has been computed previously or is currently being computed (the .tfunc
list is a dictionary that maps from a call signature to either the inprocess InferenceState
object or a pair of inferred function body and return type). If the type signature is not found, it is added to the work queue to be inferred. In any case, the current bestguess is returned. If inference has not finished with the function, an edge is also added to the inference state for these two functions (caller and callee), which tracks the dependency of the return type of the callee on that line number in the caller.
There are several partiallyoverlapping sets (or states) that a function can be in. I’ll list them all here for your reference, but I won’t fully explain what my terms mean until later:
 active: any function that is waiting to be inferred (superset of 26)
 inworkqueue: any function that is the work queue (yeah, I know, very insightful), which means it is ready for more processing (3)
 processing (one function): intermediate state after leaving the work queue while inferring all active lines and before it is then moved to the appropriate followon state (47)
 stuck: not in the workqueue, with edges ≠ ∅ but active lines = ∅
 stuck´: not in the workqueue, with edges ≠ ∅ and active lines ≠ ∅
 fixedpoint: part of a cycle that may have collectively reached a fixed point
 finished: done
Any number of functions could be added to the work queue before starting global inference; the process doesn’t necessarily need to be rooted in one caller. Once the loop is called to start inference on everything, one function is selected^{5} from the work queue for the algorithm to make progress on. It is important that the algorithm is always making progress to ensure that it will eventually terminate (more on this later). The algorithm then iterates over the instructions in the current function until it runs out of lines that can make progress (the algorithm could pause work earlier, but restarting frequently is less efficient and makes it harder to compute the “making progress” metric). Each time the algorithm reaches a call to another function that is not yet inferred, it adds an edge to the inference state for this function.
Once all active lines are processed, the algorithm determines if it is completely finished by counting the number of unresolved edges. If the number is zero, all child functions have also finished inference and the current return type result is fully converged. In that case, the function can be removed from the inference queue and recorded in the .tfunc
mapping (moved from the active
set to the finished
set). Then all parent functions are readded to the work queue (by following the list of backedges) so they can check for their own convergence. This is then repeated until the entire queue is empty (*the queue can also become empty due to a recursion cycle – I’ll get to that shortly). Alternatively, the item is moved to the stuck
set until the functions that it called have been inferred and can provide their actual return type.
If all active lines have not been processed, the algorithm picks another function off of the work queue and repeats the process. Because each calledge between two functions is explicitly tracked, each time the return value of a function changes the algorithm can easily remark the appropriate lines in all of its callers for revisiting by the type inference algorithm, by following the backedges (thereby moving the caller from stuck
to stuck´
).
Now comes the tricky part: dealing with cycles.
I mentioned earlier that the work queue may also become empty due to recursion. When this happens, all of the unresolved edges of the function that was just moved – from work queue
to stuck
– contain cycles. There are several ways to deal with this situation, depending on how the implementation chooses to represent, detect, and resolve the cycle. The following is a description of the implementation proposed for Julia. In terms of efficiency, this design optimizes for the case where most of the functions in the graph converge quickly (versus the possibility of needing to iterate the cycle many times before reaching a fixed point). Since most Julia code is written to have predictable “typestable” leaf return types (and inference widens type information very quickly), this is expected to work out well in practice.
When the work queue becomes empty, the algorithm marks the current function as having reached a fixedpoint cycle
. It then adds all of its nonfixedpoint
edges to the work queue and marks them as having reached a fixedpoint
. To guarantee convergence, the following invariant is maintained on this set at all times:

Any function marked as
fixedpoint
must either:a. Be in the
work queue
set of functions (orprocessing
)b. Have all of its edges in the workqueue and marked as
fixedpoint
This makes it remarkably easy to track this property and maintain these invariants. If the return type of the function changes, it and all of its immediate backedges are marked as not being at a fixed point (moving them to stuck´
). The backedges of each of those backedges are then recursively also unmarked (moving them to stuck
). If a function marked as fixedpoint
is retired from processing
, it adds all of its edges to fixedpoint
.
If the work queue is still empty at this point, then the edges of all functions in the fixedpoint
set are also in the fixedpoint
set, and the entire cycle has reached convergence (yay!), and can be moved en masse to the finished set.
Finally, if the work queue is still empty, then the active
set contained multiple independent call graphs (and all of the remaining ones contain cycles). A random one gets moved to processing
to kickstart continued cycle resolution.
Once the active
set is empty, inference is done. It is now the compiler’s job to decide what to do with all this information to maximize speed.
But wait! There’s one more complication: in addition to cycles on a function with the same signature, because we specialize each signature on the inferred type signature there can be a cycle for recursively calling a method where the type signature changes at each call.
In this case, various heuristics are required to ensure this can still terminate. The current heuristics place a limit on the complexity of the type signature^{1} in the recursion to force it to to be finite. But a full discussion of these heuristics will have to wait for a future blog post.
Addendum: Speed
After writing this post, testing of the algorithm above showed that it was unacceptably slower and more memoryhungry than the existing code. So after all of that work to remove all recursion and make it a simple work queue, I made one final tweak: I made it recursive. This made its behavior considerably more complex to analyze, but it pays for its complexity in the speed benefit of being able to infer the common case (noncyclic functions) without creating a deferred edge. The fallback, however, is still the corrected looping algorithm given previously. This change however demonstrates one advantage of the states chosen above for the algorithm: it is highly decoupled, so that only the act of making and finalizing an edge needs to be synchronized. The rest of the work can be executed in any order – in the future, it could even be run multithreaded.
References
My thanks to @carnaval for correcting my convergence algorithm, encouraging me to simplify my state transitions, and generally for teaching me about this algorithm.
Also, his paper on static analysis capabilities for Julia helped me better understand some of the future features that may be desirable for Julia’s inference algorithm, and how those would affect the convergence criteria.
[M02] Markus Mohnen. A GraphFree Approach to Dataflow Analysis. In R. Horspool, editor, Compiler Construction, volume 2304 of Lecture Notes in Computer Science, pages 185–213. Springer Berlin / Heidelberg, 2002.
[CC77] Patrick Cousot, Radhia Cousot – Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints (SIGPLAN 1977)

There are a finite number of user types defined, but the type parameters can be arbitrarily nested in depth and there are tuples and unions, which (in addition to depth), can have arbitrary length. This permits the following definitions that together characterize the size of a type:
depth(type) = 1 + maximum(depth, type.parameters) length(type) = length(type.parameters)

Limiting the depth of a Type can be achieved by replacing some type parameters with a TypeVar of its supertype:
Type{Type{Type{Type{Int}}}} > Type{Type{T <: Type}}

Limiting the length of a Union can be achieved by replacing several elements with their common supertype:
Union{Int, Float32, Float64} > Union{Int, FloatingPoint}

Limiting the length of a Tuple can be achieved by replacing the tail types with a Vararg:
Tuple{Int, Int, Int} > Tuple{Int, Vararg{Int}}


The “lowered” form of code is a simplified representation of the body of the function which is more efficient to analyze. ↩

The element type of the Array returned from a comprehension is defined as the result that typeinference was able to compute. This makes the correctness of the expected return type for a comprehension dependent on the behavior and quality of typeinference on the containing function. This is not desirable, since the behavior of the language spec is supposed to be independent of the result of the typeinference optimization. ↩

The
static_typeof
expressions are an unusual operation that allows using the expected type of a variable prior to that variable’s actual definition. This odd behavior has added another dimension to the complexity of reaching convergence.This construct has been used to implement comprehensions. For example, the code:
c = [f(x) for x in iter]
is handled by inference by first representing it as:
c = Array{$(Expr(:static_typeof, :fx),1}(0) for x in iter fx = f(x) $(Expr(:type_goto, :fx)) push!(c, fx) end
Typically, inference (in Julia) is done as a flowsensitive analysis. However,
static_typeof
explicitly violates that condition. Indeed, the placeholder variablefx
may not ever be defined (the type of that variable is thus the empty set:Union{}
).Note: inference actually uses a more simplified representation than this. I’ve taken the liberty here to illustrate the approximate result of just this one transform in quasiJulia syntax to preserve clarity. The fully lowered^{2} form that
code_lowered(func, argtypes)
returns performs a similar transform while simultaneously simplifying the expression form to permit easier analysis.The constraint that makes this computationally feasible is that no assignment to
fx
may occur after any usage ofc
in any execution path (or equivalently: there is no execution path in which a usage ofc
occurs before the assignment tofx
). This structurally prohibits the type offx
from depending recursively on the type ofc
.One issue remains, which is that initially the type of
fx
isUnion{}
. Normally, this would signal that an error was thrown on that line and stop inference from proceeding to the next statement. For astatic_typeof
statement, the first iteration must instead proceed as if that statement will be computable in the future, then resume back at that point after the type information is known. However, this can cause the initial type inference result forc
to end up beingUnion{Array{Union{}}, Array{T}}
, even though the actual type ofc
would beArray{T}
. This type overapproximation can be seen, for example, in the type inference result of a nested comprehension. In order to narrow this type, inference restarts while preserving only thestatic_typeof
variable inputs. This step is then iterated until the types of these variables remain unchanged across one of these restart iterations. ↩ 
For the correctness of the algorithm, it doesn’t matter which function is selected. Conceptually, it could be imagined that a function is selected by calling
rand
. In practice, the algorithm tries to arrange the queue such that there is likely progress that can be made on the next item. But the invariants required by this implementation of the algorithm even allow that the function could be selected fromstuck
orstuck´
instead. ↩