Julia Computing has recently completed an initial implementation of the STAC-A2 benchmark. STAC-A2 is an industry standard benchmark suite for testing compute-intensive analytic workloads for options pricing and risk management. The primary application defined in the STAC-A2 benchmark consists of the computation of various “Greeks”, derivatives of the calculus variety, for a particular type of exotic option, derivatives of the financial variety. The main benchmark includes a Monte Carlo simulation of the underlying security using the Andersen QE formulation of the Heston stochastic volatility model followed by American-exercise pricing of the option using the Longstaff and Schwartz method. The particular option being evaluated is a lookback, best-of option pricing a basket of securities. The benchmark tests scaling of an implementation by discretizing the simulation over a cube of time-steps, Monte Carlo paths, and number of assets in the current basket.
The initial Julia implementation of the STAC-A2 benchmark is a single process, single threaded implementation of the benchmark using Julia 0.4.5 on an Amazon Web Services (AWS) Elastic Compute Cloud (EC2) r3.2xlarge server. This initial Julia implementation mirrors the sequential algorithms of the example STAC-A2 reference model. Access is available to the unaudited report, configuration disclosure, and code for the current Julia implementation to STAC members with a premium subscription.
Recommender systems play a pivotal role in various business settings like e-commerce sites, social media platforms, and other platforms involving user interaction with other users or products. Recommender systems provide valuable insights to gain actionable
intelligence on these users.
Large Scale Recommender systems help in unraveling the latent information in the complex relational data between users and items.However mapping the users space to the items space to predict the interaction is a challenge. Inferring actionable information from a variety of data sources collected either implicitly like click patterns, browser history etc, or explicitly like ratings of books and movies, is what well-designed recommender systems do consistently well.
Matrix Factorizations
Depending on the source of information on the users and the items, there are a variety of techniques to build recommender systems, each with a unique mathematical approach. Linear algebra and matrix factorisations are important to certain types of recommenders where user ratings are available and it is most ideal to apply methods like svd in such cases.
In matrix factorization the users and items are mapped onto a joint latent factor space of reduced dimension f, and the inner product of the user vector with the item vector gives the corresponding interaction. Dimensionality reduction is mainly about a more compact representation of the large training data which is obtained by matrix factorization. We want to quantify the nature or the characteristics of the movies defined by a certain number of aspects (factors), i.e., we are trying to generalize the information (independent and unrelated ratings matrix) in a concise and descriptive way.
Example :
Let us consider a simple example to figure out how matrix factorization helps in predicting the likelihood of a user liking a movie or not.
For sake of brevity, we have couple of users, Joe and Jane and couple of movies, Titanic and Troll 2. The users and the movies are characterized based on certain number of factors as show in the below tables.
Factors/Movies
Titanic
Troll 2
Romance
4
1
Comedy
2
4
Box Office success
5
2
Drama
3
2
Horror
1
4
Factors/Movies
Joe
Jane
Romance
4
1
Comedy
2
4
Box Office success
5
2
Drama
3
2
Horror
1
4
Consider Joe to be characterized by vector [4 2 5 3 1], which suggests that Joe likes Romance and big hit movies and not so much horror or comedy. Similarly Jane likes comedy horror and she is not very particular about box office success of the movies, neither is she a big fan of romance movies.
The movies Titanic, is a popular romance movie, where as the movie Troll 2, is not so popular and horror comedy. It is intuitively obvious that Joe will end up liking Titanic and Jane will like Troll 2. This is based on how the users and movies score on the 5 factors. Using Cosine distance as shown in the below table, confirms this.
Factors/Movies
Joe
Jane
Titanic
0.94
0.67
Troll 2
0.50
0.97
With large rating data matrix, like in the NETFLIX dataset which had around 20 thousand movies and 0.5 million users, mapping all the users and the movies in the above way is impossible. This is where matrix factorization helps in factoring the Rating matrix into user matrix and movie matrix.
Alternating Least Squares
Let be the user feature matrix where and , and let be the item or
movie feature matrix, where and . Here is the number of factors, i.e., the reduced dimension or the lower rank, which is determined by cross validation. The predictions can be calculated for any user-movie combination,
, as .
Here we minimize the loss function of and as the condition in the iterative process of obtaining these matrices. Let us start by considering the loss due to a single prediction in terms of squared error:
begin{equation}
mathcal{L}^2(r,{u},{m})=(r-<{u},{m}>)^2.
end{equation}
Based on the above equation generalizing it for the whole data set, the
empirical total loss as:
begin{equation}
mathcal{L}^{emp}(R,U,M)=frac{1}{n} sum_{(i,j) in
I}mathcal{L}^2(r_{ij},{u_i},{m_j}),
end{equation}
where is the known ratings dataset having ratings.
Julia recommender system
The package RecSys.jl is a package for recommender systems in Julia, it can currently work with explicit ratings data. For preparing the input create an object of ALSWR type. This takes two input parameters, firstly input file location, and second optional input is the variable par which specifies the type of parallelism. The parallelism is about how the data is shared/distributed across the processing units. When par=ParShemm the data is present at one location and is shared across the processing units, when par=ParChunk the data is distributed across the processing units as chunks. For this report only sequential timings were captured, i.e., with nprocs=1.
The call to the function to create a model is train(rec, 10, 10) where 10 is the number of iterations to run and 10 is the number of factors.
Performance Analysis
Sequential version
The sequential performance of the ALS algorithm is tested on Apache Spark and Julia. The scala example code shown in the mentioned link was run with rank = 10 and iterations = 10. The timing of the ALS.train() function is recorded in order to analyse the core computational part only. For the same parameters in Julia, the timings for the computationally intensive train() function is captured.
The algorithm took around 500 seconds to train on the NETFLIX dataset on a single processor, which is good for data as large as 1 billion ratings.
The below table also summarises the performance(single processor) on various other datasets like the Movielens and lastFM.
Parameters/Datasets
Size (No. of interactions)
Factorization time (in secs)
Movielens
20 Million
119
Last.fm
0.5 Billion
2913
The NETFLIX dataset is not available publicly anymore, however datasets for movielens and lastfm can be downloaded. Please refer the dataset specific julia example scripts in the examples/ directory for more details on how to model the recommender system for the respective datasets.
Parallel version
Parallelism is made possible in Julia mainly 2 ways, a). Multiprocessing and b). Multithreading. The multithreading development is onging. However the multiprocessing based parallel processing in Julia is mature and mainly based around Tasks which are concurrent function calls. The implementation details are not covered here, the following graph summarises the performance of parallel ALS implementation in Julia and Spark,
In the above graph, Julia Distr breaks up the problem and uses Julia’s distributed computing capabilities. Julia Shared uses shared memory through mmap arrays. Julia MT is the multithreading version of the ALS. While multi-threading in Julia is nascent, it already gives parallel speedups. There are several planned improvements to Julia’s multi-threading that we expect will make the multi-threaded ALS faster than the other parallel implementations.
The experiments were conducted by invoking spark with flags --master local[1]. The experiments were conducted on a 30 core Intel Xeon machine with 132 GB memory and 2 hyperthreads per core.
Credits to Tanmay KM for contributing towards the parallel implementation of the package.
Apart from methods to model the data and check for accuracy, there are also abilities to make recommendations for users who have not interacted with items, by picking the most likely items the user would interact with. Hence in RecSys.jl we have a fast, scalable and accurate recommender system which can be used to for end to end system. Currently we are working on a demo of such a recommender system with a UI interface too implemented in Julia.
Julia uses type inference to determine the types of program variables and generate fast, optimized code. But how does it really work? I recently redesigned the implementation of Julia’s type inference algorithm, and decided to blog what I’ve learned. If you’re already comfortable with these algorithms, I might not say anything new. But since only 26 people have committed any changes to Julia’s inference.jl file, I suspect that this is an unfamiliar topic for most people.
However, I’m not going to focus on explaining the typical algorithms, but instead focus on some of the implementation challenges.
A high level view of type inference when approached as a data-flow problem (as Julia does) is that it involves running an “interpreter” on the program, but only looking at types instead of values. Unlike normal program execution, this process can be proven to always converge and terminate if certain criteria are met. A typical explanation of the algorithm might end with a statement such as “now iterate to convergence”. But this simple phrase masks a significant design complexity, which can impact the speed and precision of the algorithm. This is the component of the Julia algorithm that needed a significant redesign to address a particular recursion bug and make inference on self-recursive functions faster.
Basic Algorithm
Many algorithms for type inference exist. For statically typed languages, the algorithm of choice is usually a flow-insensitive unification-based algorithm like Hindley-Milner. Julia’s usage of data-flow analysis, however, is better suited for dynamic typing. This approach allows, for example, variables to have different types at different points in the program, adding extra flexibility and convenience. Unlike static typing systems, type inference is used as an optimization, rather than to guarantee correctness at compile time. To succeed, it does not need to prove a unique tightest bound on types, so it is allowed to widen (over-approximate) types heuristically. As we will see, these heuristics have an interesting design space.
I have illustrated Julia’s algorithm below for a simple for loop. In the animation below, pc is the “program counter” indicating the line currently being processed and pc´ is highlighted to indicate the lines that remain to be processed next:
function sum(list)
total = 0
for item in list
total += item
end
return total
end
function sum(list::Vector{Float64}) # <--pc
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#unreached
for item in list
#unreached
total += item
#unreached
end
#unreached
return total
#unreached
end
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}
for item in list
#unreached
total += item
#unreached
end
#unreached
return total
#unreached
end
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}total = 0 # <--pc
#{list::Vector{Float64}; total::#undef; item::#undef}; total::Int64
for item in list
#unreached
total += item
#unreached
end
#unreached
return total
#unreached
end
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}for item in list
#{list::Vector{Float64}; total::Int64; item::#undef}
total += item
#unreached
end; {list::Vector{Float64}
#unreached
return total
#unreached
end
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}for item in list <--pc
#{list::Vector{Float64}; total::Int64; item::#undef}; item::Float64
total += item
#unreached
end; {list::Vector{Float64}
#unreached
return total
#unreached
end
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}
for item in list
#{list::Vector{Float64}; total::Int64; item::#undef}total += item
#{list::Vector{Float64}; total::Int64; item::Float64}end
#{list::Vector{Float64}; total::Int64; item::#undef}
return total
#unreached
end
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}
for item in list
#{list::Vector{Float64}; total::Int64; item::#undef}total += item # <--pc
#{list::Vector{Float64}; total::Int64; item::Float64}; total::Float64end
#{list::Vector{Float64}; total::Int64; item::#undef}
return total
#unreached
end
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}
for item in list
#{list::Vector{Float64}; total::Int64; item::#undef}
total += item
#{list::Vector{Float64}; total::Int64; item::Float64}end
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
return total
#unreached
end
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}
for item in list
#{list::Vector{Float64}; total::Int64; item::#undef}
total += item
#{list::Vector{Float64}; total::Int64; item::Float64}end # <--pc
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
return total
#unreached
end
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}for item in list
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
total += item
#{list::Vector{Float64}; total::Int64; item::Float64}
end
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}return total
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
end
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}for item in list
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
total += item
#{list::Vector{Float64}; total::Int64; item::Float64}
end
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}return total # <--pc
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
end::Union{Float64, Int64}
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}for item in list
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
total += item
#{list::Vector{Float64}; total::Int64; item::Float64}
end
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
return total
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
end::Union{Float64, Int64}
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}for item in list # <--pc
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}; item::Float64
total += item
#{list::Vector{Float64}; total::Int64; item::Float64}
end
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
return total
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
end::Union{Float64, Int64}
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}
for item in list
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}total += item
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Float64}
end
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
return total
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
end::Union{Float64, Int64}
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}
for item in list
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}total += item # <--pc
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Float64}; total::Float64
end
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
return total
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
end::Union{Float64, Int64}
function sum(list::Vector{Float64})
#{list::Vector{Float64}; total::#undef; item::#undef}
total = 0
#{list::Vector{Float64}; total::#undef; item::#undef}
for item in list
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
total += item
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Float64}
end
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
return total
#{list::Vector{Float64}; total::Union{Float64, Int64}; item::Union{Float64, #undef}}
end::Union{Float64, Int64}
function sum(list::Vector{Float64})
total = 0::Int
for item::Float64 in list::Vector{Float64}
total = total::Union{Float64, Int64} + item::Float64
end
return total::Union{Float64, Int64}
end::Union{Float64, Int64}
In Julia’s data-flow algorithm, inference execution starts with the statements in a function and initializes several states:
The list of the current instruction pointers or active lines is set to be the entry point of the function.
The types of all variables in the function are set to be Union{} (aka Bottom) or the type of the corresponding argument.
The function also has an estimated return type of Union{} (does not return).
Then the algorithm acts like a “virtual machine”, iterating the following process until there are no lines waiting to be examined:
The virtual machine takes one statement from the list of current instruction pointers.
The side-effects of this statement are computed. For example, a variable assignment will result in the type of that variable changing. If the expression is a call, then the virtual machine looks up or computes the current best guess at the return value of that call, given the types of the input arguments. By applying this recursively, a current guess of the types of all of the elements of that statement is produced.
The next line of the function is then added to the list of currently active instruction pointers, and the types of all variables on that next line are merged with the types from the current line, modified by any changes made by the current statement. This is not an easy update to describe in text, since it is merging two sets of types while taking into account the changes made in the previous statement. The best description I can give is that it should do exactly what you would expect: return the set of types possible for the variables on the next line, which is the same as the set of types on the current line, plus any changes made on the current line.
For any if statement or other control operation (while, goto, etc.), the previous operation is performed for all of the possible next statements, instead of just the next one. For example, given an if-statement, inference will take both the if-true and if-false paths through the function (and merge the results at the end).
To avoid infinite looping, the next instruction is only added to the list of current instruction pointers if the type of any variable on that line changed. Otherwise, the algorithm loops repeatedly until all of the types stop changing. The specific algorithm used to widen the types at each iteration can have dramatic effects on the runtime of the algorithm in the worst-cases. For this reason, Julia’s type-merging function typically tries to widen its estimate very quickly.
As I shall discuss next, once convergence has been reached, inference is done. The return type and variable types are marked and cached for code generation and future reuse (for example, if the function is inlined). The function can also then potentially be better optimized now that all of the types are known.
Convergence
Now we come to the real challenge of type inference: reaching convergence.
The data-flow algorithm can be proven to converge if two conditions are met: monotonicity and finiteness.
This first requirement means that merging two types must result in a less specific type, and that knowing less about type for an argument to a function call does not lead to knowing more about its return type. This condition is fairly easy to meet, since all we need to do is describe the behavior of language primitives in a well-behaved way.
The second requirement means that there must be limit on number of types that can be constructed. This condition is more interesting, since many useful type lattices, including Julia’s, do not obey it1. Instead, for the purposes of reaching convergence, the lattice is “forced” to finite height using a technique called “widening”. In widening, heuristics are applied to make types grow much faster than they would in a naive implementation of the algorithm, leading to faster convergence. This ensures that Julia’s type-inference algorithm is guaranteed to reach convergence eventually.
That still leaves the practical challenge of determining how to reach convergence. In Julia, detecting convergence is handled in several parts: Julia has local convergence at the statement level within a function, convergence of static_typeof variables (comprehensions), and global convergence of all functions in the entire call graph.
Historical Perspective
Initially, Julia only had the first test. This made inference quite simple to implement as a recursive descent over the depth-first search on the call graph. At each call, the algorithm pauses inference on the current function to recur into the callee. Eventually a leaf function is reached (one which does not make any calls to an uninferred function) and then the whole stack can unravel, filling in the return type information as it goes, and continuing with iterating convergence on the current function.
There is a slight wrinkle with this plan however if recursion is encountered. If there is a recursion, there is no leaf function. Instead, when entering a new function, the call stack is first checked to see if this function had already been encountered. Then the algorithm needed to take several actions. First, it marked the function as being recursive (toprec) and marked every intervening caller as depending on a recursive function (rec). Then as the recursion unwound, those functions would record that their return type was an incomplete, under-approximation of the real return type and discard their incomplete inference work (marked true in the tfunc cache). The topmost function in the recursion would then iterate this process until its return type reached a fixed-point.
Over time, this test was made more complicated to handle cases where recursion in the call graph can form multiple loops. Until v0.5, the algorithm computed a cycle-id for each recursion detected, to attempt to separate the dependency graph into separate recursion loops. This was not entirely correct because the graph is not always separable, which led to bug #9770 and then PR #9926 to compute it more correctly (I never attempted to prove it was fully correct, just that it was more correct on that issue). Unfortunately, that PR implementation also discarded and repeated significantly more incomplete work while handling recursion. The result was that inference became significantly slower on most tasks (such as inferring a single key handler for the repl), so the PR was never merged. To overcome these limitations, a new approach is required to deal with this case.
I mentioned previously that cycles cause issues for the existing type inference algorithm. The root of the problem is that using recursion to solve for the return types makes it inherently difficult to pause and resume type inference to incorporate new information. This impairs computation of the simultaneous convergence of all of the functions involved a cycle and instead led to a solution where only the top-most function in the call-graph is converged at a time. The resulting rework is very time-expensive in cases with heavy recursion, such as running inference on the inference code itself.
The solution is allow pausing and restarting inference on any part of a function at any time. This could still be written recursively, but once the state has been encapsulated in this way, it is easier to represent the algorithm as pure iteration. Indeed, this has obvious symmetries to inference within a function, where the working state is represented via an integer set of lines to revisit. This should not be particularly surprising, since there exists an isomorphism between statements and functions allowing one to be rewritten in terms of the other (e.g. replacing recursion with iteration or inlining one function into another).
The remaining construct that affects convergence is the static_typeof expression. This construct appears in the lowered2 code for a comprehension is an unusual complication in Julia’s type inference algorithm. It has required special care to make it look like the result of comprehensions are predictable, even though the current design is not3. That is intended to be fixed soon, so I’ve relegated an overview of this construct to the footnotes4.
Corrected Convergence
The corrected convergence algorithm in PR #15300 works iteratively over all functions in the call graph, adding new functions to the queue as it comes across them, until all of them have collectively reached global convergence. When asked for the return type of a function, inference first checks the list in the .tfunc field of the LambdaInfo method definition for whether the return type has been computed previously or is currently being computed (the .tfunc list is a dictionary that maps from a call signature to either the in-process InferenceState object or a pair of inferred function body and return type). If the type signature is not found, it is added to the work queue to be inferred. In any case, the current best-guess is returned. If inference has not finished with the function, an edge is also added to the inference state for these two functions (caller and callee), which tracks the dependency of the return type of the callee on that line number in the caller.
There are several partially-overlapping sets (or states) that a function can be in. I’ll list them all here for your reference, but I won’t fully explain what my terms mean until later:
active: any function that is waiting to be inferred (superset of 2-6)
in-work-queue: any function that is the work queue (yeah, I know, very insightful), which means it is ready for more processing (3)
processing (one function): intermediate state after leaving the work queue while inferring all active lines and before it is then moved to the appropriate follow-on state (4-7)
stuck: not in the work-queue, with edges ≠ ∅ but active lines = ∅
stuck´: not in the work-queue, with edges ≠ ∅ and active lines ≠ ∅
fixed-point: part of a cycle that may have collectively reached a fixed point
finished: done
Any number of functions could be added to the work queue before starting global inference; the process doesn’t necessarily need to be rooted in one caller. Once the loop is called to start inference on everything, one function is selected5 from the work queue for the algorithm to make progress on. It is important that the algorithm is always making progress to ensure that it will eventually terminate (more on this later). The algorithm then iterates over the instructions in the current function until it runs out of lines that can make progress (the algorithm could pause work earlier, but restarting frequently is less efficient and makes it harder to compute the “making progress” metric). Each time the algorithm reaches a call to another function that is not yet inferred, it adds an edge to the inference state for this function.
Once all active lines are processed, the algorithm determines if it is completely finished by counting the number of unresolved edges. If the number is zero, all child functions have also finished inference and the current return type result is fully converged. In that case, the function can be removed from the inference queue and recorded in the .tfunc mapping (moved from the active set to the finished set). Then all parent functions are re-added to the work queue (by following the list of back-edges) so they can check for their own convergence. This is then repeated until the entire queue is empty (*the queue can also become empty due to a recursion cycle – I’ll get to that shortly). Alternatively, the item is moved to the stuck set until the functions that it called have been inferred and can provide their actual return type.
If all active lines have not been processed, the algorithm picks another function off of the work queue and repeats the process. Because each call-edge between two functions is explicitly tracked, each time the return value of a function changes the algorithm can easily re-mark the appropriate lines in all of its callers for revisiting by the type inference algorithm, by following the back-edges (thereby moving the caller from stuck to stuck´).
Now comes the tricky part: dealing with cycles.
I mentioned earlier that the work queue may also become empty due to recursion. When this happens, all of the unresolved edges of the function that was just moved – from work queue to stuck – contain cycles. There are several ways to deal with this situation, depending on how the implementation chooses to represent, detect, and resolve the cycle. The following is a description of the implementation proposed for Julia. In terms of efficiency, this design optimizes for the case where most of the functions in the graph converge quickly (versus the possibility of needing to iterate the cycle many times before reaching a fixed point). Since most Julia code is written to have predictable “type-stable” leaf return types (and inference widens type information very quickly), this is expected to work out well in practice.
When the work queue becomes empty, the algorithm marks the current function as having reached a fixed-point cycle. It then adds all of its non-fixed-point edges to the work queue and marks them as having reached a fixed-point. To guarantee convergence, the following invariant is maintained on this set at all times:
Any function marked as fixed-point must either:
a. Be in the work queue set of functions (or processing)
b. Have all of its edges in the workqueue and marked as fixed-point
This makes it remarkably easy to track this property and maintain these invariants. If the return type of the function changes, it and all of its immediate back-edges are marked as not being at a fixed point (moving them to stuck´). The back-edges of each of those back-edges are then recursively also unmarked (moving them to stuck). If a function marked as fixed-point is retired from processing, it adds all of its edges to fixed-point.
If the work queue is still empty at this point, then the edges of all functions in the fixed-point set are also in the fixed-point set, and the entire cycle has reached convergence (yay!), and can be moved en masse to the finished set.
Finally, if the work queue is still empty, then the active set contained multiple independent call graphs (and all of the remaining ones contain cycles). A random one gets moved to processing to kickstart continued cycle resolution.
Once the active set is empty, inference is done. It is now the compiler’s job to decide what to do with all this information to maximize speed.
But wait! There’s one more complication: in addition to cycles on a function with the same signature, because we specialize each signature on the inferred type signature there can be a cycle for recursively calling a method where the type signature changes at each call.
In this case, various heuristics are required to ensure this can still terminate. The current heuristics place a limit on the complexity of the type signature1 in the recursion to force it to to be finite. But a full discussion of these heuristics will have to wait for a future blog post.
Addendum: Speed
After writing this post, testing of the algorithm above showed that it was unacceptably slower and more memory-hungry than the existing code. So after all of that work to remove all recursion and make it a simple work queue, I made one final tweak: I made it recursive. This made its behavior considerably more complex to analyze, but it pays for its complexity in the speed benefit of being able to infer the common case (non-cyclic functions) without creating a deferred edge. The fall-back, however, is still the corrected looping algorithm given previously. This change however demonstrates one advantage of the states chosen above for the algorithm: it is highly decoupled, so that only the act of making and finalizing an edge needs to be synchronized. The rest of the work can be executed in any order – in the future, it could even be run multi-threaded.
References
My thanks to @carnaval for correcting my convergence algorithm, encouraging me to simplify my state transitions, and generally for teaching me about this algorithm.
Also, his paper on static analysis capabilities for Julia helped me better understand some of the future features that may be desirable for Julia’s inference algorithm, and how those would affect the convergence criteria.
[M02] Markus Mohnen. A Graph-Free Approach to Data-flow Analysis. In R. Horspool, editor, Compiler Construction, volume 2304 of Lecture Notes in Computer Science, pages 185–213. Springer Berlin / Heidelberg, 2002.
[CC77] Patrick Cousot, Radhia Cousot – Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints (SIGPLAN 1977)
There are a finite number of user types defined, but the type parameters can be arbitrarily nested in depth and there are tuples and unions, which (in addition to depth), can have arbitrary length. This permits the following definitions that together characterize the size of a type:
The “lowered” form of code is a simplified representation of the body of the function which is more efficient to analyze. ↩
The element type of the Array returned from a comprehension is defined as the result that type-inference was able to compute. This makes the correctness of the expected return type for a comprehension dependent on the behavior and quality of type-inference on the containing function. This is not desirable, since the behavior of the language spec is supposed to be independent of the result of the type-inference optimization. ↩
The static_typeof expressions are an unusual operation that allows using the expected type of a variable prior to that variable’s actual definition. This odd behavior has added another dimension to the complexity of reaching convergence.
This construct has been used to implement comprehensions. For example, the code:
c = [f(x) for x in iter]
is handled by inference by first representing it as:
c = Array{$(Expr(:static_typeof, :fx),1}(0)
for x in iter
fx = f(x)
$(Expr(:type_goto, :fx))
push!(c, fx)
end
Typically, inference (in Julia) is done as a flow-sensitive analysis. However, static_typeof explicitly violates that condition. Indeed, the placeholder variable fx may not ever be defined (the type of that variable is thus the empty set: Union{}).
Note: inference actually uses a more simplified representation than this. I’ve taken the liberty here to illustrate the approximate result of just this one transform in quasi-Julia syntax to preserve clarity. The fully lowered2 form that code_lowered(func, argtypes) returns performs a similar transform while simultaneously simplifying the expression form to permit easier analysis.
The constraint that makes this computationally feasible is that no assignment to fx may occur after any usage of c in any execution path (or equivalently: there is no execution path in which a usage of c occurs before the assignment to fx). This structurally prohibits the type of fx from depending recursively on the type of c.
One issue remains, which is that initially the type of fx is Union{}. Normally, this would signal that an error was thrown on that line and stop inference from proceeding to the next statement. For a static_typeof statement, the first iteration must instead proceed as if that statement will be computable in the future, then resume back at that point after the type information is known. However, this can cause the initial type inference result for c to end up being Union{Array{Union{}}, Array{T}}, even though the actual type of c would be Array{T}. This type over-approximation can be seen, for example, in the type inference result of a nested comprehension. In order to narrow this type, inference restarts while preserving only the static_typeof variable inputs. This step is then iterated until the types of these variables remain unchanged across one of these restart iterations. ↩
For the correctness of the algorithm, it doesn’t matter which function is selected. Conceptually, it could be imagined that a function is selected by calling rand. In practice, the algorithm tries to arrange the queue such that there is likely progress that can be made on the next item. But the invariants required by this implementation of the algorithm even allow that the function could be selected from stuck or stuck´ instead. ↩