Tag Archives: performance

Speeding up Julia precompilation

Re-posted from: https://medium.com/coffee-in-a-klein-bottle/speeding-up-julia-precompilation-97f39d151a9f?source=rss-8bd6ec95ab58------2

An easy Tutorial on using PackageCompiler.jl to create a precompiled Custom Julia version

Continue reading on Coffee in a Klein Bottle »

Generated constant propagation; hold on to your britches!

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2019/06/20/generated-constant-propagation-hold-on-to-your-britches/

In a recent JuliaLang Discourse post, OP was looking for a simple way to convert their Vector{CustomType} to a DataFrame, treating each CustomType element as a “row”, with the fields of CustomType as fields in the row. The post piqued my interest, given my work on the Tables.jl interface over the last year. Tables.jl is geared specifically towards creating generic access patterns to “table” like sources, specifically providing Tables.rows and Tables.columns as two ways to access table data from any source. Tables.columns returns a “property-accessible object” of iterators, while Tables.rows returns the dual: an iterator of “property-accessible objects” (property-accessible object here is defined as any object that supports both propertynames(obj) and getproperty(obj, prop), which includes all custom structs by default).

I decided to respond by showing how simple it can be to “hook in” to the Tables.jl world of interop; my code looked like:

using DataFrames, Random, Tables


struct Mine
    a::Int
    b::Float64
    c::Matrix{Float64}
end
Tables.istable(::Type{Vector{Mine}}) = true 
Tables.rowaccess(::Type{Vector{Mine}}) = true 
Tables.rows(x::Vector{Mine}) = x 
Tables.schema(x::Vector{Mine}) = Tables.Schema((:a, :b), Tuple{Int, Float64})

Random.seed!(999) v = [Mine(rand(1:10), rand(), rand(2,2)) for i ∈ 1:10^6]; df = DataFrame(v)

But when doing a quick benchmark of the conversion from Vector{Mine} to DataFrame, it was 3x-5x slower than other methods already proposed. Huh? I started digging. One thing I noticed was by ***not*** defining Tables.schema, the code was a bit faster! That led me to assume something was up in our schema-based row-to-column code.

Part of the challenge with this functionality is wrestling with providing a fast, type-stable method for building up strongly-typed columns for a set of rows with arbitrary number of columns and types. Just iterating over a row in a for-loop, accessing each field programmatically can be disastrous; because each extracted field value in the hot for-loop could be a number of types, the compiler can’t do much more than create a Core.Box to put arbitrary values in, then pull out again for inserting into the vector we’re building up. Yuck, yuck, yuck; if you see Core.Boxs in your @code_typed output, or values inferred as Any, it’s an indication the compiler just couldn’t figure out anything more specific, and in a hot for-loop called over and over and over again, any hope of performance is toast.

Enter meta-programming: Tables.jl tries to judiciously employ the use of @generated functions to help deal w/ the issue here. Specifically, when doing this rows-to-columns conversion, with a known schema, Tables.eachcolumn can take the column names/types as input, and generate a custom function with this aforementioned “hot for-loop” unrolled. That is, instead of:

for rownumber = 1:number_of_rows
row = rows[rownumber] for i = 1:ncols column[i][rownumber] = getfield(row, i) end
end

For a specific table input’s schema, the generated code looks like:

for rownumber = 1:number_of_rows row = rows[rownumber] column[1][rownumber] = getfield(row, 1) column[2][rownumber] = getfield(row, 2) # etc. end

What makes this code ***really*** powerful is the compiler’s ability to do “constant propagation”, which means that, while compiling, when it encounters getfield(row, 1), it can realize, “hey, I know that row is a Mine type, and I see that they’re calling getfield to get the 1 field, and I happen to know that the 1 field of Mine is an Int64, so I’ll throw that information into dataflow analysis so things can get further inlined/optimized”. It’s really the difference between the scenario I described before of the compiler just having to treat the naive for-loop values as Any vs. recognizing that it can figure out the exact types to expect which can make the operation we’re talking about here boil down to a very simple “load”, then “store”, without needing to box/unbox or rely on runtime dynamic dispatch based on types.

Back to the original issue: why was this code slower than the unknown schema case? Well, it turns out that our generated code was actually trying to be clever by calculating a run-length encoding of consecutive types for a schema, then generating mini type-stable for-loops; i.e. the generated code looked like:

for rownumber = 1:number_of_rows row = rows[rownumber] # if columns 1 & 2 are Int64 for col = 1:2 column[col][rownumber] = getfield(row, col) end # column 3 is a Float64 for col = 3:3 column[col][rownumber] = getfield(row, col) end # columns 4 & 5 are Int64 again for col = 4:5 column[col][rownumber] = getfield(row, col) end # etc. end

While this is clever, and can generate less code when schema types tend to be “sticky” (i.e. lots of the same type consecutively), the problem is in the getfield(row, col) call. Even though the call to each getfield ***should*** be type-stable, as calculated from our run-length encoding, the compiler unfortunately can’t figure that out inside these mini type-stable for-loops. This means no super-inlining/optimizations, so things end up being slower.

So what’s the conclusion here? What can be done? The answer was pretty simple, instead of trying to be extra clever with run-length encodings, if the schema is small enough, just generate the getfield calls directly for each column. The current heuristic is for schemas with less than 100 columns, the code will be generated directly, and for larger schema cases, we’ll try the run-length encoding approach (since it’s still more efficient than the initial naive approach).

Hope you enjoyed a little dive under the hood of Tables.jl and some of the challenges the data ecosystem faces in the powerful type system of Julia. Ping me on twitter with any thoughts or questions.

Cheers.

Tricksy Tuple Types…

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2019/06/19/tricksy-tuple-types/

Some of you may be aware of my obsession with JSON libraries, and it’s true, there’s something about simple data formats that sends my brain into endless brainstorming of ways to optimize reading, writing, and object-mapping in the Julia language. JSON3.jl is my latest attempt at a couple of new ideas for JSON <=> Julia fun. The package is almost ready for a public release, and I promise I’ll talk through some of the fun ideas going on there, but today, just wanted to point out a tricky performance issue that took a bit of sleuthing to track down.

Here’s the scenario: we have a string of JSON like {"a": 1}, super simple right? In the standard Julia JSON.jl library, you just call JSON.parse(str) and get back a Dict{String, Any}. In JSON3.jl, we have a similar “plain parse” option which looks like JSON3.read(str), which returns a custom JSON3.Object type which I can talk about in another post in more detail. Another option in JSON3.jl, is to do JSON3.read(str, Dict{String, Any}), i.e. we can specify the type we’d like to parse from any string of JSON. While doing some quick benchmarking to make sure things look reasonable, I noticed JSON3.jl was about 2x slower compared to both JSON.parse, and JSON3.read(str, Dict{String, Int}). Hmmm, what’s going on here??

I first turned to profiling, and used the wonderful StatProfilerHTML.jl package to inspect my profiling results. That’s when I noticed around ~40% of the time was spent on a seemingly simple line of code:

Hmmmm……a return statement with a simple ifelse call? Seems fishy. Luckily, there’s a fun little project called Cthulhu.jl, which allows debugger “stepping” functionality with Julia’s unparalleled code inspection tools (@code_lowered, @code_typed, @code_llvm, etc.). As I “descended into madness” to take a look at the @code_typed of this line of code, I found this:

%1865 = (JSON3.ifelse)(%1864, %1857, %1851)::Union{Float64, Int64} %1866 = (Core.tuple)(%1853, %1865)::Tuple{Int64,Union{Float64, Int64}}

Ruh-roh Shaggy…….the issue here is this Tuple{Int64,Union{Float64,Int64}} return type. It’s not concrete and leads to worse type inference in later code that tries to access this tuple’s second element. This is also undesirable because we know that the value should be either an Int64 or Float64, so ideally we could structure things so that code generation can just do a single branch and generate nice clean code the rest of the way down. If we change the code to:

Let’s take another cthulic descent and check out the generated code:

%1863 = (%1857 === %1862)::Bool │ │ @ float.jl:484 within `==' @ float.jl:482 │ │┌ @ bool.jl:40 within `&' │ ││ %1864 = (Base.and_int)(%1861, %1863)::Bool │ └└ └──── goto #691 if not %1864 @ /Users/jacobquinn/.julia/dev/JSON3/src/structs.jl:330 within `read' @ /Users/jacobquinn/.julia/dev/JSON3/src/structs.jl:99 690 ─ %1866 = (Core.tuple)(%1853, %1857)::Tuple{Int64,Int64} └──── goto #693 @ /Users/jacobquinn/.julia/dev/JSON3/src/structs.jl:330 within `read' @ /Users/jacobquinn/.julia/dev/JSON3/src/structs.jl:101 691 ─ %1868 = (Core.tuple)(%1853, %1851)::Tuple{Int64,Float64} └──── goto #693

Ah, much better! Though there’s a few more steps, we can now see we’re getting what we’re after: our return type will be Tuple{Int64,Int64} or Tuple{Int64,Float64} instead of Tuple{Int64,Union{Int64,Float64}}. And the final performance results? Faster than JSON.jl!

Thanks for reading and I’ll try to get things polished up in JSON3.jl soon so you can take it for a spin.

Feel free to follow me on twitter, ask questions, or discuss this post there

Cheers.

juliabloggers.com

A Julia Language Blog Aggregator

Tag Archives: performance

Speeding up Julia precompilation

Generated constant propagation; hold on to your britches!

Tricksy Tuple Types…