Tag Archives: Uncategorized

Data Structure Design in CSV.jl

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2020/06/07/data-structure-design-in-csv-jl/

I’m currently working through a refactor of CSV.jl internals and once again am faced with some tricky questions surrounding which data structures to use. In csv reading, like many things, there are trade-offs between performance/efficiency and convenience: we want to read delimited files as fast as possible, but we also want the most natural, standard data structures. Some of the complexity in these decisions come from the csv spec (or lack thereof!) itself, but to fully understand that, let’s talk through the primary task of csv reading.

The aim for CSV.jl, is to take plain text data–columns separated by a delimiter, and rows separated by newlines–parse text values into native typed values, and output the data in columnar data structures, like a native Julia Vector. Unfortunately, csv files do not contain any metadata to inform how many rows, columns, or what data types columns will be, as well as if they will have null values or not. This lack of metadata means we have to use tricks like guessing the # of rows in a file, “detecting” data types, and being flexible in case we detect one type and need to “promote” to a different data type later. The complexities are compounded when considering multi-threaded parsing because you either have to play the synchronization dance between threads, or give each thread a chunk of the file and “merge” their results afterwards (the current CSV.jl approach).

Given these processing considerations, here are some of the data structure questions I’m currently pondering:

Vector{Union{T, Missing}} vs. SentinelVector{T}: Base Julia includes an optimization that allows Arrays with isbits Union elements to be stored inline, which means they can be extremely efficient when working with, for example, Union{Float64, Missing}. I’ve been prototyping a new SentinelArrays.jl package that allows wrapping any AbstractArray and specifying a special “sentinel” value of the array type that should return a different “special value”, like missing. For Float64 for example, we can use a non-standard NaN bit pattern as a sentinel for misssing and this works quite well: we’re basically working with a plain Vector{Float64} in terms of storage, but get the advantage of representing Union{Float64, Missing} through our sentinel.

The pros of Vector{Union{T, Missing}} is that it’s Base julia, built-in, standard, and much of the data ecosystem has become familiar with the type and integrated well. The cons are that it takes up a little extra space (1 extra byte per element which signals whether an element has a Float64 or a missing), and that there currently isn’t a way to “truncate” the array like convert(Vector{Float64}, A::Vector{Union{Float64, Missing}}) without copying data. This is desirable because while parsing, we need to assume Union{T, Missing} in case missing values are encountered, but might finish parsing and know that there were in fact none, in which case we’d like to just return Vector{Float64}. With a SentinelVector, this is trivial because we can just “unwrap” the underlying array and return that, given no sentinels were used while parsing. The disadvantages of SentinelArrays are just that they’re non-standard, though the Julia ecosystem has evolved pretty well to just rely on the general AbstractArray interface instead of needing specific array types.

The 2nd question is How to return/represent String columns? String columns are a bit of the odd duck with respect to parsing because you’re not really “parsing” anything but just noting the start/end positions of a cell of the csv file. And indeed, one representation of a String column is just that: a custom array type that holds on to the original file byte buffer, and each “element” is just a byte position and length of each cell. Upon indexing, the String can be fully materialized, but otherwise, this lazy-materializing structure provides a lot of efficiency. The disadvantage of this approach is needing to hold on to the original file buffer, which has caused confusion for users in the past when they try to modify/delete the file after parsing and get errors saying the file is still in use. Another disadvantage is trying to support the full AbstractArray interface with this lazy structure, particularly with regards to mutating operations (push!, append!, etc.). The WeakRefStrings.jl package provides a structure that can mostly be used for this kind of task, but there are a few internal mismatches with how the position/lengths are represented. The alternative of just materializing a full Vector{String} has the advantage of being extremely standard and easy to work with, but just expensive because each string cell in the file must be copied/allocated, even if it never ends up getting used.

The 3rd data structure question involves multi-threaded parsing: How to best return/represent columns when multiple threads are involved? As noted earlier, CSV.jl currently chunks up large files and lets each thread process a chunk separately: detecting types, guessing rows, the full parsing task. After each thread has finished, a “merge” operation is performed where columns will be promoted together, recoded, and ensure that each has a consistent type. CSV.jl currently defines its own CSV.Column2 type (admittedly not the greatest name 😜) that “chains” each threads’ column together into a single “column”. This seems to work pretty well in practice, with iteration still being extremely efficient, but there’s a gotcha if you tried to do

for i = 1:length(column)
    x = column[i]
    # do stuff with x
end

That is, linear getindex operations are slower (O(log N) slow) because it has to determine which underlying chained array i belongs to.
The most obvious alternative solution is to just vcat all the thread columns together, but my worry there is we’re essentially doubling the required memory for parsing, if only for the brief operation of appending columns together.

Anyway, these are some of the questions I’ve been sleeping on for the past few days; I don’t have firm commitments to choose one solution over the other, but with the other internal refactoring, I thought it would be good to think through the ideals and see if we can improve things somehow. My ulterior motive in writing this all up in a blog post is to hopefully elicit some discussion/ideas around ways we can accomplish some of the things I’ve discussed here. As always, feel free to ping me on Twitter or the #data julialang slack channel to chat.

Tables.jl 1.0 Release

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2020/02/12/tables-jl-1-0-release/

Since its inception on a couple of whiteboards at JuliaCon 2018 in London, the Tables.jl package has grown to become a foundational set of interfaces for data packages in Julia, even accumulating 74 direct dependencies in the General registry as of this writing (Feb 2020)!

So why the 1.0 release? An official 1.0 release can signal stability and maturity, particularly in the case of an “interface package” like Tables.jl. With very little change in API since the package started, we figured it was time to send that signal out to the broader community: hey! this package can be pretty useful and we’re not going to break it (intentionally)! It also gives the opportunity to polish up a few APIs, iron out a few wrinkles, and cleanup a lot of documentation.

For those who aren’t familiar, the Tables.jl package is about providing powerful interfaces for “table types” in all their shapes, sizes, and formats, to talk with each other seamlessly, and most importantly, without direct knowledge of each other! At the highest level, it provides the Tables.rows and Tables.columns functions, which provide two of the most common access patterns to table data everywhere (row-by-row, or by entire columns). Table types become valid “sources” by implementing access to their data via Tables.rows or Tables.columns (or both!), and “sinks” can then operate on any source by calling Tables.rows or Tables.columns and processing the resulting data appropriately. A key feature for sinks is the “orientation agnostic” nature of Tables.rows and Tables.columns; i.e. sinks don’t need to worry if the source is row or column-oriented by nature, since performant, generic fallbacks are provided both ways. That is, calling Tables.rows on a source that only defined Tables.columns fallsback to a generic lazy row iteration over the input columns. And vice versa, calling Tables.columns on a source that only defined Tables.rows will process the input rows to build up columns. This works because it’s work the sink would have had to do anyway; if the sink really needs columns to do its work, then it would have to turn rows into columns anyway, so Tables.jl provides that with the best, community-sourced implementation possible.

Another core concept, and now clarified in the 1.0 release, is the interface for accessing data on an individual row, as well as a set of columns. It turns out, by viewing a “row” as an ordered set of named column values, and “columns” as an ordered set of named columns, lends itself naturally to a common interface, simplified implementations, and ability to provide powerful default functionality. Check out new the docs, particularly about Tables.AbstractRow and Tables.AbstractColumns to learn more. You can also checkout the Discourse announcement, which is geared a little bit more towards 1.0 release notes and upgrade guide.

A recent popular blog post highlighted features of Julia that lend itself to such widespread composability between packages and Tables.jl is a powerful example of this. Through just a few simple APIs, it allows a DataFrame to be serialized to disk in the binary feather format, read back, converted to JSON to be sent over the web, and loaded into a mysql database. All without any of those packages knowing about each other, and without needing a single intermediary table type to convert between. Applications and processing tasks can feel free to leverage any number of in-memory representations, disk formats, and databases to handle table manipulations.

Here’s to more integrations, more composability, and more table types in the future!

BYO-Closures For Performance

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2019/12/31/byo-closures-for-performance/

Some may be familiar with the idea of closures, which, in short, are local functions that capture state from enclosing context. Closures are obviously supported in Julia, which often just look like anonymous functions; in those docs, it mentions that “Functions in Julia are first-class objects“, which means you can think about them as being defined in the language itself. Indeed, we could take the example of the exponent method for IEEFloats in Base, which is defined (slightly abbreviated) as:

function exponent(x::T) where T<:IEEEFloat
    xs = reinterpret(Unsigned, x) & ~sign_mask(T)
    k = Int(xs >> significand_bits(T))
    if k == 0 # x is subnormal
        m = leading_zeros(xs) - exponent_bits(T)
        k = 1 - m
    end
    return k - exponent_bias(T)
end

And think of this method definition being “lowered” to:

struct exponentFunction <: Function
end

function (f::exponentFunction)(x::T) where T<:IEEEFloat
 xs = reinterpret(Unsigned, x) & ~sign_mask(T)
    k = Int(xs >> significand_bits(T))
    if k == 0 # x is subnormal
        m = leading_zeros(xs) - exponent_bits(T)
        k = 1 - m
    end
    return k - exponent_bias(T)
end

const exponent = exponentFunction()

So here we’re defining a struct exponentFunction, which is a subtype of Function, that all function types inherit from (you can check this yourself by querying supertype(typeof(Base.exponent))). Then we’re defining a method with some unusual syntax to make instances of exponentFunction callable, like exponentFunction()(3.14), which is accomplished with the syntax function (f::exponentFunction)(x::T). Finally, we declare our const exponent to just be an instance of our exponentFunction, which is often known as a “functor”. (Functors are covered in more detail in the Julia manual).

Ok, so why start off a blog post going over a bunch of stuff in the JuliaLang docs manual? Well, in a recent refactoring, I ran into a decently well-known performance issue with closures, which suggested a few different solutions, but none which quite fit my use-case. Now, I have to admit to not fully understanding the fundamental language issue causing the performance hit here; what I do understand from my own code factorings/use is that when you try to use a variable that gets captured as closure state after the closure definition/use, it ends up creating a Core.Box object to put the variable’s value into (which massively affects performance because the variable’s inferred type is essentially Any and then relies on runtime/dynamic dispatch at every use).

Part of my aforementioned refactoring involved moving some common code into higher-order functions that applied functor arguments to each field of a struct, for example, which meant my code was now relying on closures passed to the higher-order functions. Luckily, I tend to use my favorite JuliaLang feature, its code-inspection tools (see @code_typed, for example) to just see what core functions are getting inferred/lowered to, and noticed a bunch of red-flags in the form of Core.Box for key variables. Digging a little further, it became clear that I was a victim of issue #15276 and would need to figure out a solution. One solution suggested in the issue thread was to use let blocks, declaring closure-capture state variables to make it explicit which variables will be captured. In my case, however, I needed the variables to be updated within the closure and then needed those updated values afterwards to pass along (so my parsing functions passed current parsing state down into the closures and need to then pass it along to parse the next object).

So my solution? Well, an extremely unique feature of the Julia language is how much of the language is written in itself, and how many major constructs are true, first-class citizens of the language. So I decided to write my own closure object!

mutable struct StructClosure{T, KW}
    buf::T
    pos::Int
    len::Int
    kw::KW
end

@inline function (f::StructClosure)(i, nm, TT)
    pos_i, x_i = readvalue(f.buf, f.pos, f.len, TT; f.kw...)
    f.pos = pos_i
    return x_i
end

So similar to our exponent example before, we define a StructClosure functor object, which this time has a few fields, which represent the closure-captured state variables that we need access to inside our actual function code. Also note that we made our functor mutable struct because in our function, we actually want to update our position variable after we’ve read a value (f.pos = pos_i).

We end up using our home-grown closure like:

@inline function read(::Struct, buf, pos, len, b, ::Type{T}; kw...) where {T}
    if b != UInt8('{')
        error = ExpectedOpeningObjectChar
        @goto invalid
    end
    pos += 1
    @eof
    b = getbyte(buf, pos)
    @wh
    if b == UInt8('}')
        pos += 1
        return pos, T()
    elseif b != UInt8('"')
        error = ExpectedOpeningQuoteChar
        @goto invalid
    end
    pos += 1
    @eof
    c = StructClosure(buf, pos, len, kw)
    x = StructTypes.construct(c, T)
    return c.pos, x

@label invalid
    invalid(error, buf, pos, T)
end

So we first create an instance of our closure c = StructClosure(buf, pos, len, kw), and then pass it to the higher-order function like x = StructTypes.construct(c, T). Finally, you’ll notice how we return our closure variable at the end with return c.pos, x. How’s the performance? Back in-line with our fully-unrolled, pre-higher-order function code. Ultimately, this actually felt like a pretty simple, even clever solution in order to cleanup my code and use some common, well-tested higher-order functions to do some fancier code unrolling.

As always, hit me up on twitter with any comments or questions and I’m happy to discuss further.