Author Archives: Jacob Quinn

Partition all the datas!

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2020/11/13/partition-all-the-datas/

It’s been a while, but I wanted to put up a quick post around some new functionality in the Tables.jl package; as of the v1.1 release, a new function Tables.partitions has been introduced into the Tables.jl interfaces. Tables.partitions adds a “batch processing” layer to the already established Tables.rows/Tables.columns interfaces to allow input source tables to signal that they have natural “batches” or partitions. If a source overloads Tables.partitions, each partition/iteration should return a valid “table”, aka an object that satisfies the rows or columns interfaces.

Why partitions? There are certain workflows and algorithms that work best on “batches” of data, by employing the popular map-reduce methodology, or in cases of extremely large datasets, just merely processing such large tables chunk by chunk.

What are some concrete uses already out there? Here are a few examples:

  • In the Arrow.jl package, data is naturally stored in “record batches”. The package provides the Arrow.Stream type to allow iterating record batches from arrow data. The Arrow.write also allows writing input table partitions out as separate record batches, using separate threads to write each batch.
  • In the CSV.jl package, the CSV.Chunks type is provided for processing extremely large files in “chunks”, where each iteration is a CSV.File object. Because CSV.File is a valid Tables.jl table, CSV.Chunks naturally satisfies the Tables.partitions interface by returning itself. CSV.write also now supports the partition=true keyword argument that allows writing out multiple files at once using multiple threads, writing input partitions out to the separate files.

There has also been efforts at providing several convenience functions to make working with partitions easier. For example:

  • Tables.partitioner: is basically a lazy version of map, where each element returned must satisfy the Tables.jl source interface. This allows for easy “partitioning”, of multiple tables where each table is a partition. For example, Tables.partitioner(CSV.File, array_of_csv_filenames) will returned a object where Tables.partitions applys CSV.File to each element of the array_of_csv_filenames array. This allows treating a bunch of csv file as a single “long” table, as long as each csv file has the same schema.
  • TableOperations.joinpartitions: can take a partitioned input table, and return a single “table” that satisfies the Tables.columns interface; it does this by taking the columns of each partition and “appending” them together using the ChainedVector array type from the SentinelArrays.jl package.
  • TableOperations.makepartitions: Take an input Tables.jl compatible table source, along with an integer number of rows, and return an object that implements Tables.partitions by returning the # of specified rows in each partition.

Anyway, it’s exciting to already hear about people adopting these behaviors/functionality and enjoying the benefits of partitioning data seamlessly across packages. As always, feel free to ping me in the JuliaLang #data channel, or hit me up on twitter.

Data Structure Design in CSV.jl

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2020/06/07/data-structure-design-in-csv-jl/

I’m currently working through a refactor of CSV.jl internals and once again am faced with some tricky questions surrounding which data structures to use. In csv reading, like many things, there are trade-offs between performance/efficiency and convenience: we want to read delimited files as fast as possible, but we also want the most natural, standard data structures. Some of the complexity in these decisions come from the csv spec (or lack thereof!) itself, but to fully understand that, let’s talk through the primary task of csv reading.

The aim for CSV.jl, is to take plain text data–columns separated by a delimiter, and rows separated by newlines–parse text values into native typed values, and output the data in columnar data structures, like a native Julia Vector. Unfortunately, csv files do not contain any metadata to inform how many rows, columns, or what data types columns will be, as well as if they will have null values or not. This lack of metadata means we have to use tricks like guessing the # of rows in a file, “detecting” data types, and being flexible in case we detect one type and need to “promote” to a different data type later. The complexities are compounded when considering multi-threaded parsing because you either have to play the synchronization dance between threads, or give each thread a chunk of the file and “merge” their results afterwards (the current CSV.jl approach).

Given these processing considerations, here are some of the data structure questions I’m currently pondering:

Vector{Union{T, Missing}} vs. SentinelVector{T}: Base Julia includes an optimization that allows Arrays with isbits Union elements to be stored inline, which means they can be extremely efficient when working with, for example, Union{Float64, Missing}. I’ve been prototyping a new SentinelArrays.jl package that allows wrapping any AbstractArray and specifying a special “sentinel” value of the array type that should return a different “special value”, like missing. For Float64 for example, we can use a non-standard NaN bit pattern as a sentinel for misssing and this works quite well: we’re basically working with a plain Vector{Float64} in terms of storage, but get the advantage of representing Union{Float64, Missing} through our sentinel.

The pros of Vector{Union{T, Missing}} is that it’s Base julia, built-in, standard, and much of the data ecosystem has become familiar with the type and integrated well. The cons are that it takes up a little extra space (1 extra byte per element which signals whether an element has a Float64 or a missing), and that there currently isn’t a way to “truncate” the array like convert(Vector{Float64}, A::Vector{Union{Float64, Missing}}) without copying data. This is desirable because while parsing, we need to assume Union{T, Missing} in case missing values are encountered, but might finish parsing and know that there were in fact none, in which case we’d like to just return Vector{Float64}. With a SentinelVector, this is trivial because we can just “unwrap” the underlying array and return that, given no sentinels were used while parsing. The disadvantages of SentinelArrays are just that they’re non-standard, though the Julia ecosystem has evolved pretty well to just rely on the general AbstractArray interface instead of needing specific array types.

The 2nd question is How to return/represent String columns? String columns are a bit of the odd duck with respect to parsing because you’re not really “parsing” anything but just noting the start/end positions of a cell of the csv file. And indeed, one representation of a String column is just that: a custom array type that holds on to the original file byte buffer, and each “element” is just a byte position and length of each cell. Upon indexing, the String can be fully materialized, but otherwise, this lazy-materializing structure provides a lot of efficiency. The disadvantage of this approach is needing to hold on to the original file buffer, which has caused confusion for users in the past when they try to modify/delete the file after parsing and get errors saying the file is still in use. Another disadvantage is trying to support the full AbstractArray interface with this lazy structure, particularly with regards to mutating operations (push!, append!, etc.). The WeakRefStrings.jl package provides a structure that can mostly be used for this kind of task, but there are a few internal mismatches with how the position/lengths are represented. The alternative of just materializing a full Vector{String} has the advantage of being extremely standard and easy to work with, but just expensive because each string cell in the file must be copied/allocated, even if it never ends up getting used.

The 3rd data structure question involves multi-threaded parsing: How to best return/represent columns when multiple threads are involved? As noted earlier, CSV.jl currently chunks up large files and lets each thread process a chunk separately: detecting types, guessing rows, the full parsing task. After each thread has finished, a “merge” operation is performed where columns will be promoted together, recoded, and ensure that each has a consistent type. CSV.jl currently defines its own CSV.Column2 type (admittedly not the greatest name 😜) that “chains” each threads’ column together into a single “column”. This seems to work pretty well in practice, with iteration still being extremely efficient, but there’s a gotcha if you tried to do

for i = 1:length(column)
    x = column[i]
    # do stuff with x
end

That is, linear getindex operations are slower (O(log N) slow) because it has to determine which underlying chained array i belongs to.
The most obvious alternative solution is to just vcat all the thread columns together, but my worry there is we’re essentially doubling the required memory for parsing, if only for the brief operation of appending columns together.

Anyway, these are some of the questions I’ve been sleeping on for the past few days; I don’t have firm commitments to choose one solution over the other, but with the other internal refactoring, I thought it would be good to think through the ideals and see if we can improve things somehow. My ulterior motive in writing this all up in a blog post is to hopefully elicit some discussion/ideas around ways we can accomplish some of the things I’ve discussed here. As always, feel free to ping me on Twitter or the #data julialang slack channel to chat.

Tables.jl 1.0 Release

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2020/02/12/tables-jl-1-0-release/

Since its inception on a couple of whiteboards at JuliaCon 2018 in London, the Tables.jl package has grown to become a foundational set of interfaces for data packages in Julia, even accumulating 74 direct dependencies in the General registry as of this writing (Feb 2020)!

So why the 1.0 release? An official 1.0 release can signal stability and maturity, particularly in the case of an “interface package” like Tables.jl. With very little change in API since the package started, we figured it was time to send that signal out to the broader community: hey! this package can be pretty useful and we’re not going to break it (intentionally)! It also gives the opportunity to polish up a few APIs, iron out a few wrinkles, and cleanup a lot of documentation.

For those who aren’t familiar, the Tables.jl package is about providing powerful interfaces for “table types” in all their shapes, sizes, and formats, to talk with each other seamlessly, and most importantly, without direct knowledge of each other! At the highest level, it provides the Tables.rows and Tables.columns functions, which provide two of the most common access patterns to table data everywhere (row-by-row, or by entire columns). Table types become valid “sources” by implementing access to their data via Tables.rows or Tables.columns (or both!), and “sinks” can then operate on any source by calling Tables.rows or Tables.columns and processing the resulting data appropriately. A key feature for sinks is the “orientation agnostic” nature of Tables.rows and Tables.columns; i.e. sinks don’t need to worry if the source is row or column-oriented by nature, since performant, generic fallbacks are provided both ways. That is, calling Tables.rows on a source that only defined Tables.columns fallsback to a generic lazy row iteration over the input columns. And vice versa, calling Tables.columns on a source that only defined Tables.rows will process the input rows to build up columns. This works because it’s work the sink would have had to do anyway; if the sink really needs columns to do its work, then it would have to turn rows into columns anyway, so Tables.jl provides that with the best, community-sourced implementation possible.

Another core concept, and now clarified in the 1.0 release, is the interface for accessing data on an individual row, as well as a set of columns. It turns out, by viewing a “row” as an ordered set of named column values, and “columns” as an ordered set of named columns, lends itself naturally to a common interface, simplified implementations, and ability to provide powerful default functionality. Check out new the docs, particularly about Tables.AbstractRow and Tables.AbstractColumns to learn more. You can also checkout the Discourse announcement, which is geared a little bit more towards 1.0 release notes and upgrade guide.

A recent popular blog post highlighted features of Julia that lend itself to such widespread composability between packages and Tables.jl is a powerful example of this. Through just a few simple APIs, it allows a DataFrame to be serialized to disk in the binary feather format, read back, converted to JSON to be sent over the web, and loaded into a mysql database. All without any of those packages knowing about each other, and without needing a single intermediary table type to convert between. Applications and processing tasks can feel free to leverage any number of in-memory representations, disk formats, and databases to handle table manipulations.

Here’s to more integrations, more composability, and more table types in the future!