Tag Archives: Statistics

Data Wrangling in Julia based on dplyr Flights Tutorials

By: Clinton Brownley

Re-posted from: https://cbrownley.wordpress.com/2017/11/29/data-wrangling-in-julia-based-on-dplyr-flights-tutorials/

A couple of my favorite tutorials for wrangling data in R with dplyr are Hadley Wickham’s dplyr package vignette and Kevin Markham’s dplyr tutorial. I enjoy the tutorials because they concisely illustrate how to use a small set of verb-based functions to carry out common data wrangling tasks.

I tend to use Python to wrangle data, but I’m exploring the Julia programming language so I thought creating a similar dplyr-based tutorial in Julia would be a fun way to examine Julia’s capabilities and syntax. Julia has several packages that make it easier to deal with tabular data, including DataFrames and DataFramesMeta.

The DataFrames package provides functions for reading and writing, split-apply-combining, reshaping, joining, sorting, querying, and grouping tabular data. The DataFramesMeta package provides a set of macros that are similar to dplyr’s verb-based functions in that they offer a more convenient, readable syntax for munging data and chaining together multiple operations.

Data

For this tutorial, let’s following along with Kevin’s tutorial and use the hflights dataset. You can obtain the dataset from R with the following commands or simply download it here: hflights.csv

install.packages("hflights")
library(hflights)
write.csv(hflights, "hflights.csv")

Load packages and example dataset

To begin, let’s start the Julia REPL, load the DataFrames and DataFramesMeta packages, and load and inspect the hflights dataset:

using DataFrames
using DataFramesMeta

hflights = readtable("/Users/clinton/Documents/Julia/hflights.csv");
size(hflights)
names(hflights)
head(hflights)
describe(hflights)

hflights1b

The semicolon on the end of the readtable command prevents it from printing the dataset to the screen. The size command returns the number of rows and columns in the dataset. You can specify you only want the number of rows with size(hflights, 1) or columns with size(hflights, 2). This dataset contains 227,496 rows and 21 columns. The names command lists the column headings. By default, the head command prints the header row and six data rows. You can specify the number of data rows to display by adding a second argument, e.g. head(hflights, 10). The describe command prints summary statistics for each column.

@where: Keep rows matching criteria

AND: All of the conditions must be true for the returned rows

# Julia DataFrames approach to view all flights on January 1
hflights[.&(hflights[:Month] .== 1, hflights[:DayofMonth] .== 1), :]

# DataFramesMeta approach
@where(hflights, :Month .== 1, :DayofMonth .== 1)

hflights2

Julia’s DataFrames’ row filtering syntax is similar to R’s syntax. To specify multiple AND conditions, use “.&()” and place the filtering conditions, separated by commas, between the parentheses. Like dplyr’s filter function, DataFramesMeta’s @where macro simplifies the syntax and makes the command easier to read.

OR: One of the conditions must be true for the returned rows

# Julia DataFrames approach to view all flights where either AA or UA is the carrier
hflights[.|(hflights[:UniqueCarrier] .== "AA", hflights[:UniqueCarrier] .== "UA"), :]

# DataFramesMeta approach
@where(hflights, .|(:UniqueCarrier .== "AA", :UniqueCarrier .== "UA"))

hflights3

To specify multiple OR conditions, use “.|()” and place the filtering conditions between the parentheses. Again, the DataFramesMeta approach is more concise.

SET: The values in a column are in a set of interest

# Julia DataFrames approach to view all flights where the carrier is in Set(["AA", "UA"])
carriers_set = Set(["AA", "UA"])
hflights[findin(hflights[:UniqueCarrier], carriers_set), :]

# DataFramesMeta approach
@where(hflights, findin(:UniqueCarrier, carriers_set))

hflights4

To filter for rows where the values in a particular column are in a specific set of interest, create a Set with the values you’re interested in and then specify the column and your set of interest in the findin function.

PATTERN / REGULAR EXPRESSION: The values in a column match a pattern

# Julia DataFrames approach to view all flights where the carrier matches the regular expression r"AA|UA"
carriers_pattern = r"AA|UA"
hflights[[ismatch(carriers_pattern, String(carrier)) for carrier in hflights[:UniqueCarrier]], :]

# DataFramesMeta approach
@where(hflights, [ismatch(carriers_pattern, String(carrier)) for carrier in :UniqueCarrier])

hflights5

To filter for rows where the values in a particular column match a pattern, create a regular expression and then use it in the ismatch function in an array comprehension.

@select: Pick columns by name

# Julia DataFrames approach to selecting columns
hflights[:, [:DepTime, :ArrTime, :FlightNum]]

# DataFramesMeta approach
@select(hflights, :DepTime, :ArrTime, :FlightNum)

Julia’s DataFrames’ syntax for selecting columns is similar to R’s syntax. Like dplyr’s select function, DataFramesMeta’s @select macro simplifies the syntax and makes the command easier to read.

# Julia DataFrames approach to selecting columns
# first three columns
hflights[:, 1:3]
# pattern / regular expression
heading_pattern = r"Taxi|Delay"
hflights[:, [ismatch(heading_pattern, String(name)) for name in names(hflights)]]
# startswith
hflights[:, filter(name -> startswith(String(name), "Arr"), names(hflights))]
# endswith
hflights[:, filter(name -> endswith(String(name), "Delay"), names(hflights))]
# contains
hflights[:, filter(name -> contains(String(name), "Month"), names(hflights))]

# AND conditions
hflights[:, filter(name -> startswith(String(name), "Arr") && endswith(String(name), "Delay"), names(hflights))]
# OR conditions
hflights[:, filter(name -> startswith(String(name), "Arr") || contains(String(name), "Cancel"), names(hflights))]

hflights6

# DataFramesMeta approach
# first three columns
@select(hflights, 1:3)
# pattern / regular expression
heading_pattern = r"Taxi|Delay"
@select(hflights, [ismatch(heading_pattern, String(name)) for name in names(hflights)])
# startswith
@select(hflights, filter(name -> startswith(String(name), "Arr"), names(hflights)))
# endswith
@select(hflights, filter(name -> endswith(String(name), "Delay"), names(hflights)))
# contains
@select(hflights, filter(name -> contains(String(name), "Month"), names(hflights)))

# AND conditions
@select(hflights, filter(name -> startswith(String(name), "Arr") && endswith(String(name), "Delay"), names(hflights)))
# OR conditions
@select(hflights, filter(name -> startswith(String(name), "Arr") || contains(String(name), "Cancel"), names(hflights)))

hflights7

# Kevin Markham's multiple select conditions example
# select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
# Julia Version of Kevin's Example
# Taxi or Delay in column heading
mask = [ismatch(r"Taxi|Delay", String(name)) for name in names(hflights)]
# Also include first three columns, i.e. Year, Month, DayofMonth
mask[1:3] = true
@select(hflights, mask)

These examples show you can select columns by position and name, and you can combine multiple conditions with AND, “&&”, or OR, “||”. Similar to filtering rows, you can select specific columns based on a pattern by using the ismatch function in an array comprehension. You can also use contains, startswith, and endswith in the filter function to select columns that contain, start with, or end with a specific text pattern.

“Chaining” or “Pipelining”

In R, dplyr provides, via the magrittr package, the %>% operator, which enables you to chain together multiple commands into a single data transformation pipeline in a very readable fashion. In Julia, the DataFramesMeta package provides the @linq macro and |> symbol to enable similar functionality. Alternatively, you can load the Lazy package and use an @> begin end block to chain together multiple commands.

# Chaining commands with DataFrameMeta’s @linq macro
@linq hflights[find(.!isna.(hflights[:,:DepDelay])), :] |>
@where(:DepDelay .> 60) |>
@select(:UniqueCarrier, :DepDelay)

# Chaining commands with Lazy’s @> begin end block
using Lazy
@> begin
hflights[find(.!isna.(hflights[:,:DepDelay])), :]
@where(:DepDelay .> 60)
@select(:UniqueCarrier, :DepDelay)
end

hflights8

These two blocks of code produce the same result, a DataFrame containing carrier names and departure delays for which the departure delay is greater than 60. In each chain, the first expression is the input DataFrame, e.g. hflights. In these examples, I use the find and !isna. functions to start with a DataFrame that doesn’t contain NA values in the DepDelay column because the commands fail when NAs are present. I prefer the @linq macro version over the @> begin end version because it’s so similar to the dplyr-magrittr syntax, but both versions are more succinct and readable than their non-chained versions. The screen shot shows how to assign the pipeline results to variables.

@orderby: Reorder rows

Both DataFrames and DataFramesMeta provide functions for sorting rows in a DataFrame by values in one or more columns. In the first pair of examples, we want to select the UniqueCarrier and DepDelay columns and then sort the results by the values in the DepDelay column in descending order. The last example shows how to sort by multiple columns with the @orderby macro.

# Julia DataFrames approach to sorting
sort(hflights[find(.!isna.(hflights[:,:DepDelay])), [:UniqueCarrier, :DepDelay]], cols=[order(:DepDelay, rev=true)])

# DataFramesMeta approach (add a minus sign before the column symbol for descending)
@linq hflights[find(.!isna.(hflights[:,:DepDelay])), :] |>
@select(:UniqueCarrier, :DepDelay) |>
@orderby(-:DepDelay)

# Sort hflights dataset by Month, descending, and then by DepDelay, ascending
@linq hflights |>
@orderby(-:Month, :DepDelay)

hflights9

DataFrames provides the sort and sort! functions for ordering rows in a DataFrame. sort! orders the rows, inplace. The DataFrames user guide provides additional examples of ordering rows, in ascending and descending order, based on multiple columns, as well as applying functions to columns, e.g. uppercase, before using the column for sorting.

DataFramesMeta provides the @orderby macro for ordering rows in a DataFrame. Specify multiple column names in the @orderby macro to sort the rows by multiple columns. Use a minus sign before a column name to sort in descending order.

@transform: Add new variables

Creating new variables in Julia DataFrames is similar to creating new variables in Python and R. You specify a new column name in square brackets after the name of the DataFrame and assign it a collection of values, sometimes based on values in other columns. DataFramesMeta’s @transform macro simplifies the syntax and makes the transformation more readable.

# Julia DataFrames approach to creating new variable
hflights[:Speed] = hflights[:Distance] ./ hflights[:AirTime] .* 60
hflights[:, [:Distance, :AirTime, :Speed]]

# Delete the variable so we can recreate it with DataFramesMeta approach
delete!(hflights, :Speed)

# DataFramesMeta approach
@linq hflights |>
@select(:Distance, :AirTime) |>
@transform(Speed = :Distance ./ :AirTime .* 60) |>
@select(:Distance, :AirTime, :Speed)

# Save the new column in the original DataFrame
hflights = @linq hflights |>
@transform(Speed = :Distance ./ :AirTime .* 60)

hflights10

The first code block illustrates how to create a new column in a DataFrame and assign it values based on values in other columns. The second code block shows you can use delete! to delete a column. The third example demonstrates the DataFramesMeta approach to creating a new column using the @transform macro. The last example shows how to save a new column in an existing DataFrame using the @transform macro by assigning the result of the transformation to the existing DataFrame.

@by: Reduce variables to values (Grouping and Summarizing)

dplyr provides group_by and summarise functions for grouping and summarising data. DataFrames and DataFramesMeta also support the split-apply-combine strategy with the by function and the @by macro, respectively. Here Julia versions of Kevin’s summarise examples.

# Julia DataFrames approach to grouping and summarizing
by(hflights[complete_cases(hflights[[Symbol(name) for name in names(hflights)]]), :],
:Dest,
df -> DataFrame(meanArrDelay = mean(df[:ArrDelay])))

# DataFramesMeta approach
@linq hflights[complete_cases(hflights[[Symbol(name) for name in names(hflights)]]), :] |>
@by(:Dest, meanArrDelay = mean(:ArrDelay))

hflights11

DataFrames and DataFramesMeta don’t have dplyr’s summarise_each function, but it’s easy to apply different functions to multiple columns inside the @by macro.

@linq hflights |>
@by(:UniqueCarrier,
meanCancelled = mean(:Cancelled), meanDiverted = mean(:Diverted))

@linq hflights[complete_cases(hflights[[Symbol(name) for name in names(hflights)]]), :] |>
@by(:UniqueCarrier,
minArrDelay = minimum(:ArrDelay), maxArrDelay = maximum(:ArrDelay),
minDepDelay = minimum(:DepDelay), maxDepDelay = maximum(:DepDelay))

hflights12

DataFrames and DataFramesMeta also don’t have dplyr’s n and n_distinct functions, but you can count the number of rows in a group with size(df, 1) or nrow(df), and you can count the number of distinct values in a group with countmap.

# Group by Month and DayofMonth, count the number of flights, and sort descending
# Count the number of rows with size(df, 1)
sort(by(hflights, [:Month,:DayofMonth], df -> DataFrame(flight_count = size(df, 1))), cols=[order(:flight_count, rev=true)])

# Group by Month and DayofMonth, count the number of flights, and sort descending
# Count the number of rows with nrow(df)
sort(by(hflights, [:Month,:DayofMonth], df -> DataFrame(flight_count = nrow(df))), cols=[order(:flight_count, rev=true)])

# Split grouping and sorting into two separate operations
g = by(hflights, [:Month,:DayofMonth], df -> DataFrame(flight_count = nrow(df)))
sort(g, cols=[order(:flight_count, rev=true)])

# For each destination, count the total number of flights and the number of distinct planes
by(hflights[find(.!isna.(hflights[:,:TailNum])),:], :Dest) do df
DataFrame(flight_count = size(df,1), plane_count = length(keys(countmap(df[:,:TailNum]))))
end

hflights13

While these examples reproduce the results in Kevin’s dplyr tutorial, they’re definitely not as succinct and readable as the dplyr versions. Grouping by multiple columns, summarizing with counts and distinct counts, and gracefully chaining these operations are areas where DataFrames and DataFramesMeta can improve.

Other useful convenience functions

Randomly sampling a fixed number or fraction of rows from a DataFrame can be a helpful operation. dplyr offers the sample_n and sample_frac functions to perform these operations. In Julia, StatsBase provides the sample function, which you can repurpose to achieve similar results.


using StatsBase
# randomly sample a fixed number of rows
hflights[sample(1:nrow(hflights), 5), :]
hflights[sample(1:size(hflights,1), 5), :]

# randomly sample a fraction of rows
hflights[sample(1:nrow(hflights), ceil(Int,0.0001*nrow(hflights))), :]
hflights[sample(1:size(hflights,1), ceil(Int,0.0001*size(hflights,1))), :]

hflights14

Randomly sampling a fixed number of rows is fairly straightforward. You use the sample function to randomly select a fixed number of rows, in this case five, from the DataFrame. Randomly sampling a fraction of rows is slightly more complicated because, since the sample function takes an integer for the number of rows to return, you need to use the ceil function to convert the fraction of rows, in this case 0.0001*nrow(hflights), into an integer.

Conclusion

In R, dplyr sets a high bar for wrangling data well with succinct, readable code. In Julia, DataFrames and DataFramesMeta provide many useful functions and macros that produce similar results; however, some of the syntax isn’t as concise and clear as it is with dplyr, e.g. selecting columns in different ways and chaining together grouping and summarizing operations. These are areas where Julia’s packages can improve.

I enjoyed becoming more familiar with Julia by reproducing much of Kevin’s dplyr tutorial. It was also informative to see differences in functionality and readability between dplyr and Julia’s packages. I hope you enjoyed this tutorial and find it to be a useful reference for wrangling data in Julia.

Filed under: Analytics, General, Julia, Python, R, Statistics Tagged: DataFrames, DataFramesMeta, dplyr, Julia, Python, R

#MonthOfJulia Day 26: Statistics

Julia-Logo-Statistics

JuliaStats is a meta-project which consolidates various packages related to statistics and machine learning in Julia. Well worth taking a look if you plan on working in this domain.

julia> x = rand(10);
julia> mean(x)
0.5287191472784906
julia> std(x)
0.2885446536178459

Julia already has some builtin support for statistical operations, so additional packages are not strictly necessary. However they do increase the scope and ease of possible operations (as we’ll see below).Julia already has some builtin support for statistical operations. Let’s kick off by loading all the packages that we’ll be looking at today.

julia> using StatsBase, StatsFuns, StreamStats

StatsBase

The documentation for StatsBase can be found here. As the package name implies, it provides support for basic statistical operations in Julia.

High level summary statistics are generated by summarystats().

julia> summarystats(x)
Summary Stats:
Mean:         0.528719
Minimum:      0.064803
1st Quartile: 0.317819
Median:       0.529662
3rd Quartile: 0.649787
Maximum:      0.974760

Weighted versions of the mean, variance and standard deviation are implemented. There’re also geometric and harmonic means.

julia> w = WeightVec(rand(1:10, 10));      # A weight vector.
julia> mean(x, w)                          # Weighted mean.
0.48819933297961043
julia> var(x, w)                           # Weighted variance.
0.08303843715334995
julia> std(x, w)                           # Weighted standard deviation.
0.2881639067498738
julia> skewness(x, w)
0.11688162715805048
julia> kurtosis(x, w)
-0.9210456851144664
julia> mean_and_std(x, w)
(0.48819933297961043,0.2881639067498738)

There’s a weighted median as well as functions for calculating quantiles.

julia> median(x)                           # Median.
0.5296622773635412
julia> median(x, w)                        # Weighted median.
0.5729104703595038
julia> quantile(x)
5-element Array{Float64,1}:
 0.0648032
 0.317819 
 0.529662 
 0.649787 
 0.97476  
julia> nquantile(x, 8)
9-element Array{Float64,1}:
 0.0648032
 0.256172 
 0.317819 
 0.465001 
 0.529662 
 0.60472  
 0.649787 
 0.893513 
 0.97476  
julia> iqr(x)                              # Inter-quartile range.
0.3319677541313941

Sampling from a population is also catered for, with a range of algorithms which can be applied to the sampling procedure.

julia> sample(['a':'z'], 5)                # Sampling (with replacement).
5-element Array{Char,1}:
 'w'
 'x'
 'e'
 'e'
 'o'
julia> wsample(['T', 'F'], [5, 1], 10)     # Weighted sampling (with replacement).
10-element Array{Char,1}:
 'F'
 'T'
 'T'
 'T'
 'F'
 'T'
 'T'
 'T'
 'T'
 'T'

There’s also functionality for empirical estimation of distributions from histograms and a range of other interesting and useful goodies.

StatsFuns

The StatsFuns package provides constants and functions for statistical computing. The constants are by no means essential but certainly very handy. Take, for example, twoπ and sqrt2.

There are some mildly exotic mathematical functions available like logistic, logit and softmax.

julia> logistic(-5)
0.0066928509242848554
julia> logistic(5)
0.9933071490757153
julia> logit(0.25)
-1.0986122886681098
julia> logit(0.75)
1.0986122886681096
julia> softmax([1, 3, 2, 5, 3])
5-element Array{Float64,1}:
 0.0136809
 0.101089 
 0.0371886
 0.746952 
 0.101089 

Finally there is a suite of functions relating to various statistical distributions. The functions for the Normal distribution are illustrated below, but there’re functions for Beta and Binomial distribution, the Gamma and Hypergeometric distribution and many others. The function naming convention is consistent across all distributions.

julia> normpdf(0);                         # PDF
julia> normlogpdf(0);                      # log PDF
julia> normcdf(0);                         # CDF
julia> normccdf(0);                        # Complementary CDF
julia> normlogcdf(0);                      # log CDF
julia> normlogccdf(0);                     # log Complementary CDF
julia> norminvcdf(0.5);                    # inverse-CDF
julia> norminvccdf(0.99);                  # inverse-Complementary CDF
julia> norminvlogcdf(-0.693147180559945);  # inverse-log CDF
julia> norminvlogccdf(-0.693147180559945); # inverse-log Complementary CDF

StreamStats

Finally, the StreamStats package supports calculating online statistics for a stream of data which is being continuously updated.

julia> average = StreamStats.Mean()
Online Mean
 * Mean: 0.000000
 * N:    0
julia> variance = StreamStats.Var()
Online Variance
 * Variance: NaN
 * N:        0
julia> for x in rand(10)
       	update!(average, x)
       	update!(variance, x)
       	@printf("x = %3.f: mean = %.3f | variance = %.3fn", x, state(average),
                                                                state(variance))
       end
x = 0.928564: mean = 0.929 | variance = NaN
x = 0.087779: mean = 0.508 | variance = 0.353
x = 0.253300: mean = 0.423 | variance = 0.198
x = 0.778306: mean = 0.512 | variance = 0.164
x = 0.566764: mean = 0.523 | variance = 0.123
x = 0.812629: mean = 0.571 | variance = 0.113
x = 0.760074: mean = 0.598 | variance = 0.099
x = 0.328495: mean = 0.564 | variance = 0.094
x = 0.303542: mean = 0.535 | variance = 0.090
x = 0.492716: mean = 0.531 | variance = 0.080

In addition to the mean and variance illustrated above, the package also supports online versions of min() and max(), and can be used to generate incremental confidence intervals for Bernoulli and Poisson processes.

That’s it for today. Check out the full code on github and watch the video below.

The post #MonthOfJulia Day 26: Statistics appeared first on Exegetic Analytics.

What’s Wrong with Statistics in Julia?

By: John Myles White

Re-posted from: http://www.johnmyleswhite.com/notebook/2014/11/29/whats-wrong-with-statistics-in-julia/

Introduction

Several months ago, I promised to write an updated version of my old post, “The State of Statistics in Julia”, that would describe how Julia’s support for statistical computing has evolved since December 2012.

I’ve kept putting off writing that post for several reasons, but the most important reason is that all of my attention for the last few months has been focused on what’s wrong with how Julia handles statistical computing. As such, the post I’ve decided to write isn’t a review of what’s already been done in Julia, but a summary of what’s being done right now to improve Julia’s support for statistical computing.

In particular, this post focuses on several big changes to the core data structures that are used in Julia to represent statistical data. These changes should all ship when Julia 0.4 is released.

What’s Wrong with Statistics in Julia Today?

The primary problem with statistical computing in Julia is that the current tools were all designed to emulate R. Unfortunately, R’s approach to statistical computing isn’t amenable to the kinds of static analysis techniques that Julia uses to produce efficient machine code.

In particular, the following differences between R and Julia have repeatedly created problems for developers:

  • In Julia, computations involving scalars are at least as important as computations involving vectors. In particular, iterative computations are first-class citizens in Julia. This implies that statistical libraries must allow developers to write efficient code that iterates over the elements of a vector in pure Julia. Because Julia’s compiler can only produce efficient machine code for computations that are type-stable, the representations of missing values, categorical values and ordinal values in Julia programs must all be type-stable. Whether a value is missing or not, its type must remain the same.
  • In Julia, almost all end-users will end up creating their own types. As such, any tools for statistical computing must be generic enough that they can be extended to arbitrary types with little to no effort. In contrast to R, which can heavily optimize its algorithms for a very small number of primitive types, Julia developers must ensure that their libraries are both highly performant and highly abstract.
  • Julia, like most mainstream languages, eagerly evaluates the arguments passed to functions. This implies that idioms from R which depend upon non-standard evaluation are not appropriate for Julia, although it is possible to emulate some forms of non-standard evaluation using macros. In addition, Julia doesn’t allow programmers to reify scope. This implies that idioms from R that require access to the caller’s scope are not appropriate for Julia.

The most important way in which these issues came up in the first generation of statistical libraries was in the representation of a single scalar missing value. In Julia 0.3, this concept is represented by the value NA, but that representation will be replaced when 0.4 is released. Most of this post will focus on the problems created by NA.

In addition to problems involving NA, there were also problems with how expressions were being passed to some functions. These problems have been resolved by removing the function signatures for statistical functions that involved passing expressions as arguments to those functions. A prototype package called DataFramesMeta, which uses macros to emulate some kinds of non-standard evaluation, is being developed by Tom Short.

Representing Missing Values

In Julia 0.3, missing values are represented by a singleton object, NA, of type NAtype. Thus, a variable x, which might be either a Float64 value or a missing value encoded as NA, will end up with type Union(Float64, NAtype). This Union type is a source of performance problems because it defeats Julia’s compiler’s attempts to assign a unique concrete type to every variable.

We could remove this type-instability by ensuring that every type has a specific value, such as NaN, that signals missingness. This is the approach that both R and pandas take. It offers acceptable performance, but does so at the expense of generic handling of non-primitive types. Given Julia’s rampant usage of custom types, the sentinel values approach is not viable.

As such, we’re going to represent missing values in Julia 0.4 by borrowing some ideas from functional languages. In particular, we’ll be replacing the singleton object NA with a new parametric type Nullable{T}. Unlike NA, a Nullable object isn’t a direct scalar value. Rather, a Nullable object is a specialized container type that either contains one value or zero values. An empty Nullable container is taken to represent a missing value.

The Nullable approach to representing a missing scalar value offers two distinct improvements:

  • Nullable{T} provides radically better performance than Union(T, NA). In some benchmarks, I find that iterative constructs can be as much as 100x faster when using Nullable{Float64} instead of Union(Float64, NA). Alternatively, I’ve found that Nullable{Float64} is about 60% slower than using NaN to represent missing values, but involves a generic approach that trivially extends to arbitrary new types, including integers, dates, complex numbers, quaternions, etc…
  • Nullable{T} provides more type safety by requiring that all attempts to interact with potentially missing values explicitly indicate how missing values should be treated.

In a future blog post, I’ll describe how Nullable works in greater detail.

Categorical Values

In addition to revising the representation of missing values, I’ve also been working on revising our representation of categorical values. Working with categorical data in Julia has always been a little strange, because the main tool for representing categorical data, the PooledDataArray, has always occupied an awkward intermediate position between two incompatible objectives:

  • A container that keeps track of the unique values present in the container and uses this information to efficiently represent values as pointers to a pool of unique values.
  • A container that contains values of a categorical variable drawn from a well-defined universe of possible values. The universe can include values that are not currently present in the container.

These two goals come into severe tension when considering subsets of a PooledDataArray. The uniqueness constraint suggests that the pool should shrink, whereas the categorical variable definition suggests that the pool should be maintained without change. In Julia 0.4, we’re going to commit completely to the latter behavior and leave the problem of efficiently representing highly compressible data for another data structure.

We’ll also begin representing scalar values of categorical variables using custom types. The new CategoricalVariable and OrdinalVariable types that will ship with Julia 0.4 will further the efforts to put scalar computations on an equal footing with vector computations. This will be particularly notable for dealing with ordinal variables, which are not supported at all in Julia 0.3.

Metaprogramming

Many R functions employ non-standard evaluation as a mechanism for augmenting the current scope with the column names of a data.frame. In Julia, it’s often possible to emulate this behavior using macros. The in-progress DataFramesMeta package explores this alternative to non-standard evaluation. We will also be exploring other alternatives to non-standard evaluation in the future.

What’s Next

In the long-term future, I’m hoping to improve several other parts of Julia’s core statistical infrastructure. In particular, I’d like to replace DataFrames with a new type that no longer occupies a strange intermediate position between matrices and relational tables. I’ll write another post about those issues later.