Using Julia for Data Science (Part 02)

By: Posts on Cleyton Farias

Re-posted from: https://cleytonfar.github.io/posts/using-julia-for-data-science-part-02/

In the previous post,
we talked about what is Julia, how to install it and we have learned how to work
rightaway with tabular data. Now we are going to take one more step and learn new
tricks with the DataFramesMeta package.

The first thing that we need to do is to install the package.

using Pkg
Pkg.add("DataFramesMeta")

Once it is installed, let’s get started!

Important: In this post I am using the current stable release (v1.1.0). For
those who are using the long-term support release (v1.0.3), all the code will run just
fine.

Introduction to DataFramesMeta package

The DataFramesMeta is a package that provides a collection of metaprogramming
tools for DataFrames. But you may be wondering why you should worry about
metaprogramming? This package
offers some macros that can be used to improve performance and provide more
convenient syntax. But again, you might be asking: how is that useful?

Ok ok ok. So, let’s consider the example used in the previous post where we
had some random dataset and we wanted the rows which x1 and x2 are
both greater than or equal to their average:

using DataFrames
using Statistics
## Creating a random dataset with 10 rows and 5 columns:
foo = DataFrame(rand(10, 5));
## Creating the conditions:
cond1 = foo.x1 .>= mean(foo.x1);
cond2 = foo.x2 .>= mean(foo.x2);
## Subsetting:
foo[.&(cond1, cond2), :]
## 1×5 DataFrame
## │ Row │ x1      │ x2       │ x3       │ x4       │ x5       │
## │     │ Float64 │ Float64  │ Float64  │ Float64  │ Float64  │
## ├─────┼─────────┼──────────┼──────────┼──────────┼──────────┤
## │ 1   │ 0.96575 │ 0.971574 │ 0.228869 │ 0.813457 │ 0.807968 │

That is it!! But what if I told there is a way we could get the same result
typing much less code?

using DataFramesMeta
@where(foo, :x1 .>= mean(:x1), :x2 .>= mean(:x2))
## 1×5 DataFrame
## │ Row │ x1      │ x2       │ x3       │ x4       │ x5       │
## │     │ Float64 │ Float64  │ Float64  │ Float64  │ Float64  │
## ├─────┼─────────┼──────────┼──────────┼──────────┼──────────┤
## │ 1   │ 0.96575 │ 0.971574 │ 0.228869 │ 0.813457 │ 0.807968 │

Did you see that? Using the macro @where we achieved the same result as before
with just one line of code. That’s what DataFramesMeta package is all about:
a collection of “functions” begining with @ that simplifies some tasks when
working with DataFrames.

Main Features of DataFramesMeta:

In this post, we are going to explore what I considered to be the main tools of
the DataFramesMeta package. For more detail, please refer to the official documentation.

@with macro

@with is a macro expression that can be used with DataFrames that allows reference
columns as symbols in expressions. For those who are familiar with R language,
it works similarly to the with() function.

df = DataFrame(x = 1:3, y = [2, 1, 2])
## 3×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 1     │ 2     │
## │ 2   │ 2     │ 1     │
## │ 3   │ 3     │ 2     │
x = [2, 1, 0];
## Taking the columm "y" from the df DataFrame and adding 1:
@with(df, :y .+ 1)
## 3-element Array{Int64,1}:
##  3
##  2
##  3
## Taking the column "x" from df and add it to the x variable:
@with(df, :x + x)
## 3-element Array{Int64,1}:
##  3
##  3
##  3

Also, we can reference the column by an expression and wrapped in cols() function:

colref = :x;
@with(df, cols(colref) .+ 1)
## 3-element Array{Int64,1}:
##  2
##  3
##  4

The use of cols() is very useful when we want to perform some task over a list
of column variables in a for or while loop.

If an expression is wrapped in ^(expr), then expr gets passed through untouched:

@with(df, df[:x .> 1, ^(:y)])
## 2-element Array{Int64,1}:
##  1
##  2

Later, I’ll show how to perform this same task by piping some macros using the
pipe |> symbol.

@where macro

@where is used when we want to get subsets of DataFrames according to
some criteria. It is similar to the filter() function from dplyr package
in R:

@where(df, :x .> 1)
## 2×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 2     │ 1     │
## │ 2   │ 3     │ 2     │
@where(df, :x .> x, :y .== 1)
## 1×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 2     │ 1     │

Notice that if there is more than one condition inside @where, the condition are
performed as condition1 AND condition2. But if you want an OR condition to
be performed, we can use the |() syntax:

@where(df, .|(:x .> x, :y .== 1))
## 2×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 2     │ 1     │
## │ 2   │ 3     │ 2     │

@select macro

@select macro can be used to perform column selection in DataFrames. Again,
if you are familiar with the dplyr package in R, it works similarly to the
select() function.

## Select column x and return as a DataFrame:
@select(df, :x)
## 3×1 DataFrame
## │ Row │ x     │
## │     │ Int64 │
## ├─────┼───────┤
## │ 1   │ 1     │
## │ 2   │ 2     │
## │ 3   │ 3     │

Moreover, we can also mutate variables using @select. For instance, suppose that
we want the columns x, y, and a new column representing x times y:

@select(df, :x, :y, x_y = :x .* :y)
## 3×3 DataFrame
## │ Row │ x     │ y     │ x_y   │
## │     │ Int64 │ Int64 │ Int64 │
## ├─────┼───────┼───────┼───────┤
## │ 1   │ 1     │ 2     │ 2     │
## │ 2   │ 2     │ 1     │ 2     │
## │ 3   │ 3     │ 2     │ 6     │

Notice that the name of the new column is not referenced as symbol.

@transform macro

The DataFramesMeta also has a specific macro to perform mutation.
@transform is very useful when we want to add new columns based on keyword
argument:

@transform(df, newColumn = :x .^ 2 + 2 .* :x)
## 3×3 DataFrame
## │ Row │ x     │ y     │ newColumn │
## │     │ Int64 │ Int64 │ Int64     │
## ├─────┼───────┼───────┼───────────┤
## │ 1   │ 1     │ 2     │ 3         │
## │ 2   │ 2     │ 1     │ 8         │
## │ 3   │ 3     │ 2     │ 15        │

One more time, @transform works similarly to the mutate() function from
dplyr package in R.

@orderby macro

@orderby macro is used to sort the dataset according to a specific column.

using StatsBase # to use the sample() function
## unordered data set: shuffling the dataset
df = df[sample(1:nrow(df), nrow(df), replace = false), :]
## ordered data set:
@orderby(df, :x)
## 3×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 1     │ 2     │
## │ 2   │ 2     │ 1     │
## │ 3   │ 3     │ 2     │

By default, the sort will be performed in ascending order. To sort
in descending order, just add a negative sign:

@orderby(df, -:x)
## 3×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 3     │ 2     │
## │ 2   │ 2     │ 1     │
## │ 3   │ 1     │ 2     │















@linq macro

For me, this is the most useful macro in this package. @linq macro supports
chaining all of the functionality defined in other macros. In practice, it
means that we can chain a bunch of macro commands using the pipe |> syntax,
but overcoming its limitations. What do I mean about limitations of the pipe symbol?
Take the following as an example:

## This will not work:
df |>
    select(:x) |>
    transform(newColumn = :x .^2)

When you run this chunk, Julia will throw an error because it will not
recognize the select macro, but also it will not figure it out how pipe the
df dataset to the following functions. That’s when the @linq macro is very
useful:

## This will work:
@linq df |>
    select(:x) |>
    transform(newColumn = :x .^2)
## 3×2 DataFrame
## │ Row │ x     │ newColumn │
## │     │ Int64 │ Int64     │
## ├─────┼───────┼───────────┤
## │ 1   │ 1     │ 1         │
## │ 2   │ 2     │ 4         │
## │ 3   │ 3     │ 9         │

As we can see, the use of the @linq macro allows us to pipe the df dataset
to the first macro, and then pipe the result of this to the following macro and
so on. Moreover, chaining the individual macros makes the code looks cleaner and
more obvious with less noise from @ symbols.

Previously, we performed an operation using the @with macro where we subset
the rows with x greater than 1 and take only the column y:

@with(df, df[:x .> 1, ^(:y)])
## 2-element Array{Int64,1}:
##  1
##  2

We could get the same result using @linq and |>:

@linq df |>
    where(:x .> 1) |>
    select(:y)
## 2×1 DataFrame
## │ Row │ y     │
## │     │ Int64 │
## ├─────┼───────┤
## │ 1   │ 1     │
## │ 2   │ 2     │

Moreover, you can not only use @linq to chain macros, but also with any
function. For example, we can pipe a dataset to see the first 5 rows as well as
to the describe() function:

@linq foo |> 
    first(5)
## 5×5 DataFrame
## │ Row │ x1       │ x2        │ x3       │ x4       │ x5       │
## │     │ Float64  │ Float64   │ Float64  │ Float64  │ Float64  │
## ├─────┼──────────┼───────────┼──────────┼──────────┼──────────┤
## │ 1   │ 0.921889 │ 0.205641  │ 0.200573 │ 0.247493 │ 0.490941 │
## │ 2   │ 0.978455 │ 0.0355136 │ 0.320555 │ 0.174487 │ 0.739182 │
## │ 3   │ 0.012211 │ 0.349529  │ 0.88621  │ 0.344799 │ 0.023447 │
## │ 4   │ 0.52635  │ 0.614627  │ 0.857196 │ 0.52956  │ 0.542027 │
## │ 5   │ 0.555943 │ 0.599705  │ 0.981179 │ 0.414446 │ 0.116372 │
## Default behavior will omitt some columns:
@linq foo |> 
    describe
## 5×8 DataFrame. Omitted printing of 2 columns
## │ Row │ variable │ mean     │ min       │ median   │ max      │ nunique │
## │     │ Symbol   │ Float64  │ Float64   │ Float64  │ Float64  │ Nothing │
## ├─────┼──────────┼──────────┼───────────┼──────────┼──────────┼─────────┤
## │ 1   │ x1       │ 0.642087 │ 0.012211  │ 0.715696 │ 0.978455 │         │
## │ 2   │ x2       │ 0.391035 │ 0.0355136 │ 0.288137 │ 0.971574 │         │
## │ 3   │ x3       │ 0.522389 │ 0.184677  │ 0.365964 │ 0.981179 │         │
## │ 4   │ x4       │ 0.467295 │ 0.174487  │ 0.381865 │ 0.813457 │         │
## │ 5   │ x5       │ 0.478956 │ 0.023447  │ 0.516484 │ 0.875312 │         │
## This will show all columns:
@linq foo |> 
    describe |> 
    show(allcols = true)
## 5×8 DataFrame
## │ Row │ variable │ mean     │ min       │ median   │ max      │ nunique │
## │     │ Symbol   │ Float64  │ Float64   │ Float64  │ Float64  │ Nothing │
## ├─────┼──────────┼──────────┼───────────┼──────────┼──────────┼─────────┤
## │ 1   │ x1       │ 0.642087 │ 0.012211  │ 0.715696 │ 0.978455 │         │
## │ 2   │ x2       │ 0.391035 │ 0.0355136 │ 0.288137 │ 0.971574 │         │
## │ 3   │ x3       │ 0.522389 │ 0.184677  │ 0.365964 │ 0.981179 │         │
## │ 4   │ x4       │ 0.467295 │ 0.174487  │ 0.381865 │ 0.813457 │         │
## │ 5   │ x5       │ 0.478956 │ 0.023447  │ 0.516484 │ 0.875312 │         │
## 
## │ Row │ nmissing │ eltype   │
## │     │ Nothing  │ DataType │
## ├─────┼──────────┼──────────┤
## │ 1   │          │ Float64  │
## │ 2   │          │ Float64  │
## │ 3   │          │ Float64  │
## │ 4   │          │ Float64  │
## │ 5   │          │ Float64  │







Conclusion

The DataFramesMeta package presents a collection of useful macros that can be used to
perform data wrangling and make our code cleaner. As we saw, some of these macros
behave similarly to functions commonly used in R. Also, the use
of the pipe operator |> plus the @linq macro makes the experience even more
alike to the %>% operator from maggrittr package in R. Hence, if you come (like me) from a
R background, Julia can become a lot easier to learn. The following table
summarizes the equivalence among functions used in this post between the two
languages:

Julia R
@with with()
@select select()
@where filter()
@transform mutate()
@orderby arrange()/order()
@linq + |> %>%