Author Archives: Posts on Cleyton Farias

Using Julia for Data Science (Part 03): Plotting

By: Posts on Cleyton Farias

Re-posted from: https://cleytonfar.github.io/posts/using-julia-for-data-science-part-03/

The Julia plotting system is available from a set of packages each one using its own
syntax. The most important examples are the Plots and Gadfly packages.
In this post, we will take a look at the basic functionalities from these libraries.

Before we start playing around, the first thing to do is to install the necessary
packages:

using Pkg
Pkg.add("Plots")
Pkg.add("GR")
Pkg.add("PyPlot")
Pkg.add("Gadfly")

Now let’s get started!!

The Plots package

The most basic plot that we can do is a line plot. We can plot a line by calling
the plot() function on two vectors:

using Plots
x = 1:10;
y = rand(10, 1);
plot(x, y)

In Plots, every column is treated as a series. Thus, we can plot multiple
lines by plotting a matrix of values where each column will be interpreted as a
different serie:

y = rand(10, 2);
p = plot(x, y)

We can modify an existing plot by using the modifier function plot!(). For instance,
let’s add one more line to the previous plot:

z = rand(10);
## adding line z to plot p:
plot!(p, x, z) 

Notice that I specified the plot (p) to be modified on the last calling. We could
just call plot!(x, z) and the plot p would be modified because the Plots
package will look for the latest plot to apply the modifications.

Plots Attributes

Not only we want to make plots, but also make them look nice, right?! So, in order
to do that we can tweak the plot attributes. The Plots package follows a simple
rule with data vs attributes: positional arguments are input data, and keyword
arguments are attributes. For instance, calling plot(x, y, z) will produce a 3-D plot, while
calling plot(x, y, attribute = value) will output a 2D plot with an attribute.
To illustrate this, let’s add a title and modify the legend labels for our previous plot:

p = plot(x, y, 
     title = "My beautiful Plot", ## adding a title
     label = ["1", "2"]) ## adding legend labels

Additionally, we can use modifiers functions to customize our plots. For example,
let’s say we wanted to add a label for the y-axis and x-axis. We could just add
the argument xlabel = "..." and ylabel = "..." on the last call, or we could
use the modifier functions xlabel!() and ylabel!():

xlabel!(p, "My customized x label")

ylabel!(p, "My customized y label")

Also, we can customize the line colors, as well as adding markers and even annotations
to the plot:

markershapes= [:circle :star5];
markercolors= [:orange :green];
plot(x, y,
     title = "My beautiful Plot",
     xlabel = "My customized x label",
     ylabel = "My customized y label",
     label = ["1", "2"],
     color = markercolors,
     shape = markershapes,
     annotation = [(4, .9, "Look at me!!")])

Of course, not only plotting lines can a data scientist survive, right?! In Plots,
we can make other types of plots just by adjusting the seriestype = "..." attribute.
For instance, instead of a line plot, we can make a scatter plot:

x = rand(20);
y = rand(20);
plot(x, y, seriestype = :scatter, legend = false, color = [:blue])

Also, we can make a bar plot:

x = 1:10;
y = sin.(x);
plot(x, y, seriestype = :bar, legend = false)

and to make a histogram, we can do:

using LaTeXStrings
mathstring = L"X \sim \mathcal{N}(0,\,1)";
plot(randn(1000), seriestype = :histogram, legend = false, title = mathstring)

Notice that we can also add LaTeX notation in the plot using the functionalities
from the LaTeXStrings package.

There are a large numbers of plot attributes we can tweak. This is just the tip of
the iceberg. For more detail, please refer to official documentation.

Plot Backend

Now, let me tell something:

Plots is not a plotting package!!

What??? That’s right!! Plots is what is called a metapackage. Its aim is to
bring many different plotting packages under a single API (interface). What do you mean by that, Cleyton?

Well… in Julia we have access to different plotting packages such as PyPlot (Python’s matplotlib),
Plotly, GR and some others.
Each one have different features which can be very useful for certain situations.
However each one has its own syntax. So, in order to get the most from these packages,
you would have to learn their syntax.

That’s when Plots comes at hand! Instead of learning different syntaxes, Plots
package provides you access to different plotting packages (called backends)
using just one single syntax. Then, Plots interprets your commands and then generates
the plots using another plotting library. That is, this means you can use many
different plotting libraries, all with the Plots syntax, only by specifying which
backend you want to use. That’s it! Just like that!.

Up until now, our plot was using the default backend. The default depends in what
plotting package you have installed in Julia. Some common choices for backends (plotting package)
are PyPlot and GR. To install these backends, simply use the standard Julia
installation Pkg.add("BackendPackage").

In order to specify which backend we want to use just use the name of the backend
in lower case as a function:

x = 1:10;
y = rand(10, 2);
## specifying pyplot backend:
pyplot()
## Plots.PyPlotBackend()
plot(x, y, title = "using Pyplot", shape = :cirle)

See?! Very easy! You can kepp changing the backend back and forth just like that.
The choice of backend depends on the situation. Usually, I prefer to use Plotly
when I want to make interactivity plots, GR to make simple and quick plots (for example, in an exploratory data analysis situation), and PyPlot otherwise.

In order to save the plots we use the savefig() command:

# saves the current plot:
savefig("myplot.png") 
# saves the plot from p:
savefig(p,"myplot.pdf") 

For more information on backends, please refer to the official documentation.

Recipe Libraries

Recipes libraries are extensions that we can use with Plots framework.
They add more functionalities such as default interpretation for certain types,
new series types, and many others.

One of the most important recipe libraries is StatsPlots, a package comprising
a set of new statistical plot series for a certain data type. We can install this
library using Pkg.add("StatsPlots") command. The StatsPlots package
has a macro @df which allows you to plot a DataFrame directly by using the
column names. We can specify the column names either as symbol (:column_name) or
as string (“column_name”):

using StatsPlots
using DataFrames
## creating a random DataFrame
df = DataFrame(a = 1:10, b = rand(10), c = rand(10));
## Plotting using the @df macro specifying colum names as symbol:
@df df plot(:a, [:b :c], color = [:red :blue])

We can also make a call for @df using the cols() utility function. This function
allows us to specify the column using a positional index:

@df df plot(:a, cols(2:3), color = [:red :blue])

StatsPlots also contains the corrplot() and cornerplot() functions to plot
the correlation among input variables:

@df df corrplot(cols(2:3))

@df df cornerplot(cols(2:3))

Of course, there are more functionalities from the StatsPlots library
than I have showed here. For more detail, please refer to official documentation.

The Gadfly Package

Now, let me be honest with you: this is my favorite one!! Gadfly is another
package used to create beautiful plots in Julia. This package is an implementation
of the “grammar of graphics”style. For those who have R experience, this is
the same principle used in the wonderful ggplot2 package.

In order to start playing with Gadfly, we need some data. Let’s make use of the
RDatasets package which give us access to a list of the datasets available from R.

Pkg.add("RDatasets")

When used with a DataFrame, we can use the plot() function with the following syntax:

plot(data::DataFrame, x = :column_name, y = :column_name, geometry)

where the geometry argument is just the series type you want to plot: a line, point,
error bar, histogram, etc.
Notice something: Plots and Gadfly use the same name for the plotting function.
To avoid confusion in Julia about which plot() function to call, we can specify
from which package we want the call to be made by using the Gadfly.plot() syntax.
For those who have an R background, this syntax is equivalent to name_package::function_name()
in R.

Now, let’s use the iris dataset to start playing around with Gadfly:

using RDatasets
iris = dataset("datasets", "iris");
first(iris, 5)
## 5×5 DataFrame
## │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species      │
## │     │ Float64     │ Float64    │ Float64     │ Float64    │ Categorical… │
## ├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────────┤
## │ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa       │
## │ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa       │
## │ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa       │
## │ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ setosa       │
## │ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ setosa       │

First, let’s plot a scatter plot using SepalLength and SepalWidth variables.
To specify that we want a scatter plot, we must set the geometry element using
Geom.point argument:

using Gadfly
Gadfly.plot(iris, x = :SepalLength, y = :SepalWidth, Geom.point)



We can keep adding geometries to produce more layers in the plot. For instance, we
can add lines to the plot just adding the Geom.line argument:

Gadfly.plot(iris, x = :SepalLength, y = :SepalWidth, Geom.point, Geom.line)



Also, we can set the keyword argument color according to some variable to specify
how to color the points:

Gafdfly.plot(iris, x = :SepalLength, y = :SepalWidth, color = :Species, Geom.point)



Gadfly has some special signatures to make plotting functions and expressions
more convenient:

Gadfly.plot((x,y) -> sin(x) + cos(y), 0, 2pi, 0, 2pi)



So, as you have noticed that the call from Gadfly.plot() will render the image to your default
multimedia display, typically an internet browser. To be honest, I do not know why this the
default behavior. In order to render the plot to a file, Gadfly supports creating SVG images out of the box.
The PNG, PDF, PS, and PGF formats require Julia’s bindings to cairo
and fontconfig, which can be installed with:

Pkg.add("Cairo")
Pkg.add("Fontconfig")

To save to a file, we use the draw() function on the chosen backend:

p = Gadfly.plot((x,y) -> sin(x) + cos(y), 0, 2pi, 0, 2pi);
## saving to a pdf device:
draw(PDF("plot.pdf", p))
## or to a png device
draw(PNG("plot.pdf", p))

Geometries

Gadfly presents a lot of geometry format options. As we have seen, to plot more
geometries to a figure we can just add more geometry types. The most common ones are
Geom.line, Geom.point, Geom.bar, Geom.boxplot, Geom.histogram,
Geom.errorbar, Geom.density, etc.

We already saw Geom.line and Geom.point. So now let’s plot the other geometry types
in one figure using the gridstack() function:

p1 = Gadfly.plot(dataset("ggplot2", "diamonds"), x= :Price, Geom.histogram);
p2 = Gadfly.plot(dataset("HistData", "ChestSizes"), x = :Chest, y = :Count, Geom.bar);
p3 = Gadfly.plot(dataset("lattice", "singer"), x = :VoicePart, y = :Height, Geom.boxplot);
p4 = Gadfly.plot(dataset("ggplot2", "diamonds"), x = :Price, Geom.density);
gridstack([p1 p2; p3 p4])



Theme

We can tweak the plot appearance by using the Theme() function. Many parameters
controlling the appearance of plots can be overridden by passing this function
to plot() or setting the Theme as the current theme using push_theme().

For instance, we can change the label and size label:

Gadfly.plot(x = rand(10), y = rand(10),
             Theme(major_label_font = "Hack",
                   minor_label_font = "Hack",
                   major_label_font_size = 16pt,
                   minor_label_font_size = 14pt,
                   background_color = "#bdbdbd"))



There are a lot of options we can tweak in Theme(). This is just the surface.
For the full list of options, see this link.

Calling ggplot2

The Plots and Gadfly package are the two main plotting packages for Julia.
Each one have different characteristics and a syntax on their own.

However, let’s say you have an R background and you are very used to the wonderful
ggplot2 package and would rather not to learn another plotting system. Or it might
be the case that while you are still learning the Julia plotting system you have to create
very well crafted plots for your report but you only know how to do it in ggplot2.

What if I told you there is a way to use Julia and still make plots using ggplot2 package?
Well, in order to do that we will use the RCall package. First of all, let’s install
this package:

Pkg.add("RCall")

RCall is package with the aim of facilitating communication between
R and Julia languages and allows the user to call R packages from within
Julia, providing the best of both worlds.

In order to call ggplot2 package from Julia, we use the @rlibrary syntax to
load the R package. Then, we can use R"" syntax to call the R command:

using RCall
@rlibrary ggplot2
gasoline = dataset("Ecdat", "Gasoline");

## notice that we use $name_dataset inside R"" command.
R"ggplot($gasoline, aes(x = Year, y = LGasPCar, color = Country)) +
  geom_line() + 
  geom_point() + 
  ggthemes::theme_economist_white(gray_bg = F) +    
  theme(panel.grid.major = element_line(colour = '#d9d9d9',
                                        size = rel(0.9),
                                        linetype='dashed'),
        legend.position = 'bottom',
        legend.direction = 'horizontal',
        legend.box = 'horizontal',
        legend.key.size = unit(1, 'cm'),
        plot.title = element_text(family= 'AvantGarde', hjust = 0.5),
        text = element_text(family =  'AvantGarde'),
        axis.title = element_text(size = 12),
        axis.text.x = element_text(angle = 0, hjust = 0.5),
        legend.text = element_text(size = 12),
        legend.title=element_text(face = 'bold', size = 12)) +
  labs(title = 'Gas Consumption over the years', x = '', y = '')"

That’s it!!! Now, You do not need to leave Julia in order to make your plots with ggplot2.

Conclusion

In this post we saw basic functionalities of the main packages from the Julia
plotting system. Plots and Gadfly stand out as the major players when it comes
to plotting in Julia.

The Plots package is not really a plotting package but rather an API to call other
plotting libraries using a common syntax. Its functionalities kind of resembles
the ones from the base plotting system in R.

On the other hand, the Gadfly is an implementation of the “grammar of graphics
style once found in the already consolidated ggplot2 package from R.
It resambles many of the functionalities found in ggplot2 and highly customizable.

Which package is better depends on the case and, of course, in your preferences.
Personally, I am very satisfied with Gadfly because of the similarities with
ggplot2, but Plots package offers some handy functionalities throught
recipes libraries, for instance StatsPlots.

As an introduction to the topic, I hope this post helps you get a better understand
on how to make well crafted plots in Julia. Have any additional comments or
suggestion, please feel free to let me know!!

Using Julia for Data Science (Part 02)

By: Posts on Cleyton Farias

Re-posted from: https://cleytonfar.github.io/posts/using-julia-for-data-science-part-02/

In the previous post,
we talked about what is Julia, how to install it and we have learned how to work
rightaway with tabular data. Now we are going to take one more step and learn new
tricks with the DataFramesMeta package.

The first thing that we need to do is to install the package.

using Pkg
Pkg.add("DataFramesMeta")

Once it is installed, let’s get started!

Important: In this post I am using the current stable release (v1.1.0). For
those who are using the long-term support release (v1.0.3), all the code will run just
fine.

Introduction to DataFramesMeta package

The DataFramesMeta is a package that provides a collection of metaprogramming
tools for DataFrames. But you may be wondering why you should worry about
metaprogramming? This package
offers some macros that can be used to improve performance and provide more
convenient syntax. But again, you might be asking: how is that useful?

Ok ok ok. So, let’s consider the example used in the previous post where we
had some random dataset and we wanted the rows which x1 and x2 are
both greater than or equal to their average:

using DataFrames
using Statistics
## Creating a random dataset with 10 rows and 5 columns:
foo = DataFrame(rand(10, 5));
## Creating the conditions:
cond1 = foo.x1 .>= mean(foo.x1);
cond2 = foo.x2 .>= mean(foo.x2);
## Subsetting:
foo[.&(cond1, cond2), :]
## 1×5 DataFrame
## │ Row │ x1      │ x2       │ x3       │ x4       │ x5       │
## │     │ Float64 │ Float64  │ Float64  │ Float64  │ Float64  │
## ├─────┼─────────┼──────────┼──────────┼──────────┼──────────┤
## │ 1   │ 0.96575 │ 0.971574 │ 0.228869 │ 0.813457 │ 0.807968 │

That is it!! But what if I told there is a way we could get the same result
typing much less code?

using DataFramesMeta
@where(foo, :x1 .>= mean(:x1), :x2 .>= mean(:x2))
## 1×5 DataFrame
## │ Row │ x1      │ x2       │ x3       │ x4       │ x5       │
## │     │ Float64 │ Float64  │ Float64  │ Float64  │ Float64  │
## ├─────┼─────────┼──────────┼──────────┼──────────┼──────────┤
## │ 1   │ 0.96575 │ 0.971574 │ 0.228869 │ 0.813457 │ 0.807968 │

Did you see that? Using the macro @where we achieved the same result as before
with just one line of code. That’s what DataFramesMeta package is all about:
a collection of “functions” begining with @ that simplifies some tasks when
working with DataFrames.

Main Features of DataFramesMeta:

In this post, we are going to explore what I considered to be the main tools of
the DataFramesMeta package. For more detail, please refer to the official documentation.

@with macro

@with is a macro expression that can be used with DataFrames that allows reference
columns as symbols in expressions. For those who are familiar with R language,
it works similarly to the with() function.

df = DataFrame(x = 1:3, y = [2, 1, 2])
## 3×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 1     │ 2     │
## │ 2   │ 2     │ 1     │
## │ 3   │ 3     │ 2     │
x = [2, 1, 0];
## Taking the columm "y" from the df DataFrame and adding 1:
@with(df, :y .+ 1)
## 3-element Array{Int64,1}:
##  3
##  2
##  3
## Taking the column "x" from df and add it to the x variable:
@with(df, :x + x)
## 3-element Array{Int64,1}:
##  3
##  3
##  3

Also, we can reference the column by an expression and wrapped in cols() function:

colref = :x;
@with(df, cols(colref) .+ 1)
## 3-element Array{Int64,1}:
##  2
##  3
##  4

The use of cols() is very useful when we want to perform some task over a list
of column variables in a for or while loop.

If an expression is wrapped in ^(expr), then expr gets passed through untouched:

@with(df, df[:x .> 1, ^(:y)])
## 2-element Array{Int64,1}:
##  1
##  2

Later, I’ll show how to perform this same task by piping some macros using the
pipe |> symbol.

@where macro

@where is used when we want to get subsets of DataFrames according to
some criteria. It is similar to the filter() function from dplyr package
in R:

@where(df, :x .> 1)
## 2×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 2     │ 1     │
## │ 2   │ 3     │ 2     │
@where(df, :x .> x, :y .== 1)
## 1×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 2     │ 1     │

Notice that if there is more than one condition inside @where, the condition are
performed as condition1 AND condition2. But if you want an OR condition to
be performed, we can use the |() syntax:

@where(df, .|(:x .> x, :y .== 1))
## 2×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 2     │ 1     │
## │ 2   │ 3     │ 2     │

@select macro

@select macro can be used to perform column selection in DataFrames. Again,
if you are familiar with the dplyr package in R, it works similarly to the
select() function.

## Select column x and return as a DataFrame:
@select(df, :x)
## 3×1 DataFrame
## │ Row │ x     │
## │     │ Int64 │
## ├─────┼───────┤
## │ 1   │ 1     │
## │ 2   │ 2     │
## │ 3   │ 3     │

Moreover, we can also mutate variables using @select. For instance, suppose that
we want the columns x, y, and a new column representing x times y:

@select(df, :x, :y, x_y = :x .* :y)
## 3×3 DataFrame
## │ Row │ x     │ y     │ x_y   │
## │     │ Int64 │ Int64 │ Int64 │
## ├─────┼───────┼───────┼───────┤
## │ 1   │ 1     │ 2     │ 2     │
## │ 2   │ 2     │ 1     │ 2     │
## │ 3   │ 3     │ 2     │ 6     │

Notice that the name of the new column is not referenced as symbol.

@transform macro

The DataFramesMeta also has a specific macro to perform mutation.
@transform is very useful when we want to add new columns based on keyword
argument:

@transform(df, newColumn = :x .^ 2 + 2 .* :x)
## 3×3 DataFrame
## │ Row │ x     │ y     │ newColumn │
## │     │ Int64 │ Int64 │ Int64     │
## ├─────┼───────┼───────┼───────────┤
## │ 1   │ 1     │ 2     │ 3         │
## │ 2   │ 2     │ 1     │ 8         │
## │ 3   │ 3     │ 2     │ 15        │

One more time, @transform works similarly to the mutate() function from
dplyr package in R.

@orderby macro

@orderby macro is used to sort the dataset according to a specific column.

using StatsBase # to use the sample() function
## unordered data set: shuffling the dataset
df = df[sample(1:nrow(df), nrow(df), replace = false), :]
## ordered data set:
@orderby(df, :x)
## 3×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 1     │ 2     │
## │ 2   │ 2     │ 1     │
## │ 3   │ 3     │ 2     │

By default, the sort will be performed in ascending order. To sort
in descending order, just add a negative sign:

@orderby(df, -:x)
## 3×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 3     │ 2     │
## │ 2   │ 2     │ 1     │
## │ 3   │ 1     │ 2     │















@linq macro

For me, this is the most useful macro in this package. @linq macro supports
chaining all of the functionality defined in other macros. In practice, it
means that we can chain a bunch of macro commands using the pipe |> syntax,
but overcoming its limitations. What do I mean about limitations of the pipe symbol?
Take the following as an example:

## This will not work:
df |>
    select(:x) |>
    transform(newColumn = :x .^2)

When you run this chunk, Julia will throw an error because it will not
recognize the select macro, but also it will not figure it out how pipe the
df dataset to the following functions. That’s when the @linq macro is very
useful:

## This will work:
@linq df |>
    select(:x) |>
    transform(newColumn = :x .^2)
## 3×2 DataFrame
## │ Row │ x     │ newColumn │
## │     │ Int64 │ Int64     │
## ├─────┼───────┼───────────┤
## │ 1   │ 1     │ 1         │
## │ 2   │ 2     │ 4         │
## │ 3   │ 3     │ 9         │

As we can see, the use of the @linq macro allows us to pipe the df dataset
to the first macro, and then pipe the result of this to the following macro and
so on. Moreover, chaining the individual macros makes the code looks cleaner and
more obvious with less noise from @ symbols.

Previously, we performed an operation using the @with macro where we subset
the rows with x greater than 1 and take only the column y:

@with(df, df[:x .> 1, ^(:y)])
## 2-element Array{Int64,1}:
##  1
##  2

We could get the same result using @linq and |>:

@linq df |>
    where(:x .> 1) |>
    select(:y)
## 2×1 DataFrame
## │ Row │ y     │
## │     │ Int64 │
## ├─────┼───────┤
## │ 1   │ 1     │
## │ 2   │ 2     │

Moreover, you can not only use @linq to chain macros, but also with any
function. For example, we can pipe a dataset to see the first 5 rows as well as
to the describe() function:

@linq foo |> 
    first(5)
## 5×5 DataFrame
## │ Row │ x1       │ x2        │ x3       │ x4       │ x5       │
## │     │ Float64  │ Float64   │ Float64  │ Float64  │ Float64  │
## ├─────┼──────────┼───────────┼──────────┼──────────┼──────────┤
## │ 1   │ 0.921889 │ 0.205641  │ 0.200573 │ 0.247493 │ 0.490941 │
## │ 2   │ 0.978455 │ 0.0355136 │ 0.320555 │ 0.174487 │ 0.739182 │
## │ 3   │ 0.012211 │ 0.349529  │ 0.88621  │ 0.344799 │ 0.023447 │
## │ 4   │ 0.52635  │ 0.614627  │ 0.857196 │ 0.52956  │ 0.542027 │
## │ 5   │ 0.555943 │ 0.599705  │ 0.981179 │ 0.414446 │ 0.116372 │
## Default behavior will omitt some columns:
@linq foo |> 
    describe
## 5×8 DataFrame. Omitted printing of 2 columns
## │ Row │ variable │ mean     │ min       │ median   │ max      │ nunique │
## │     │ Symbol   │ Float64  │ Float64   │ Float64  │ Float64  │ Nothing │
## ├─────┼──────────┼──────────┼───────────┼──────────┼──────────┼─────────┤
## │ 1   │ x1       │ 0.642087 │ 0.012211  │ 0.715696 │ 0.978455 │         │
## │ 2   │ x2       │ 0.391035 │ 0.0355136 │ 0.288137 │ 0.971574 │         │
## │ 3   │ x3       │ 0.522389 │ 0.184677  │ 0.365964 │ 0.981179 │         │
## │ 4   │ x4       │ 0.467295 │ 0.174487  │ 0.381865 │ 0.813457 │         │
## │ 5   │ x5       │ 0.478956 │ 0.023447  │ 0.516484 │ 0.875312 │         │
## This will show all columns:
@linq foo |> 
    describe |> 
    show(allcols = true)
## 5×8 DataFrame
## │ Row │ variable │ mean     │ min       │ median   │ max      │ nunique │
## │     │ Symbol   │ Float64  │ Float64   │ Float64  │ Float64  │ Nothing │
## ├─────┼──────────┼──────────┼───────────┼──────────┼──────────┼─────────┤
## │ 1   │ x1       │ 0.642087 │ 0.012211  │ 0.715696 │ 0.978455 │         │
## │ 2   │ x2       │ 0.391035 │ 0.0355136 │ 0.288137 │ 0.971574 │         │
## │ 3   │ x3       │ 0.522389 │ 0.184677  │ 0.365964 │ 0.981179 │         │
## │ 4   │ x4       │ 0.467295 │ 0.174487  │ 0.381865 │ 0.813457 │         │
## │ 5   │ x5       │ 0.478956 │ 0.023447  │ 0.516484 │ 0.875312 │         │
## 
## │ Row │ nmissing │ eltype   │
## │     │ Nothing  │ DataType │
## ├─────┼──────────┼──────────┤
## │ 1   │          │ Float64  │
## │ 2   │          │ Float64  │
## │ 3   │          │ Float64  │
## │ 4   │          │ Float64  │
## │ 5   │          │ Float64  │







Conclusion

The DataFramesMeta package presents a collection of useful macros that can be used to
perform data wrangling and make our code cleaner. As we saw, some of these macros
behave similarly to functions commonly used in R. Also, the use
of the pipe operator |> plus the @linq macro makes the experience even more
alike to the %>% operator from maggrittr package in R. Hence, if you come (like me) from a
R background, Julia can become a lot easier to learn. The following table
summarizes the equivalence among functions used in this post between the two
languages:

Julia R
@with with()
@select select()
@where filter()
@transform mutate()
@orderby arrange()/order()
@linq + |> %>%

Using Julia for Data Science

By: Posts on Cleyton Farias

Re-posted from: https://cleytonfar.github.io/posts/using-julia-for-data-science/

In recent years, data science has become a huge attractive field with the
profession of data scientist topping the list of the best jobs in America.
And with all the hype that the field produces, one might ask: what does it take do
be a data scientist?

Well… that’s a good question. First of all, there are a lot of requirements. But one
of the most important ones is to learn how to work with data sets. And I am not
talking about playing with spreadsheets. I am talking about working with some
real programming language to get the job done with any datasets, no matter how
huge it is.

Hence, this post is the first part of a series about working with tabular data
with the Julia programming language. Of course, the aim of this post is not only
to give you a quick introduction to the language, but also present to you how
you can easily install and work rightaway with datasets with Julia.

Why Julia?

Glad you asked! Julia is a high level programming language released in 2012 by a
team of MIT researchers. Since its beginning, the aim was to solve the so called
two-language programing problem: easy to use functionalities of interpretable languages
(Python, R, Matlab) vs high performance of compiled languages (C, C++, Fortran).
According to its creators:

We want a language that’s open source, with a liberal license.
We want the speed of C with the dynamism of Ruby.
We want a language that’s homoiconic, with true macros like Lisp, but with obvious,
familiar mathematical notation like Matlab. We want something as usable for
general programming as Python, as easy for statistics as R, as natural for string
processing as Perl, as powerful for linear algebra as Matlab, as good at gluing
programs together as the shell.
Something that is dirt simple to learn, yet keeps the most serious hackers happy.
We want it interactive and we want it compiled. — julialang.org

Hence, Julia was born. Combining the JIT (Just In Time) compiler and
Julia’s multiple-dispatch system plus the fact that its codebase is written
entirely in native language, Julia gives birth to the popular phrase in the
community:

“Walks like Python, runs like C.”

Installing Julia

To play around with Julia there are some options. One obvious way is to download
the official binaries from the site for your
specific plataform (Windows, macOS, Linux, etc). At the
time of this writting, the Current stable release is v1.1.0 and the
Long-term support release is v1.0.3. Once you downloaded and execute
the binaries, you will see the following window:



Another options is to use Julia in the browser on JuliaBox.com
with Jupyter notebooks. No installation is required – just point your browser
there, login and start playing around.

Installing Packages

All the package management in Julia is performed by the Pkg package. To
install a given package we use Pkg.add("package_name"). In this tutorial we
are going to use some packages that are not pre-installed with Julia. To install
them, do the following:

using Pkg
Pkg.add("DataFrames")
Pkg.add("DataFramesMeta")
Pkg.add("CSV")

We installed three packages: DataFrames (which is the subject of this post),
DataFramesMeta (we will use some of its functionalities) and CSV (to read
and write CSV files).

Of course there is more about package management in Julia than I just showed.
A great introduction is presented in this video by Jane
Harriman. For more advanced usage, please refer to the official documentation.

Introduction to DataFrames in Julia

In Julia, tablular data is handled using the DataFrames package. Other packages
are commonly used to read/write data into/from Julia such as CSV.

A data frame is created using the DataFrame() function:

using DataFrames 
foo = DataFrame();
foo 
## 0×0 DataFrame

To use the functionalities of the package, let’s create some random data. I will
use the rand() function to generate random numbers to create an array 100 x 10
and convert it to a data frame:

foo = DataFrame(rand(100, 10));
foo 
## 100×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1         │ x2       │ x3        │ x4       │ x5        │ x6       │
## │     │ Float64    │ Float64  │ Float64   │ Float64  │ Float64   │ Float64  │
## ├─────┼────────────┼──────────┼───────────┼──────────┼───────────┼──────────┤
## │ 1   │ 0.0193136  │ 0.466228 │ 0.790475  │ 0.805074 │ 0.51182   │ 0.201707 │
## │ 2   │ 0.986082   │ 0.33719  │ 0.309992  │ 0.117098 │ 0.792606  │ 0.102682 │
## │ 3   │ 0.00366356 │ 0.323071 │ 0.685271  │ 0.596414 │ 0.847368  │ 0.105035 │
## │ 4   │ 0.297846   │ 0.136907 │ 0.726739  │ 0.569452 │ 0.922995  │ 0.846519 │
## │ 5   │ 0.73245    │ 0.208294 │ 0.353801  │ 0.448741 │ 0.185897  │ 0.496741 │
## │ 6   │ 0.209719   │ 0.114021 │ 0.0662264 │ 0.463682 │ 0.628582  │ 0.130653 │
## │ 7   │ 0.341692   │ 0.608349 │ 0.946541  │ 0.589161 │ 0.418321  │ 0.295541 │
## ⋮
## │ 93  │ 0.714495   │ 0.661317 │ 0.954527  │ 0.209581 │ 0.107941  │ 0.233787 │
## │ 94  │ 0.680497   │ 0.101874 │ 0.872371  │ 0.596457 │ 0.669133  │ 0.740674 │
## │ 95  │ 0.909319   │ 0.182776 │ 0.343387  │ 0.142707 │ 0.0140866 │ 0.791679 │
## │ 96  │ 0.642578   │ 0.949993 │ 0.380511  │ 0.96358  │ 0.878766  │ 0.270409 │
## │ 97  │ 0.605148   │ 0.240233 │ 0.144059  │ 0.545245 │ 0.0463105 │ 0.188397 │
## │ 98  │ 0.0907523  │ 0.334278 │ 0.288403  │ 0.519876 │ 0.267965  │ 0.552448 │
## │ 99  │ 0.751681   │ 0.289301 │ 0.488135  │ 0.382877 │ 0.320208  │ 0.999445 │
## │ 100 │ 0.856248   │ 0.577105 │ 0.588476  │ 0.435958 │ 0.0163749 │ 0.337817 │

Maybe you have noticed the “;” at the end of a command. It turns out that in Julia,
contrary to many other languages, everything is an expression, so it will return
a result. Hence, to turn off this return, we must include the “;” at the end of
each command.

To get the dimension of a data frame, we can use the size() function. Also,
similarly to R programming language, nrow() and ncol() are available to
get the number of rows and columns, respectively:

size(foo)
## (100, 10)
nrow(foo)
## 100
ncol(foo)
## 10

Another basic task when working with datasets is to to get the names of each
variable contained in the table. We use the names() function to get the column
names:

names(foo)
## 10-element Array{Symbol,1}:
##  :x1 
##  :x2 
##  :x3 
##  :x4 
##  :x5 
##  :x6 
##  :x7 
##  :x8 
##  :x9 
##  :x10

To get a summary of the dataset in general, we can use the function describe():

describe(foo)
## 10×8 DataFrame. Omitted printing of 2 columns
## │ Row │ variable │ mean     │ min        │ median   │ max      │ nunique │
## │     │ Symbol   │ Float64  │ Float64    │ Float64  │ Float64  │ Nothing │
## ├─────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┤
## │ 1   │ x1       │ 0.502457 │ 0.00190391 │ 0.508102 │ 0.993014 │         │
## │ 2   │ x2       │ 0.461593 │ 0.0143797  │ 0.465052 │ 0.949993 │         │
## │ 3   │ x3       │ 0.4659   │ 0.0180212  │ 0.409124 │ 0.978917 │         │
## │ 4   │ x4       │ 0.503142 │ 0.0130052  │ 0.508707 │ 0.986293 │         │
## │ 5   │ x5       │ 0.518394 │ 0.00177395 │ 0.502389 │ 0.994104 │         │
## │ 6   │ x6       │ 0.486075 │ 0.00543681 │ 0.475648 │ 0.999445 │         │
## │ 7   │ x7       │ 0.490961 │ 0.00366989 │ 0.482302 │ 0.996092 │         │
## │ 8   │ x8       │ 0.503405 │ 0.0180501  │ 0.525201 │ 0.985918 │         │
## │ 9   │ x9       │ 0.507343 │ 0.0327247  │ 0.533176 │ 0.990731 │         │
## │ 10  │ x10      │ 0.468541 │ 0.00622055 │ 0.470003 │ 0.996703 │         │

Note that there is a message indicating the omission of some columns. This is the
default behavior of Julia. To avoid this feature, we use the show() function
as follows:

show(describe(foo), allcols = true)

Manipulating Rows:

Subset rows in Julia can be a little odd in the beginning, but once you get used to, it becomes
more logical. For example, suppose we want the rows where x1 is above its average.
We could this as follows:

## Loading the Statistics package:
using Statistics
## Creating the conditional:
cond01 = foo[:x1] .>= mean(foo[:x1]);
## Subsetting the rows:
foo[cond01, :] 
## 51×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1       │ x2       │ x3        │ x4       │ x5        │ x6        │
## │     │ Float64  │ Float64  │ Float64   │ Float64  │ Float64   │ Float64   │
## ├─────┼──────────┼──────────┼───────────┼──────────┼───────────┼───────────┤
## │ 1   │ 0.986082 │ 0.33719  │ 0.309992  │ 0.117098 │ 0.792606  │ 0.102682  │
## │ 2   │ 0.73245  │ 0.208294 │ 0.353801  │ 0.448741 │ 0.185897  │ 0.496741  │
## │ 3   │ 0.716057 │ 0.325789 │ 0.193415  │ 0.813209 │ 0.232703  │ 0.314502  │
## │ 4   │ 0.538082 │ 0.932279 │ 0.101212  │ 0.363205 │ 0.979265  │ 0.274936  │
## │ 5   │ 0.693567 │ 0.78976  │ 0.123106  │ 0.566847 │ 0.492958  │ 0.798202  │
## │ 6   │ 0.794447 │ 0.405418 │ 0.0521367 │ 0.587886 │ 0.922298  │ 0.211156  │
## │ 7   │ 0.664186 │ 0.432662 │ 0.0431839 │ 0.810072 │ 0.963643  │ 0.678182  │
## ⋮
## │ 44  │ 0.85821  │ 0.484308 │ 0.899559  │ 0.754818 │ 0.252699  │ 0.0590497 │
## │ 45  │ 0.714495 │ 0.661317 │ 0.954527  │ 0.209581 │ 0.107941  │ 0.233787  │
## │ 46  │ 0.680497 │ 0.101874 │ 0.872371  │ 0.596457 │ 0.669133  │ 0.740674  │
## │ 47  │ 0.909319 │ 0.182776 │ 0.343387  │ 0.142707 │ 0.0140866 │ 0.791679  │
## │ 48  │ 0.642578 │ 0.949993 │ 0.380511  │ 0.96358  │ 0.878766  │ 0.270409  │
## │ 49  │ 0.605148 │ 0.240233 │ 0.144059  │ 0.545245 │ 0.0463105 │ 0.188397  │
## │ 50  │ 0.751681 │ 0.289301 │ 0.488135  │ 0.382877 │ 0.320208  │ 0.999445  │
## │ 51  │ 0.856248 │ 0.577105 │ 0.588476  │ 0.435958 │ 0.0163749 │ 0.337817  │

What if we want two conditionals? For example, we want the same condition as before
and/or the rows where x2 is greater than or equal its average? Now things
become trickier. Let’s check how we could do this:

## Creating the second conditional:
cond02 = foo[:x2] .>= mean(foo[:x2]);
## Subsetting cond01 AND cond02:
foo[.&(cond01, cond02), :]
## 25×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1       │ x2       │ x3        │ x4        │ x5        │ x6        │
## │     │ Float64  │ Float64  │ Float64   │ Float64   │ Float64   │ Float64   │
## ├─────┼──────────┼──────────┼───────────┼───────────┼───────────┼───────────┤
## │ 1   │ 0.538082 │ 0.932279 │ 0.101212  │ 0.363205  │ 0.979265  │ 0.274936  │
## │ 2   │ 0.693567 │ 0.78976  │ 0.123106  │ 0.566847  │ 0.492958  │ 0.798202  │
## │ 3   │ 0.567098 │ 0.747233 │ 0.589314  │ 0.0677154 │ 0.630238  │ 0.357654  │
## │ 4   │ 0.976991 │ 0.648552 │ 0.32794   │ 0.36951   │ 0.846276  │ 0.117798  │
## │ 5   │ 0.553247 │ 0.615375 │ 0.122955  │ 0.440636  │ 0.283713  │ 0.734161  │
## │ 6   │ 0.849795 │ 0.703195 │ 0.232944  │ 0.668432  │ 0.686921  │ 0.788872  │
## │ 7   │ 0.530801 │ 0.825475 │ 0.644381  │ 0.15488   │ 0.669306  │ 0.151317  │
## ⋮
## │ 18  │ 0.665926 │ 0.943121 │ 0.438038  │ 0.921251  │ 0.82234   │ 0.761529  │
## │ 19  │ 0.987506 │ 0.946972 │ 0.0462434 │ 0.67867   │ 0.731762  │ 0.482322  │
## │ 20  │ 0.862284 │ 0.886346 │ 0.694874  │ 0.0166389 │ 0.386215  │ 0.527352  │
## │ 21  │ 0.855198 │ 0.650342 │ 0.0321678 │ 0.723076  │ 0.449779  │ 0.0364525 │
## │ 22  │ 0.85821  │ 0.484308 │ 0.899559  │ 0.754818  │ 0.252699  │ 0.0590497 │
## │ 23  │ 0.714495 │ 0.661317 │ 0.954527  │ 0.209581  │ 0.107941  │ 0.233787  │
## │ 24  │ 0.642578 │ 0.949993 │ 0.380511  │ 0.96358   │ 0.878766  │ 0.270409  │
## │ 25  │ 0.856248 │ 0.577105 │ 0.588476  │ 0.435958  │ 0.0163749 │ 0.337817  │
## Subsetting cond01 OR cond02:
foo[.|(cond01, cond02), :]
## 77×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1        │ x2       │ x3        │ x4       │ x5        │ x6        │
## │     │ Float64   │ Float64  │ Float64   │ Float64  │ Float64   │ Float64   │
## ├─────┼───────────┼──────────┼───────────┼──────────┼───────────┼───────────┤
## │ 1   │ 0.0193136 │ 0.466228 │ 0.790475  │ 0.805074 │ 0.51182   │ 0.201707  │
## │ 2   │ 0.986082  │ 0.33719  │ 0.309992  │ 0.117098 │ 0.792606  │ 0.102682  │
## │ 3   │ 0.73245   │ 0.208294 │ 0.353801  │ 0.448741 │ 0.185897  │ 0.496741  │
## │ 4   │ 0.341692  │ 0.608349 │ 0.946541  │ 0.589161 │ 0.418321  │ 0.295541  │
## │ 5   │ 0.413762  │ 0.644062 │ 0.495503  │ 0.96149  │ 0.249137  │ 0.592854  │
## │ 6   │ 0.129374  │ 0.663032 │ 0.0180212 │ 0.280431 │ 0.887136  │ 0.329406  │
## │ 7   │ 0.716057  │ 0.325789 │ 0.193415  │ 0.813209 │ 0.232703  │ 0.314502  │
## ⋮
## │ 70  │ 0.85821   │ 0.484308 │ 0.899559  │ 0.754818 │ 0.252699  │ 0.0590497 │
## │ 71  │ 0.714495  │ 0.661317 │ 0.954527  │ 0.209581 │ 0.107941  │ 0.233787  │
## │ 72  │ 0.680497  │ 0.101874 │ 0.872371  │ 0.596457 │ 0.669133  │ 0.740674  │
## │ 73  │ 0.909319  │ 0.182776 │ 0.343387  │ 0.142707 │ 0.0140866 │ 0.791679  │
## │ 74  │ 0.642578  │ 0.949993 │ 0.380511  │ 0.96358  │ 0.878766  │ 0.270409  │
## │ 75  │ 0.605148  │ 0.240233 │ 0.144059  │ 0.545245 │ 0.0463105 │ 0.188397  │
## │ 76  │ 0.751681  │ 0.289301 │ 0.488135  │ 0.382877 │ 0.320208  │ 0.999445  │
## │ 77  │ 0.856248  │ 0.577105 │ 0.588476  │ 0.435958 │ 0.0163749 │ 0.337817  │

In Julia, instead of the syntax condition1 & condition2, which is more common in
other programming languages, we use &(condition1, condition2) or
|(condition1, condition2) operators to perform multiple conditional
filtering.

Now, let’s say you have a DataFrame and you want to append rows to it.
There are a couple of ways of doing data. The first one is to use the [data1; data2]
syntax:

## Creating a DataFrame with 3 rows and 5 columns:
x = DataFrame(rand(3, 5));
## Let's add another line using [dataset1; dataset2] syntax:
[ x ; DataFrame(rand(1, 5)) ]
## 4×5 DataFrame
## │ Row │ x1       │ x2        │ x3        │ x4       │ x5        │
## │     │ Float64  │ Float64   │ Float64   │ Float64  │ Float64   │
## ├─────┼──────────┼───────────┼───────────┼──────────┼───────────┤
## │ 1   │ 0.722487 │ 0.0930212 │ 0.146     │ 0.486439 │ 0.0892853 │
## │ 2   │ 0.640469 │ 0.5902    │ 0.667832  │ 0.882527 │ 0.766987  │
## │ 3   │ 0.094589 │ 0.805257  │ 0.291809  │ 0.582878 │ 0.704144  │
## │ 4   │ 0.18066  │ 0.187027  │ 0.0440521 │ 0.077637 │ 0.884914  │

We could get the same result using the vcat() function. According to the
documentation, vcat() performs concatenation along dimension 1, which means
it will concatenate rows. The syntax would be:

## taking the first 2 lines and append with the third one:
vcat(x[1:2, :] , x[3, :])

Another way to do that is using the function append!(). This function will append
a new row to the last row in a given DataFrame. Note that the column names must
match exactly.

## Column names matches
append!(x, DataFrame(rand(1, 5)))
## 4×5 DataFrame
## │ Row │ x1       │ x2        │ x3        │ x4       │ x5        │
## │     │ Float64  │ Float64   │ Float64   │ Float64  │ Float64   │
## ├─────┼──────────┼───────────┼───────────┼──────────┼───────────┤
## │ 1   │ 0.722487 │ 0.0930212 │ 0.146     │ 0.486439 │ 0.0892853 │
## │ 2   │ 0.640469 │ 0.5902    │ 0.667832  │ 0.882527 │ 0.766987  │
## │ 3   │ 0.094589 │ 0.805257  │ 0.291809  │ 0.582878 │ 0.704144  │
## │ 4   │ 0.492341 │ 0.823765  │ 0.0731187 │ 0.123074 │ 0.264452  │

Note that if the column names between two DataFrames do not match , the append!()
function is going to throw an error. Although this kind of behavior is important
when we want to control for possible side effects, we might also prefer to not worry about
this and “force” the append procedure. In order to do this we can make use of
the push!() function.

## providing an Array:
push!(x, rand(ncol(x)))
## 5×5 DataFrame
## │ Row │ x1       │ x2        │ x3        │ x4       │ x5        │
## │     │ Float64  │ Float64   │ Float64   │ Float64  │ Float64   │
## ├─────┼──────────┼───────────┼───────────┼──────────┼───────────┤
## │ 1   │ 0.722487 │ 0.0930212 │ 0.146     │ 0.486439 │ 0.0892853 │
## │ 2   │ 0.640469 │ 0.5902    │ 0.667832  │ 0.882527 │ 0.766987  │
## │ 3   │ 0.094589 │ 0.805257  │ 0.291809  │ 0.582878 │ 0.704144  │
## │ 4   │ 0.492341 │ 0.823765  │ 0.0731187 │ 0.123074 │ 0.264452  │
## │ 5   │ 0.632829 │ 0.357564  │ 0.09631   │ 0.198201 │ 0.924137  │
## providing an dictionary:
push!(x, Dict(:x1 => rand(),
              :x2 => rand(),
              :x3 => rand(),
              :x4 => rand(),
              :x5 => rand()))
## 6×5 DataFrame
## │ Row │ x1       │ x2        │ x3        │ x4       │ x5        │
## │     │ Float64  │ Float64   │ Float64   │ Float64  │ Float64   │
## ├─────┼──────────┼───────────┼───────────┼──────────┼───────────┤
## │ 1   │ 0.722487 │ 0.0930212 │ 0.146     │ 0.486439 │ 0.0892853 │
## │ 2   │ 0.640469 │ 0.5902    │ 0.667832  │ 0.882527 │ 0.766987  │
## │ 3   │ 0.094589 │ 0.805257  │ 0.291809  │ 0.582878 │ 0.704144  │
## │ 4   │ 0.492341 │ 0.823765  │ 0.0731187 │ 0.123074 │ 0.264452  │
## │ 5   │ 0.632829 │ 0.357564  │ 0.09631   │ 0.198201 │ 0.924137  │
## │ 6   │ 0.234059 │ 0.530488  │ 0.0448796 │ 0.565734 │ 0.262909  │

As we can see, this function also accepts that we give a dictionary or an array
to append to a DataFrame.

So, there are at least 4 methods to add rows to a DataFrame. Which one to use?
Let’s see how fast it is each function:

using BenchmarkTools
@btime [x ; DataFrame(rand(1, 5))];
@btime vcat(x, DataFrame(rand(1, 5)));
@btime append!(x, DataFrame(rand(1, 5)));
@btime push!(x, rand(1, 5));

Manipulating Columns:

One of the first things we would want to do when working with a dataset is selecting
some columns. In Julia, the syntax of selecting columns in DataFrames is similar to the one
used in Matlab/Octave. For instance, we can make use of the “:” symbol to represent
that we want all columns (or all rows) and/or a sequence of them:

## Taking all rows of the first 2 columns:
foo[:, 1:2]
## 100×2 DataFrame
## │ Row │ x1         │ x2       │
## │     │ Float64    │ Float64  │
## ├─────┼────────────┼──────────┤
## │ 1   │ 0.0193136  │ 0.466228 │
## │ 2   │ 0.986082   │ 0.33719  │
## │ 3   │ 0.00366356 │ 0.323071 │
## │ 4   │ 0.297846   │ 0.136907 │
## │ 5   │ 0.73245    │ 0.208294 │
## │ 6   │ 0.209719   │ 0.114021 │
## │ 7   │ 0.341692   │ 0.608349 │
## ⋮
## │ 93  │ 0.714495   │ 0.661317 │
## │ 94  │ 0.680497   │ 0.101874 │
## │ 95  │ 0.909319   │ 0.182776 │
## │ 96  │ 0.642578   │ 0.949993 │
## │ 97  │ 0.605148   │ 0.240233 │
## │ 98  │ 0.0907523  │ 0.334278 │
## │ 99  │ 0.751681   │ 0.289301 │
## │ 100 │ 0.856248   │ 0.577105 │
## Taking the first 10 rows of all columns:
foo[1:10, :]
## 10×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1         │ x2       │ x3        │ x4       │ x5       │ x6       │
## │     │ Float64    │ Float64  │ Float64   │ Float64  │ Float64  │ Float64  │
## ├─────┼────────────┼──────────┼───────────┼──────────┼──────────┼──────────┤
## │ 1   │ 0.0193136  │ 0.466228 │ 0.790475  │ 0.805074 │ 0.51182  │ 0.201707 │
## │ 2   │ 0.986082   │ 0.33719  │ 0.309992  │ 0.117098 │ 0.792606 │ 0.102682 │
## │ 3   │ 0.00366356 │ 0.323071 │ 0.685271  │ 0.596414 │ 0.847368 │ 0.105035 │
## │ 4   │ 0.297846   │ 0.136907 │ 0.726739  │ 0.569452 │ 0.922995 │ 0.846519 │
## │ 5   │ 0.73245    │ 0.208294 │ 0.353801  │ 0.448741 │ 0.185897 │ 0.496741 │
## │ 6   │ 0.209719   │ 0.114021 │ 0.0662264 │ 0.463682 │ 0.628582 │ 0.130653 │
## │ 7   │ 0.341692   │ 0.608349 │ 0.946541  │ 0.589161 │ 0.418321 │ 0.295541 │
## │ 8   │ 0.308353   │ 0.428978 │ 0.914878  │ 0.84873  │ 0.440174 │ 0.310166 │
## │ 9   │ 0.413762   │ 0.644062 │ 0.495503  │ 0.96149  │ 0.249137 │ 0.592854 │
## │ 10  │ 0.129374   │ 0.663032 │ 0.0180212 │ 0.280431 │ 0.887136 │ 0.329406 │

Also, we can select a column by using its name as a symbol or using the “.” operator:

## take the column x1 using "." operator:
foo.x1
## 100-element Array{Float64,1}:
##  0.019313572390828426
##  0.9860824001880526  
##  0.003663562628546835
##  0.2978463233159676  
##  0.7324498468154668  
##  0.2097185474768264  
##  0.34169153867123714 
##  0.30835315833846444 
##  0.41376236563754887 
##  0.1293737178707406  
##  ⋮                   
##  0.8582099391004219  
##  0.7144949034554522  
##  0.6804966837971145  
##  0.9093192833587018  
##  0.6425780404716646  
##  0.6051475800989663  
##  0.09075227070455938 
##  0.7516814773623635  
##  0.8562478916762768
## Take the column using "x1" as a symbol:
foo[:x1]
## 100-element Array{Float64,1}:
##  0.019313572390828426
##  0.9860824001880526  
##  0.003663562628546835
##  0.2978463233159676  
##  0.7324498468154668  
##  0.2097185474768264  
##  0.34169153867123714 
##  0.30835315833846444 
##  0.41376236563754887 
##  0.1293737178707406  
##  ⋮                   
##  0.8582099391004219  
##  0.7144949034554522  
##  0.6804966837971145  
##  0.9093192833587018  
##  0.6425780404716646  
##  0.6051475800989663  
##  0.09075227070455938 
##  0.7516814773623635  
##  0.8562478916762768

Notice that the return will be an Array. To select one or more column and return
them as a DataFrame type, we use the double brackets syntax:

using DataFramesMeta
## take column x1 as DataFrame
@linq foo[[:x1]] |> first(5)
## 5×1 DataFrame
## │ Row │ x1         │
## │     │ Float64    │
## ├─────┼────────────┤
## │ 1   │ 0.0193136  │
## │ 2   │ 0.986082   │
## │ 3   │ 0.00366356 │
## │ 4   │ 0.297846   │
## │ 5   │ 0.73245    │
## Take column x1 an x2:
@linq foo[[:x1, :x2]] |> first(5)
## 5×2 DataFrame
## │ Row │ x1         │ x2       │
## │     │ Float64    │ Float64  │
## ├─────┼────────────┼──────────┤
## │ 1   │ 0.0193136  │ 0.466228 │
## │ 2   │ 0.986082   │ 0.33719  │
## │ 3   │ 0.00366356 │ 0.323071 │
## │ 4   │ 0.297846   │ 0.136907 │
## │ 5   │ 0.73245    │ 0.208294 │

There are some new things here. The first() function aims to just show the first
lines of our dataset. Similarly, last() performs the same, but showing us the last
lines. Also, you may have noticed the use of the “|>” operator. This is the
pipe symbol in Julia. If you are familiar with R programming language, it
works similarly to the “%>%” operator from magrittr package, but with some
limitations. For example, we can not pipe to a specific argument in a
subsequent function, so that’s why the use of @linq from DataFramesMeta
package. For now just take these commands for granted. In another post I will show
how to use the functionalities of the metaprogramming tools for DataFrames.

Another trivial task we can perform with column is to add or alter columns in a
DataFrame. For example, let’s create a new column which will be a sequence between
1 and until 100 by 0.5:

## To create a sequence, use the function range():
foo[:new_column] = range(1, step = 0.5, length = nrow(foo));
foo[:, :new_column]
## 100-element Array{Float64,1}:
##   1.0
##   1.5
##   2.0
##   2.5
##   3.0
##   3.5
##   4.0
##   4.5
##   5.0
##   5.5
##   ⋮  
##  46.5
##  47.0
##  47.5
##  48.0
##  48.5
##  49.0
##  49.5
##  50.0
##  50.5

We can also add column using the insertcols!() function. The syntax allow us to
specify in which position we want to add the column in the DataFrame:

## syntax: insert!(dataset, position, column_name => array)
insertcols!(foo, 2, :new_colum2 => range(1, step = 0.5, length = nrow(foo)));
first(foo, 3)
## 3×12 DataFrame. Omitted printing of 6 columns
## │ Row │ x1         │ new_colum2 │ x2       │ x3       │ x4       │ x5       │
## │     │ Float64    │ Float64    │ Float64  │ Float64  │ Float64  │ Float64  │
## ├─────┼────────────┼────────────┼──────────┼──────────┼──────────┼──────────┤
## │ 1   │ 0.0193136  │ 1.0        │ 0.466228 │ 0.790475 │ 0.805074 │ 0.51182  │
## │ 2   │ 0.986082   │ 1.5        │ 0.33719  │ 0.309992 │ 0.117098 │ 0.792606 │
## │ 3   │ 0.00366356 │ 2.0        │ 0.323071 │ 0.685271 │ 0.596414 │ 0.847368 │

Note the use of the “!” in insertcols!() function. This means that the function
is altering the object in memory rather than in a “virtual copy” that later needs
to be assigned to a new variable. This is a behavior that can be used in other function
as well.

Ok… But what if you want to do the opposite? that is, to remove a column?
Well… it is just as easy as to add it. Just use the deletecols!() function:

deletecols!(foo, [:new_column, :new_colum2])
## 100×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1         │ x2       │ x3        │ x4       │ x5        │ x6       │
## │     │ Float64    │ Float64  │ Float64   │ Float64  │ Float64   │ Float64  │
## ├─────┼────────────┼──────────┼───────────┼──────────┼───────────┼──────────┤
## │ 1   │ 0.0193136  │ 0.466228 │ 0.790475  │ 0.805074 │ 0.51182   │ 0.201707 │
## │ 2   │ 0.986082   │ 0.33719  │ 0.309992  │ 0.117098 │ 0.792606  │ 0.102682 │
## │ 3   │ 0.00366356 │ 0.323071 │ 0.685271  │ 0.596414 │ 0.847368  │ 0.105035 │
## │ 4   │ 0.297846   │ 0.136907 │ 0.726739  │ 0.569452 │ 0.922995  │ 0.846519 │
## │ 5   │ 0.73245    │ 0.208294 │ 0.353801  │ 0.448741 │ 0.185897  │ 0.496741 │
## │ 6   │ 0.209719   │ 0.114021 │ 0.0662264 │ 0.463682 │ 0.628582  │ 0.130653 │
## │ 7   │ 0.341692   │ 0.608349 │ 0.946541  │ 0.589161 │ 0.418321  │ 0.295541 │
## ⋮
## │ 93  │ 0.714495   │ 0.661317 │ 0.954527  │ 0.209581 │ 0.107941  │ 0.233787 │
## │ 94  │ 0.680497   │ 0.101874 │ 0.872371  │ 0.596457 │ 0.669133  │ 0.740674 │
## │ 95  │ 0.909319   │ 0.182776 │ 0.343387  │ 0.142707 │ 0.0140866 │ 0.791679 │
## │ 96  │ 0.642578   │ 0.949993 │ 0.380511  │ 0.96358  │ 0.878766  │ 0.270409 │
## │ 97  │ 0.605148   │ 0.240233 │ 0.144059  │ 0.545245 │ 0.0463105 │ 0.188397 │
## │ 98  │ 0.0907523  │ 0.334278 │ 0.288403  │ 0.519876 │ 0.267965  │ 0.552448 │
## │ 99  │ 0.751681   │ 0.289301 │ 0.488135  │ 0.382877 │ 0.320208  │ 0.999445 │
## │ 100 │ 0.856248   │ 0.577105 │ 0.588476  │ 0.435958 │ 0.0163749 │ 0.337817 │

Now suppose that you do not want to delete a colum, but just change its name.
For this task, I am afraid there is a very difficult function to remember
the name: rename(). The syntax is as follows:

## rename(dataFrame, :old_name => :new_name)
rename(foo, :x1 => :A1, :x2 => :A2)
## 100×10 DataFrame. Omitted printing of 4 columns
## │ Row │ A1         │ A2       │ x3        │ x4       │ x5        │ x6       │
## │     │ Float64    │ Float64  │ Float64   │ Float64  │ Float64   │ Float64  │
## ├─────┼────────────┼──────────┼───────────┼──────────┼───────────┼──────────┤
## │ 1   │ 0.0193136  │ 0.466228 │ 0.790475  │ 0.805074 │ 0.51182   │ 0.201707 │
## │ 2   │ 0.986082   │ 0.33719  │ 0.309992  │ 0.117098 │ 0.792606  │ 0.102682 │
## │ 3   │ 0.00366356 │ 0.323071 │ 0.685271  │ 0.596414 │ 0.847368  │ 0.105035 │
## │ 4   │ 0.297846   │ 0.136907 │ 0.726739  │ 0.569452 │ 0.922995  │ 0.846519 │
## │ 5   │ 0.73245    │ 0.208294 │ 0.353801  │ 0.448741 │ 0.185897  │ 0.496741 │
## │ 6   │ 0.209719   │ 0.114021 │ 0.0662264 │ 0.463682 │ 0.628582  │ 0.130653 │
## │ 7   │ 0.341692   │ 0.608349 │ 0.946541  │ 0.589161 │ 0.418321  │ 0.295541 │
## ⋮
## │ 93  │ 0.714495   │ 0.661317 │ 0.954527  │ 0.209581 │ 0.107941  │ 0.233787 │
## │ 94  │ 0.680497   │ 0.101874 │ 0.872371  │ 0.596457 │ 0.669133  │ 0.740674 │
## │ 95  │ 0.909319   │ 0.182776 │ 0.343387  │ 0.142707 │ 0.0140866 │ 0.791679 │
## │ 96  │ 0.642578   │ 0.949993 │ 0.380511  │ 0.96358  │ 0.878766  │ 0.270409 │
## │ 97  │ 0.605148   │ 0.240233 │ 0.144059  │ 0.545245 │ 0.0463105 │ 0.188397 │
## │ 98  │ 0.0907523  │ 0.334278 │ 0.288403  │ 0.519876 │ 0.267965  │ 0.552448 │
## │ 99  │ 0.751681   │ 0.289301 │ 0.488135  │ 0.382877 │ 0.320208  │ 0.999445 │
## │ 100 │ 0.856248   │ 0.577105 │ 0.588476  │ 0.435958 │ 0.0163749 │ 0.337817 │

We could also add the “!” to the rename() function to alter the DataFrame
in memory.

Let’s talk about missing values:

Missing values are represented in Julia with missing value. When an array
contains missing values, it automatically creates an appropriate union type:

x = [1.0, 2.0, missing]
## 3-element Array{Union{Missing, Float64},1}:
##  1.0     
##  2.0     
##   missing
typeof(x)
## Array{Union{Missing, Float64},1}
typeof.(x)
## 3-element Array{DataType,1}:
##  Float64
##  Float64
##  Missing

To check if a particular element in an array is missing, we use the ismissing()
function:

ismissing.([1.0, 2.0, missing])
## 3-element BitArray{1}:
##  false
##  false
##   true

It is important to notice that missing comparison produces missing as a result:

missing == missing

isequal and === can be used to produce the results of type Bool:

isequal(missing, missing)
## true
missing === missing
## true

Other functions are available to work with missing values. For instance, suppose
we want an array with only non-missing values, we use the skipmissing() function:

x |> skipmissing |> collect
## 2-element Array{Float64,1}:
##  1.0
##  2.0

Here, we use the collect() function as the skipmissing() returns an iterator.

To replace the missing values with some other value we can use the
Missings.replace() function. For example, suppose we want to change the missing
values by NaN:

Missings.replace(x, NaN) |> collect
## 3-element Array{Float64,1}:
##    1.0
##    2.0
##  NaN

We also can use use other ways to perform the same operation:

## Using coalesce() function:
coalesce.(x, NaN)
## 3-element Array{Float64,1}:
##    1.0
##    2.0
##  NaN
## Using recode() function:
recode(x, missing => NaN)
## 3-element Array{Float64,1}:
##    1.0
##    2.0
##  NaN

Until now, we have only talked about missing values in arrays. But what about missing
values in DataFrames? To start, let’s create a DataFrame with some missing values:

x = DataFrame(A = [1, missing, 3, 4], B = ["A", "B", missing, "C"])
## 4×2 DataFrame
## │ Row │ A       │ B       │
## │     │ Int64⍰  │ String⍰ │
## ├─────┼─────────┼─────────┤
## │ 1   │ 1       │ A       │
## │ 2   │ missing │ B       │
## │ 3   │ 3       │ missing │
## │ 4   │ 4       │ C       │

For some analysis, we would want only the rows with non-missing values. One way
to achieve this is making use of the completecases() function:

x[completecases(x), :]
## 2×2 DataFrame
## │ Row │ A      │ B       │
## │     │ Int64⍰ │ String⍰ │
## ├─────┼────────┼─────────┤
## │ 1   │ 1      │ A       │
## │ 2   │ 4      │ C       │

The completecases() function returns an boolean array with value true for
rows that have non-missing values and false otherwise. For those who are familiar
with R, this is the same behavior as the complete.cases() function from stats package.

Another option to return the rows with non-missing values of a DataFrame in Julia
is to use the dropmissing() function:

dropmissing(x)
## 2×2 DataFrame
## │ Row │ A     │ B      │
## │     │ Int64 │ String │
## ├─────┼───────┼────────┤
## │ 1   │ 1     │ A      │
## │ 2   │ 4     │ C      │

and again, for R users is the same behavior as na.omit() function.

Merging DataFrames:

Often, we need to combine two or more DataFrames together based on some common
column(s) among them. For example, suppose we have two DataFrames:

df1 = DataFrame(x = 1:3, y = 4:6)
## 3×2 DataFrame
## │ Row │ x     │ y     │
## │     │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1   │ 1     │ 4     │
## │ 2   │ 2     │ 5     │
## │ 3   │ 3     │ 6     │
df2 = DataFrame(x = 1:3, z = 'd':'f', new = 11:13)
## 3×3 DataFrame
## │ Row │ x     │ z    │ new   │
## │     │ Int64 │ Char │ Int64 │
## ├─────┼───────┼──────┼───────┤
## │ 1   │ 1     │ 'd'  │ 11    │
## │ 2   │ 2     │ 'e'  │ 12    │
## │ 3   │ 3     │ 'f'  │ 13    │

which have the column x in common. To merge these two tables, we use the
join() function:

join(df1, df2, on = :x)
## 3×4 DataFrame
## │ Row │ x     │ y     │ z    │ new   │
## │     │ Int64 │ Int64 │ Char │ Int64 │
## ├─────┼───────┼───────┼──────┼───────┤
## │ 1   │ 1     │ 4     │ 'd'  │ 11    │
## │ 2   │ 2     │ 5     │ 'e'  │ 12    │
## │ 3   │ 3     │ 6     │ 'f'  │ 13    │

That’s it!! We merge our DataFrames altogether. But that’s the default behavior of
the function. There is more to explore. Essentially, join() takes 4 arguments:

  • DataFrame 1
  • DataFrame 2
  • on = the column(s) to be the key in merging;
  • kind = type of the merge (left, right, inner, outer, …)

The kind argument specifies the type of join we are interested in performing.
The definition of each one is as follows:

  • Inner: The output contains rows for values of the key that exist
    in BOTH the first (left) and second (right) arguments to
    join;

  • Left: The output contains rows for values of the key that exist in
    the first (left) argument to join, whether or not that value
    exists in the second (right) argument;

  • Right: The output contains rows for values of the key that exist in
    the second (right) argument to join, whether or not that
    value exists in the first (left) argument;

  • Outer: The output contains rows for values of the key that exist in
    the first (left) OR second (right) argument to join;

and here are the “strange” ones:

  • Semi: Like an inner join, but output is restricted to columns from
    the first (left) argument to join;

  • Anti: The output contains rows for values of the key that exist in
    the first (left) but NOT in the second (right) argument to
    join. As with semi joins, output is restricted to columns
    from the first (left) argument.

If you are familiar with SQL or with the join functions from dplyr package in R,
it is the same concept.

To illustrate how the different kind of joins work, let’s create more DataFrames
to demonstrate each type of join:

Names = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])
## 2×2 DataFrame
## │ Row │ ID    │ Name     │
## │     │ Int64 │ String   │
## ├─────┼───────┼──────────┤
## │ 1   │ 20    │ John Doe │
## │ 2   │ 40    │ Jane Doe │
jobs = DataFrame(ID = [20, 60], Job = ["Lawyer", "Astronaut"])
## 2×2 DataFrame
## │ Row │ ID    │ Job       │
## │     │ Int64 │ String    │
## ├─────┼───────┼───────────┤
## │ 1   │ 20    │ Lawyer    │
## │ 2   │ 60    │ Astronaut │

In the Names and jobs DataFrame, we have the ID column as the key to perform the
join. But notice that the ID values are not equal between the DataFrames. Now
let’s perform the joins:

join(Names, jobs, on = :ID, kind = :inner)
## 1×3 DataFrame
## │ Row │ ID    │ Name     │ Job    │
## │     │ Int64 │ String   │ String │
## ├─────┼───────┼──────────┼────────┤
## │ 1   │ 20    │ John Doe │ Lawyer │
join(Names, jobs, on = :ID, kind = :left)
## 2×3 DataFrame
## │ Row │ ID    │ Name     │ Job     │
## │     │ Int64 │ String   │ String⍰ │
## ├─────┼───────┼──────────┼─────────┤
## │ 1   │ 20    │ John Doe │ Lawyer  │
## │ 2   │ 40    │ Jane Doe │ missing │
join(Names, jobs, on = :ID, kind = :right)
## 2×3 DataFrame
## │ Row │ ID    │ Name     │ Job       │
## │     │ Int64 │ String⍰  │ String    │
## ├─────┼───────┼──────────┼───────────┤
## │ 1   │ 20    │ John Doe │ Lawyer    │
## │ 2   │ 60    │ missing  │ Astronaut │
join(Names, jobs, on = :ID, kind = :outer)
## 3×3 DataFrame
## │ Row │ ID    │ Name     │ Job       │
## │     │ Int64 │ String⍰  │ String⍰   │
## ├─────┼───────┼──────────┼───────────┤
## │ 1   │ 20    │ John Doe │ Lawyer    │
## │ 2   │ 40    │ Jane Doe │ missing   │
## │ 3   │ 60    │ missing  │ Astronaut │

Semi and anti join have a more uncommon behavior. Semi join returns the rows
from the left which DO MATCH with the ID from the right:

join(Names, jobs, on = :ID, kind = :semi)
## 1×2 DataFrame
## │ Row │ ID    │ Name     │
## │     │ Int64 │ String   │
## ├─────┼───────┼──────────┤
## │ 1   │ 20    │ John Doe │

Anti join returns the rows from the left which DO NOT MATCH with
the ID from the right

join(Names, jobs, on = :ID, kind = :anti)
## 1×2 DataFrame
## │ Row │ ID    │ Name     │
## │     │ Int64 │ String   │
## ├─────┼───────┼──────────┤
## │ 1   │ 40    │ Jane Doe │

Split-Apply-Combine:

Some common tasks involve splitting the data into groups, applying some function
to each of these groups and gathering the results to analyze later on. This is the
split-apply-combine strategy described in the paper “The Split-Apply-Combine Strategy for Data analysis” written by Hadley Wickham, creator
of many R packages, including ggplot2 and dplyr.

The DataFrames package in Julia supports the Split-Apply-Combine strategy
through the by() function, which takes three arguments:

  • DataFrame;
  • one or more column names to split on;
  • a function or expression to apply to each subset;

To illustrate its usage, let’s make use of the RDatasets package, which gives
access to some preloaded well known datasets from R packages.

using RDatasets
foo = dataset("datasets", "iris");
first(foo, 5)
## 5×5 DataFrame
## │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species      │
## │     │ Float64     │ Float64    │ Float64     │ Float64    │ Categorical… │
## ├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────────┤
## │ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa       │
## │ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa       │
## │ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ setosa       │
## │ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ setosa       │
## │ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ setosa       │

A trivial task is to find how many of each “Species” there are in the
dataset. One way to do this is to apply the Split-Apply-Combine strategy: split
the data into the Species column, apply the nrow() function to this
splitted dataset, and combine the results:

## Syntax: by(dataset, :name_column_to_split, name_function)
by(foo, :Species, nrow)
## 3×2 DataFrame
## │ Row │ Species      │ x1    │
## │     │ Categorical… │ Int64 │
## ├─────┼──────────────┼───────┤
## │ 1   │ setosa       │ 50    │
## │ 2   │ versicolor   │ 50    │
## │ 3   │ virginica    │ 50    │

We can also make use of anonymous function:

by(foo, :Species, x -> DataFrame(N = nrow(x)))
## 3×2 DataFrame
## │ Row │ Species      │ N     │
## │     │ Categorical… │ Int64 │
## ├─────┼──────────────┼───────┤
## │ 1   │ setosa       │ 50    │
## │ 2   │ versicolor   │ 50    │
## │ 3   │ virginica    │ 50    │

One of the advantages of using anonymous function inside the by() function is
that we can format the resulted output and apply as many function as we want:

## Applying the count, mean and standard deviation function:
by(foo, :Species, x -> DataFrame(N = nrow(x),
                                 avg_PetalLength = mean(x[:PetalLength]),
                                 std_PetalWidth = std(x[:PetalWidth])))
## 3×4 DataFrame
## │ Row │ Species      │ N     │ avg_PetalLength │ std_PetalWidth │
## │     │ Categorical… │ Int64 │ Float64         │ Float64        │
## ├─────┼──────────────┼───────┼─────────────────┼────────────────┤
## │ 1   │ setosa       │ 50    │ 1.462           │ 0.105386       │
## │ 2   │ versicolor   │ 50    │ 4.26            │ 0.197753       │
## │ 3   │ virginica    │ 50    │ 5.552           │ 0.27465        │

Another way to use the Split-Apply-Combine strategy is implementing the
aggregate() function, which also takes three arguments:

  • DataFrame;
  • one or more column names to split on;
  • one or more function to be applied ON THE COLUMNS NOT USED TO SPLIT.

The difference between by() and aggregate() function is that in the
latter, the function(s) will be applied to each column not used in
the split part.

For instance, let’s say you want the average of each colum for each Species.
Instead of using by() with an anonymous function and writing the name of all columns
we can do:

aggregate(foo, :Species, [mean])
## 3×5 DataFrame. Omitted printing of 1 columns
## │ Row │ Species      │ SepalLength_mean │ SepalWidth_mean │ PetalLength_mean │
## │     │ Categorical… │ Float64          │ Float64         │ Float64          │
## ├─────┼──────────────┼──────────────────┼─────────────────┼──────────────────┤
## │ 1   │ setosa       │ 5.006            │ 3.428           │ 1.462            │
## │ 2   │ versicolor   │ 5.936            │ 2.77            │ 4.26             │
## │ 3   │ virginica    │ 6.588            │ 2.974           │ 5.552            │

Note that Julia only display output that fits the screen. Pay
attention to the message “Omitted printing of 1 columns”. To
overcome this, use the show() as advised before.

Reading and Writting CSV files:

Last but not least, let’s see how to read and write CSV files into/from Julia.
Although this is not exactly handled by the DataFrames package, the task of
reading/writing CSV files are so natural when working with DataFrame that I will
show you the basics.

To read/write CSV files, we use the CSV package. To demonstrate its usage,
let’s work with the iris dataset and write a CSV file to a local computer. Then,
we read it back.

So, first we are going to write the foo object (which contains the iris dataset)
to a CSV file. To do this we will use the CSV.write() function. Some useful
arguments in CSV.write are:

  • delim : the file’s delimeter. Default ‘,’;
  • header : boolean whether to write the colnames from source;
  • colnames : provide colnames to be written;
  • append : bool to indicate if it to append data;
  • missingstring : string that indicates how missing values will be represented.
    using CSV
    CSV.write("iris.csv", foo, missingsstring = "NA")

To read a CSV file, we use the CSV.read(). Some useful arguments are:

  • delim : a Char or String that indicates how columns are delimited in a file’s delimeter. Default ‘,’;
  • decimal : a Char indicating how decimals are separated in
    floats. Default ‘.’ ;
  • limit : indicates the total number of rows to read;
  • header : provide manually the names of the columns;
  • types : a Vector or Dict of types to be used for column types.
iris = CSV.read("iris.csv")

It is important to note that when loading in a DataFrame from a CSV, all columns
allow Missing by default.

This is the basics of reading/writting CSV files in Julia. To get more details
refers to the official documentation.

Conclusion:

This post was a very small introduction to the DataFrames packages in Julia.
After reading this post you will be able to read CSV datasets and perform some
tasks with the data at hand.

In the following posts, we will explore more advanced tricks to perform data
wrangling and exploratory data analysis. At each step we are going to build
knowledge to completely use Julia to perform data analysis for any problem that
you might face.