Category Archives: Julia

Using DataFrames and PyPlot in Julia

By: Jaafar Ballout

Re-posted from: https://www.supplychaindataanalytics.com/using-dataframes-and-pyplot-in-julia/

Julia includes many packages that could be used in operation research and optimization in general. This article serves as a brief introduction to different packages available in Julia’s ecosystem. I will focus on two packages that are stepping-stones in future work: DataFrames.jl and PyPlots. DataFrames.jl provides a set of tools for working with tabular data similar to Pandas in Python. The PyPlots module provides a Julia interface to the Matplotlib plotting library from Python. Although PyPlots is a distribution of Python and Matplotlib, Julia can install a private distribution that can’t be accessed outside Julia’s environment.

Adding the necessary packages

Open a new terminal window and run Julia. Use the code below to add the packages required for this article.

julia> using Pkg
julia> Pkg.add("DataFrames")
julia> Pkg.add("CSV")
julia> Pkg.add("Arrow")

Open a new terminal window and run Julia. Initialize the PYTHON environment variable:

julia> ENV["PYTHON"] = ""
""

Install PyPlot:

julia> using Pkg
julia> Pkg.add("PyPlot")

After adding all the required packages using Julia REPL, the following code is used to import the packages in the Jupyter editor (i.e. Jupyter Notebook).

using DataFrames
using CSV
using Arrow
using PyPlot

Using DataFrames in Julia

DataFrames requires some packages in the backend like CSV and Arrow to complete its operations properly. Thus, these packages were added initially in the section before.

Using PyPlots in Julia

Because PyPlots is an interface for Matplotlib in Julia, all the documentation is available on Matplotlib’s main page.

Covid-19 showcase using Julia

The first covid-19 case was detected on the 17th of November, 2019. Although two years passed since the pandemic started, the virus is still persisting and cases are increasing exponentially. Thus, data analysis is important to understand the growth of cases around the world. I will be using a .csv file containing data about the cases in country X. The file is available in a public repository on my GitHub page, so I can copy it to an excel sheet and move it to the directory file where the Jupyter notebook is located. Also, I can import the data from the web using excel and attach the link of the raw format of the database on GitHub. In both cases, saving the file as time.csv is necessary to fit with the code in later stages.

Here, I am showing the database, in csv format, that I opened via excel on my desktop.

I will read the csv file, in the Jupyter notebook, using the code block below:

df  = CSV.File("time.csv") |> DataFrame; # reading the csv file using CSV package and changing it to a DataFrame using the arrow operation |>
df[1:5,:] # output the first five rows

Importing data files is smoother using DataFrames. Some of the operations present in the code blocks below are explained in a previous post introducing Julia. Plotting is another important tool to understand data and visualize it better. Matplotlib is introduced before to the blog but in Python.

The code block below creates a bar chart showing the number of cumulative tests and cumulative negative cases over a period of six days from the DataFrame, df, imported above.

y1 = df[20:25,2]; 
y2 = df[20:25,3];
x = df[20:25,1];

fig = plt.figure() 
ax = plt.subplot() 
ax.bar(x, y1, label="cumulative tests",color="black")
ax.bar(x, y2, label="cumulative negative cases",color="grey")
ax.set_title("Covid-19 data in Country X",fontsize=18,color="green")
ax.set_xlabel("date",fontsize=14,color="red")
ax.set_ylabel("number of cases",fontsize=14,color="red")
ax.legend(fontsize=10)
ax.grid(b=1,color="blue",alpha=0.1)
plt.show()

The bar chart above could be improved by allocating a bar for each category. This is done using the code block below.

barWidth = 0.25
br1 = 1:1:length(x)
br2 = [x + barWidth for x in br1]

fig = plt.figure()
ax = plt.subplot() 
ax.bar(br1, y1,color="r", width=barWidth, edgecolor ="grey",label="cumulative tests")
ax.bar(br2, y2,color ="g",width=barWidth, edgecolor ="grey", label="cumulative negative cases")
ax.set_title("Covid-19 data in Country X",fontsize=18,color="green")
ax.set_xlabel("date", fontweight ="bold",fontsize=14)
ax.set_ylabel("number of cases", fontweight ="bold",fontsize=14)
ax.legend(fontsize=10)
ax.grid(b=1,color="blue",alpha=0.1)
plt.xticks([r + barWidth for r in br1],  ["2/8/2020", "2/9/2020", "2/10/2020", "2/11/2020", "2/12/2020","2/13/2020"])
plt.legend()
plt.show()

Understanding the exponential growth associated with covid19 requires plotting the whole dataset which is over the span of 44 days. Therefore, the next code block aims to show the covid cases of the complete dataset.

days = 1:1:44
days_array = collect(days)
confirmed_cases = df[:,4]

fig = plt.figure(figsize=(10,10))
ax = plt.subplot()
ax.bar(days_array,confirmed_cases , color ="blue",width = 0.4)
ax.set_title("A bar chart showing cumulative confirmed covid-19 cases in country X",fontsize=22,color="darkgreen")
ax.set_xlabel("day number",fontsize=16,color="darkgreen")
ax.set_ylabel("number of cases",fontsize=16,color="darkgreen")
ax.xaxis.set_ticks_position("none")
ax.yaxis.set_ticks_position("none")
ax.xaxis.set_tick_params(pad = 5)
ax.yaxis.set_tick_params(pad = 10)
ax.grid(b = 1, color ="grey",linestyle ="-.", linewidth = 0.5,alpha = 0.2)
fig.text(0.3, 0.8, "SCDA-JaafarBallout", fontsize = 10,color ="grey", ha ="right", va ="bottom",alpha = 0.6)
plt.show()

Even though bar charts are powerful in our case, it is still nice to observe the exponential curve. For that, a code block is presented below to plot the growth curve of confirmed cases.

fig = plt.figure(figsize=(10,10))
ax = plt.subplot()
ax.plot(days_array,confirmed_cases , color ="blue",marker="o",markersize=6, linewidth=2, linestyle ="--")
ax.set_title("A bar chart showing cumulative confirmed covid-19 cases in country X",fontsize=22,color="darkgreen")
ax.set_xlabel("day number",fontsize=16,color="darkgreen")
ax.set_ylabel("number of cases",fontsize=16,color="darkgreen")
ax.grid(b = 1, color ="grey",linestyle ="-.", linewidth = 0.5,alpha = 0.2)
fig.text(0.3, 0.8, "SCDA-JaafarBallout", fontsize = 10,color ="grey", ha ="right", va ="bottom",alpha = 0.6)
plt.xticks(size=16, color ="black")
plt.yticks(size=16, color ="black")
plt.show()

Realizing a meme! (with Julia)

Recently, this meme got viral. So, it is nice to figure the missing functions by plotting.

fig, axs = plt.subplots(2, 2);
fig.tight_layout(pad=4);

x1 = range(-10, 10, length=1000);
y1 = range(1, 1,length =1000);

axs[1].plot(x1, y1, color="blue", linewidth=2.0, linestyle="-");
axs[1].set_xlabel("x1");
axs[1].set_ylabel("y1");
axs[1].set_title("Constant");

x2 = range(-10, 10, length=1000);
y2 = x2.^3;

axs[2].plot(x2, y2, color="blue", linewidth=2.0, linestyle="-");
axs[2].set_xlabel("x2");
axs[2].set_ylabel("y2");
axs[2].set_title("Y = X^(3)");

x3 = range(-10, 10, length=1000);
y3 = x2.^2;

axs[3].plot(x3, y3, color="blue", linewidth=2.0, linestyle="-");
axs[3].set_xlabel("x3");
axs[3].set_ylabel("y3");
axs[3].set_title("Y = X^(2)");

x4 = range(-5, 5, length=1000);
y4 = cos.(x4);

axs[4].plot(x4, y4, color="blue", linewidth=2.0, linestyle="-");
axs[4].set_xlabel("x4");
axs[4].set_ylabel("y4");
axs[4].set_title("Y = cos(X)");

Graphing networks by hand or traditional software seems exhausting. In future posts, I will demonstrate how to draw complicated networks using Julia and its dependencies. Then, I will solve the network problem using the JuMP package.

The post Using DataFrames and PyPlot in Julia appeared first on Supply Chain Data Analytics.

New features in DataFrames.jl 1.3: conclusion

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/01/07/release13.html

Introduction

This is the last post from the series introducing features added in DataFrames.jl 1.3. There are many changes I have not covered yet. I have selected some
of them that I think are most relevant in typical data wrangling workflows.

The topics I plan to discuss are:

  • ordering of groups in groupby;
  • unstack now supports fill keyword argument;
  • deprecations in deleting rows and sorting API.

The post was written under Julia 1.7.0, DataFrames.jl 1.3.1,
Chain.jl 0.4.10, and FreqTables.jl 0.4.5.

Ordering of groups in groupby

Let me start with highlighting that GroupedDataFrame objects produced by the
groupby function are indexable. This means that you can flexibly subset groups
or re-order them. Here is an example:

julia> using DataFrames

julia> df = DataFrame(a=[1,1,2,2,2,3])
6×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     1
   3 │     2
   4 │     2
   5 │     2
   6 │     3

julia> gdf = groupby(df, :a, sort=true)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = 1
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     1
⋮
Last Group (1 row): a = 3
 Row │ a
     │ Int64
─────┼───────
   1 │     3

julia> gdf[[3, 1]]
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 3
 Row │ a
     │ Int64
─────┼───────
   1 │     3
⋮
Last Group (2 rows): a = 1
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     1

Here the gdf[[3, 1]] operation picked two groups from gdf putting group
with original index 3 first and group with original index 1 next.

This feature is often useful and gives a lot of flexibility to the users. Here
is an example showing how you can sort groups based on non-key column values:

julia> df = DataFrame(a=[1,1,2,2,2,3], x=6:-1:1)
6×2 DataFrame
 Row │ a      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      6
   2 │     1      5
   3 │     2      4
   4 │     2      3
   5 │     2      2
   6 │     3      1

julia> gdf = groupby(df, :a, sort=true)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = 1
 Row │ a      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      6
   2 │     1      5
⋮
Last Group (1 row): a = 3
 Row │ a      x
     │ Int64  Int64
─────┼──────────────
   1 │     3      1

julia> gdf[sortperm([sum(sdf.x) for sdf in gdf])]
GroupedDataFrame with 3 groups based on key: a
First Group (1 row): a = 3
 Row │ a      x
     │ Int64  Int64
─────┼──────────────
   1 │     3      1
⋮
Last Group (2 rows): a = 1
 Row │ a      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      6
   2 │     1      5

However, this means that one should be careful when considering the ordering
of groups in a GroupedDataFrame. For this reason apart from integer indexing
GroupedDataFrame also supports indexing using values of grouping columns
(in the example I show Tuple indexing, but also NamedTuple and dictionary
indexing is supported):

julia> df = DataFrame(name=["Alice", "Bob"])
2×1 DataFrame
 Row │ name
     │ String
─────┼────────
   1 │ Alice
   2 │ Bob

julia> gdf = groupby(df, :name, sort=true)
GroupedDataFrame with 2 groups based on key: name
First Group (1 row): name = "Alice"
 Row │ name
     │ String
─────┼────────
   1 │ Alice
⋮
Last Group (1 row): name = "Bob"
 Row │ name
     │ String
─────┼────────
   1 │ Bob

julia> gdf[("Bob",)]
1×1 SubDataFrame
 Row │ name
     │ String
─────┼────────
   1 │ Bob

or you can use a special GroupKey object that is produced by the keys
function (this option is fastest):

julia> keys(gdf)
2-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (name = "Alice",)
 GroupKey: (name = "Bob",)

So what is new in DataFrames.jl 1.3? The thing is that previously user was not
able to fully control the initial ordering of groups produced by groupby in
all cases. Now this can be controlled by the sort keyword argument and the
API has been established with the following rules:

  • if you pass sort=true the groups will be sorted by values of grouping columns;
  • if you pass sort=false the groups will be produced in order of their first
    appearance in the source data frame;
  • if you omit passing the sort keyword argument the ordering of groups is
    undefined and will depend on the grouping algorithm used (DataFrames.jl has
    several grouping algorithms and tries to choose the fastest available).

To see that these options matter let me show two examples of grouping on an
integer column:

julia> df = DataFrame(id=[2, 3, 1])
3×1 DataFrame
 Row │ id
     │ Int64
─────┼───────
   1 │     2
   2 │     3
   3 │     1

julia> keys(groupby(df, :id))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1,)
 GroupKey: (id = 2,)
 GroupKey: (id = 3,)

julia> keys(groupby(df, :id, sort=true))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1,)
 GroupKey: (id = 2,)
 GroupKey: (id = 3,)

julia> keys(groupby(df, :id, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 2,)
 GroupKey: (id = 3,)
 GroupKey: (id = 1,)

julia> df = DataFrame(id=[2, 30, 1])
3×1 DataFrame
 Row │ id
     │ Int64
─────┼───────
   1 │     2
   2 │    30
   3 │     1

julia> keys(groupby(df, :id))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 2,)
 GroupKey: (id = 30,)
 GroupKey: (id = 1,)

julia> keys(groupby(df, :id, sort=true))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1,)
 GroupKey: (id = 2,)
 GroupKey: (id = 30,)

julia> keys(groupby(df, :id, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 2,)
 GroupKey: (id = 30,)
 GroupKey: (id = 1,)

As you can see passing the sort keyword argument produces a consistent
ordering. However, when it is not passed in both examples we got a different
order of groups.

unstack now supports fill keyword argument

The change in unstack is pretty simple, but in many common scenarios will be
useful I think. Now you can specify what value should be used to fill missing
combinations of data.

Let me give a practical example. Assume you have a data frame where you have
several observations of peoples’ hair color and eye color:

julia> df = DataFrame(hair=["brown", "yellow", "brown", "brown"],
                      eyes=["blue", "blue", "green", "blue"])
4×2 DataFrame
 Row │ hair    eyes
     │ String  String
─────┼────────────────
   1 │ brown   blue
   2 │ yellow  blue
   3 │ brown   green
   4 │ brown   blue

You can create a frequency table of this data with the FreqTables.jl package:

julia> using FreqTables

julia> freqtable(df, :hair, :eyes)
2×2 Named Matrix{Int64}
hair ╲ eyes │  blue  green
────────────┼─────────────
brown       │     2      1
yellow      │     1      0

You got a matrix with a desired result. However, what if you wanted to get
a DataFrame instead. In the past you would do:

julia> using Chain

julia> @chain df begin
           groupby([:hair, :eyes], sort=true)
           combine(nrow)
           unstack(:hair, :eyes, :nrow)
       end
2×3 DataFrame
 Row │ hair    blue    green
     │ String  Int64?  Int64?
─────┼─────────────────────────
   1 │ brown        2        1
   2 │ yellow       1  missing

The only problem is that you get missing instead of 0 in the cell where
there were no observations. To get 0 you would write:

julia> @chain df begin
           groupby([:hair, :eyes], sort=true)
           combine(nrow)
           unstack(:hair, :eyes, :nrow)
           coalesce.(0)
       end
2×3 DataFrame
 Row │ hair    blue   green
     │ String  Int64  Int64
─────┼──────────────────────
   1 │ brown       2      1
   2 │ yellow      1      0

Since DataFrames.jl the pipeline is easier as you can pass fill=0 keyword
argument to unstack:

julia> @chain df begin
           groupby([:hair, :eyes], sort=true)
           combine(nrow)
           unstack(:hair, :eyes, :nrow, fill=0)
       end
2×3 DataFrame
 Row │ hair    blue   green
     │ String  Int64  Int64
─────┼──────────────────────
   1 │ brown       2      1
   2 │ yellow      1      0

Deprecations in deleting rows and sorting

The deprecation in row deletion is simple. The delete! function is deprecated
in favor of deleteat! function. This change was made to make the DataFrames.jl
API consistent with the Julia Base API (where delete! is defined to remove a
mapping for the given key in a collection, while deleteat! removes items
from given indices).

The deprecation in sorting API is more subtle. Consider the following data
frame:

julia> df = DataFrame(x=[1, 2, 2, 1], y =[2, 2, 1, 1], z=1:4)
4×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      1
   2 │     2      2      2
   3 │     2      1      3
   4 │     1      1      4

If you sort it without passing the list of columns on which it should be sorted
by default a lexicographic sort on all columns is performed:

julia> sort(df)
4×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      4
   2 │     1      2      1
   3 │     2      1      3
   4 │     2      2      2

is the same as:

julia> sort(df, All())
4×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      4
   2 │     1      2      1
   3 │     2      1      3
   4 │     2      2      2

However, to our surprise, currently also when you ask for sorting on no columns
you also get a data frame sorted on all columns:

julia> sort(df, Cols())
┌ Warning: When empty column selector is passed ordering is done on all colums. This behavior is deprecated and will change in the future.
│   caller = sortperm(df::DataFrame, cols::Cols{Tuple{}}; alg::Nothing, lt::typeof(isless), by::typeof(identity), rev::Bool, order::Base.Order.ForwardOrdering) at sort.jl:579
└ @ DataFrames ~/.julia/packages/DataFrames/BM4OQ/src/abstractdataframe/sort.jl:579
4×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      4
   2 │     1      2      1
   3 │     2      1      3
   4 │     2      2      2

We think that it is an incorrect behavior and in the future sorting on no
columns will produce the result identical to the input data frame (no sorting
will be performed).

Conclusions

This post concludes a series of reviews of new features in DataFrames.jl release
1.3. I have not covered everything that was introduced, a complete list of
changes can be found in the NEWS.md file.

I hope you will enjoy using the package! Happy data wrangling in year 2022!