Julia first impressions

By: pickle.dump(me, blog)

I have been a happy Python user for more than 20 years. I use it daily
as a data scientist with the classical scientific stack and a little
of PyTorch when I get fancy.

Recently I have decided to give Julia a
try. I must admit I was skeptical. Is it worthy? I think yes.

Let me say that I was surprised by Julia’s design but that is the
subject for another post. This post is about what everyone thinks is
the differentiating Julia advantage:
The main conclusion is that it’s possible in a practical way to
obtain near C speed
. Julia will not make your code run fast
automagically but it’s a power tool that in trained hands can achieve
good results.

I’m still a Julia newbie so I wouldn’t be surprised if something I did
can be better done or something I wrote is

Analyzing Graphs with Julia


A brief tutorial on how to use Julia to analyze graphs using the JuliaGraphs packages

Example of Graph created using LigthsGraphs.jl and VegaLite.jl

First of all, let’s be clear. The goal of this article is to briefly introduce how to use Julia for analyzing graphs. Besides the many types of graphs (undirected, directed, bipartite, weighted…), there are also many methods for analyzing them (degree distribution, centrality measures, clustering measures, visual layouts …). Hence, a comprehensive introduction to Graph Analysis with Julia would be too large of a task.

Therefore, this tutorial focuses on undirected weighted graphs, since they encompass weightless graphs, and are usually more common than directed graphs¹.

JuliaGraphs Project

Almost every package you will need can be found in the JuliaGraphs Project. The project contains specific packages for plotting, network layouts, weighted graphs, and more. In our example, we’ll be using GraphPlot.jl and SimpleWeightedGraphs.jl. The good thing about the project is that these packages work together and are very similar in design. Hence, the functions you use for creating a weighted graph are very similar to the ones you use for creating a simple graph with LightGraphs.jl.

Creating your first Graph

Let’s create a DataFrame using the DataFrames.jl package. Each column will represent a person, and each row will represent an attribute. Therefore, our graph will be composed of nodes (people) and edges (people share the same attribute).

DataFrame used for creating the Graph

After creating the DataFrame, we create the graph. Note here that they are actually separate objects. To create the graph, you only need to specify the number of nodes, which is the number of columns. Below I present the code for the creation of the DataFrame and the Graph.

The nodes were inserted, now we have to create the appropriate edges. This is done by using the command add_edge!() which takes the graph, the nodes that must be connected, and the edge weight.

In our example, the the weight is equal to the number of shared attributes between two columns. For example, the first and second column both have 1’s in rows 1 and 7. Therefore, they share an edge with weight 2:

add_edge!(g,1,2,2) # add_edge!(graph, node_1, node_2, weight)

The following code loops through the data, and adds the edges to the graph.

Visualizing the Graph

With this, our graph is ready to be visualized. This can be easily done with the following command:

Output from gplot()

There are several different layouts to chose from, just take a look at the GraphPlots.jl page. Here is another example:

Output of gplot() using another layout

Centrality and Minimum Spanning Tree

Let’s do some analysis in this graph. As I said in the beginning, there are many way to analyze a graph. Two very common methods are studying the centrality of each node, and creating a Minimum Spanning Tree (MST). The goal of this article is not to explain theoretical aspects of graphs, so I’ll assume that the reader knows what I’m talking about.

Here is an example of how to calculate the betweeness centrality and visualizing the results. Note that I added 1 in the first line of code just to enable a better visualization.

Output of code above. The nodes colors represent the centrality.

Finally, one can also easily create a MST. To do so, we must create a new graph, since we’ll be removing edges and adjusting the nodes’ locations. The code below shows how to do this. The only new function here is the kruskal_mst() which will give us the MST. You can choose if the MST is minimizing of maximizing the path. In our case, we want to create a tree the contains only the “strongest” connections, hence, we are maximizing.

MST from the code above.


This is the end of this brief tutorial. There are much more functionalities in the JuliaGraphs project, enabling a much more thorough analysis than the one presented here. Also, one can create more interactive visualizations using VegaLite.jl, but I’ll leave that for another article.

¹ Authors opinion.

² A Jupyter Notebook can also be found on Github.

Swift as an Arrow.jl

By: Bogumił Kamiński

The Julia language is renowned for being swift (yet it is clearly not Swift).

Recently its data reading and writing capabilities are ultra swift mainly
thanks to Jacob Quinn building on earlier efforts of ExpandingMan.

The game changer is Arrow.jl package which is not only fast, but also
it is an implementation of Apache Arrow format. This means that we have
a great format for in-memory and on-disk data exchange with C, C++, C#, Go, Java,
JavaScript, MATLAB, Python, R, Ruby, and Rust.

Recently a very nice blog post that presents how Arrow.jl is used
in practice was written by Jacob Zelko.

In this blog I want to do some performance benchmarking and give recommendations
for people working with DataFrames.jl.

The post is written under Linux, Julia 1.5.2 and the following package setup:

(@v1.5) pkg> status Arrow CSV DataFrames
Status `~/.julia/environments/v1.5/Project.toml`
  [69666777] Arrow v0.4.1
  [336ed68f] CSV v0.7.7
  [a93c6f00] DataFrames v0.21.8

Arrow.jl test drive

First we create some data frame we will work with:

julia> using Arrow, DataFrames

julia> df = DataFrame(["x$i" => i:10^7+i for i in 1:10])
10000001×10 DataFrame
│ Row      │ x1       │ x2       │ x3       │ x4       │ x5       │ x6       │ x7       │ x8       │ x9       │ x10      │
│          │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │
│ 1        │ 1        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │
│ 2        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │
│ 3        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │
│ 4        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │
│ 5        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │
│ 6        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │
│ 7        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │ 16       │
│ 9999994  │ 9999994  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │
│ 9999995  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │
│ 9999996  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │
│ 9999997  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │
│ 9999998  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │
│ 9999999  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │
│ 10000000 │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │
│ 10000001 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │ 10000010 │

And benchmark the writing-to-disk speed (I run all timings twice to capture both
time of the first run, and time after things are compiled, as both are relevant
in practice):

julia> using Arrow, DataFrames

julia> @time Arrow.write("test.arrow", df)
  2.792842 seconds (7.61 M allocations: 386.438 MiB)

julia> @time Arrow.write("test.arrow", df)
  2.222653 seconds (436 allocations: 35.953 KiB)

julia> stat("test.arrow").size / 10^6

The performance looks really good given the final file has 800 MB.
Also we see that compilation latency is low.

Now we read the file back:

julia> @time df2 = Arrow.Table("test.arrow") |> DataFrame;
  0.831660 seconds (2.44 M allocations: 123.515 MiB)

julia> @time df2 = Arrow.Table("test.arrow") |> DataFrame
  0.000320 seconds (649 allocations: 40.047 KiB)
10000001×10 DataFrame
│ Row      │ x1       │ x2       │ x3       │ x4       │ x5       │ x6       │ x7       │ x8       │ x9       │ x10      │
│          │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │
│ 1        │ 1        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │
│ 2        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │
│ 3        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │
│ 4        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │
│ 5        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │
│ 6        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │
│ 7        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │ 16       │
│ 9999994  │ 9999994  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │
│ 9999995  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │
│ 9999996  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │
│ 9999997  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │
│ 9999998  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │
│ 9999999  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │
│ 10000000 │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │
│ 10000001 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │ 10000010 │

and here the magic happens. When Arrow.Table reads the file from the disk
it does memory mapping so reading it is almost instant (again, as when writing
the file the compilation latency is low, which is nice).

There is one cost of this speed, Arrow.jl uses its own custom vector type and it
is additionally read only when memory mapped, as we can see here:

julia> typeof(df2.x1)

julia> df2.x1[1] = 100
ERROR: ReadOnlyMemoryError()

Fortunately, it is easily fixed by making a copy of a DataFrame:

julia> @time df3 = copy(df2);
  0.522758 seconds (287.94 k allocations: 777.914 MiB, 9.34% gc time)

julia> @time df3 = copy(df2);
  0.559576 seconds (44 allocations: 762.943 MiB, 36.20% gc time)

julia> typeof(df3.x1)

And it costs around 0.5 second in our case, which is not that bad I think.

If we want to avoid memory mapping, we can read a file from an IO like this:

julia> @time df2 = open("test.arrow") do io
           return Arrow.Table(io) |> DataFrame
  0.338193 seconds (42.84 k allocations: 765.156 MiB)

julia> @time df2 = open("test.arrow") do io
           return Arrow.Table(io) |> DataFrame
  0.323562 seconds (8.50 k allocations: 763.373 MiB)
10000001×10 DataFrame
│ Row      │ x1       │ x2       │ x3       │ x4       │ x5       │ x6       │ x7       │ x8       │ x9       │ x10      │
│          │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │ Int64    │
│ 1        │ 1        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │
│ 2        │ 2        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │
│ 3        │ 3        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │
│ 4        │ 4        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │
│ 5        │ 5        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │
│ 6        │ 6        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │
│ 7        │ 7        │ 8        │ 9        │ 10       │ 11       │ 12       │ 13       │ 14       │ 15       │ 16       │
│ 9999994  │ 9999994  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │
│ 9999995  │ 9999995  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │
│ 9999996  │ 9999996  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │
│ 9999997  │ 9999997  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │
│ 9999998  │ 9999998  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │
│ 9999999  │ 9999999  │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │
│ 10000000 │ 10000000 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │
│ 10000001 │ 10000001 │ 10000002 │ 10000003 │ 10000004 │ 10000005 │ 10000006 │ 10000007 │ 10000008 │ 10000009 │ 10000010 │

julia> typeof(df2.x1)

julia> df2.x1[1] = 100

julia> push!(df2, 1:10)
ERROR: MethodError: no method matching resize!(::Arrow.Primitive{Int64,Array{Int64,1}}, ::Int64)

As you can see this time the vectors are mutable, but not re-sizable. Having
mutability costs around 0.3 second of extra read time over memory mapping.

To get a relative feeling about these timings let us try CSV.jl
(we use a single thread):

julia> using CSV

julia> @time CSV.write("test.csv", df)
 14.644919 seconds (501.00 M allocations: 11.977 GiB, 8.92% gc time)

julia> @time CSV.write("test.csv", df)
 15.076190 seconds (499.98 M allocations: 11.925 GiB, 8.93% gc time)

julia> stat("test.csv").size / 10^6

julia> @time df2 = CSV.read("test.csv");
  7.391509 seconds (6.01 M allocations: 2.909 GiB, 3.92% gc time)

julia> @time df2 = CSV.read("test.csv");
  3.292178 seconds (2.94 k allocations: 2.612 GiB, 6.21% gc time)

So we see that indeed both reading and writing is much faster (although the size
of the file on disk is comparable in both approaches).

Concluding remarks

I did not want to go into the details of Arrow.jl usage in this post,
but rather show a high level view how it works and present how to use it with
DataFrames.jl (depending on what level of mutability one wants).

Maybe as a final comment let me just highlight the three options you have to
create the Arrow.Table table object. You can use data from:

  • from a file on a disk, in which case you have an advantage of being able to use
    memory mapping;
  • from an IO (so this means you can ingest data from any external source);
  • from a Vector{UInt8} (so you can easily process data passed as a pointer,
    or e.g. byes read from a HTTP request).

I think it is really great as it covers virtually any use case one might
encounter in practice.