I have been a happy Python user for more than 20 years. I use it daily
as a data scientist with the classical scientific stack and a little
of PyTorch when I get fancy.
Recently I have decided to give Julia a
try. I must admit I was skeptical. Is it worthy? I think yes.
Let me say that I was surprised by Julia’s design but that is the
subject for another post. This post is about what everyone thinks is
the differentiating Julia advantage: speed. The main conclusion is that it’s possible in a practical way to
obtain near C speed. Julia will not make your code run fast
automagically but it’s a power tool that in trained hands can achieve
good results.
I’m still a Julia newbie so I wouldn’t be surprised if something I did
can be better done or something I wrote is
A brief tutorial on how to use Julia to analyze graphs using the JuliaGraphs packages
First of all, let’s be clear. The goal of this article is to briefly introduce how to use Julia for analyzing graphs. Besides the many types of graphs (undirected, directed, bipartite, weighted…), there are also many methods for analyzing them (degree distribution, centrality measures, clustering measures, visual layouts …). Hence, a comprehensive introduction to Graph Analysis with Julia would be too large of a task.
Therefore, this tutorial focuses on undirected weighted graphs, since they encompass weightless graphs, and are usually more common than directed graphs¹.
JuliaGraphs Project
Almost every package you will need can be found in the JuliaGraphs Project. The project contains specific packages for plotting, network layouts, weighted graphs, and more. In our example, we’ll be using GraphPlot.jl and SimpleWeightedGraphs.jl. The good thing about the project is that these packages work together and are very similar in design. Hence, the functions you use for creating a weighted graph are very similar to the ones you use for creating a simple graph with LightGraphs.jl.
Creating your first Graph
Let’s create a DataFrame using the DataFrames.jl package. Each column will represent a person, and each row will represent an attribute. Therefore, our graph will be composed of nodes (people) and edges (people share the same attribute).
After creating the DataFrame, we create the graph. Note here that they are actually separate objects. To create the graph, you only need to specify the number of nodes, which is the number of columns. Below I present the code for the creation of the DataFrame and the Graph.
The nodes were inserted, now we have to create the appropriate edges. This is done by using the command add_edge!() which takes the graph, the nodes that must be connected, and the edge weight.
In our example, the the weight is equal to the number of shared attributes between two columns. For example, the first and second column both have 1’s in rows 1 and 7. Therefore, they share an edge with weight 2:
Let’s do some analysis in this graph. As I said in the beginning, there are many way to analyze a graph. Two very common methods are studying the centrality of each node, and creating a Minimum Spanning Tree (MST). The goal of this article is not to explain theoretical aspects of graphs, so I’ll assume that the reader knows what I’m talking about.
Here is an example of how to calculate the betweeness centrality and visualizing the results. Note that I added 1 in the first line of code just to enable a better visualization.
Finally, one can also easily create a MST. To do so, we must create a new graph, since we’ll be removing edges and adjusting the nodes’ locations. The code below shows how to do this. The only new function here is the kruskal_mst() which will give us the MST. You can choose if the MST is minimizing of maximizing the path. In our case, we want to create a tree the contains only the “strongest” connections, hence, we are maximizing.
Conclusion
This is the end of this brief tutorial. There are much more functionalities in the JuliaGraphs project, enabling a much more thorough analysis than the one presented here. Also, one can create more interactive visualizations using VegaLite.jl, but I’ll leave that for another article.
The Julia language is renowned for being swift (yet it is clearly not Swift).
Recently its data reading and writing capabilities are ultra swift mainly
thanks to Jacob Quinn building on earlier efforts of ExpandingMan.
The game changer is Arrow.jl package which is not only fast, but also
it is an implementation of Apache Arrow format. This means that we have
a great format for in-memory and on-disk data exchange with C, C++, C#, Go, Java,
JavaScript, MATLAB, Python, R, Ruby, and Rust.
And benchmark the writing-to-disk speed (I run all timings twice to capture both
time of the first run, and time after things are compiled, as both are relevant
in practice):
and here the magic happens. When Arrow.Table reads the file from the disk
it does memory mapping so reading it is almost instant (again, as when writing
the file the compilation latency is low, which is nice).
There is one cost of this speed, Arrow.jl uses its own custom vector type and it
is additionally read only when memory mapped, as we can see here:
As you can see this time the vectors are mutable, but not re-sizable. Having
mutability costs around 0.3 second of extra read time over memory mapping.
To get a relative feeling about these timings let us try CSV.jl
(we use a single thread):
So we see that indeed both reading and writing is much faster (although the size
of the file on disk is comparable in both approaches).
Concluding remarks
I did not want to go into the details of Arrow.jl usage in this post,
but rather show a high level view how it works and present how to use it with
DataFrames.jl (depending on what level of mutability one wants).
Maybe as a final comment let me just highlight the three options you have to
create the Arrow.Table table object. You can use data from:
from a file on a disk, in which case you have an advantage of being able to use
memory mapping;
from an IO (so this means you can ingest data from any external source);
from a Vector{UInt8} (so you can easily process data passed as a pointer,
or e.g. byes read from a HTTP request).
I think it is really great as it covers virtually any use case one might
encounter in practice.