When you work with strings in Julia you have several options how to store them.
In this post I discuss the most common usage scenarios and the recommended
choices.
The sting storage decision tree
The choices I discuss below are related to performance and memory consumption.
Therefore in what follows I assume you work with large data sets relative to
available RAM on your machine.
Let me start with the decision flowchart and then explain it:
The first decision you need to make is if you want only performance optimization
or you want your strings to be treated as ordered or unordered categorical data
in statistical sense. If you need your data to be categorical then the choice
is simple. The only option is to use the CategoricalArrays.jl package,
where the underlying data can be stored in a String.
On the other hand if your goal is only performance and saving memory then the
first question is if the number of unique values of strings in your data is
low. If this is the case then the recommended package is PooledArrays.jl, where again you should be fine with storing String values.
We are down to the scenario when you have a lot of strings that have very many
unique values. In such a case the question is if all of these strings are
relatively short and have a similar size. If this is the case then you can use
the InlineStrings.jl package. It provides several types called String1, String3, String7, String15 etc. where the number indicates maximum string
size in bytes that a given type can store. The benefit of these values is that
they are not heap allocated. It means that they are fast to work with and they
do not burden the Julia Garbage Collector.
Finally we are left with many strings, that have many unique values and that
have varying and possibly large size. In this case what Julia Base offers is a
sensible choice. Normally you should just use the String type stored in
standard collections like Vector. However, there is one special case when
you could consider using the Symbol type instead of a String type.
You can use Symbol instead of String if all the following conditions are
met:
your strings are just labels that you only need to compare against each other;
in particular it assumes that you do not need to perform any transformations
on them; the reason is that Symbol is not an AbstractString so it cannot
be passed to functions that only accept strings (as a benefit comparing Symbols is faster than comparing Strings);
you are OK with the fact that once Symbol is created the memory it uses up
will be never reclaimed by the Julia Garbage Collector until the end of the
session (however, the benefit is that if you have several identical Symbols
they share the same memory).
Conclusions
Choosing an appropriate type to store your strings is often a quite hard
decision. I hope that after reading this post you have a better overview of
available options and when each of them is appropriate to be used.
It is also recommended to immediately convert the data to an appropriate format
when you read it in. Therefore, e.g. I recommend you to check out the
documentation of the CSV.jl package to learn how to specify what you
want to get when reading the CSV files (the most important keyword arguments
for handling these choices are pool and stringtype).
Occasionally I write posts about Julia tools that are often not commonly
known, but are useful in practice. Today I want to talk about
the ClipData.jl package.
The post was written under Julia 1.6.3, DataFrames.jl 1.2.2, and
ClipData.jl 0.2.1.
What is ClipData.jl?
The package does one thing and does it well: it allows you to move
tabular data between your Julia session and the system clipboard both ways.
The to major use cases are:
You have a table in e.g. Google Sheet, you copy it to the system clipboard,
and want to interactively ingest it in the Julia session as a table
(in my examples I will use DataFrame).
You have a DataFrame in your Julia session and you want to copy it to the
system clipboard so that you can later paste it in e.g. Google Sheet.
Many data scientists need to do both operations virtually every day, and ClipData.jl comes to the rescue. This package is not only nice, but it
has an excellent visuals explaining how things work. Therefore, since they are
MIT licensed, I just link to the videos prepared by Peter Deffebach here.
Let us get to action.
First you need to know if your data has a header of not. If it has a
header we will work with a DataFrame, if it does not we will work with a Matrix.
Clipping tables
To work with tabular data (having a header) use the cliptable function. To
copy data from the system clipboard and store it in a DataFrame called df
just write:
df = cliptable() |> DataFrame
On the other hand if you want to copy your df data frame to the system
clipboard use:
cliptable(df)
All this is very nicely presented in the following video (in particular notice
that column element types are automatically detected):
Clipping matrices
To work with arrays use the cliparray function. To copy data from
system clipboard and store it in a Matrix called mat just write:
mat = cliparray()
On the other hand if you want to copy your mat matrix to system clipboard
use:
cliparray(mat)
Here is a video showing the process:
Conclusions
There are several additional features that ClipData.jl provides (like handling
how table cells should be parsed). If you want to know more details please
refer to the ClipData.jl homepage.
I am sure you will find this little package quite useful in your data science
projects!
When starting learning Julia, one might get lost in the many different packages available to do data visualization. Right out of the cuff, there is Plots, Gadfly, VegaLite … and there is Makie.
Makie is fairly new (~2018), yet, it’s very versatile, actively developed and quickly growing in number of users. This article is a quick introduction to Makie, yet, by the end of it, you will be able to do a plethora of different plots.
The Future of Plotting in Julia
When I started coding in Julia, Makie was not one of the contenders for “best” plotting libraries. As time passed, I started to here more and more about it around the community. For some reason, people were saying that:
“Makie is the future” — People in the Julia Community
I never fully understood why that was the case, and every time I tried to learn it, I’d be turned off by the verbose syntax, and, frankly, ugly examples. It was only when I bumped into Beautiful Makie that I decided to put aside my prejudices and get on with the times.
Hence, if you are starting to code in Julia, and is wondering which plotting package you should invest your time to learn, I say to you that Makie is the way to go, since I guess “Makie is the future”.
Number of GitHub Star’s in per repository. I guess indeed Makie is the future, if this trend keeps going.
Starting with Makie… Pick your backend
The versatility in Makie can make it a bit unwelcoming for those that “just want to do a damn scatter plot”. First of all, there is Makie.jl, CairoMakie.jl, GLMakie.jl WGLMakie.jl ?. Which one should you use?
Well, here is the deal. Makie.jl is the main plotting package, but you have to choose a backend to which you will display your plots. The choice depends on your objectives. So yes, besides Makie.jl, you will need to install one of the backends. Here is a small description to help you chose:
CairoMakie.jl: It’s the easiest to use of all three, and it’s the ideal choice if you just want to produce static plots (not interactive);
GLMakie.jl: Uses OpenGL to display the plots, hence, you need to have OpenGL installed. Once you do a plot and run the display(myplot) , it’ll open an interactive window with your plot. If you want to do interactive 3D plots, then this is the backend for you;
WGLMakie.jl: It’s the hardest one to work with. Still, if you want to create interactive visualizations in the web, this is your choice.
In this tutorial, we’ll use CairoMakie.jl.
Your first plot
After picking our backend, we can now start plotting! I’ll go out on a limb and say that Makie is very similar to Matplotlib. It does not work with any fancy “Grammar of Graphics” (but if you like this sort of stuff, take a look at the AlgebraOfGraphics.jl, which implements an “Algebra of Graphics” on Makie).
Thus, there are a bunch of ready to use functions for some of the most common plots.
using CairoMakie #Yeah, no need to import Makie scatter(rand(10,2))
Easy breezy… Yet, if you are plotting this in a Jupyter Notebook, you might be slightly ticked off by two things. First, the image is just too large. And second, it’s kind of low quality. What is going on?
By default, CairoMakie uses raster format for images, and the default size is a bit large. If you are like me and prefer your plots to be in svg and a bit smaller, then no worries! Just do the following:
using CairoMakie CairoMakie.activate!(type = "svg") scatter(rand(10,2),figure=(;resolution=(300,300)))
In the code above, the CairoMakie.activate!() is a command that tells Makie which backend you are using. You can import more than one backend at a time, and switch between them using this activation commands. Also, the CairoMakie backend has the option to do svg plots (to my knowledge, this is not possible for the other backends). Hence, with this small line of code, all our plots will now be displayed in high quality.
Next, we defined a “resolution” to our figure. In my opinion, this is a bit of an unfortunate name, because the resolution is actually the size of our image. Yet, as we’ll see further on, the attribute resolution actually belongs to our figure, and not to the actual scatter plot. For this reason we pass the whole figure = (; resolution=(300,300)) (if you are new to Julia, the ; is just a way of separating attributes that have names, from unnamed ones, i.e. args and kwags).
Congrats! You now know the bare minimum of Makie to do a whole bunch of different plots! Just go to the Makie’s website and see how to use all the different ready-to-use plotting functions! In order to be self contained, here is a small cheat sheet from the great book Julia Data Science.
Of course, we still haven’t talked about a bunch of important things, like titles, subplots, legends, axes limits, etc. Just keep on reading…
Commands like scatter produce a “FigureAxisPlot” object, which contains a figure, a set of axes and the actual plot. Each of these objects has different attributes and are fundamental in order to customize your visualization. By doing:
fig, ax, plt = scatter(rand(10,2))
We save each of these objects in a different variable, and can more easily modify them. In this example, the function scatter is actually creating all three objects, and not only the plot. We could instead create each of these objects individually. Here is how we do it:
Let’s explain the code above. First, we created the empty figure and stored it in fig . Next, we created an “Axis”. But, we need to tell to which figure this object belongs, and this is where the fig[1,1] comes in. But, what is this “[1,1]”?
Every figure in Makie comes with a grid layout underneath, which enable us to easily create subplots in the same figure. Hence, the fig[1,1] means “Axis belongs to fig row 1 and column 1”. Since our figure only has one element, then our axis will occupy the whole thing. Still confused? Don’t worry, once we do subplots you’ll understand why this is so useful.
The rest of the arguments in “Axis” are easy to understand. We are just defining the names in each axis and then the title.
Finally, we add the plot using lines! . The exclamation is a standard in Julia that means that a function is actually modifying an object. In our case, the lines!(ax, 1:0.1:10, x->sin(x)) is appending a line plot to the ax axis.
It’s clear now how we can, for example, add more line plots. By running the same lines! , this will append more plots to our ax axis. In this case, let’s also add a legend to our plot.
#*Tip*: if you are using Jupyter and want to display your # visualization, you can do display(fig) or just write fig in # the end of the cell.
Ok, our plots are starting to look good. Let me end this section talking about subplots. As I said, this is where the whole “fig[1,1]” comes into play. If instead of doing two plots in the same axis we wanted to create two parallel plots in the same figure, here is how we would do this.
fig = Figure(resolution=(600, 300)) ax1 = Axis(fig[1, 1], xlabel = "x label", ylabel = "y label", title = "Title1") ax2 = Axis(fig[1, 2], xlabel = "x label", ylabel = "y label", title = "Title2")
This time, in the same figure, we created two axis, but the first one is in the first row and first column, while the second one is in the second column. We then just append the plot to the respective axis. Lastly, we save the figure in “png” format.
Final Words
That’s it for this tutorial. Of course, there is much more the talk about, as we have only scratched the surface. Makie has some awesome capabilities in terms of animations, and much more attributes/objects to play with in order to create truly astonishing visualizations. If you want to learn more, take a look at Makie’s documentation, it’s very nice. And also, the Julia Data Science book has a chapter only on Makie.
References
This article draws heavily on the Julia Data Science book and Makie’s own documentation.
Danisch & Krumbiegel, (2021). Makie.jl: Flexible high-performance data visualization for Julia. Journal of Open Source Software, 6(65), 3349, https://doi.org/10.21105/joss.03349