Google Summer of Code is starting up, so I thought it would be a good time to share my workflow for developing my own Julia packages, as well as my workflow for contributing to other Julia packages. This does not assume familiarity with commandline Git, and instead shows you how to use a GUI (GitKraken) to make branches and PRs, as well as reviewing and merging code. You can think of it as an update to my old blog post on package development in Julia. However, this is not only updated but also improved since I am now able to walk through the “non-code” parts of package developing (such as setting up AppVeyor and code coverage).
Enjoy! (I quite like this video blog format: it was a lot less work)
Recently I was using Julia to run ffprobe to get the length of a video file. The trouble was the ffprobe was dumping its output to stderr and I wanted to take that output and run it through grep. From a bash shell one would typically run:
you will get errors. Julia does not like pipes | inside the backticks command (for very sensible reasons). Instead you should be using Julia’s pipeline command. Also the redirection 2>&1 will not work. So instead, the best thing to use is and instance of Pipe. This was not in the manual. I stumbled upon it in an issue discussion on GitHub. So a good why to do what I am after is to run.
So, last summer, my program was producing three dimensional data, and I needed a way to export and save that data from my C++ program. Simple ASCII files, my default method, no longer covered my needs. Of course, I wasn’t the first person to encounter this problem, so I discovered the HDF5 standard.
Instead of storing data in a human readable format like ASCII, the Hierarchical Data Format, HDF, stores data in binary format. This preserves the shape of the data in the computer and keeps it at its minimum size. WOHOO!!
Sadly, the syntax for HDF5 in C++ and Fortran is just as bad as FFTW or OpenBLAS. But happily, just like FFTW and OpenBLAS, HDF5 has wonderful syntax in Julia, Python, and julia, among others.
So how does it work?
We don’t just print a single variable. Each HDF5 file is like its own file system. In my home directory, I have my documents folder, my programming folder, my pictures, configuration files,… and inside each folder I can have subfolders or files.
The same is true for an HDF5 file. We have the root, and then we have groups and subgroups. A group is like a folder. Then we can have datasets. Datasets are objects that hold data (files).
Installing the Package
While running Pkg.add("HDF5"); should hopefully add the HDF5 library, additional steps may be required. I remember having a horrible time with the HDF installation when using C++ a year ago. If at all possible, just use a package manager, and do not try and install it from source! See the HDF5.jl or HDFGroup pages for details.
Firstly, lets open a file and then write some data to it.
We can open a file in three ways:
Write. Will overwrite anything already there.
Read-write. Preserving existing contents.
If we open with this syntax, we have to always remember to close it with close()
Navigating a File
Now lets see if we were successful by reading. Instead of reading the dataset, we are going to checkout the structure of the file first.
names(fid) tells us what is inside the location fid.
dump(fid) is much more in depth, exploring everything below fid. If we had a bunch of subdirectories, it would go down each one to see what was there.
Both these functions help you find your way around a file.
HDF5.HDF5File len 1
string: HDF5Dataset () : Hello World
Now when we are reading data, we need to know the difference between dataset and the data the dataset contains.
Look at the below example
the dataset: HDF5.HDF5Dataset
the string: ASCIIString Hello World
read another way: ASCIIString Hello World
A dataset is like the filename “fairytale.txt”, so we then need to read the file to get “Once upon a time …”.
I’ve talked about groups, but we haven’t done anything with them yet. Let’s make some!
Here we use g_create to create two groups, one inside the other. For the subgroup, it’s parent is g, so we have to create it at location g. Just like in a filesystem, it’s name/ path is nested within its parent’s path.
HDF5.HDF5File len 1
mygroup: HDF5.HDF5Group len 1
mysubgroup: HDF5.HDF5Group len 0
path of h: /mygroup/mysubgroup
Say in a file I want to include the information that I ran the simulation with 100 sites, at 1 Kelvin, for 100,000 timesteps. Instead of creating new datasets for each of these individual numbers, I can create attributes and tie them to either a group or a dataset.
typeof attrs: HDF5.HDF5Attributes
N Sites: 100
Before diving in to learn how to use this, think about whether you need it or not. How large and complex is your data? Is it worth the time to learn? While the syntax might be relatively simple in Julia, ASCII files are still much easier to deal with.
If you are going to play around or use this format, I recommend getting an HDF viewer, like HDFViewer. While you can have much more control via code, sometimes it is just that much simpler to check everything is working with a GUI.
For more information, checkout the Package page at HDF5.jl or the HDFGroup page at HDFGroup
I’ve shown some of the basic functionality in simple test cases. If you want more control, you might just have to work a bit for it.