Yesterday we looked at how time series data can be sucked into Julia from Quandl. What happens if your data are sitting in a database? No problem, Julia can handle that too. There are a number of database packages available. I’ll be focusing on SQLite and ODBC, but it might be worthwhile checking out JDBC, LevelDB and LMDB too.
SQLite
SQLite is a lightweight transactional SQL database engine that does not require a server or any major configuration. Installation is straightforward on most platforms.
The first step towards using SQLite from Julia is to load the package.
julia> using SQLite
Next, for illustrative purposes, we’ll create a database (which exists as a single file in the working directory) and add a table which we’ll populate directly from a delimited file.
Most of the expected SQL operations are supported by SQLite (check the documentation) and hence also by the Julia interface. When we’re done we close the database connection.
julia> close(db)
Of course, the database we created in Julia is now available through the shell too.
colliera@propane:~/proj/Z-212-language-julia/src$ ls -l passwd.sqlite
-rw-r--r-- 1 colliera colliera 6144 Sep 18 07:21 passwd.sqlite
colliera@propane:~/proj/Z-212-language-julia/src$ sqlite3 passwd.sqlite
SQLite version 3.8.7.4 2014-12-09 01:34:36
Enter ".help" for usage hints.
sqlite> pragma table_info(passwd);
0|username|TEXT|0||0
1|password|TEXT|0||0
2|UID|REAL|0||0
3|GID|REAL|0||0
4|comment|TEXT|0||0
5|homedir|TEXT|0||0
6|shell|TEXT|0||0
sqlite>
ODBC
If you need to access an enterprise DB (for example, Oracle, PostgreSQL, MySQL, Microsoft SQL Server or DB2) then the ODBC interface will be the way to go. To avoid the overhead of using one of these fancy DBs, I will demonstrate Julia’s ODBC functionality using the SQLite database we created above. Before we do that though, you’ll need to setup ODBC for SQLite. It’s not an onerous procedure at all. Then we fire up the ODBC package and we’re ready to roll.
julia> using ODBC
First we’ll check which drivers are available for ODBC (just SQLite in my case) and what data source names (DSNs) are registered.
We see that there is a DSN available for the passwd database. So we create a connection:
julia> db = ODBC.connect("passwd")
ODBC Connection Object
----------------------
Connection Data Source: passwd
passwd Connection Number: 1
Contains resultset(s)? No
At this point I’d like to execute a query. However, somewhat disappointingly, this doesn’t work. No error message but also no results. I’ve logged an issue with the package maintainer, so hopefully this will be resolved soon.
julia> query("SELECT * FROM passwd LIMIT 5;", db)
0x0 DataFrame
What’s promising though is that I can still retrieve the metadata for that query.
Again, when we’re done, we close the database connection.
julia> disconnect(db)
We’re now covered a number of means for getting data into Julia. Over the next few days we’ll be looking at Julia’s capabilities for analysing data. Stay tuned. In the meantime you can check out the code for today (and previous days) on github. Also take a look at the talk below.
The DataFrame type in Julia is not dissimilar to the analogous types in R and Python/pandas. It provides a way of grouping data which is convenient for analysis and reminiscent of a database table.
I’m assuming that you’ve already installed the DataFrames package. If not, take a look at yesterday’s post. The first step is then to load it up:
julia> using DataFrames
Next we can start assembling our data. A DataFrame can be built up one field at a time (as is done in the example below) or by passing all of the data at once to the constructor.
You can dig deeper with describe(), which gives a simple statistical summary of each column. It does essentially the same thing as summary() in R.
Indexing operations allow you to access the data in various ways. There’s also head() and tail(), which return the first and last few records in the data.
You can apply a range of operations to columns. Note, however, that there is a subtle difference in syntax: while == is the normal equality operator, .== is the element-wise equality operator which must be applied to columns in order to make element-by-element comparisons. A similar syntax pertains to other operators like .<= and .>.
Of course you're not likely to construct any serious collection of data manually. It's more likely to come from a database or file. There are various ways to accomplish this. The simplest is reading from a delimited file.
Note how names!() was used to alter the column names. There are other ways of loading data from a delimited text file that will handle column names more elegantly. We'll get to those in a few days time.
Watch the video below and then read further to find out about the DataArrays package.
DataArrays
Data are seldom perfect and missing values are not uncommon. Now, you might use some a particular numerical value (like -9999, for example) to indicate a missing datum. However, this is a bit of a kludge, difficult to maintain and open to ambiguity. The DataArrays package introduces the singleton NA type which can be used to unambiguously indicate missing data.
A vector with missing data is created using the @data macro.
julia> using DataArrays
julia> x = @data([1, 2, 3, 4, NA, 6])
6-element DataArray{Int64,1}:
1
2
3
4
NA
6
Functions anyna() and allna() can be used to test whether any or all of the elements of a vector are missing.
Two ways of dealing with NAs are to either drop them or replace them with another value.