Tag Archives: Data Visualization

Visualizing Analytics Languages With VennEuler.jl

By: randyzwitch - Articles

Re-posted from: http://randyzwitch.com/visualizing-analytics-languages-venneuler-jl/

It often doesn’t take much to get me off track, and on a holiday weekend…well, I was just begging for a fun way to shirk. Enter Harlan Harris:

Hey, I’m someone looking for something to do! And I like writing Julia code! So let’s have a look at recreating this diagram in Julia using VennEuler.jl (IJulia Notebook link):

Source: Revolution R/KDNuggets

http://blog.revolutionanalytics.com/2014/08/r-tops-kdnuggets-data-analysis-software-poll-for-4th-consecutive-year.html

Installing VennEuler.jl

Because VennEuler.jl is not in METADATA as of the time of writing, instead of using Pkg.add() you’ll need to run:

 Pkg.clone("https://github.com/HarlanH/VennEuler.jl.git")

Note that VennEuler uses some of the more exotic packages (at least to me) like NLopt and Cairo, so you might need to have a few additional dependencies installed with the package.

Data

The data was a bit confusing to me at first, since the percentages add up to more than 100% (people could vote multiple times). In order to create a dataset to use, I took the percentages, multiplied by 1000, then re-created the voting pattern. The data for the graph can be downloaded from this link.

Code – Circles

With a few modifications, I basically re-purposed Harlan’s code from the package test files. The circle result is as follows:

venneulercircles

Since the percentage of R, SAS, and Python users isn’t too dramatically different (49.81%, 33.42%, 40.97% respectively) and the visualizations are circles, it’s a bit hard to tell that R is about 16% points higher than SAS and 9% points higher than Python.

Code – Rectangles

Alternatively, we can use rectangles to represent the areas:

venneulerrectangles

Here, it’s a slight bit easier to see that SAS and Python are about the same area-wise and that R is larger, although the different dimensions do obscure this fact a bit.

Summary

If I spent more time with this package, I’m sure I could make something even more aesthetically pleasing. And for that matter, it’s still a pre-production package that will no doubt get better in the future. But at the very least, there is a way to create an area-accurate representation of relationships using VennEuler.jl in Julia.

Using Julia As A ‘Glue’ Language

By: randyzwitch - Articles

Re-posted from: http://randyzwitch.com/julia-odbc-jl/

While much of the focus in the Julia community has been on the performance aspects of Julia relative to other scientific computing languages, Julia is also perfectly suited to ‘glue’ together multiple data sources/languages. In this blog post, I will cover how to create an interactive plot using Gadfly.jl, by first preparing the data using Hadoop and Teradata Aster via ODBC.jl.

The example problem I am going to solve is calculating and visualizing the number of airplanes by hour in the air at any given time in the U.S. for the year 1987. Because of the structure and storage of the underlying data, I will need to write some custom Hive code, upload the data to Teradata Aster via a command-line utility, re-calculate the number of flights per hour using a built-in Aster function, then using Julia to visualize the data.

Step 1: Getting Data From Hadoop

In a prior set of blog posts, I talked about loading the airline dataset into Hadoop, then analyzing the dataset using Hive or Pig. Using ODBC.jl, we can use Hive via Julia to submit our queries. The hardest part of setting up this process is making sure that you have the appropriate Hive drivers for your Hadoop cluster and credentials (which isn’t covered here). Once you have your DSN set up, running Hive queries is as easy as the following:In this code, I’ve written my query as a Julia string, to keep my code easily modifiable. Then, I pass the Julia string object to the query() function, along with my ODBC connection object. This query runs on Hadoop through Hive, then streams the result directly to my local hard drive, making this a very RAM efficient (though I/O inefficient!) operation.

Step 2: Shelling Out To Load Data To Aster

Once I created the file with my Hadoop results in it, I now have a decision point: I can either A) do the rest of the analysis in Julia or B) use a different tool for my calculations. Because this is a toy example, I’m going to use Teradata Aster to do my calculations, which provides a convenient function called ‘burst()’ to regularize timestamps into fixed intervals. But before I can use Aster to ‘burst’ my data, I first need to upload it to the database.

While I could loop over the data within Julia and insert each record one at a time, Teradata provides a command-line utility to upload data in parallel. Running command-line scripts from within Julia is as easy as using the run() command, with each command surrounded in backticks:While I could’ve run this at the command-line, having all of this within an IJulia Notebook keeps all my work together, should I need to re-run this in the future.

Step 3: Using Aster For Calculations

With my data now loaded in Aster, I can normalize the timestamps to UTC, then ‘burst’ the data into regular time intervals. Again, all of this can be done via ODBC from within Julia:Since it might not be clear what I’m doing here, the ‘burst()’ function in Aster takes a row of data with a start and end timestamp, and potentially returns multiple rows which normalize the time between the timestamps. If you’re familiar with pandas in Python, it’s a similar functionality to ‘resample’ on a series of timestamps.

Step 4: Download Smaller Data Into Julia, Visualize

Now that the data has been processed from Hadoop to Aster through a series of queries, we now have a much smaller dataset that can be loaded into RAM and processed by Julia:The Gadfly code above produces the following plot (using a d3.js backend for interactivity):

Since this chart is in UTC, it might not be obvious what the interpretation is of the trend. Because the airline dataset represents flights either leaving or returning to the United States, there are many fewer planes in the air overnight and the early morning hours (UTC 7-10, 2-5am Eastern). During the hours when the airports are open, there appears to be a limit of roughly 2500 planes per hour in the sky.

Why Not Do All Of This In Julia?

At this point, you might be tempted to wonder why go through all of this effort? Couldn’t this all be done in Julia?

Yes, you probably could do all of this work in Julia with a sufficiently large amount of RAM. As a proof-of-concept, I hope I’ve shown that there is much more to Julia than micro-benchmarking Julia’s speed relative to other scientific programming languages. You’ll notice that in none of my code have I used any type annotations, as none would really make sense (nor would they improve performance).  And although this is a toy example purposely using multiple systems, I much more frequently use Julia in this manner at work than doing linear algebra or machine learning.

So next time you’re tempted to use Python or R or shell scripting or whatever, consider Julia as well. Julia is just as at-home as a scripting language as a scientific computing language.

 

Creating Network Diagrams in Plotly from Julia

By: benjamin

Re-posted from: http://badhessian.org/2014/05/creating-network-diagrams-in-plotly-from-julia/

I’ve been using R for years and absolutely love it, warts and all, but it’s been hard to ignore some of the publicity the Julia language has been receiving. To put it succinctly, Julia promises both speed and intuitive use to meet contemporary data challenges. As soon as I started dabbling in it about six months ago I was sold. It’s a very nice language. After I had understood most of the language’s syntax, I found myself thinking “But can it do networks?” As it stands, there’s currently one library, Graphs, that addresses networks. Relative to R’s network packages, Graphs.jl has a very slick way of storing network data . Wasserman and Faust (1994) define a network as a set of actors and the set of the relationships between them–Graphs.jl reminds users of this conceptualizaton. Unlike R, Julia requires that users actively consider data types, so naturally, each actor in a network is represented by an object of type “KeyVertex” or “ExVertex.” The difference between these two vertex types is whether or not actors need attributes and labels. Should an actor have a label and attributes, this information is stored within each ExVertex object within a network. A relationship between two actors is likewise represented by an object of type “Edge” or “ExEdge” (again, differing based upon whether or not the edge can store an attribute). Creating an edge object requires a user to specify the sending and receiving vertices, expressed as vertex objects. In short, each network is represented by an array of vertices of a particular type as well as an array of edges of a particular type. Once the user has decided upon the types of vertices in her network, she must then create the vertex objects, include them in her network or “graph,” create edge objects that include pairs of vertex objects, then include those edges in her graph as well. After the user has created a graph object, she can then easily retrieve the vertex and edge objects along with any attributes they might have. After I create a network data object, I typically want to visualize it immediately. Unfortunately, for now the functionality in Graphs.jl is quite limited compared to igraph and the statnet suite in R. As it stands, there are no network visualization functions and just one layout function (random). Because Graphs.jl is relatively young and the developers are busy contributing to many of Julia’s other libraries, this limitation is to be expected. Nevertheless, I’ve been itching to make a network plot in Julia so I decided to write up a function. Network plotting functions need to accept a few traditional arguments. First, they obviously need to accept a network/graph object. Second, they need the ability to express vertex and edge attributes through varying colors, shapes, and sizes. Third, edge directionality needs to be indicated, should it exist. As a matter of personal preference, I find that opacity can be a useful visualization tool and that intensity-tapered-curved edges are a good way to portray directionality. The defaults should also not be fugly. Julia has a few visualization tools and for this project I decided to render the networks in plot.ly. Plot.ly has a few benefits over the other packages in Julia in that it stores the visualizations and data online, it allows collaboration, and the platform is also accessible through R, MATLAB, and Python. If network diagrams are basically many line and scatter plots, then, in theory, plot.ly should have no trouble rendering network visualizations. To demonstrate, let’s do a “Hello, world” with Padgett’s Florentine family marriage network. Here, marriage is an undirected relationship, the labels are meaningful, and we’re omitting the other attributes. For the data examples here, I’ve cheated a bit and calculated the vertex layout in igraph for R, but the rest was done in Julia. After starting Julia, you’ll need to load the data and plotting function, then you’ll need to enter your own plot.ly user name and API key, and lastly run the function.

  1. download(“http://pastebin.com/raw.php?i=Z2t7XRd3″, “GraphsPopularDataSets.jl”)
  2. include(“GraphsPopularDataSets.jl”)
  3. rm(“GraphsPopularDataSets.jl”)
  4. download(“http://pastebin.com/raw.php?i=ymrmxPtU”, “Sociogram–graph_plotly.jl”)
  5. include(“Sociogram–graph_plotly.jl”)
  6. rm(“Sociogram–graph_plotly.jl”)
  7. # See https://plot.ly/julia/getting-started
  8. yourplotlyname = “Not this string”
  9. yourplotlyAPIkey = “Not this string, either”
  10. floplot = graph_plotly(flograph, yourplotlyname, yourplotlyAPIkey, vertexsize = 0.75, label = true)
  11. println(floplot[“url”]) #Find the plot at this URL

Going to that URL, you should see a plot that looks like this one: Florentine Marriage Network   Directed networks need a bit more patience. Directed relationships are conventionally represented in social network diagrams using an arrow. On the downside, at the moment I haven’t found a way to use arrows in plotly. On the upside, the plotting script will indicate directionality with intensity-tapered-curved edges, a representation that is easier to interpret than arrows. On the downside (again), this method is much more computationally expensive. The parameter of interest here is the “gradient.” Each edge contains a number of small line plots equal to gradient. By default, these line plots go from wider, darker, and more opaque to thinner, lighter, and more transparent as they leave the source vertext and approach the target vertex. Setting the gradient parameter to a higher value will improve the quality of the visual results, though it will take longer to load in your browser; setting the gradient parameter to a lower value will do just the opposite. For the directed network, let’s look at Coleman’s (1964) high school interaction (“friendship”) network.

  1. colemanplot = graph_plotly(colemangraph, yourplotlyname, yourplotlyAPIkey, vertexsize = 0.33, vertexopacity = .95, vertexborderopacity = 1, gradient = 150)
  2. println(colemanplot[“url”]) #Find the plot at this URL

Coleman HS Net Plotly Right now, the script has a few limitations. First, the graph coordinates must be saved as vertex attributes “x” and “y” and I’ve had to calculate these outside of Julia. Likewise, the vertices must be of type ExVertex, as the type KeyVertex cannot store attributes. Second, curved edges for undirected graphs have not yet been implemented. Lastly, I’m exploring different ideas to get the speed up for directed graphs. If anyone has any suggestions or feedback, I’d greatly appreciate it! Helper functions / Data / Plotting function