Author Archives: randyzwitch - Articles

Code Refactoring Using Metaprogramming

By: randyzwitch - Articles

Re-posted from: http://randyzwitch.com/julia-metaprogramming-refactoring/

It’s been nearly a year since I wrote Twitter.jl, back when I seemingly had MUCH more free time. In these past 10 months, I’ve used Julia quite a bit to develop other packages, and I try to use it at work when I know I’m not going to be collaborating with others (since my colleagues don’t know Julia, not because it’s bad for collaboration!).

One of the things that’s obvious from my earlier Julia code is that I didn’t understand how powerful metaprogramming can be, so here’s a simple example where I can replace 50 lines of Julia code with 10.

CTRL-A, CTRL-C, CTRL-P. Repeat.

Admittedly, when I started on the Twitter package, I fully meant to go back and clean up the codebase, but moved onto something more fun instead. The Twitter package started out as a means of learning how to use the Requests.jl library to make API calls, figured out the OAuth syntax I needed (which itself should be factored out of Twitter.jl), then copied-and-pasted the same basic function structure over and over. While fast, what I was left with was this (currently, the help.jl file in the Twitter package):It’s pretty clear that this is the same exact code pattern, right down to the spacing! The way to interpret this code is that for these five Twitter API methods, there are no required inputs. Optionally, there is the ‘options’ keyword that allows for specifying a Dict() of options. For these five functions, there are no options you can pass to the Twitter API, so even this keyword is redundant. These are simple functions so I don’t gain a lot by way of maintainability by using metaprogramming, but at the same time, one of the core tenets of programming is ‘Dont Repeat Yourself’, so let’s clean this up.

For :symbol in symbolslist…

In order to clean this up, we need to take out the unique parts of the function, then pass them as arguments to the @eval macro as follows:
What’s happening in this code is that I define two tuples: one of function names (as symbols, denoted by ‘:’ ) and one of the API endpoints. We can then iterate over the two tuples, substituting the function names and endpoints into the code. When the package is loaded, this code evaluates, defining the five functions for use in the Twitter package.

Wha?

Yeah, so metaprogramming can be simple, can it can also be mind-bending. It’s one thing to not repeat yourself, it’s another to write something so complex that even YOU can’t remember how the code works. But somewhere in between lies a sweet spot where you can re-factor whole swaths of code and streamline your codebase.

Metaprogramming is used throughout the Julia codebase, so if you’re interested in seeing more examples of metaprogramming, check out the Julia source code, the Requests.jl package (where I first saw this) or really anyone who actually knows what they are doing. I’m just a metaprogramming pretender at this point :)

 

To read additional discussion around this specific example, see the Julia-Users discussion at:

https://groups.google.com/forum/#!topic/julia-users/zvJmqB2N0GQ

Visualizing Analytics Languages With VennEuler.jl

By: randyzwitch - Articles

Re-posted from: http://randyzwitch.com/visualizing-analytics-languages-venneuler-jl/

It often doesn’t take much to get me off track, and on a holiday weekend…well, I was just begging for a fun way to shirk. Enter Harlan Harris:

Hey, I’m someone looking for something to do! And I like writing Julia code! So let’s have a look at recreating this diagram in Julia using VennEuler.jl (IJulia Notebook link):

Source: Revolution R/KDNuggets

http://blog.revolutionanalytics.com/2014/08/r-tops-kdnuggets-data-analysis-software-poll-for-4th-consecutive-year.html

Installing VennEuler.jl

Because VennEuler.jl is not in METADATA as of the time of writing, instead of using Pkg.add() you’ll need to run:

 Pkg.clone("https://github.com/HarlanH/VennEuler.jl.git")

Note that VennEuler uses some of the more exotic packages (at least to me) like NLopt and Cairo, so you might need to have a few additional dependencies installed with the package.

Data

The data was a bit confusing to me at first, since the percentages add up to more than 100% (people could vote multiple times). In order to create a dataset to use, I took the percentages, multiplied by 1000, then re-created the voting pattern. The data for the graph can be downloaded from this link.

Code – Circles

With a few modifications, I basically re-purposed Harlan’s code from the package test files. The circle result is as follows:

venneulercircles

Since the percentage of R, SAS, and Python users isn’t too dramatically different (49.81%, 33.42%, 40.97% respectively) and the visualizations are circles, it’s a bit hard to tell that R is about 16% points higher than SAS and 9% points higher than Python.

Code – Rectangles

Alternatively, we can use rectangles to represent the areas:

venneulerrectangles

Here, it’s a slight bit easier to see that SAS and Python are about the same area-wise and that R is larger, although the different dimensions do obscure this fact a bit.

Summary

If I spent more time with this package, I’m sure I could make something even more aesthetically pleasing. And for that matter, it’s still a pre-production package that will no doubt get better in the future. But at the very least, there is a way to create an area-accurate representation of relationships using VennEuler.jl in Julia.

Six of One (Plot), Half-Dozen of the Other

By: randyzwitch - Articles

Re-posted from: http://badhessian.org/2014/07/six-of-one-plot-half-dozen-of-the-other/

This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at http://randyzwitch.com. He’s blogged at Bad Hessian before here.

WordPress Stats - Visitors vs. Views
WordPress Stats – Visitors vs. Views

For those of you with WordPress blogs and have the Jetpack Stats module installed, you’re intimately familiar with this chart. There’s nothing particularly special about this chart, other than you usually don’t see bar charts with the bars shown superimposed.

I wanted to see what it would take to replicate this chart in R, Python and Julia. Here’s what I found. (download the data).

R: ggplot2

Although I prefer to use other languages these days for my analytics work, there’s a certain inertia/nostalgia that when I think of making charts, I think of using ggplot2 and R. Creating the above chart is pretty straightforward, though I didn’t quite replicate the chart, as I couldn’t figure out how to make my custom legend not do the diagonal bar thing.

The R Cookbook talks about a hack to remove the diagonal lines from legends, so I don’t feel too bad about not getting it. I also couldn’t figure out how to force ggplot2 to give me the horizontal line at 10000. If anyone in the R community knows how to fix these, let me know!

(Pythonistas: I’m aware of the ggplot port by Yhat; functionality I used in my R code is still in TODO, so I didn’t pursue plotting with ggplot in Python)

R: Base Graphics

Of course, not everyone finds ggplot2 to be easy to understand, as it requires a different way of thinking about coding than most ‘base’ R functions. To that end, there are the base graphics built into R, which produced this plot: wordpress-base-rWhile I was able to nearly replicate the WordPress chart (except for the feature of having the dark bars slightly smaller width than the lighter), the base R syntax is horrid. The abbreviations for plotting arguments are indefensible, the center and width keywords seem to shift the range of the x-axis instead of changing the actual bar width, and in general, the experience plotting using base R was the worst of the six libraries I evaluated.

Python: matplotlib

In the past year or so, there’s been quite a lot of activity towards improving the graphics capabilities in Python. Historically, there’s been a lot of teeth-gnashing about matplotlib being too low-level and hard to work with, but with enough effort, the results are quite pleasant. Unlike with ggplot2 and base R, I was able to replicate all the features of the WordPress plot:wordpress-matplotlib

Python: Seaborn

One of the aforementioned improvements to matplotlib is Seaborn, which promises to be a higher-level means of plotting data than matplotlib, as well as adding new plotting functionality common in statistics and research. Re-creating this plot using Seaborn is a waste of the additional functionality of Seaborn, and as such, I found it more difficult to make this plot using Seaborn than I did with matplotlib.

To replicate the plot, I ended up hacking a solution together using both Seaborn functionality and matplotlib in order to be able to set bar width and to create the legend, which defeats the purpose of using Seaborn in the first place.

Julia: Gadfly

In the Julia community, Gadfly is clearly the standard for plotting graphics. Supporting d3.js, PNG, PS, and PDF, Gadfly is built to work with many popular back-end environments. I was able to replicate everything about the WordPress graph except for the legend:wordpress-julia-gadflyWhile Gadfly took a line or two more than base R in terms of fewest lines of code, I find the Gadfly syntax significantly more pleasant to work with.

Julia: Plot.ly

Plot.ly is an interesting ‘competitor’ in this challenge, as it’s not a language-specific package per-se. Rather, Plot.ly is a means of specifying plots using JSON, with lightweight Julia/Python/MATLAB/R wrappers. I was able to replicate nearly everything about the WordPress plot, with the exception of not having a line at 10000, having the legend vertical instead of horizontal and I couldn’t figure out how to set the bar widths separately. wordpress-julia-plotly

And The Winner Is…matplotlib?!

If you told me at the beginning of this exercise that matplotlib (and by extension, Seaborn) would be the only library that I would be able to replicate all the features of the WordPress graph, I wouldn’t have believed it. And yet, here we are. ggplot2 was certainly very close, and I’m certain that someone knows how to fix the diagonal line issue. I suspect I could submit an issue ticket to Gadfly.jl to get the feature added to create custom legends (and for that matter, make the request of Plot.ly for horizontal legends), so in the future there could be feature parity using these two libraries as well.

I hope we all agree there’s no hope for Base Graphics in R besides quick throwaway plots.

In the end, the best thing I can say from this exercise is that the analytics community is fortunate to have so many talented people working to provide these amazing visualization libraries. This graph was rather pedestrian in nature, so I didn’t even scratch the surface of what these various libraries can do. Even beyond the six libraries I chose, there are others I didn’t choose, including: prettyplotlib (Python), Bokeh (Python), Vincent (Python), rCharts (R), ggvis (R), Winston (Julia), ASCII Plots (Julia) and probably even more that I’m not even aware of! All free and open-source and miles apart from terrible looking Microsoft graphics in Excel and Powerpoint.