Introducing Twitter.jl

By: Randy Zwitch

Re-posted from: http://randyzwitch.com/twitter-api-julia/

This is possibly the latest “announcement” of a package ever, given that title="Twitter API Julia client" href="https://github.com/randyzwitch/Twitter.jl" >Twitter.jl has actually existed on title="Julia METADATA" href="https://github.com/JuliaLang/METADATA.jl">METADATA for nearly a year now, but that’s how things go sometimes. Here’s how to get started with Twitter.jl and some of the highlights.

Hello, World!

If ‘Hello, World!’ is the canonical example of getting started with a programming language, the Twitter API is becoming the first place to start for people wanting to learn about APIs. Authenticating with the Twitter API using Julia is similar to using the R or Python packages, except that rather than doing the OAuth “dance”, Twitter.jl takes all four authentication values in one function:All four of these values can be found after registering at the title="Twitter Dev" href="https://dev.twitter.com/" >Twitter Developer page and creating an application. Having all four values in your script is less secure than just providing the api key and api secret, but in the future, I’ll likely implement the full OAuth “handshake”. One thing to keep in mind with this function as it currently works is that no validation of your credentials is performed; the only thing this function does is define a global variable twittercred for later use by the various functions that create the OAuth headers. To shout “Hello, World!” to all of your Twitter followers, you can use the following code:

General Package/Function Structure

From the example above, you can see that the function naming follows the title="Twitter REST API documentation" href="https://dev.twitter.com/rest/public" >Twitter REST API naming convention, with the HTTP verb first and the endpoint as the remainder of the function name. As such, it’s a good idea at this early package state to have the Twitter documentation open while using this package, so that you can quickly find the methods you are looking for.

For each function/API endpoint, I’ve gone through and determined which parameters are required; these are required arguments in the Julia functions. For all other options, each function takes a second optional Dict{String, String} for any option shown in the Twitter documentation. While this Dict structure allows for ultimate flexibility (and quick definition of functions!), I do realize that it’s less than optimal that you don’t know what optional arguments each Twitter endpoint allows.

As an example, suppose you wanted to search for tweets containing the hashtag #julialang. The minimum function call is as follows:By default, the API will return the 15 most recent tweets containing the #julialang hashtag. To return the most recent 100 tweets (the maximum per API ‘page’), you can pass the “count” parameter via the Options Dict:

Composite Types and DataFrames definitions

The Twitter API is structured into 4 return data types ( title="Twitter API Places data type" href="https://dev.twitter.com/overview/api/places" >Places, title="Twitter API Users type" href="https://dev.twitter.com/overview/api/users" >Users, title="Twitter API Tweets type" href="https://dev.twitter.com/overview/api/tweets" >Tweets, and title="Twitter API Entities type" href="https://dev.twitter.com/overview/api/entities" >Entities), and I’ve mimicked these types using Julia title="Julia Composite Types" href="http://julia.readthedocs.org/en/latest/manual/types/#composite-types" >Composite Types. As such, most functions in Twitter.jl return an array of specific type, such as Array{TWEETS,1} from the prior #julialang search example. The benefit to defining custom types for the returned Twitter data is that rudimentary DataFrame methods have also been defined:

I describe these DataFrames as ‘rudimentary’ as they parse the top level of JSON into columns, which results in some DataFrame columns having complex data types such as Dict() (and within the Dict(), nested Dicts!). As a running theme in this post, this is something I hope to get around to improving in the future.

Want to Get Started Developing Julia? Start Here!

One of the common questions I get asked is how to get started with Julia, both from a learning perspective and from a package development perspective. Hacking away on the core Julia codebase is great if you have the ability, but the code can certainly be intimidating (the people are quite friendly though). Creating a package isn’t necessarily hard, but you have to think about an idea you want to implement. The third alternative is…

…improve the Twitter package! If you go to the title="Twitter.jl GitHub Julia" href="https://github.com/randyzwitch/Twitter.jl" >GitHub page for Twitter.jl, you’ll see a long list of TODO items that need to be worked on. The hardest part (building the OAuth headers) has already been taken care of. What’s left is title="Code Refactoring Using Metaprogramming" href="http://randyzwitch.com/julia-metaprogramming-refactoring/" >re-factoring the code for simplification, factoring out the title="OAuth in Julia" href="https://github.com/randyzwitch/OAuth.jl" >OAuth code in general into a new Julia library (also partially started), then building the Streaming API functions, cleaning up the DataFrame methods to remove the Dict column types, paging through API results…and so-on.

So if any of you are on the sidelines wanting to get some practice on developing packages, without needing to worry about learning Astrophysics first, I’d love to collaborate. And if any Julia programming masters want to collaborate, well that’s great too. All help and pull requests are welcomed.

In the meantime, hopefully some of you will find this package useful for natural language processing, social networking analysis or even creating bots src="http://randyzwitch.com/wp-includes/images/smilies/icon_wink.gif" alt=";)" class="wp-smiley" />

 

What’s Wrong with Statistics in Julia?

By: John Myles White

Re-posted from: http://www.johnmyleswhite.com/notebook/2014/11/29/whats-wrong-with-statistics-in-julia/

Introduction

Several months ago, I promised to write an updated version of my old post, “The State of Statistics in Julia”, that would describe how Julia’s support for statistical computing has evolved since December 2012.

I’ve kept putting off writing that post for several reasons, but the most important reason is that all of my attention for the last few months has been focused on what’s wrong with how Julia handles statistical computing. As such, the post I’ve decided to write isn’t a review of what’s already been done in Julia, but a summary of what’s being done right now to improve Julia’s support for statistical computing.

In particular, this post focuses on several big changes to the core data structures that are used in Julia to represent statistical data. These changes should all ship when Julia 0.4 is released.

What’s Wrong with Statistics in Julia Today?

The primary problem with statistical computing in Julia is that the current tools were all designed to emulate R. Unfortunately, R’s approach to statistical computing isn’t amenable to the kinds of static analysis techniques that Julia uses to produce efficient machine code.

In particular, the following differences between R and Julia have repeatedly created problems for developers:

  • In Julia, computations involving scalars are at least as important as computations involving vectors. In particular, iterative computations are first-class citizens in Julia. This implies that statistical libraries must allow developers to write efficient code that iterates over the elements of a vector in pure Julia. Because Julia’s compiler can only produce efficient machine code for computations that are type-stable, the representations of missing values, categorical values and ordinal values in Julia programs must all be type-stable. Whether a value is missing or not, its type must remain the same.
  • In Julia, almost all end-users will end up creating their own types. As such, any tools for statistical computing must be generic enough that they can be extended to arbitrary types with little to no effort. In contrast to R, which can heavily optimize its algorithms for a very small number of primitive types, Julia developers must ensure that their libraries are both highly performant and highly abstract.
  • Julia, like most mainstream languages, eagerly evaluates the arguments passed to functions. This implies that idioms from R which depend upon non-standard evaluation are not appropriate for Julia, although it is possible to emulate some forms of non-standard evaluation using macros. In addition, Julia doesn’t allow programmers to reify scope. This implies that idioms from R that require access to the caller’s scope are not appropriate for Julia.

The most important way in which these issues came up in the first generation of statistical libraries was in the representation of a single scalar missing value. In Julia 0.3, this concept is represented by the value NA, but that representation will be replaced when 0.4 is released. Most of this post will focus on the problems created by NA.

In addition to problems involving NA, there were also problems with how expressions were being passed to some functions. These problems have been resolved by removing the function signatures for statistical functions that involved passing expressions as arguments to those functions. A prototype package called DataFramesMeta, which uses macros to emulate some kinds of non-standard evaluation, is being developed by Tom Short.

Representing Missing Values

In Julia 0.3, missing values are represented by a singleton object, NA, of type NAtype. Thus, a variable x, which might be either a Float64 value or a missing value encoded as NA, will end up with type Union(Float64, NAtype). This Union type is a source of performance problems because it defeats Julia’s compiler’s attempts to assign a unique concrete type to every variable.

We could remove this type-instability by ensuring that every type has a specific value, such as NaN, that signals missingness. This is the approach that both R and pandas take. It offers acceptable performance, but does so at the expense of generic handling of non-primitive types. Given Julia’s rampant usage of custom types, the sentinel values approach is not viable.

As such, we’re going to represent missing values in Julia 0.4 by borrowing some ideas from functional languages. In particular, we’ll be replacing the singleton object NA with a new parametric type Nullable{T}. Unlike NA, a Nullable object isn’t a direct scalar value. Rather, a Nullable object is a specialized container type that either contains one value or zero values. An empty Nullable container is taken to represent a missing value.

The Nullable approach to representing a missing scalar value offers two distinct improvements:

  • Nullable{T} provides radically better performance than Union(T, NA). In some benchmarks, I find that iterative constructs can be as much as 100x faster when using Nullable{Float64} instead of Union(Float64, NA). Alternatively, I’ve found that Nullable{Float64} is about 60% slower than using NaN to represent missing values, but involves a generic approach that trivially extends to arbitrary new types, including integers, dates, complex numbers, quaternions, etc…
  • Nullable{T} provides more type safety by requiring that all attempts to interact with potentially missing values explicitly indicate how missing values should be treated.

In a future blog post, I’ll describe how Nullable works in greater detail.

Categorical Values

In addition to revising the representation of missing values, I’ve also been working on revising our representation of categorical values. Working with categorical data in Julia has always been a little strange, because the main tool for representing categorical data, the PooledDataArray, has always occupied an awkward intermediate position between two incompatible objectives:

  • A container that keeps track of the unique values present in the container and uses this information to efficiently represent values as pointers to a pool of unique values.
  • A container that contains values of a categorical variable drawn from a well-defined universe of possible values. The universe can include values that are not currently present in the container.

These two goals come into severe tension when considering subsets of a PooledDataArray. The uniqueness constraint suggests that the pool should shrink, whereas the categorical variable definition suggests that the pool should be maintained without change. In Julia 0.4, we’re going to commit completely to the latter behavior and leave the problem of efficiently representing highly compressible data for another data structure.

We’ll also begin representing scalar values of categorical variables using custom types. The new CategoricalVariable and OrdinalVariable types that will ship with Julia 0.4 will further the efforts to put scalar computations on an equal footing with vector computations. This will be particularly notable for dealing with ordinal variables, which are not supported at all in Julia 0.3.

Metaprogramming

Many R functions employ non-standard evaluation as a mechanism for augmenting the current scope with the column names of a data.frame. In Julia, it’s often possible to emulate this behavior using macros. The in-progress DataFramesMeta package explores this alternative to non-standard evaluation. We will also be exploring other alternatives to non-standard evaluation in the future.

What’s Next

In the long-term future, I’m hoping to improve several other parts of Julia’s core statistical infrastructure. In particular, I’d like to replace DataFrames with a new type that no longer occupies a strange intermediate position between matrices and relational tables. I’ll write another post about those issues later.

Code Refactoring Using Metaprogramming

By: Randy Zwitch

Re-posted from: http://randyzwitch.com/julia-metaprogramming-refactoring/

It’s been nearly a year since I wrote Twitter.jl, back when I seemingly had MUCH more free time. In these past 10 months, I’ve used Julia quite a bit to develop other packages, and I try to use it at work when I know I’m not going to be collaborating with others (since my colleagues don’t know Julia, not because it’s bad for collaboration!).

One of the things that’s obvious from my earlier Julia code is that I didn’t understand how powerful metaprogramming can be, so here’s a simple example where I can replace 50 lines of Julia code with 10.

CTRL-A, CTRL-C, CTRL-P. Repeat.

Admittedly, when I started on the Twitter package, I fully meant to go back and clean up the codebase, but moved onto something more fun instead. The Twitter package started out as a means of learning how to use the Requests.jl library to make API calls, figured out the OAuth syntax I needed (which itself should be factored out of Twitter.jl), then copied-and-pasted the same basic function structure over and over. While fast, what I was left with was this (currently, the help.jl file in the Twitter package):It’s pretty clear that this is the same exact code pattern, right down to the spacing! The way to interpret this code is that for these five Twitter API methods, there are no required inputs. Optionally, there is the ‘options’ keyword that allows for specifying a Dict() of options. For these five functions, there are no options you can pass to the Twitter API, so even this keyword is redundant. These are simple functions so I don’t gain a lot by way of maintainability by using metaprogramming, but at the same time, one of the core tenets of programming is ‘Dont Repeat Yourself’, so let’s clean this up.

For :symbol in symbolslist…

In order to clean this up, we need to take out the unique parts of the function, then pass them as arguments to the @eval macro as follows:
What’s happening in this code is that I define two tuples: one of function names (as symbols, denoted by ‘:’ ) and one of the API endpoints. We can then iterate over the two tuples, substituting the function names and endpoints into the code. When the package is loaded, this code evaluates, defining the five functions for use in the Twitter package.

Wha?

Yeah, so metaprogramming can be simple, can it can also be mind-bending. It’s one thing to not repeat yourself, it’s another to write something so complex that even YOU can’t remember how the code works. But somewhere in between lies a sweet spot where you can re-factor whole swaths of code and streamline your codebase.

Metaprogramming is used throughout the Julia codebase, so if you’re interested in seeing more examples of metaprogramming, check out the Julia source code, the Requests.jl package (where I first saw this) or really anyone who actually knows what they are doing. I’m just a metaprogramming pretender at this point :)

 

To read additional discussion around this specific example, see the Julia-Users discussion at:

https://groups.google.com/forum/#!topic/julia-users/zvJmqB2N0GQ