By: Andrew Collier

Re-posted from: http://www.exegetic.biz/blog/2015/09/monthofjulia-day-17-datasets-from-r/

R has an extensive range of builtin datasets, which are useful for experimenting with the language. The `RDatasets`

package makes many of these available within Julia. We’ll see another way of accessing R’s datasets in a couple of days’ time too. In the meantime though, check out the documentation for `RDatasets`

and then read on below.

As always, the first thing that we need to do is load the package.

julia> using RDatasets

We can get a list of the R packages which are supported by `RDatasets`

.

julia> RDatasets.packages() 33x2 DataFrame | Row | Package | Title | |-----|----------------|---------------------------------------------------------------------------| | 1 | "COUNT" | "Functions, data and code for count data." | | 2 | "Ecdat" | "Data sets for econometrics" | | 3 | "HSAUR" | "A Handbook of Statistical Analyses Using R (1st Edition)" | | 4 | "HistData" | "Data sets from the history of statistics and data visualization" | | 5 | "ISLR" | "Data for An Introduction to Statistical Learning with Applications in R" | | 6 | "KMsurv" | "Data sets from Klein and Moeschberger (1997), Survival Analysis" | | 7 | "MASS" | "Support Functions and Datasets for Venables and Ripley's MASS" | | 8 | "SASmixed" | "Data sets from \"SAS System for Mixed Models\"" | | 9 | "Zelig" | "Everyone's Statistical Software" | | 10 | "adehabitatLT" | "Analysis of Animal Movements" | | 11 | "boot" | "Bootstrap Functions (Originally by Angelo Canty for S)" | | 12 | "car" | "Companion to Applied Regression" | | 13 | "cluster" | "Cluster Analysis Extended Rousseeuw et al." | | 14 | "datasets" | "The R Datasets Package" | | 15 | "gap" | "Genetic analysis package" | | 16 | "ggplot2" | "An Implementation of the Grammar of Graphics" | | 17 | "lattice" | "Lattice Graphics" | | 18 | "lme4" | "Linear mixed-effects models using Eigen and S4" | | 19 | "mgcv" | "Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation" | | 20 | "mlmRev" | "Examples from Multilevel Modelling Software Review" | | 21 | "nlreg" | "Higher Order Inference for Nonlinear Heteroscedastic Models" | | 22 | "plm" | "Linear Models for Panel Data" | | 23 | "plyr" | "Tools for splitting, applying and combining data" | | 24 | "pscl" | "Political Science Computational Laboratory, Stanford University" | | 25 | "psych" | "Procedures for Psychological, Psychometric, and Personality Research" | | 26 | "quantreg" | "Quantile Regression" | | 27 | "reshape2" | "Flexibly Reshape Data: A Reboot of the Reshape Package." | | 28 | "robustbase" | "Basic Robust Statistics" | | 29 | "rpart" | "Recursive Partitioning and Regression Trees" | | 30 | "sandwich" | "Robust Covariance Matrix Estimators" | | 31 | "sem" | "Structural Equation Models" | | 32 | "survival" | "Survival Analysis" | | 33 | "vcd" | "Visualizing Categorical Data" |

Next we’ll get a list of all datasets supported across all of those R packages. There are a lot of them! Also we see some specific statistics about the number of records and fields in each of them.

julia> sets = RDatasets.datasets(); julia> size(sets) (733,5) julia> head(sets) 6x5 DataFrame | Row | Package | Dataset | Title | Rows | Columns | |-----|---------|-------------|-------------|------|---------| | 1 | "COUNT" | "affairs" | "affairs" | 601 | 18 | | 2 | "COUNT" | "azdrg112" | "azdrg112" | 1798 | 4 | | 3 | "COUNT" | "azpro" | "azpro" | 3589 | 6 | | 4 | "COUNT" | "badhealth" | "badhealth" | 1127 | 3 | | 5 | "COUNT" | "fasttrakg" | "fasttrakg" | 15 | 9 | | 6 | "COUNT" | "lbw" | "lbw" | 189 | 10 |

Or we can find out what datasets are available from a particular R package.

julia> RDatasets.datasets("vcd") 31x5 DataFrame | Row | Package | Dataset | Title | Rows | Columns | |-----|---------|-------------------|--------------------------------------------|-------|---------| | 1 | "vcd" | "Arthritis" | "Arthritis Treatment Data" | 84 | 5 | | 2 | "vcd" | "Baseball" | "Baseball Data" | 322 | 25 | | 3 | "vcd" | "BrokenMarriage" | "Broken Marriage Data" | 20 | 4 | | 4 | "vcd" | "Bundesliga" | "Ergebnisse der Fussball-Bundesliga" | 14018 | 7 | | 5 | "vcd" | "Bundestag2005" | "Votes in German Bundestag Election 2005" | 16 | 6 | | 6 | "vcd" | "Butterfly" | "Butterfly Species in Malaya" | 24 | 2 | | 7 | "vcd" | "CoalMiners" | "Breathlessness and Wheeze in Coal Miners" | 32 | 4 | | 8 | "vcd" | "DanishWelfare" | "Danish Welfare Study Data" | 180 | 5 | | 9 | "vcd" | "Employment" | "Employment Status" | 24 | 4 | | 10 | "vcd" | "Federalist" | "'May' in Federalist Papers" | 7 | 2 | | 11 | "vcd" | "Hitters" | "Hitters Data" | 154 | 4 | | 12 | "vcd" | "HorseKicks" | "Death by Horse Kicks" | 5 | 2 | | 13 | "vcd" | "Hospital" | "Hospital data" | 3 | 4 | | 14 | "vcd" | "JobSatisfaction" | "Job Satisfaction Data" | 8 | 4 | | 15 | "vcd" | "JointSports" | "Opinions About Joint Sports" | 40 | 5 | | 16 | "vcd" | "Lifeboats" | "Lifeboats on the Titanic" | 18 | 8 | | 17 | "vcd" | "NonResponse" | "Non-Response Survey Data" | 12 | 4 | | 18 | "vcd" | "OvaryCancer" | "Ovary Cancer Data" | 16 | 5 | | 19 | "vcd" | "PreSex" | "Pre-marital Sex and Divorce" | 16 | 5 | | 20 | "vcd" | "Punishment" | "Corporal Punishment Data" | 36 | 5 | | 21 | "vcd" | "RepVict" | "Repeat Victimization Data" | 8 | 9 | | 22 | "vcd" | "Saxony" | "Families in Saxony" | 13 | 2 | | 23 | "vcd" | "SexualFun" | "Sex is Fun" | 4 | 5 | | 24 | "vcd" | "SpaceShuttle" | "Space Shuttle O-ring Failures" | 24 | 6 | | 25 | "vcd" | "Suicide" | "Suicide Rates in Germany" | 306 | 6 | | 26 | "vcd" | "Trucks" | "Truck Accidents Data" | 24 | 5 | | 27 | "vcd" | "UKSoccer" | "UK Soccer Scores" | 5 | 6 | | 28 | "vcd" | "VisualAcuity" | "Visual Acuity in Left and Right Eyes" | 32 | 4 | | 29 | "vcd" | "VonBort" | "Von Bortkiewicz Horse Kicks Data" | 280 | 4 | | 30 | "vcd" | "WeldonDice" | "Weldon's Dice Data" | 11 | 2 | | 31 | "vcd" | "WomenQueue" | "Women in Queues" | 11 | 2 |

Finally, the most interesting bit: accessing data from a particular dataset. Below we load up the `women`

dataset from the `vcd`

package.

julia> women = dataset("datasets", "women") 15x2 DataFrame | Row | Height | Weight | |-----|--------|--------| | 1 | 58 | 115 | | 2 | 59 | 117 | | 3 | 60 | 120 | | 4 | 61 | 123 | | 5 | 62 | 126 | | 6 | 63 | 129 | | 7 | 64 | 132 | | 8 | 65 | 135 | | 9 | 66 | 139 | | 10 | 67 | 142 | | 11 | 68 | 146 | | 12 | 69 | 150 | | 13 | 70 | 154 | | 14 | 71 | 159 | | 15 | 72 | 164 |

From these data we learn that the average mass of American women of height 66 inches is around 139 pounds. If you are from a country which uses the Metric system (like me!) then these numbers might seem a little mysterious. Come back in a couple of days and we’ll see how Julia can convert pounds and inches in metres and kilograms.

That’s all for now. Code for today is available on github.

The post #MonthOfJulia Day 17: Datasets from R appeared first on Exegetic Analytics.