Author Archives: Josh Day

The Best Data Science Talks of JuliaCon 2021

By: Josh Day

Re-posted from: https://www.juliafordatascience.com/best-data-science-talks-of-juliacon-2021/

Enjoying Julia For Data Science?  Please share us with a friend and follow us on Twitter at @JuliaForDataSci.

Introduction

The Best Data Science Talks of JuliaCon 2021

JuliaCon 2021 has come and gone.  I've finally finished getting through my backlog of talks to watch and wanted to share my favorites (that fall under the category of data science).  Note that this is not an exhaustive list of "good talks".  I'm merely highlighting the ones most applicable to the average data scientist and I hope you check out many more than just these listed here!

Each talk below is listed (in alphabetical order) along with its abstract as well as some of my own notes.  I've also assigned a rating in terms of beginner-friendliness using the following scheme:

  • 🟢 – Beginner friendly!  Aimed at Julia beginners.  
  • 🟦 – Intermediate.  You're expected to know a bit of Julia beforehand.
  • ♦️ – Advanced.  Either advanced Julia or domain knowledge will help you understand the talk.

The Talks


♦️ Applied Measure Theory for Probablistic Modeling

  • Speaker: Chad Scherrer.
  • Abstract: We'll give an overview of MeasureTheory.jl, describing some of the advantages relative to Distributions.jl and some applications in probabilistic modeling.
  • Josh's notes:  This talk only loosely falls under the category of data science, but I love using the mathematical statistics/measure theory part of my brain, so it gets included.  I really enjoyed this one.

🟢🟦 Bias Audit and Mitigation in Julia

  • Speaker: Ashrya Agrawal.
  • Abstract: This talk introduces Fairness.jl, a toolkit to audit and mitigate bias in ML decision support tools. We shall introduce the problem of fairness in ML systems, its sources, significance and challenges. Then we will demonstrate Fairness.jl structure and workflow.
  • Josh's notes:  This talk gives a great introduction to the issues of fairness/bias in machine learning as well as offers easy-to-follow examples using Fairness.jl.

🟢🟦 Clearing the Pipeline Jungle with FeatureTransforms.jl

  • Speaker: Glenn Moynihan.
  • Abstract: The prevalence of glue code in feature engineering pipelines poses many problems in conducting high-quality, scalable research. In worst-case scenarios, the technical debt racked up by overgrown “pipeline jungles” can preclude further development and grind promising projects to a halt. This talk will show how the FeatureTransforms.jl package can help make feature engineering a more sustainable practice for users without sacrificing the flexibility they desire.
  • Josh's notes:  This talk has a solid introduction to the real-world difficulties with feature engineering and moves on to clear examples using FeatureTransforms.jl.

🟢🟦 DataFrames.jl 1.0 Tutorial (workshop)

  • Speaker: Bogumił Kamiński.
  • Abstract: In this workshop an introduction to DataFrames.jl 1.2 will be presented. You will learn how to load, transform and visualize your data using the DataFrames.jl package. The tutorial assumes that you have some experience in working with data frames in e.g. R or Python.  All the materials used are available for download at https://github.com/bkamins/JuliaCon2021-DataFrames-Tutorial.
  • Josh's notes: A great way to learn DataFrames from its most active maintainer!

♦️ Easy, Featureful Parallelism with Dagger.jl

  • Speaker: Julian P Samaroo.
  • Abstract: Parallelizing codes with Distributed.jl is simple and can provide an appreciable speed-up; but for complicated problems or when scaling to large problem sizes, the APIs are somewhat lacking. Dagger.jl takes parallelism to the next level, with support for GPU execution, fault tolerance, and more. Dagger's scheduler exploits every bit of parallelism it can find, and uses all the resources you can give it. In this talk, I'll build an application with Dagger to highlight what Dagger can do for you!
  • Josh's notes: Working on a non-trivial distributed computing problem?  This is a low level talk, but if you've struggled with Julia's distributed computing primitives, you may want to give Dagger a try.

🟦♦️ Introduction to Bayesian Data Analysis (workshop)

  • Speaker: Kusti Skytén, Chad Scherrer, & Tor Fjelde.
  • Abstract: This workshop will introduce the recommended workflow for applied Bayesian data analysis by working through an example analysis together. We will start with the simplest non-trivial model and use increasingly sophisticated models to explain the properties of our data set based on model diagnostics. We will also give an overview of the different probabilistic programming packages in Julia and show where we have advantages over other languages such as Stan and Python.
  • Josh's notes: This is an in-depth tutorial to the ecosystem of Bayesian data analysis in Julia.  You may want to start with the "Statistics with Julia from the Ground Up" talk if you are newer to Julia.

🟢🟦 Pluto – One Year Later

  • Speaker: Fons van der Plas.
  • Abstract: Pluto.jl is a notebook IDE for Julia, with a focus on interactivity and education. In this talk, you'll learn about our work during the past year, and our future plans.
  • Josh's notes: Pluto got a lot of attention the past year and rightfully so.  If you find yourself in a notebook environment (like Jupyter) often, you should give Pluto a try.  Watch this one!  If you want more of Pluto, also check out the Open and Interactive Computational Thinking with Julia and Pluto talk.

🟦 Rewriting Pieces of a Python Codebase in Julia

  • Speaker: Satvik Souza Beri.
  • Abstract: Many people looking at Julia are coming from Python, and already have a sizable codebase. Our fund started rewriting performance-critical parts of our Python codebase in Julia, getting 10x-30x speedups. I'll go over how to start migrating Python code to Julia using PyCall and PyJulia, some gotchas to avoid, and where you're likely to see the biggest benefits.
  • Josh's notes: This is not specific to data science, but I've included this talk because many people are in the same boat.

🟢 The State of DataFrames.jl

  • Speaker: Bogumił Kamiński.
  • Abstract: In this talk I discuss what has recently changed in DataFrames.jl, what is the current state of the package, and what are our plans for the future.
  • Josh's notes:  You won't see a lot of code in this talk (see the tutorial), but you'll learn about DataFrames' design philosophy as well as what the open source contributors to DataFrames.jl have achieved.

🟦 State of Julia

  • Speakers: Jeff Bezanson, Stefan Karpinski, Keno Fischer, & Viral Shah
  • Abstract: Annual talk on the state of things by Julia's creators.
  • Josh's notes:  This is a staple of every JuliaCon and always worth a watch to see what people are working on under the hood.

🟢 Statistics with Julia from the Ground Up (workshop)

  • Speaker: Yoni Nazarathy.
  • Abstract: This workshop provides an introduction to the Julia language for data-scientists and statisticians. No prior experience with Julia is assumed. The workshop starts with a few Julia basics and then progresses through basic probability and statistics examples, usage of dataframes, elementary statistical inference, regression, and more advanced methods. At the end of this workshop, attendees will have solid entry point for using Julia as their preferred data analysis tool.
  • Josh's notes: This is the best workshop I watched.  It begins by introducing Julia at a comfortable pace and then goes into a tour of the entire statistics ecosystem, including doing basic statistics, plotting, working with data, etc.

🚀 That's It!

Did we miss any awesome talks?

Enjoying Julia For Data Science?  Please share us with a friend and follow us on Twitter at @JuliaForDataSci.

Calling R From Julia

By: Josh Day

Re-posted from: https://www.juliafordatascience.com/calling-r-from-julia/

Calling R From Julia

Since Julia has a younger ecosystem, you won't always find the functionality you need for every task.  In those cases, you're left with the choice to either:

  • Implement it yourself.
    • This is great for the Julia community in the long run (especially if you release your work as a package! Please do this!), but you don't always have the time/energy to do so.
  • Use a package in another language.
    • Language wars are boring. You should use all the tools at your disposable. You don't need to stick with Julia for everything (especially since interop is easy!).

Particularly for niche models in the field of statistics, you'll often find R packages that do not have Julia counterparts.  Thankfully, Julia and R work together rather seamlessly through the help of one package.

Introducing RCall.jl

To get started, let's install RCall:

] add RCall

Running R Code from Julia

  • After installing and typing using RCall, you'll have access to the R REPL Mode by typing $.  Your prompt will change from julia> to R>  and now all of your commands will run in R instead of Julia.  You can use it just like a normal R session.
julia> using RCall

# type `$`

R> install.packages("ggplot2")

R> library(ggplot2)

R> data(diamonds)

R> ggplot(diamonds, aes(x=carat, y=price)) + geom_point()
Calling R From Julia
  • If you want to call R from Julia in a non-interactive manner (not from the REPL), you can use @R_str macro:
julia> R"y = 2"
RObject{RealSxp}
[1] 2

Sending Julia Variables to R

  • The @rput macro sends a variable to R and uses the same name.
julia> x = 1
1

julia> @rput x
1

R> x
[1] 1
  • You can interpolate Julia values in @R_str commands as well as the R REPL Mode:
julia> x = 1
1

julia> R"y = $x"
RObject{IntSxp}
[1] 1

R> 1 + $x
[1] 2

Retrieving R Variables in Julia

  • The @rget macro sends a variable from R to Julia and uses the same name.
julia> R"z = 5"
RObject{RealSxp}
[1] 5


julia> @rget z
5.0

julia> z
5.0
  • You can also convert an RObject into the appropriate Julia counterpart with rcopy.
julia> robj = R"z"
RObject{RealSxp}
[1] 5


julia> rcopy(robj)
5.0

🚀 That's It!

You now know the basics of working in R from Julia.  There are deeper depths to dive into on the topic (such as type conversions between R and Julia), but this should give you a start on using your favorite R package together with Julia.

Enjoying Julia For Data Science?  Please share us with a friend and follow us on Twitter at @JuliaForDataSci.

Additional Resources

Big CSVs

By: Josh Day

Re-posted from: https://www.juliafordatascience.com/big-csvs/

Big CSVs

Big data is an overloaded term.  Here at Julia For Data Science, we'll loosely define it as:

Big data is any dataset (or collection of datasets) that requires you to change how you analyze it because of its size.

For example, I am writing this post on a laptop with 8GB of memory.  If I'm trying to analyze a 10MB CSV, I can easily load that into memory and run my analysis.  If I have a 10GB CSV, I need to change my approach because I can't load all the data at once.

For Small CSVs: CSV.File 📄

The CSV package is a high-performance Julia package for working with CSV files.  For smaller (comfortably fitting in memory) datasets, you can get started with CSV.File("path/to/file.csv") .

  • This creates a CSV.File object that can be loaded into many kinds of "sinks", such as a DataFrame.
  • A CSV.File will automatically determine the types for each column.
  • There are many options available to help you read CSVs with different formats.

Here's a quick example of loading data a CSV from the web (New York Times' COVID data), loading it into a DataFrame, and making a plot:

using CSV, DataFrames, Plots

url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv"

file = CSV.File(download(url))

df = DataFrame(file)

plot(df.date, df.cases, title="Cumulative COVID-19 Cases in the US",
	lab="", ylab="N", xlab="Date")
Using CSV to load the New York Times' COVID dataset
Big CSVs

For Big CSVs: CSV.Rows 🚣

You can also create a CSV iterator that only loads one row into memory at a time, allowing you to work with huge CSVs.  CSV.Rows is similar to CSV.File, but it does not infer types.  When you iterate over CSV.Rows, all data is represented as a String.  Thus, you must manually parse the data into its appropriate type.

As an example, let's take a look at Kaggle's New York City Taxi training data, a 5.31GB CSV file containing data on taxi rides (fare amount, number of passengers, pickup time, and pickup and dropoff locations).  Note this isn't actually larger my laptop's memory, but we are trying to keep our examples easily reproducible.

We can't plot all the data directly since we aren't loading it all at once, so what can we do?

CSV 🤝 OnlineStats

OnlineStats is a Julia package that provides fast single-pass algorithms for calculating statistics and data visualizations on big data.  Every statistics algorithm in OnlineStats uses constant memory, so it can be used on infinitely-sized data!

We'll use OnlineStats together with CSV.Rows to run our analysis.  Note that the CSV and OnlineStats packages don't depend on each other, but they work well together because of Julia's composability.  Let's take a look at the distributions of fare amount, grouped by the number of passengers.

using CSV, OnlineStats, Plots

# `reusebuffer=true` lets us reuse the same computer memory for each row
rows = CSV.Rows("/Users/joshday/datasets/nyc_taxi_kaggle/train.csv", 
	reusebuffer=true)

# Create an iterator that parses the types
itr = (parse(Int, r.passenger_count) => parse(Float64, r.fare_amount)
    for r in rows)

# OnlineStats works with any iterator
o = GroupBy(Int, Hist(0:.5:100))

fit!(o, itr)

# Plot results for 1-4 passengers
plot(plot(o[1]), plot(o[2]), plot(o[3]), plot(o[4]),
    layout=(4,1), link=:all, lab=[1 2 3 4]
)
OnlineStats + CSV.Rows
Big CSVs

🚀 That's It!

You can now calculate statistics and run an analysis on a big CSV file using the CSV and OnlineStats packages.  

What do you want to learn about next?  Ping us on Twitter at @JuliaForDataSci!

Enjoying Julia For Data Science?  Please share us with a friend and follow us on Twitter at @JuliaForDataSci.

Additional Resources