Category Archives: Julia

Fitting Mixed Effects Models – Python, Julia or R?

Re-posted from: https://dm13450.github.io/2022/01/06/Mixed-Models-Benchmarking.html

I’m benchmarking how long it takes to fit a mixed
effects model using lme4 in R, statsmodels in Python, plus
showing how MixedModels.jl in Julia is also a viable option.

Enjoy these types of posts? Then you should sign up for my newsletter. It’s a short monthly recap of anything and everything I’ve found interesting recently plus
any posts I’ve written. So sign up and stay informed!

Data science is always up for debating whether R or Python is the better
language when it comes to analysing some data. Julia has been
making its case as a viable alternative as well. In most cases you can
perform a task in all three with a little bit of syntax
adjustment, so no need for any real commitment. Yet, I’ve
recently been having performance issues in R with a mixed model.

I have a dataset with 1604 groups for a random effect that has been
grinding to a halt when fitting in R using lme4. The team at lme4
have a vignette titled
performance tips
which at the bottom suggests using Julia to speed things up. So I’ve
taken it upon myself to benchmark the basic model-fitting performances
to see if there is a measurable difference. You can use this post as
an example of fitting a mixed effects model in Python, R and Julia.

The Setup

In our first experiment, I am using the palmerspenguins dataset to
fit a basic linear model. I’ve followed the most basic method in all
three languages, using what the first thing in Google
displays.

The dataset has 333 observations with 3 groups for the random effects
parameter.

I’ll make sure all the parameters are close across the three
languages before benchmarking the performance, again, using what
Google says is the best approach to time some code.

I’m running everything on a Late 2013 Macbook. 2.6GHz i5 with 16GB of
RAM. I’m more than happy to repeat this on an M1 Macbook if someone is
willing to sponsor the purchase!

Now onto the code.

Mixed Effects Models in R with lme4

We will start with R as that is where the dataset comes from. Loading
up the palmerspenguins package and filtering out the NaN in the
relevant columns provide a consistent dataset for the other
languages.

R – 4.1.0
lme4 – 1.1-27.1

require(palmerpenguins)
require(lme4)
require(microbenchmark)

testData <- palmerpenguins::penguins %>%
                    drop_na(sex, species, island, bill_length_mm)
lmer(bill_length_mm ~ (1 | species) + sex + 1,  testData)

Intercept – 43.211
sex:male – 3.694
species variance – 29.55

This testData gets saved as a csv for the rest of the languages
to read and use in the benchmarking.

To benchmark the function I use the
microbenchmark
package. It is an excellent and lightweight way of quickly working out
how long a function takes to run.

microbenchmark(lmer(bill_length_mm ~ (1 | species) + sex + 1,  testData), times = 1000)

This outputs the relevant quantities of the benchmarking times.

Mixed Effects Models Python with statsmodels

Python 3.9.4
statmodels 0.13.1

import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import timeit as tt

modelData = pd.read_csv("~/Downloads/penguins.csv")

md = smf.mixedlm("bill_length_mm  ~ sex + 1", modelData, groups=modelData["species"])
mdf = md.fit(method=["lbfgs"])

Which gives us parameter values:

Intercept – 43.211
sex:male – 3.694
Species variance: 29.496

When we benchmark the code, we define a specific function and
repeatedly run it 10000 times. This is all contained in the timeit
module, part of the Python standard library.

def run():
    md = smf.mixedlm("bill_length_mm  ~ sex + 1", modelData, groups=modelData["species"])
    mdf = md.fit(method=["lbfgs"])

times = tt.repeat('run()', repeat = 10000, setup = "from __main__ import run", number = 1)

We’ll be taking the mean, median and, range of the times array.

Mixed Effects Models in Julia with MixedModels.jl

Julia 1.6.0
MixedModels 4.5.0

Julia follows the R syntax very closely, so this needs little
explanation.

using DataFrames, DataFramesMeta, CSV, MixedModels
using BenchmarkTools

modelData = CSV.read("/Users/deanmarkwick/Downloads/penguins.csv",
                                     DataFrame)

m1 = fit(MixedModel, @formula(bill_length_mm ~ 1 + (1|species) + sex), modelData)

Intercept – 43.2
sex:male coefficient – 3.694
group variance of – 19.68751

We use the
BenchmarkTools.jl
package to run the function 10,000 times.

@benchmark fit(MixedModel, @formula(bill_length_mm ~ 1 + (1|species) + sex), $modelData)

As a side note, if you run this benchmarking code in a Jupyter
notebook, you get this beautiful output from the BenchmarkTools
package. Gives you a lovely overview of all the different metrics and
the distribution on the running times.

Julia benchmarking screenshot

Timing Results

All the parameters are close enough, how about the running times?

In milliseconds:

Language	Mean	Median	Min	Max
Julia	0.482	0.374	0.320	34
Python	340	260	19	1400
R	29.5	24.5	20.45	467

Julia blows both Python and R out of the water. About 60 times
faster.

I don’t think Python is that slow in practice, I think it is more of
an artefact of the benchmarking code that doesn’t behave in the
same way as Julia and R.

What About Bigger Data and More Groups?

What if we increased the scale of the problem and also the number of
groups in the random effects parameters?

I’ll now fit a Poisson mixed model to some football data. I’ll
be modeling the goals scored by each team as a Poisson variable, with
a fixed effect of whether the team played at home or not and random
effects for the team and another random effect of the opponent.

This new data set is from
football-data.co.uk and has 98,242
rows with 151 groups in the random effects. Much bigger than the
Palmer Penguins dataset.

Now, poking around the statsmodels documentation, there doesn’t
appear to be a way to fit this model in a frequentist
way. The closest is the PoissonBayesMixedGLM, which isn’t comparable
to the R/Julia methods. So in this case we will be dropping Python
from the analysis. If I’m wrong, please let me know in the comments
below and I’ll add it to the benchmarking.

With generalised linear models in both R and Julia, there are
additional parameters to help speed up the fitting but at the expense of
the parameter accuracy. I’ll be testing these parameters to judge how
much of a tradeoff there is between speed and accuracy.

R

The basic mixed-effects generalised linear model doesn’t change much from the
above in R.

glmer(Goals ~ Home + (1 | Team) + (1 | Opponent), data=modelData, family="poisson")

The documentation states that you can pass nAGQ=0 to speed up the
fitting process but might lose some accuracy. So our fast version of
this model is simply:

glmer(Goals ~ Home + (1 | Team) + (1 | Opponent), data=modelData, family="poisson", nAGQ = 0)

Julia

Likewise for Julia hardly any difference in fitting this type of Poisson model.

fit(MixedModel, @formula(Goals ~ Home + (1 | Team) + (1 | Opponent)), footballData, Poisson())

And even mode simply, there is a fast parameter to use which speeds
up the fitting.

fit(MixedModel, @formula(Goals ~ Home + (1 | Team) + (1 | Opponent)), footballData, Poisson(), fast = true)

Big Data Results

Let’s check the fitted coefficients.

Method	Intercept	Home	\(\sigma _\text{Team}\)	\(\sigma_\text{Opponent}\)
R slow	0.1345	0.2426	0.2110	0.2304
R fast	0.1369	0.2426	0.2110	0.2304
Julia slow	0.13455	0.242624	0.211030	0.230415
Julia fast	0.136924	0.242625	0.211030	0.230422

The parameters are all very similar, showing that for this parameter
specification the different speed flags do not change the coefficient
results, which is good. But for any specific model, you should verify
on a subsample at least to make sure the flags don’t change anything.

Now, what about speed.

Language	Additional Parameter	Mean	Median	Min	Max
Julia	–	11.151	10.966	9.963	16.150
Julia	`fast=true`	5.94	5.924	4.98	8.15
R	–	35.4	33.12	24.33	66.48
R	`nAGQ=0`	8.06	7.99	7.37	9.56

So setting fast=true gives a 2x speed boost in Julia which is
nice. Likewise, setting nAGQ=0 in R improves the speed by almost 3x over
the default. Julia set to fast = true is the quickest, but I’m
surprised that R can get close with its speed-up parameter.

Conclusion

If you are fitting a large mixed-effects model with lots of groups
hopefully, this convinces you that Julia is the way forward. The syntax
for fitting the model is equivalent, so you can do all your
preparation in R before importing the data into Julia to do the model
fitting.

Advent of Code 2021 – Day 25

By: Julia on Eric Burden

Re-posted from: https://www.ericburden.work/blog/2022/01/06/advent-of-code-2021-day-25/

It’s that time of year again! Just like last year, I’ll be posting my solutions to the Advent of Code puzzles. This year, I’ll be solving the puzzles in Julia. I’ll post my solutions and code to GitHub as well. I had a blast with AoC last year, first in R, then as a set of exercises for learning Rust, so I’m really excited to see what this year’s puzzles will bring. If you haven’t given AoC a try, I encourage you to do so along with me!

Day 25 – Sea Cucumber

Find the problem description HERE.

The Input – Echinodermic Perambulation

Here we are! The final advent puzzle, and after a brief jaunt into strange and unusual input formats (that we didn’t actually write parsing code for), here we are again. Parsing input from a text file, how I have missed you!

# No fancy data structures this time, just two `BitMatrix`, one representing the
# locations of east-bound cucumbers, and one representing the locations of
# sounth-bound cucumbers, on a 2D grid. Can you tell I'm ready to be done?
function ingest(path)
    eastlines  = []
    southlines = []
    open(path) do f
        for line in readlines(f)
            chars = collect(line)
            push!(eastlines,  chars .== '>')
            push!(southlines, chars .== 'v')
        end
    end
    eastmatrix  = reduce(hcat, eastlines)  |> transpose |> BitMatrix
    southmatrix = reduce(hcat, southlines) |> transpose |> BitMatrix
    return (eastmatrix, southmatrix)
end

Ah, the tried and true ‘parse the grid to a BitMatrix’ approach! With two matrices!

Part One – The Cucumbers Go Marching

It looks like, at some point, the sea cucumbers will deadlock and stop moving about so much, giving us space to land the submarine. Guess you have to land submarines, then? Never considered that, to be honest, it just hasn’t come up before. Ah well, the good news is that, even here on the ocean floor, traffic is bound to come to a standstill sooner or later. Since they’re apparently quite polite, a sea cucumber will only move if there’s an empty space in front of it at the start of a round, and all the east-bound cucumbers try to move before the south-bound cucumbers. With all that in mind, we’re ready to proceed.


# Helper Functions -------------------------------------------------------------

# For checking on the total state of cucumber migration.
function pprint(east::BitMatrix, south::BitMatrix)
    template = fill('.', size(east))
    template[east]  .= '>'
    template[south] .= 'v'
    for row in eachrow(template)
        for char in row
            print(char)
        end
        println()
    end
end

# Move all the east-bound cucumbers that can be moved
function moveeast(east::BitMatrix, south::BitMatrix)
    occupied   = east .| south      # Spaces occupied by any type of cucumber
    nextstate  = falses(size(east)) # An empty BitMatrix, same size as `east`
    ncols      = size(east, 2)      # Number of columns in the grid
    colindices = collect(1:ncols)   # Vector of column indices

    # For each pair of column index and column index + 1 (wrapped around due
    # to space-defying ocean currents)...
    for (currcol, nextcol) in zip(colindices, circshift(colindices, -1))
        # Identify the cucumbers who moved in `currcol` by the empty spaces
        # in `nextcol`
        moved = east[:,currcol]  .& .!occupied[:,nextcol]

        # Identify the cucumbers who waited in `currcol` by the occupied
        # spaces in `nextcol`
        stayed = east[:,currcol] .& occupied[:,nextcol]

        # Place the updated cucumber positions into `nextstate`
        nextstate[:,nextcol] .|= moved
        nextstate[:,currcol] .|= stayed
    end

    return nextstate
end

# Move all the south-bound cucumbers than can be moved
function movesouth(east::BitMatrix, south::BitMatrix)
    occupied   = east .| south       # Spaces occupied by any type of cucumber
    nextstate  = falses(size(south)) # An empty BitMatrix, same size as `south`
    nrows      = size(south, 1)      # Number of rows in the grid
    rowindices = collect(1:nrows)    # Vector of row indices

    # For each pair of row index and row index + 1 (wrapped around due to
    # intradimensional currents)...
    for (currrow, nextrow) in zip(rowindices, circshift(rowindices, -1))
        # Identify the cucumbers who moved in `currrow` by the empty spaces
        # in `nextrow`
        moved = south[currrow,:] .& .!occupied[nextrow,:]

        # Identify the cucumbers who waited in `currrow` by the occupied
        # spaces in `nextrow`
        stayed = south[currrow,:] .& occupied[nextrow,:]

        # Place the updated cucumber positions into `nextstate`
        nextstate[nextrow,:] .|= moved
        nextstate[currrow,:] .|= stayed
    end

    return nextstate
end

# Move the herd(s)! Get the next state for east-bound and south-bound cucumbers,
# check to see if any of their positions changed, and return the results
function move(east::BitMatrix, south::BitMatrix)
    nexteast  = moveeast(east, south)
    nextsouth = movesouth(nexteast, south)

    eastchanged  = east  .⊻ nexteast
    southchanged = south .⊻ nextsouth
    anychanged = any(eastchanged .| southchanged)

    return (anychanged, nexteast, nextsouth)
end


# Solve Part One ---------------------------------------------------------------

# Solve it! Just keep updating the cucumber positions in a loop until they don't
# change, then report the number of rounds that took.
function part1(input)
    (east, south) = input
    rounds = 0
    anychanged = true
    while anychanged
        (anychanged, east, south) = move(east, south)
        rounds += 1
    end
    return rounds
end

I bet that trip from one edge of the transit zone (?) to the other is wild! Good thing these space-warping ocean currents don’t seem to affect submarines, or we’d be in trouble…

Part Two – Reflection

If you haven’t heard, Day 25 – Part Two is always the “go back and finish the rest of the puzzles” challenge, so no extra coding needed here. Instead, I’d like to use this space to reflect on the Advent of Code puzzles for this year and my experience with them.

This is my second year with Advent of Code, as I was introduced to it last year by a friend on a community Slack channel. Just like last year, this experience created a lot of joy, frustration, annoyance, elation, and satisfaction for me, which is typical of my experience with good puzzles. Since last year, I’ve spent a lot more time honing my knowledge of algorithms and data structures, partly as a way to scratch the itch that AoC helped to create, and partly as a way of sharing this love of puzzles with others. You see, in that community Slack channel I mentioned, some friends and I have starting hosting regular meetups aimed at both spending some time solving puzzles and helping some of the folks who are newer to our community practice live-coding for technical interviews. It’s been a blast, and I look forward to continuing into the new year. (Here’s the GitHub repo if you’re curious about our meetups, anyone is welcome to join).

This year came with it’s own set of challenges, from figuring out how to see family during the holidays (thanks Omicron!) to navigating my own bout of non-COVID related illness (I’m fine, but those were an unpleasant few days). There were a few days of puzzles that set me back on my “solve AoC > write blog post” daily cycle, and I had to consider my holiday priorities as a result. In the end, since this blog post isn’t going up until a few days into January, you can probably guess how that worked out. I really enjoy solving these puzzles, and I can get a bit obsessive when I’ve got an unsolved problem lying around in my head, but time with my family is just too precious to short-change. That’s an area I can get better in, to be honest. Weird how coding puzzles can teach life lessons, too.

I’ve been working through the AoC puzzles in Julia this year, and I’m not sure if it’s because of how similar some of the concepts are to R (my most proficient language), the amount of practice I’ve put in over the year, or just general growth as a programmer, but I found thinking through these puzzles in Julia to be a pretty seamless experience. When I did this with Rust, there was definitely a learning curve related to getting the code to do the things I had in my head, but it just seemed to flow a lot more smoothly in Julia. There’s a lot of really ergonomic syntax in Julia, enabled in part by it being a dynamic language, but I don’t think that’s the whole story. My recent time in Rust encouraged me to add Types in lots of places that, according to the Julia documentation and other talks/articles I’ve seen and read, I really didn’t have to and probably shouldn’t have. Frankly, I like Types, and I find being explicit about data types in my code tends to make it easier for me to reason about and clarifies what I did when I come back to it later. In particular, I found the syntax for single-line function assignment, multiple dispatch, native multidimensional array support, and the breadth and diversity of the Base package to be some of the really compelling features of Julia. The speed gains over R and Python (those operations not directly backed by Rust or C or FORTRAN) are significant, but not always as significant as I would have hoped. There’s probably a lot of room for me to improve in optimizing Julia code, though, so I suspect I’ve left a lot of performance on the table. Guess this means I’ll have to keep writing Julia code, then. As a next step, I’d like to get familiar with the Dataframes.jl and Makie.jl (data frames and plotting) packages, since that’s the kind of work I most often tend to do for my day job. Who knows, maybe Julia will inspire me to finally dabble in a bit of machine learning? I hear that’s a pretty hot topic…

I’ve really enjoyed writing this blog series. As usual, explaining how I did things (even to what I imagine to be, like, five random people on the internet who stumble across this site or these articles wherever they get posted) really helped me to clarify, correct, refactor, and update my own understanding. I recommend coding puzzles regularly to new and aspiring programmers, but I’d also recommend finding someone who could benefit from your knowledge (whatever level you feel like you’re at) and giving teaching/explaining a go. Communication skills are important for growing your career (and being a good human), and I feel like teaching, even a little, is a great way to grow those skills and your understanding of the topic.

Just like last year, I’d like to send out a huge thank you to Eric Wastl and his crew of helpers for putting this event/phenomenon on and keeping it running. If you can, please consider donating to this effort to help keep the magic going, I guarantee there will be a lot of coders (215,955 completed at least one puzzle this year, as of the time of writing) who will be glad you did.

If you’re interested, you can find all the code from this blog series on GitHub, please feel free to submit a pull request if you’re really bored. Happy Holidays!

Advent of Code 2021 – Day 25

By: Julia on Eric Burden

Re-posted from: https://ericburden.work/blog/2022/01/06/advent-of-code-2021-day-25/

Day 25 – Sea Cucumber

Find the problem description HERE.

The Input – Echinodermic Perambulation

# No fancy data structures this time, just two `BitMatrix`, one representing the
# locations of east-bound cucumbers, and one representing the locations of
# sounth-bound cucumbers, on a 2D grid. Can you tell I'm ready to be done?
function ingest(path)
    eastlines  = []
    southlines = []
    open(path) do f
        for line in readlines(f)
            chars = collect(line)
            push!(eastlines,  chars .== '>')
            push!(southlines, chars .== 'v')
        end
    end
    eastmatrix  = reduce(hcat, eastlines)  |> transpose |> BitMatrix
    southmatrix = reduce(hcat, southlines) |> transpose |> BitMatrix
    return (eastmatrix, southmatrix)
end

Ah, the tried and true ‘parse the grid to a BitMatrix’ approach! With two matrices!

Part One – The Cucumbers Go Marching


# Helper Functions -------------------------------------------------------------

# For checking on the total state of cucumber migration.
function pprint(east::BitMatrix, south::BitMatrix)
    template = fill('.', size(east))
    template[east]  .= '>'
    template[south] .= 'v'
    for row in eachrow(template)
        for char in row
            print(char)
        end
        println()
    end
end

# Move all the east-bound cucumbers that can be moved
function moveeast(east::BitMatrix, south::BitMatrix)
    occupied   = east .| south      # Spaces occupied by any type of cucumber
    nextstate  = falses(size(east)) # An empty BitMatrix, same size as `east`
    ncols      = size(east, 2)      # Number of columns in the grid
    colindices = collect(1:ncols)   # Vector of column indices

    # For each pair of column index and column index + 1 (wrapped around due
    # to space-defying ocean currents)...
    for (currcol, nextcol) in zip(colindices, circshift(colindices, -1))
        # Identify the cucumbers who moved in `currcol` by the empty spaces
        # in `nextcol`
        moved = east[:,currcol]  .& .!occupied[:,nextcol]

        # Identify the cucumbers who waited in `currcol` by the occupied
        # spaces in `nextcol`
        stayed = east[:,currcol] .& occupied[:,nextcol]

        # Place the updated cucumber positions into `nextstate`
        nextstate[:,nextcol] .|= moved
        nextstate[:,currcol] .|= stayed
    end

    return nextstate
end

# Move all the south-bound cucumbers than can be moved
function movesouth(east::BitMatrix, south::BitMatrix)
    occupied   = east .| south       # Spaces occupied by any type of cucumber
    nextstate  = falses(size(south)) # An empty BitMatrix, same size as `south`
    nrows      = size(south, 1)      # Number of rows in the grid
    rowindices = collect(1:nrows)    # Vector of row indices

    # For each pair of row index and row index + 1 (wrapped around due to
    # intradimensional currents)...
    for (currrow, nextrow) in zip(rowindices, circshift(rowindices, -1))
        # Identify the cucumbers who moved in `currrow` by the empty spaces
        # in `nextrow`
        moved = south[currrow,:] .& .!occupied[nextrow,:]

        # Identify the cucumbers who waited in `currrow` by the occupied
        # spaces in `nextrow`
        stayed = south[currrow,:] .& occupied[nextrow,:]

        # Place the updated cucumber positions into `nextstate`
        nextstate[nextrow,:] .|= moved
        nextstate[currrow,:] .|= stayed
    end

    return nextstate
end

# Move the herd(s)! Get the next state for east-bound and south-bound cucumbers,
# check to see if any of their positions changed, and return the results
function move(east::BitMatrix, south::BitMatrix)
    nexteast  = moveeast(east, south)
    nextsouth = movesouth(nexteast, south)

    eastchanged  = east  .⊻ nexteast
    southchanged = south .⊻ nextsouth
    anychanged = any(eastchanged .| southchanged)

    return (anychanged, nexteast, nextsouth)
end


# Solve Part One ---------------------------------------------------------------

# Solve it! Just keep updating the cucumber positions in a loop until they don't
# change, then report the number of rounds that took.
function part1(input)
    (east, south) = input
    rounds = 0
    anychanged = true
    while anychanged
        (anychanged, east, south) = move(east, south)
        rounds += 1
    end
    return rounds
end

I bet that trip from one edge of the transit zone (?) to the other is wild! Good thing these space-warping ocean currents don’t seem to affect submarines, or we’d be in trouble…

Part Two – Reflection

If you’re interested, you can find all the code from this blog series on GitHub, please feel free to submit a pull request if you’re really bored. Happy Holidays!

juliabloggers.com

A Julia Language Blog Aggregator

Category Archives: Julia

Fitting Mixed Effects Models – Python, Julia or R?

The Setup

Mixed Effects Models in R with lme4

Mixed Effects Models Python with statsmodels

Mixed Effects Models in Julia with MixedModels.jl

Timing Results

What About Bigger Data and More Groups?

R

Julia

Big Data Results

Conclusion

Advent of Code 2021 – Day 25

Day 25 – Sea Cucumber

The Input – Echinodermic Perambulation

Part One – The Cucumbers Go Marching

Part Two – Reflection

Advent of Code 2021 – Day 25

Day 25 – Sea Cucumber

The Input – Echinodermic Perambulation

Part One – The Cucumbers Go Marching

Part Two – Reflection