Category Archives: Julia

Advent of Code 2021 – Day 22

Re-posted from: https://www.ericburden.work/blog/2022/01/01/advent-of-code-2021-day-22/

It’s that time of year again! Just like last year, I’ll be posting my solutions to the Advent of Code puzzles. This year, I’ll be solving the puzzles in Julia. I’ll post my solutions and code to GitHub as well. I had a blast with AoC last year, first in R, then as a set of exercises for learning Rust, so I’m really excited to see what this year’s puzzles will bring. If you haven’t given AoC a try, I encourage you to do so along with me!

Day 22 – Reactor Reboot

Find the problem description HERE.

The Input – Cubic Impressions

The input for today’s puzzle isn’t super complicated, consisting of lines that start with the word ‘on’ or ‘off’, followed by a string in the format of “x=..,y=..,z=..”, where the values are positive or negative numbers that represent the bounds of a cube. So, with a function that can parse one line, we can parse all the lines.


# Data Structures --------------------------------------------------------------

# We'll represent each of the cubes of lights as a set of x, y, and z-ranges
# that encompass all the points of the cube
struct Cube
    x::UnitRange{Int64}
    y::UnitRange{Int64}
    z::UnitRange{Int64}
end

# Constructor that ensures that the first number in the x, y, or z-range is 
# the smaller number.
function Cube(x1, x2, y1, y2, z1, z2) 
    if x1 > x2; (x1, x2) = (x2, x1); end
    if y1 > y2; (y1, y2) = (y2, y1); end
    if z1 > z2; (z1, z2) = (z2, z1); end
    Cube(x1:x2, y1:y2, z1:z2)
end

# Constructor that takes a line from the input and produces a `Cube`
function Cube(s::AbstractString)
    re = r".*x=(-?\d+)\.\.(-?\d+),y=(-?\d+)\.\.(-?\d+),z=(-?\d+)\.\.(-?\d+)"
    bounds = parse.(Int, match(re, s).captures)
    return Cube(bounds...)
end

# An `Instruction` encompasses a `Cube` and whether the instruction is to turn
# 'on' all the lights in that cube (state == True) or to turn the lights 'off'
struct Instruction
    state::Bool
    cube::Cube
end

# Constructor that takes a line from the input and returns an `Instruction`
function Instruction(s::AbstractString)
    re = r"^(on|off) (.*)$"
    (statestr, cubestr) = match(re, s).captures
    state = statestr == "on"
    cube = Cube(cubestr)
    return Instruction(state, cube)
end


# Ingest the Data -------------------------------------------------------------

# Parse the input file into a list of `Instruction`s
function ingest(path)
    return open(path) do f
        [Instruction(l) for l in readlines(f)]
    end
end

We’ll keep the parsing simple by representing each cube as a set of ranges bounded by the values in the input. Based on how large some of the values in the input are, we won’t be able to keep track of every point.

Part One – Minicube

Time to reboot the submarine! As we all know, most every technological problem is solved with a reboot. In fact, if we can figure this one out, we should be good for Days 23-25 as well! And, as it turns out, we were overly pessimistic when parsing the input. We actually only care about the portions of the instruction cubes in the range from -50 -> 50, so we can just keep track of which lights are on in a 3D Array. This will be easier than we thought!

# Helper Functions -------------------------------------------------------------

# It is often convenient to get the upper and lower bounds of the 
# x, y, and z-ranges that comprise a `Cube`
function bounds(cube::Cube)
    (x1, x2) = (cube.x.start, cube.x.stop)
    (y1, y2) = (cube.y.start, cube.y.stop)
    (z1, z2) = (cube.z.start, cube.z.stop)
    return (x1, x2, y1, y2, z1, z2)
end

# Check whether a `Cube` overlaps the region being considered for the 
# initialization sequence.
function overlap(lights::BitArray, cube::Cube)
    (x1, x2, y1, y2, z1, z2) = bounds(cube)
    (lightx, lighty, lightz) = size(lights)
    return (x1 <= lightx && x2 >= 1
            && y1 <= lighty && y2 >= 1
            && z1 <= lightz && z2 >= 1)
end

# Shift a cube so that all its dimensions are positive and have at least a
# a value of 1, so that the coordinates map cleanly to an `Array`.
function displace(cube::Cube, offset::Int64)
    cubebounds = map(c -> c + offset, bounds(cube))
    return Cube(cubebounds...)
end

# Apply an instruction (turning lights on or off) to the lighted region. Just
# sets positions in `lignts` bounded by the coordinates of the instruction's
# `Cube` to either 1 or 0, depending on the type of instruction. Displaces
# the `Cube` before making this change.
function apply!(lights::BitArray, instruction::Instruction, offset = 51)
    (lightx, lighty, lightz) = size(lights)
    cube = displace(instruction.cube, offset)
    overlap(lights, cube) || return
    xrange = intersect(1:lightx, cube.x)
    yrange = intersect(1:lighty, cube.y)
    zrange = intersect(1:lightz, cube.z)
    lights[xrange, yrange, zrange] .= instruction.state
end


# Solve Part One ---------------------------------------------------------------

# Simple, straightforward; represent the target region as a 3D `BitArray` and 
# turn on/off the lights according to the instructions, in order. We use 
# `displace()` to adjust the coordinates of each `Cube` so that its bounds fall
# within the bounds of `lignts`.
function part1(input)
    lights = falses(101, 101, 101)
    foreach(cube -> apply!(lights, cube), input)
    return count(lights)
end

On second thought, yesterday’s puzzle has me a bit paranoid. Is it possible that this part was too easy…

Part Two – Rubik’s Wonderland

sigh Of course it was too good to be true. I did try to create a 3D Array large enough to accommodate the entire range, but it wouldn’t fit into my laptop’s memory. I also tried processing the instructions in chunks, which would probably work, but it was going to take forever (approximately). So, it was back to the drawing board for some shudder geometry. I ended up drawing a diagram for myself to help work out the ‘shattering’ rules (see the code below).

Diagram of a cube and the cubic zones surrounding it

# Helper Functions -------------------------------------------------------------

# Check whether two cubes overlap
function overlap(a::Cube, b::Cube)
    (ax1, ax2, ay1, ay2, az1, az2) = bounds(a)
    (bx1, bx2, by1, by2, bz1, bz2) = bounds(b)
    return (   ax1 <= bx2 && ax2 >= bx1
            && ay1 <= by2 && ay2 >= by1
            && az1 <= bz2 && az2 >= bz1)
end

# Check whether the `Cube`s from two `Instruction`s overlap
overlap(a::Instruction, b::Instruction) = overlap(a.cube, b.cube)

# Mostly just used for sanity checking while developing the `shatter()` function
# below, but it's interesting enought to keep around. Returns a `Cube` 
# representing the portions of `a` and `b` that overlap.
function Base.intersect(a::Cube, b::Cube)
    overlap(a, b) || return Cube(0:0, 0:0, 0:0)
    (ax1, ax2, ay1, ay2, az1, az2) = bounds(a)
    (bx1, bx2, by1, by2, bz1, bz2) = bounds(b)
    x = max(ax1, bx1):min(ax2, bx2)
    y = max(ay1, by1):min(ay2, by2)
    z = max(az1, bz1):min(az2, bz2)
    return Cube(x, y, z)
end

# Volume of a cube. Tricky, right? Yeah, well, *I* mistakenly corrected for 
# Julia's 1-indexing by adding one to each of these lengths and spent _way_
# too long debugging the wrong thing before I realized what I'd done...
function volume(cube::Cube)
    xlen = length(cube.x)
    ylen = length(cube.y)
    zlen = length(cube.z)
    return xlen * ylen * zlen
end


# The next set of functions are related to functionality to break apart one
# `Cube` around another `Cube`. This depends on creating new `Cubes` for each
# "zone" around the smaller `Cube`, 8 zones that align with the corners of the
# smaller `Cube`, 6 zones that align with the faces of the smaller `Cube`, and
# 12 zones that align with the edges of the smaller `Cube`. This way, if an 
# 'off' `Instruction` has a cube that overlaps a `Cube` of lignts already on,
# we can 'cut out' the `Cube` that gets turned off from the `Cube` that's 
# already on, replacing the remaining portions of the 'on' `Cube` with smaller
# `Cube`s and removing the overlapping section.

# For "face" and "edge" zones, for each dimension, there are four possible
# ranges that can be taken based on the relative coordinates of the `target`
# and `template`. Arguments are the min/max coordinates of `target` and 
# `template` (from `shatter()`) in a single dimension. I find it easier to
# imagine these values as line endpoints on a number line, if one line
# stretches from D1>D2 and the other from d1>d2.
function middlerange(D1, D2, d1, d2)
    D1 <= d1 && D2 >= d2 && return d1:d2
    D1 > d1  && D2 >= d2 && return D1:d2
    D1 <= d1 && D2 < d2  && return d1:D2
    D1 > d1  && D2 < d2  && return D1:D2
    return nothing
end

# Break a `target` cube into up to 27 segments, based on the location of 
# template. I find it easiest to image that every corner of `template` that
# lies inside `target` emits a line in each direction, carving up `target`. 
# Returns all the pieces _except_ for the piece that intersects with the
# `template`.
function shatter(target::Cube, template::Cube)
    # I'm using capital letters for the `target`, and lowercase letters for
    # the `template` `Cube`.
    (X1, X2, Y1, Y2, Z1, Z2) = bounds(target)
    (x1, x2, y1, y2, z1, z2) = bounds(template)

    pieces = Vector{Cube}()

    # These checks are used to determine if any of the points of the `target`
    # cube stray into one of the 26 cubic 'zones' adjacent to the `template` 
    # cube. This includes 12 "edge" zones, 8 "corner" zones, and six "face"
    # zones.
    xoffsets = (X1 < x1, X1 <= x2 || x1 <= X2, x2 < X2)
    yoffsets = (Y1 < y1, Y1 <= y2 || y1 <= Y2, y2 < Y2)
    zoffsets = (Z1 < z1, Z1 <= z2 || z1 <= Z2, z2 < Z2)

    # The ranges to use to construct the pieces. For each axis: the range when 
    # the `target` extends past the `template` in the negative (xyz)-direction,
    # when the `target` occupies the same (xyz)-range as the `template`, and 
    # when the `target` extends past the `template` in the positive direction
    xranges = (X1:x1-1, middlerange(X1, X2, x1, x2), x2+1:X2)
    yranges = (Y1:y1-1, middlerange(Y1, Y2, y1, y2), y2+1:Y2)
    zranges = (Z1:z1-1, middlerange(Z1, Z2, z1, z2), z2+1:Z2)
    
    # Generate the pieces
    for (a, b, c) in Iterators.product(1:3, 1:3, 1:3)
        (a, b, c) == (2, 2, 2) && continue # Skip the `template` range
        if xoffsets[a] && yoffsets[b] && zoffsets[c]
            push!(pieces, Cube(xranges[a], yranges[b], zranges[c]))
        end
    end
    
    # Sanity check, ensures that the volume of the pieces returned + the volume
    # removed is equal to the volume of the `target` `Cube`. Not needed to run
    # the solution, but it was handy for debugging.
    # @assert sum(volume.(pieces)) + volume(intersect(template, target)) == volume(target)

    return pieces
end

# Convenience function to `shatter()` the cubes held by two `Instruction`s. 
# Returns a list of `Instruction`s containing each of the pieces.
function shatter(target::Instruction, template::Instruction)
    newcubes = shatter(target.cube, template.cube)
    return Instruction.(target.state, newcubes)
end

# Given an iterable containing `Instruction`s, add up the volume of all the
# `Instruction`s that turn lignts on.
function countlights(instructionset)
    lightson = 0
    for instruction in instructionset
        !instruction.state && continue
        lightson += volume(instruction.cube)
    end
    return lightson
end


# Solve Part Two ---------------------------------------------------------------

# Add each instruction to a set of final instructions. As each instruction is
# added, have it `shatter()` any `Instruction` it overlaps with before adding
# it to the set of final instructions, thereby replacing the overlapping
# region with itself. This means that instructions to turn lights off will 
# "overwrite" regions that turn lights on, and instructions that turn lights on
# dont' get double-counted.
function part2(input)
    finalinstructions = Set{Instruction}()
    for instrin in input
        for instrout in finalinstructions
            overlap(instrin, instrout) || continue # No overlaps? No problem!

            newinstrs = shatter(instrout, instrin)
            delete!(finalinstructions, instrout)

            # `shatter()` will return an empty list if `target` fits completely
            # inside `template`, so there's no parts of `target` to add back
            # after shattering in that case.
            isempty(newinstrs) && continue
            push!(finalinstructions, newinstrs...)
        end

        # Only need to store the 'on' `Instruction`s. 
        instrin.state && push!(finalinstructions, instrin)
    end

    return countlights(finalinstructions)
end

Somehow, it’s always math, isn’t it? I’m pretty satisfied with this approach, though. There’s possibly a way to do this without needing to explicitly turn off cubes, if we worked from the end of the list backwards and only turned on the ones that were going to stay lit, but that seems complicated and I’m happy enough with this solution.

Wrap Up

Oh, AoC, when will I learn that I should never relax when part one seems pretty easy compared to previous days? This was a somewhat time-consuming puzzle, but I’m really happy with the way shatter() worked out. That moment when I realized I should be able to sum the volumes of all the pieces to get the original volume? Should have been obvious? Yes. Was it still a great feeling? Absolutely! Today leaned heavily on Julia’s destructuring functionality, and provided some great practice there, so I’m going to call today a win-win!

If you found a different solution (or spotted a mistake in one of mine), please drop me a line!

Advent of Code 2021 – Day 22

By: Julia on Eric Burden

Re-posted from: https://ericburden.work/blog/2022/01/01/advent-of-code-2021-day-22/

Day 22 – Reactor Reboot

Find the problem description HERE.

The Input – Cubic Impressions


# Data Structures --------------------------------------------------------------

# We'll represent each of the cubes of lights as a set of x, y, and z-ranges
# that encompass all the points of the cube
struct Cube
    x::UnitRange{Int64}
    y::UnitRange{Int64}
    z::UnitRange{Int64}
end

# Constructor that ensures that the first number in the x, y, or z-range is 
# the smaller number.
function Cube(x1, x2, y1, y2, z1, z2) 
    if x1 > x2; (x1, x2) = (x2, x1); end
    if y1 > y2; (y1, y2) = (y2, y1); end
    if z1 > z2; (z1, z2) = (z2, z1); end
    Cube(x1:x2, y1:y2, z1:z2)
end

# Constructor that takes a line from the input and produces a `Cube`
function Cube(s::AbstractString)
    re = r".*x=(-?\d+)\.\.(-?\d+),y=(-?\d+)\.\.(-?\d+),z=(-?\d+)\.\.(-?\d+)"
    bounds = parse.(Int, match(re, s).captures)
    return Cube(bounds...)
end

# An `Instruction` encompasses a `Cube` and whether the instruction is to turn
# 'on' all the lights in that cube (state == True) or to turn the lights 'off'
struct Instruction
    state::Bool
    cube::Cube
end

# Constructor that takes a line from the input and returns an `Instruction`
function Instruction(s::AbstractString)
    re = r"^(on|off) (.*)$"
    (statestr, cubestr) = match(re, s).captures
    state = statestr == "on"
    cube = Cube(cubestr)
    return Instruction(state, cube)
end


# Ingest the Data -------------------------------------------------------------

# Parse the input file into a list of `Instruction`s
function ingest(path)
    return open(path) do f
        [Instruction(l) for l in readlines(f)]
    end
end

Part One – Minicube

# Helper Functions -------------------------------------------------------------

# It is often convenient to get the upper and lower bounds of the 
# x, y, and z-ranges that comprise a `Cube`
function bounds(cube::Cube)
    (x1, x2) = (cube.x.start, cube.x.stop)
    (y1, y2) = (cube.y.start, cube.y.stop)
    (z1, z2) = (cube.z.start, cube.z.stop)
    return (x1, x2, y1, y2, z1, z2)
end

# Check whether a `Cube` overlaps the region being considered for the 
# initialization sequence.
function overlap(lights::BitArray, cube::Cube)
    (x1, x2, y1, y2, z1, z2) = bounds(cube)
    (lightx, lighty, lightz) = size(lights)
    return (x1 <= lightx && x2 >= 1
            && y1 <= lighty && y2 >= 1
            && z1 <= lightz && z2 >= 1)
end

# Shift a cube so that all its dimensions are positive and have at least a
# a value of 1, so that the coordinates map cleanly to an `Array`.
function displace(cube::Cube, offset::Int64)
    cubebounds = map(c -> c + offset, bounds(cube))
    return Cube(cubebounds...)
end

# Apply an instruction (turning lights on or off) to the lighted region. Just
# sets positions in `lignts` bounded by the coordinates of the instruction's
# `Cube` to either 1 or 0, depending on the type of instruction. Displaces
# the `Cube` before making this change.
function apply!(lights::BitArray, instruction::Instruction, offset = 51)
    (lightx, lighty, lightz) = size(lights)
    cube = displace(instruction.cube, offset)
    overlap(lights, cube) || return
    xrange = intersect(1:lightx, cube.x)
    yrange = intersect(1:lighty, cube.y)
    zrange = intersect(1:lightz, cube.z)
    lights[xrange, yrange, zrange] .= instruction.state
end


# Solve Part One ---------------------------------------------------------------

# Simple, straightforward; represent the target region as a 3D `BitArray` and 
# turn on/off the lights according to the instructions, in order. We use 
# `displace()` to adjust the coordinates of each `Cube` so that its bounds fall
# within the bounds of `lignts`.
function part1(input)
    lights = falses(101, 101, 101)
    foreach(cube -> apply!(lights, cube), input)
    return count(lights)
end

On second thought, yesterday’s puzzle has me a bit paranoid. Is it possible that this part was too easy…

Part Two – Rubik’s Wonderland

Diagram of a cube and the cubic zones surrounding it

# Helper Functions -------------------------------------------------------------

# Check whether two cubes overlap
function overlap(a::Cube, b::Cube)
    (ax1, ax2, ay1, ay2, az1, az2) = bounds(a)
    (bx1, bx2, by1, by2, bz1, bz2) = bounds(b)
    return (   ax1 <= bx2 && ax2 >= bx1
            && ay1 <= by2 && ay2 >= by1
            && az1 <= bz2 && az2 >= bz1)
end

# Check whether the `Cube`s from two `Instruction`s overlap
overlap(a::Instruction, b::Instruction) = overlap(a.cube, b.cube)

# Mostly just used for sanity checking while developing the `shatter()` function
# below, but it's interesting enought to keep around. Returns a `Cube` 
# representing the portions of `a` and `b` that overlap.
function Base.intersect(a::Cube, b::Cube)
    overlap(a, b) || return Cube(0:0, 0:0, 0:0)
    (ax1, ax2, ay1, ay2, az1, az2) = bounds(a)
    (bx1, bx2, by1, by2, bz1, bz2) = bounds(b)
    x = max(ax1, bx1):min(ax2, bx2)
    y = max(ay1, by1):min(ay2, by2)
    z = max(az1, bz1):min(az2, bz2)
    return Cube(x, y, z)
end

# Volume of a cube. Tricky, right? Yeah, well, *I* mistakenly corrected for 
# Julia's 1-indexing by adding one to each of these lengths and spent _way_
# too long debugging the wrong thing before I realized what I'd done...
function volume(cube::Cube)
    xlen = length(cube.x)
    ylen = length(cube.y)
    zlen = length(cube.z)
    return xlen * ylen * zlen
end


# The next set of functions are related to functionality to break apart one
# `Cube` around another `Cube`. This depends on creating new `Cubes` for each
# "zone" around the smaller `Cube`, 8 zones that align with the corners of the
# smaller `Cube`, 6 zones that align with the faces of the smaller `Cube`, and
# 12 zones that align with the edges of the smaller `Cube`. This way, if an 
# 'off' `Instruction` has a cube that overlaps a `Cube` of lignts already on,
# we can 'cut out' the `Cube` that gets turned off from the `Cube` that's 
# already on, replacing the remaining portions of the 'on' `Cube` with smaller
# `Cube`s and removing the overlapping section.

# For "face" and "edge" zones, for each dimension, there are four possible
# ranges that can be taken based on the relative coordinates of the `target`
# and `template`. Arguments are the min/max coordinates of `target` and 
# `template` (from `shatter()`) in a single dimension. I find it easier to
# imagine these values as line endpoints on a number line, if one line
# stretches from D1>D2 and the other from d1>d2.
function middlerange(D1, D2, d1, d2)
    D1 <= d1 && D2 >= d2 && return d1:d2
    D1 > d1  && D2 >= d2 && return D1:d2
    D1 <= d1 && D2 < d2  && return d1:D2
    D1 > d1  && D2 < d2  && return D1:D2
    return nothing
end

# Break a `target` cube into up to 27 segments, based on the location of 
# template. I find it easiest to image that every corner of `template` that
# lies inside `target` emits a line in each direction, carving up `target`. 
# Returns all the pieces _except_ for the piece that intersects with the
# `template`.
function shatter(target::Cube, template::Cube)
    # I'm using capital letters for the `target`, and lowercase letters for
    # the `template` `Cube`.
    (X1, X2, Y1, Y2, Z1, Z2) = bounds(target)
    (x1, x2, y1, y2, z1, z2) = bounds(template)

    pieces = Vector{Cube}()

    # These checks are used to determine if any of the points of the `target`
    # cube stray into one of the 26 cubic 'zones' adjacent to the `template` 
    # cube. This includes 12 "edge" zones, 8 "corner" zones, and six "face"
    # zones.
    xoffsets = (X1 < x1, X1 <= x2 || x1 <= X2, x2 < X2)
    yoffsets = (Y1 < y1, Y1 <= y2 || y1 <= Y2, y2 < Y2)
    zoffsets = (Z1 < z1, Z1 <= z2 || z1 <= Z2, z2 < Z2)

    # The ranges to use to construct the pieces. For each axis: the range when 
    # the `target` extends past the `template` in the negative (xyz)-direction,
    # when the `target` occupies the same (xyz)-range as the `template`, and 
    # when the `target` extends past the `template` in the positive direction
    xranges = (X1:x1-1, middlerange(X1, X2, x1, x2), x2+1:X2)
    yranges = (Y1:y1-1, middlerange(Y1, Y2, y1, y2), y2+1:Y2)
    zranges = (Z1:z1-1, middlerange(Z1, Z2, z1, z2), z2+1:Z2)
    
    # Generate the pieces
    for (a, b, c) in Iterators.product(1:3, 1:3, 1:3)
        (a, b, c) == (2, 2, 2) && continue # Skip the `template` range
        if xoffsets[a] && yoffsets[b] && zoffsets[c]
            push!(pieces, Cube(xranges[a], yranges[b], zranges[c]))
        end
    end
    
    # Sanity check, ensures that the volume of the pieces returned + the volume
    # removed is equal to the volume of the `target` `Cube`. Not needed to run
    # the solution, but it was handy for debugging.
    # @assert sum(volume.(pieces)) + volume(intersect(template, target)) == volume(target)

    return pieces
end

# Convenience function to `shatter()` the cubes held by two `Instruction`s. 
# Returns a list of `Instruction`s containing each of the pieces.
function shatter(target::Instruction, template::Instruction)
    newcubes = shatter(target.cube, template.cube)
    return Instruction.(target.state, newcubes)
end

# Given an iterable containing `Instruction`s, add up the volume of all the
# `Instruction`s that turn lignts on.
function countlights(instructionset)
    lightson = 0
    for instruction in instructionset
        !instruction.state && continue
        lightson += volume(instruction.cube)
    end
    return lightson
end


# Solve Part Two ---------------------------------------------------------------

# Add each instruction to a set of final instructions. As each instruction is
# added, have it `shatter()` any `Instruction` it overlaps with before adding
# it to the set of final instructions, thereby replacing the overlapping
# region with itself. This means that instructions to turn lights off will 
# "overwrite" regions that turn lights on, and instructions that turn lights on
# dont' get double-counted.
function part2(input)
    finalinstructions = Set{Instruction}()
    for instrin in input
        for instrout in finalinstructions
            overlap(instrin, instrout) || continue # No overlaps? No problem!

            newinstrs = shatter(instrout, instrin)
            delete!(finalinstructions, instrout)

            # `shatter()` will return an empty list if `target` fits completely
            # inside `template`, so there's no parts of `target` to add back
            # after shattering in that case.
            isempty(newinstrs) && continue
            push!(finalinstructions, newinstrs...)
        end

        # Only need to store the 'on' `Instruction`s. 
        instrin.state && push!(finalinstructions, instrin)
    end

    return countlights(finalinstructions)
end

Wrap Up

If you found a different solution (or spotted a mistake in one of mine), please drop me a line!

News features in DataFrames.jl 1.3: part 4

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/12/31/leftjoin.html

Introduction

This post continues the presentation of new features added in DataFrames.jl 1.3. This time I will discuss the leftjoin! function.

The post was written under Julia 1.7.0 and DataFrames.jl 1.3.1.
For performance comparison I have used R 4.1.2 and data.table 0.14.2.
Both in R and Julia I run the computations on 4 threads.

An in place left join

Before release 1.3, DataFrames.jl already offered a rich set of efficient
join functions:
innerjoin, leftjoin, rightjoin, outerjoin, semijoin, antijoin,
and crossjoin.

However, they all have a common limitation: their result is a freshly allocated
data frame.

A common usage scenario in practice is that we would like to add some
new columns to an existing table in-place. This is more efficient and uses less
memory (which is relevant if we work with very large data frames).

Since DataFrames.jl 1.3 this option is available with the addition of the
leftjoin! function. if you run leftjoin!(df1, df2; on=...) then the contract
is that the df1 data frame is updated in-place with columns coming from df2
based on matching rows of both data frames using the columns passed in the on
keyword argument.

It is important to remember that the design of leftjoin! assumes that the
columns of df1 are left unchanged (this is crucial for performance of the
operation). However, this implies that each row in df1 must have at most one
match in df2. Otherwise, leftjoin! would not be able to execute the
operation in-place since new rows would need to be added to df1. If you have
matching duplicate rows in df2 then just use leftjoin.

Here are two minimal examples of leftjoin!.

julia> using DataFrames

julia> using Random

julia> df1 = DataFrame(a=1:6, b=1:6)
6×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4
   5 │     5      5
   6 │     6      6

julia> df2 = DataFrame(a=[2, 4, 6], c=1:3)
3×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     4      2
   3 │     6      3

julia> leftjoin!(df1, df2, on=:a)
6×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64?
─────┼───────────────────────
   1 │     1      1  missing
   2 │     2      2        1
   3 │     3      3  missing
   4 │     4      4        2
   5 │     5      5  missing
   6 │     6      6        3

julia> Random.seed!(1234);

julia> df1 = DataFrame(a=randperm(6), b=1:6)
6×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     3      1
   2 │     2      2
   3 │     6      3
   4 │     5      4
   5 │     1      5
   6 │     4      6

julia> df2 = DataFrame(a=shuffle!([2, 4, 6]), c=1:3)
3×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     4      1
   2 │     2      2
   3 │     6      3

julia> leftjoin!(df1, df2, on=:a)
6×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64?
─────┼───────────────────────
   1 │     3      1  missing
   2 │     2      2        2
   3 │     6      3        3
   4 │     5      4  missing
   5 │     1      5  missing
   6 │     4      6        1

Performance benchmarks

Now let me run two performance benchmarks of DataFrames.jl against data.table.
In the benchmarks I use 32-bit integers to ensure comparability of memory
footprint of objects between R and Julia.

The first test is on sorted key column. We start with Julia:

julia> df1 = DataFrame(a=Int32.(1:10^8));

julia> df2 = DataFrame(a=Int32.(1:10^8), x = true);

julia> @time leftjoin!(df1, df2, on=:a);
  2.867632 seconds (221.45 k allocations: 2.433 GiB, 6.47% gc time, 3.98% compilation time)

julia> df1 = DataFrame(a=Int32.(1:10^8));

julia> @time leftjoin!(df1, df2, on=:a);
  2.934633 seconds (150 allocations: 2.421 GiB, 8.19% gc time)

And now the data.table:

> library(data.table)
> df1 = data.table(a=1:10^8)
> df2 = data.table(a=1:10^8, x=TRUE)
> system.time(df1[df2, on = 'a', x := i.x])
   user  system elapsed
  8.067   1.106   5.679
> df1 = data.table(a=1:10^8)
> system.time(df1[df2, on = 'a', x := i.x])
   user  system elapsed
  9.305   1.184   6.652

As you can see for sorted data DataFrames.jl timings are competitive. Let us
now check shuffled data.

We start with DataFrames.jl:

julia> df1 = DataFrame(a=shuffle!(Int32.(1:10^8)));

julia> df2 = DataFrame(a=shuffle!(Int32.(1:10^8)), x = true);

julia> @time leftjoin!(df1, df2, on=:a);
 23.881552 seconds (175 allocations: 3.167 GiB, 1.43% gc time)

julia> df1 = DataFrame(a=Int32.(1:10^8));

julia> @time leftjoin!(df1, df2, on=:a);
 18.909113 seconds (175 allocations: 3.167 GiB, 1.40% gc time)

and now the timing for data.table:

> df1 = data.table(a=sample(1:10^8))
> df2 = data.table(a=sample(1:10^8), x=TRUE)
> system.time(df1[df2, on = 'a', x := i.x])
   user  system elapsed
 30.778   1.791  23.153
> df1 = data.table(a=sample(1:10^8))
> system.time(df1[df2, on = 'a', x := i.x])
   user  system elapsed
 30.586   1.695  22.893

Again the timing of DataFrames.jl is competitive.

(let me stress here that this is just one set of examples and relative
performance of different packages can vary depending on the data and the
operating environment; the point of these tests is to show that currently
DataFrmes.jl is not much worse than data.table in execution of joins, as this
was a performance bottleneck of DataFrames.jl in the past)

Conclusions

This time in the post I have focused on a single function: the leftjoin!.
The reason is that I believe that addition of an in-place left join to
DataFrames.jl is quite significant since it is needed in many data processing
scenarios, especially when working with large tables.

juliabloggers.com

A Julia Language Blog Aggregator

Category Archives: Julia

Advent of Code 2021 – Day 22

Day 22 – Reactor Reboot

The Input – Cubic Impressions

Part One – Minicube

Part Two – Rubik’s Wonderland

Wrap Up

Advent of Code 2021 – Day 22

Day 22 – Reactor Reboot

The Input – Cubic Impressions

Part One – Minicube

Part Two – Rubik’s Wonderland

Wrap Up

News features in DataFrames.jl 1.3: part 4

Introduction

An in place left join

Performance benchmarks

Conclusions