Author Archives: Blog by Bogumił Kamiński

Separating a column into multiple columns in DataFrames.jl

Re-posted from: https://bkamins.github.io/julialang/2020/09/04/separate.html

Introduction

Recently several users have asked how one can split one column into several
columns in DataFrames.jl.

A functionality of this kind is provided e.g. in dplyr by the spread
function, but we are currently missing it in DataFrames.jl.

In this post I comment how to obtain the expected behavior in the current version
of DataFrames.jl (i.e. v21) with a little help of the SplitApplyCombine.jl
pakage.

The codes were tested under Julia 1.5, DataFrames.jl 0.21,
and SplitApplyCombine.jl 1.1.

Spreading a column

For our example I use the data I have found in this post about most
popular first and last name combinations in the United States:

julia> using DataFrames

julia> df = DataFrame(name=["James Smith", "Michael Smith", "Robert Smith",
                            "Maria Garcia", "David Smith", "Maria Rodriguez",
                            "Mary Smith", "Maria Hernandez", "Maria Martinez",
                            "James Johnson"],
                      freq=[38_313, 34_810, 34_269, 32_092, 31_294,
                            30_507, 28_692, 27_836, 26_956, 26_850])
10×2 DataFrame
│ Row │ name            │ freq  │
│     │ String          │ Int64 │
├─────┼─────────────────┼───────┤
│ 1   │ James Smith     │ 38313 │
│ 2   │ Michael Smith   │ 34810 │
│ 3   │ Robert Smith    │ 34269 │
│ 4   │ Maria Garcia    │ 32092 │
│ 5   │ David Smith     │ 31294 │
│ 6   │ Maria Rodriguez │ 30507 │
│ 7   │ Mary Smith      │ 28692 │
│ 8   │ Maria Hernandez │ 27836 │
│ 9   │ Maria Martinez  │ 26956 │
│ 10  │ James Johnson   │ 26850 │

We want to spread :name column into :first and :last columns holding first
and last name respectively.

Here is how one can do it using just Julia Base:

julia> tmp = split.(df.name)
10-element Array{Array{SubString{String},1},1}:
 ["James", "Smith"]
 ["Michael", "Smith"]
 ["Robert", "Smith"]
 ["Maria", "Garcia"]
 ["David", "Smith"]
 ["Maria", "Rodriguez"]
 ["Mary", "Smith"]
 ["Maria", "Hernandez"]
 ["Maria", "Martinez"]
 ["James", "Johnson"]

julia> insertcols!(df, [n => getindex.(tmp, i) for (i, n) in
                        enumerate([:first, :last])]...)
10×4 DataFrame
│ Row │ name            │ freq  │ first    │ last      │
│     │ String          │ Int64 │ SubStri… │ SubStrin… │
├─────┼─────────────────┼───────┼──────────┼───────────┤
│ 1   │ James Smith     │ 38313 │ James    │ Smith     │
│ 2   │ Michael Smith   │ 34810 │ Michael  │ Smith     │
│ 3   │ Robert Smith    │ 34269 │ Robert   │ Smith     │
│ 4   │ Maria Garcia    │ 32092 │ Maria    │ Garcia    │
│ 5   │ David Smith     │ 31294 │ David    │ Smith     │
│ 6   │ Maria Rodriguez │ 30507 │ Maria    │ Rodriguez │
│ 7   │ Mary Smith      │ 28692 │ Mary     │ Smith     │
│ 8   │ Maria Hernandez │ 27836 │ Maria    │ Hernandez │
│ 9   │ Maria Martinez  │ 26956 │ Maria    │ Martinez  │
│ 10  │ James Johnson   │ 26850 │ James    │ Johnson   │

The code is a bit verbose and uses a temporaty variable. We could have written the
second step also e.g. like this:

for (i, n) in enumerate([:first, :last])
    df[!, n] = getindex.(tmp, i)
end

but it is still quite verbose.

We can have a shorter code and avoid a temporary variable using the invert
function from SplitApplyCombine.jl:

julia> using SplitApplyCombine

julia> insertcols!(df, ([:first, :last] .=> invert(split.(df.name)))...,
                   makeunique=true)
10×6 DataFrame
│ Row │ name            │ freq  │ first    │ last      │ first_1  │ last_1    │
│     │ String          │ Int64 │ SubStri… │ SubStrin… │ SubStri… │ SubStrin… │
├─────┼─────────────────┼───────┼──────────┼───────────┼──────────┼───────────┤
│ 1   │ James Smith     │ 38313 │ James    │ Smith     │ James    │ Smith     │
│ 2   │ Michael Smith   │ 34810 │ Michael  │ Smith     │ Michael  │ Smith     │
│ 3   │ Robert Smith    │ 34269 │ Robert   │ Smith     │ Robert   │ Smith     │
│ 4   │ Maria Garcia    │ 32092 │ Maria    │ Garcia    │ Maria    │ Garcia    │
│ 5   │ David Smith     │ 31294 │ David    │ Smith     │ David    │ Smith     │
│ 6   │ Maria Rodriguez │ 30507 │ Maria    │ Rodriguez │ Maria    │ Rodriguez │
│ 7   │ Mary Smith      │ 28692 │ Mary     │ Smith     │ Mary     │ Smith     │
│ 8   │ Maria Hernandez │ 27836 │ Maria    │ Hernandez │ Maria    │ Hernandez │
│ 9   │ Maria Martinez  │ 26956 │ Maria    │ Martinez  │ Maria    │ Martinez  │
│ 10  │ James Johnson   │ 26850 │ James    │ Johnson   │ James    │ Johnson   │

In this call I have used makeunique=true as we update the df data frame in
place and it already contains :first and :last columns.

So the code is not that long, but admittedly spread in dplyr is shorter.

Before we finish let us see what the invert function produces when applied
to the tmp variable we have created above:

julia> invert(tmp)
2-element Array{Array{SubString{String},1},1}:
 ["James", "Michael", "Robert", "Maria", "David", "Maria", "Mary", "Maria",
 "Maria", "James"]
 ["Smith", "Smith", "Smith", "Garcia", "Smith", "Rodriguez", "Smith",
 "Hernandez", "Martinez", "Johnson"]

As you can see it takes a container of containers and reverses the order
of nesting.

Conclusion

If you feel that it would be good to have an in-built function in DataFrames.jl
that would do splitting of columns in a data frame please leave a comment in
this issue.

Subsetting strings in Julia using character indexing

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/08/29/charindex.html

Introduction

Two weeks ago I have written a blog post about comparison of byte and
character indexing of strings in Julia Base.

In the mean time I have answered several questions when users had to subset
a String in Julia using character indices. In this post I show a macro that
allows to do this.

This code is tested to work under Julia 1.5.

The implementation of `@char` macro

The @char macro is shown below (I hope I got all hygene right — if not
please let me know)

macro char(ex)
    if Meta.isexpr(ex, :ref) &&
        isa(ex.args[1], Union{String, Symbol}) &&
        length(ex.args) == 2

        S, i = ex.args
        i, _ = Base.replace_ref_begin_end_!(i,
            (:($Base.firstindex($S)),:($Base.length($S))))
        ex.args[2] = Expr(:(.), Base.nextind, Expr(:tuple, S, 0, i))
        return esc(ex)
    else
        throw(ArgumentError("@char macro argument must be an expression S[i]."))
    end
end

What this macro does is turning str[idx] expression from using byte indexing
to use character indexing by writing @char str[idx].
I think it is simplest to explain it using an example:

julia> "∀∃12??"[2:5]
ERROR: StringIndexError("∀∃12??", 2)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] getindex(::String, ::UnitRange{Int64}) at ./strings/string.jl:249
 [3] top-level scope at REPL[5]:1

julia> @char "∀∃12??"[2:5]
"∃12?"

julia> str = "∀∃12??"
"∀∃12??"

julia> str[2:5]
ERROR: StringIndexError("∀∃12??", 2)
Stacktrace:
 [1] string_index_err(::String, ::Int64) at ./strings/string.jl:12
 [2] getindex(::String, ::UnitRange{Int64}) at ./strings/string.jl:249
 [3] top-level scope at REPL[17]:1

julia> @char str[2:5]
"∃12?"

Let us also check that we correctly handle begin and end when indexing:

julia> idx = [2, 4]
2-element Array{Int64,1}:
 2
 4

julia> @char str[[begin, idx[begin], idx[end], end]]
"∀∃2?"

julia> @macroexpand @char str[[begin, idx[begin], idx[end], end]]
:(str[(nextind).(str, 0, [(Base).firstindex(str), idx[(firstindex)(idx)],
  idx[(lastindex)(idx)], (Base).length(str)])])

and all seems to work correctly.

Concluding comments

I have limited this macro to expect that in str[idx] expression str is a
variable name or a string literal to simplify the logic of the code (allowing
str to be a general expression would lead to a much more complex code).
I assume that in practice this should not be a severe limitation.

In terms of performance this macro does not do any optimizations of the lookup
of character index as nextind is called for each byte index separately,
so in some special cases this could be optimized.

Finally it is worth to remember that for most common cases of string subsetting
the first, last and chop functions defined in Julia Base are available
and they use character indexing.

A basic SIR model using agent-based approach in Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/08/22/sir.html

Introduction

Recently I have written a blog post about a simple agent-based model in
finance domain. This is a follow up post in preparation for
Social Simulation Week 2020 workshop on agent-based modeling using
the Julia language that will be given by Przemysław Szufel and me.

This time I have decided to present a simple SIR model implementation
using an agent-based design.

The SIR epidemic model specification

We assume that we have a finite population of n agents that live on a discrete
torus that has (xdim, ydim) size.

If you have not heard of such a model of the space then imagine it is a (xdim, ydim) chessboard whose top and bottom, as well as left and right borders are
glued together. Thus you never fall-off the board — you can travel e.g. to the
right infinitely as the space is wrapped. Here is a plot illustrating the
gluing process:

Torus from a square

and here is the end result ?:

The torus model of the space is quite popular in simple agent-based models as it
is similar to a square grid, but does not have borders nor corners.

We assume that agents independently move around our torus in a discrete time.
Some agent in time moment tick randomly chooses to move up/stay/down and
left/stay/right (so in total there are nine possibilities, one of which is to
stay in place).

In particular the above rules mean that at some moment in time several
agents can occupy the same cell. This is where the SIR epidemiological model
comes into play. We assume that there is some disease and an agent can be in one
of four states:

susceptible: S
infected: I
recovered: R
dead

There are the following rules for moving between states in our model:

if in some tick a susceptible agent is on the same cell in the grid as
an infected agent then the susceptible agent becomes infected;
infected agent stays infected for duration periods after which time
the agents becomes dead with probability pdeath and otherwise becomes
recovered;
recovered and dead agents do not change their state except that recovered
agents are still allowed to move on the grid.

Initially we assume that infected number of agents is in an infected state.
We run the simulation as long as there is at least one infected agent.

In this post I present an implementation of an agent-based simulation that
captures the dynamics of this system and allows us to track changes of its state
in time.

Implementation of the model

This post is written under Julia v1.5.0, PyPlot v2.9.0, and StatsBase v0.32.2
(normally I recommend to make sure that you have exactly the given versions of
the packages installed, but in this case I only use their basic functionality
so you can safely assume that any version should work). I assume that you use
just Julia REPL (in a terminal or in an IDE like VS Code), but all examples
should also work in Jupyter notebook.

To import the required packages write:

using StatsBase
using PyPlot

If you get an error loading the packages most likely they are not installed. In
this case add them by running:

using Pkg
Pkg.add("StatsBase")
Pkg.add("PyPlot")

Let us now move to definitions of data types that we will use in our model:

@enum AgentType agentS agentI agentR agentD

struct Agent
    x::Int # location of an agent in x-dimension
    y::Int # location of an agent in y-dimension
    type::AgentType # type of an agent
    tick::Int # moment in time when agent entered type `type`
end

mutable struct Environment
    grid::Matrix{Vector{Int}} # for each cell of a grid a vector of numbers
                              # of agents currently occupying a given cell
    agents::Vector{Agent}     # a vector of all agents
    duration::Int             # metadata: how long agent stays in infected state
    pdeath::Float64           # metadata: probability of death of an agent after
                              # being infected
    stats::Dict{AgentType, Vector{Int}} # a dictionary storing number of agents
                                        # of each type in consecutive ticks
                                        # of the simulation
    tick::Int                 # counter of the current tick of the simulation
end

We have created three types:

AgentType which is just an enumeration allowing us to refer to the type of
the agent (agentS for susceptible, agentI for infected, agentR for
recovered, and agentD for dead); thus later in the code we can write
agentS, agentI etc. specify the type of an agent;
Agent is a structure holding information about a single agent; note that
we create this type using struct keyword, which means that it is immutable
(it will have an impact how we should work with such a structure as will be
seen later in the code);
Environment is a mutable structure holding global information about the
simulation state; this time we make this structure mutable, which means that
we can change the values of variables stored in it.

Let us now move to a function that initializes the state of the simulation:

function init(n::Int, infected::Int,
              duration::Int, pdeath::Float64, xdim::Int, ydim::Int)
    grid = [Int[] for _ in 1:xdim, _ in 1:ydim]
    agents = [Agent(rand(1:xdim), rand(1:ydim),
                    i <= infected ? agentI : agentS, 0) for i in 1:n]
    for (i, a) in enumerate(agents)
        push!(grid[a.x, a.y], i)
    end
    stats = Dict(agentS => [n - infected],
                 agentI => [infected],
                 agentR => [0],
                 agentD => [0])
    return Environment(grid, agents, duration, pdeath, stats, 0)
end

We see that essentially it populates the Environment container. Note
that initially there are infected number of agents in the agentI state,
and the rest of them is in agentS state. Each agent initially gets a random
location on the grid. We also make sure to correctly initialize the grid and
stats variables.

As we have noted above the Agent container is immutable. Therefore we have to
define functions that take an Agent object and return a new Agent object
if the Agent is to change its state.

die(a::Agent, tick::Int) = Agent(a.x, a.y, agentD, tick)

recover(a::Agent, tick::Int) = Agent(a.x, a.y, agentR, tick)

infect(a::Agent, tick::Int) = Agent(a.x, a.y, agentI, tick)

move(a::Agent, dims::Tuple{Int, Int}) =
    if a.type == agentD
        a
    else
        Agent(mod1(a.x + rand(-1:1), dims[1]),
              mod1(a.y + rand(-1:1), dims[2]),
              a.type, a.tick)
    end

As we have described above an agent can change its type, which is handled by
the die, recover, and infect functions, and it can move, which is handled
by the move function. Let us note a few things about the implementation of
these functions:

we use a short-form definitions of these functions of the form f(x) = expr;
this is useful when a function body is just a single expression (in particular
note that the if–else clause is a single expression — as shown in the
move function);
in the move function the dims argument are the dimensions of the grid and
the expressions mod1(a.x + rand(-1:1), dims[1]) and
mod1(a.y + rand(-1:1), dims[2]) handle the movement of the agent to one of
the nine new positions above and the mod1 functions makes sure that if we
move out of the boundary of the grid the movement is properly wrapped (e.g.
mod1(0, 10) is 10 and mod1(11, 10) is 1);

In each tick of the simulation we first update the type of the agents, which
is handled by the following function:

function update_type!(env::Environment)
    tick = env.tick
    for (i, a) in enumerate(env.agents)
        if a.type == agentI
            if tick - a.tick > env.duration
                env.agents[i] = if rand() < env.pdeath
                    die(a, tick)
                else
                    recover(a, tick)
                end
            else
                a.tick == tick && continue
                for j in env.grid[a.x, a.y]
                    a2 = env.agents[j]
                    if a2.type == agentS
                        env.agents[j] = infect(a2, tick)
                    end
                end
            end
        end
    end
end

Note that this function’s name ends with ! which is a convention that the
function changes its env argument. In this case it changes the agent types
stored in agents field of env mutable struct.

The whole logic of the update_type! function revolves around agents that have
agentI type as they either can infect agents that have agentS type or can
either recover and become agentR or die and become agentD type.

Note that each time an agent type is changed we have to update the collection
env.agents with a new value as Agent type is immutable.

Also the condition a.tick == tick && continue might raise some questions.
We use it to make sure that agents that became sick in a given tick to not
recursively infect other agents in the same tick (in our model it has only
performance implications as we assume that agents can get infected only
if an already infected agent stays in the same cell; but this would start
to matter if we allowed e.g. a wider infection radius — which is an easy
extension to version of the model we present — as then a chain reaction
could be triggered in a single tick, and we want to avoid this).

The other collective action of agents is their movement that is defined here:

function move_all!(grid::Matrix{Vector{Int}}, agents::Vector{Agent})
    foreach(empty!, grid)
    for (i, agent) in enumerate(agents)
        a = move(agent, size(grid))
        agents[i] = a
        push!(grid[a.x, a.y], i)
    end
end

Here a notable thing is that we clear entries of a grid cache with the empty!
function for each cell of grid matrix and then populate it anew with updated
locations of the agents.

The last thing we need to do in each step of the simulation is collection of the
statistics. This is a relatively simple operation and is implemented using the
following code:

function get_statistics!(env::Environment)
    status = countmap([a.type for a in env.agents])
    for (k, v) in env.stats
        push!(v, get(status, k, 0))
    end
end

In this function we use the countmap function form the StatsBase.jl package to
get numbers of agents of each type in a given tick, next we use a get function
to populate env.stats dictionary. The use of get is needed as in general the
status dictionary returned by countmap does not have to contain all types of
agents (e.g. in tick one of the model agents are of type agentR or
agentD are not present).

We are now ready to define the main loop of the simulation:

function run!(env::Environment)
    while env.stats[agentI][end] > 0
        env.tick += 1
        update_type!(env)
        move_all!(env.grid, env.agents)
        get_statistics!(env)
    end
end

Note that in this function we take advantage of the fact that env is mutable
and increment tick by 1 in each step. Next we repeatedly update agents’ types,
move them and collect statistics as long as there is at least one infected agent.

Running the model

Let us try running the model for some sample parameterization. We assume to have
2000 agents living on a 100x100 grid. The duration of infection is 21 ticks,
we initially place 10 infected agents (so it is 0.2% of the total population)
and assume that 5% of agents that get infected eventually die:

e = init(2000, 10, 21, 0.05, 100, 100)
run!(e)
foreach(plot, values(e.stats))
legend(string.(keys(e.stats)))

You should get a plot similar to this one:

SIR plot

We see that the epidemics died out at around 300 ticks and around 80% of the
population went through the disease (agents that have agentS type at the end
of the simulation were unaffected).

As a second experiment let us check how does the fraction of infected agents
depend on the length of the infection. Here is the code (it should run under
30 seconds):

function fraction_infected(l)
    e = init(2000, 10, l, 0.05, 100, 100)
    run!(e)
    return 1 - e.stats[agentS][end] / 2000
end

len = 5:30
runs = 16
inf = [sum(fraction_infected(l) for r in 1:runs) / runs for l in len]
plot(len, inf)

$Infected fraction plot$

We have run the simulation for the disease length ranging from 5 to 30 with step
equal to 1 and averaged out 16 runs of the simulation for each parameter (416
runs of the simulation in total). We observe that the relationship is S-shaped
and the highest impact of the disease length change on number of infected agents
is around fifteen ticks.

Of course these are just two samples of the analysis that can be made with this
model. Feel free to change the assumptions of the model or the experiment setup
if you would like to investigate it more.

Concluding remarks

In this post I have shown how you can create a simple agent-based model without
having to use specialized packages. Still it is worth to know that there is e.g.
an excellent Agents.jl package that provides a lot of utility data
structures and functions for building agent-based simulations.

To recap the things related to the Julia language usage for building agent-based
simulations I wanted to show in this post:

you can use either struct or mutable struct types depending on what style
of data management pattern you prefer (I have shown both in this post); in
general the choice will have performance implications — using immutable
struct will tend to be faster in most cases — but as the Julia language
is fast in general I recommend not to go for premature optimizations here,
but rather use the patterns that are convenient for you as a developer;
usually it is most convenient if you create just one type of agent (here
Agent) and if you have subtypes of agents having different behavior to store
this trait as a field in the this type (in our case AgentType enum served
as this trait); in a blog post I have written some time ago I
discuss different possible patterns that could be used instead (with their
performance implications), but what I show in this post should serve you well
in 99% of cases from my experience;
finally let me comment that I could have easily made this simulation faster if
I wanted to sacrifice code clarity (a pet pastime of many hard-core Julia uses
?). This is a beauty of Julia that you can go really low-level in the
performance critical code to optimize things (see e.g. my recent answer to
this question on Stack Overflow ); however, as noted
above — I usually recommend to go for such optimizations only if needed, as
Julia is typically fast enough if you write the code just that follows how you
think about the problem you solve, barring you do not respect general
performance recommendations described here in the Julia manual.

juliabloggers.com

A Julia Language Blog Aggregator

Author Archives: Blog by Bogumił Kamiński

Separating a column into multiple columns in DataFrames.jl

Introduction

Spreading a column

Conclusion

Subsetting strings in Julia using character indexing

Introduction

The implementation of `@char` macro

Concluding comments

A basic SIR model using agent-based approach in Julia

Introduction

The SIR epidemic model specification

Implementation of the model

Running the model

Concluding remarks

Introduction

Spreading a column

Conclusion

Introduction

The implementation of @char macro

Concluding comments

Introduction

The SIR epidemic model specification

Implementation of the model

Running the model

Concluding remarks

The implementation of `@char` macro