Author Archives: Blog by Bogumił Kamiński

On Alan Edelman’s knife-edge condition in computational social science

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/10/01/socialscience.html

Introduction

Last week Alan Edelman and Viral Shah had an excellent keynote talk during
the Social Simulation Conference 2021. If you would like to learn
about one of the models that was discussed about please read on.

My objective with this post is to make a permanent record of the model as,
if I understood the comment of Alan Edelman correctly, it has not been published.

Of course all errors, extensions or omissions in the outline are mine (and I
will gladly correct the post if they are any).

In the computational part of the post I use Julia 1.6.3, Roots.jl 1.3.5,
and Plots.jl 1.22.3.

The problem statement

The question we want to answer is an attempt to address the the following
empirical phenomenon:

Imagine there is a road accident. Assume it happened in a remote place where
only one person is able to observe it. In such a situation it is highly likely
that this person will try to help.

Now assume that the same accident happens in a densely populated area.
Empirical data shows that each individual observer is much less likely
to help, and even it is possible that the probability of any help drops
in comparison to the first scenario when there is only one observer.

The model

The model of this situation that Alan Edelman presented is the following. Assume
that we have \(n\) spectators of the event. Next we take that the baseline
(i.e. when there is only one spectator) probability of helping is \(r\).
Finally we define the probability multiplier \(f(n)>0\), where we assume
that if there are \(n\) spectators each of them independently decides to help
with probability \(f(n)r\). The assumptions about \(f\) are that \(f(1)=1\) and
that \(f\) is decreasing.

Under these assumptions we can observe that the probability that any help is
given is \(1-(1-f(n)r)^n\). We want to analyze the properties of this number.
Now we fully switch from the social scientist’s hat to mathematician’s hat,
by assuming \(n\to+\infty\).

Knife-edge condition

When we analyze the asymptotic properties of this model it is useful to define
\(g(n)=f(n)n\). Then the formula for our probability becomes \(1-(1-g(n)r/n)^n\).

We now can notice that if \(g(n)\to0\) then the probability that help is given
tends to \(0\). On the other hand if \(g(n)\to+\infty\) then the probability
tends to \(1\).

So as we can see we get a sharp asymptotic knife-edge condition for the model
of not being uninteresting. We get that the asymptotic probability is in
\(]0,1[\) (assuming \(g(n)\) has a limit) only if \(g(n)\) has a positive and
finite limit. In other words, asymptotically the probability of reaction of a
spectator has to be proportional to \(1/n\). Assume that the limit is \(g\),
i.e. \(g(n)\to g\). Then the asymptotic probability is \(1-\exp(-gr)\). So we
might ask for which \(g\) the probability of reacting does not change. This is a
solution to the equation \(r = 1-\exp(-gr)\). We can solve it
to get \(g=-\log(1-r)/r\).

Now we are ready to jump into Julia to do some plotting.

Computational support

Let us now try plotting the solution of the equation \(r = 1-\exp(-gr)\) in
\(g\) being a function of \(r\).

Here is the code that produces the requested plot:

julia> using Roots

julia> using Plots

julia> plot(r -> find_zero(g -> 1 - exp(-r * g) - r, 1.0),
            xlim=[0.01,0.99], xlabel="r", ylabel="g", legend=nothing)

and we get the following relationship:

g(r) plot

In the code above instead of using the analytical solution I have shown
how one can dynamically find root of our equation as a function of a parameter.

As you can see the higher the initial \(r\) the higher \(g\) needs to be and it
has to be greater than \(1\) in general.

Conclusions

I hope you enjoyed the post even though it was not that much about Julia. We
have learned that the higher the baseline individual probability of reaction to
some bad event the harder it becomes to keep the reaction probability at the
given level as the number of spectators increases.

Hands-on Data Science with Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/09/24/manning.html

Introduction

Warning! The post includes (self)promotion.

Recently with Łukasz Kraiński we have published with Manning
the Hands-on Data Science with Julia liveProject.

In this post I want to discuss the idea behind this format and what you can
expect inside.

Why liveProject format?

If you are reading this post most likely you know that I write a lot on Julia
Slack, Julia Discourse, StackOverflow [julia] tag, give various tutorials,
write package documentation, and finally I do weekly updates to this blog.

So how is liveProject different format so that I have decided to give it a try
instead of doing e.g. a new JuliaAcademy course?

First, the content is project oriented. This means that you get a business
description of the challenge and should try to finish the task yourself. If
something is challenging to finish then you have three levels of support:
hints, partial solution, and finally full solution for every task. It might
seem as not very significant, but I believe that trying to do something on ones
own (and getting only as much help as really required) is superior to just
reading through some example codes.

The second valuable option is that on the platform you can compare your
implementation with other solutions to the same problem and also discuss it
with other coders or mentors.

All this means that although the projects are marked as intermediate level
content (since such experience is required to do the tasks on your own) even if
one is a beginner the material is still useful, with a twist that in this case
you probably will have to rely more on the provided help to be able to finish
the tasks.

I will see how it works out and maybe write a post with conclusions after some
time of giving the liveProject format a try.

What is inside?

The Hands-on Data Science with Julia liveProject is divided into five
parts. Here I will focus on commenting Julia packages are featured in
them (all packages are using their latest versions as for the time of writing
this post; I list them incrementally as they are introduced in consecutive
projects):

  1. data prepossessing: Arrow.jl, Chain.jl, CSV.jl, DataFrames.jl, FreqTables.jl,
    Plots.jl, StatsBase.jl;
  2. clustering (k-means, DBSCAN): Clustering.jl, Distances.jl;
  3. dimensionality reduction (PCA, t-SNE, UMAP): Conda.jl, MultivariateStats.jl,
    PyCall.jl;
  4. predictive modelling – regression problems (random forest, GLM):
    DecisionTree.jl, GLM.jl, HypothesisTests.jl;
  5. predictive modelling – classification problems (XGBoost): ROCAnalysis.jl,
    XGBoost.jl.

Conclusions

As opposed to standard teaching materials liveProjects are a paid content and
that is why I have issued a (self)promotion warning at the beginning of this
post. However, a good thing is that the first project on
data preprocessing is available for free so it is easy to check if you
like the content that we have prepared.

Updating views in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/09/17/views.html

Introduction

Today I want to preview a feature that will be introduced in 1.3 release
of DataFrames.jl. We will talk about new ways of updating the columns
of a data frame, when one is working with views. My objective is to explain
the rationale behind the new functionality and the way it works.

This post was tested under Julia 1.6.1 and DataFrames.jl checked out at
main branch on Sep 17, 2021 (SHA-1 facb6721e7450c63f2d5684b78e3c3489ed999b0)

What is a SubDataFrame and when it is useful?

In DataFrames.jl you can construct views of data frame object using the
view function or the @view macro exactly like you can create views of arrays
in Julia Base. Here is a simple example:

julia> using DataFrames

(@v1.6) pkg> st DataFrames
      Status `~/.julia/environments/v1.6/Project.toml`
  [a93c6f00] DataFrames v1.2.2 `https://github.com/JuliaData/DataFrames.jl.git#main`

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> dfv = @view df[2:3, :]
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     2      5
   2 │     3      6

Now the dfv object is a view of df data frame. It means that it references
to the same data in memory as the parent data frame df, but allows
to access only a slice of it: in our case we have picked rows 2 and 3 and
all columns.

The key features of a view are:

  • mutating its contents also mutates the contents of the parent data frame;
  • it is cheap to create as it is enough to store only the reference to the parent
    data frame and which rows and columns got selected;
  • it is memory efficient (no copying of data happens);
  • using it has a small computational overhead as when we index a view we need
    to perform transformation of these indices to the parent data frame indices.

Let us show the first feature as it is most important from the functionality
perspective:

julia> dfv[1, 1] = 100
100

julia> dfv
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │   100      5
   2 │     3      6

julia> df
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │   100      5
   3 │     3      6

julia> df[3, 1] = 200
200

julia> dfv
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │   100      5
   2 │   200      6

julia> df
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │   100      5
   3 │   200      6

As you can see changing dfv also changes df, and vice versa – changing df
also changes dfv (if the changed cells are selected in the view).

To understand performance consider two simple implementations of a procedure
computing 90% confidence interval of correlation between two variables using
bootstrapping:

julia> using Statistics

julia> function bootcor1(df, c1, c2, n)
           cors = Float64[]
           for _ in 1:n
               tmp = df[rand(1:nrow(df), nrow(df)), :]
               push!(cors, cor(tmp[!, c1], tmp[!, c2]))
           end
           return quantile(cors, [0.05, 0.95])
       end
bootcor1 (generic function with 1 method)

julia> function bootcor2(df, c1, c2, n)
           cors = Float64[]
           for _ in 1:n
               tmp = @view df[rand(1:nrow(df), nrow(df)), :]
               push!(cors, cor(tmp[!, c1], tmp[!, c2]))
           end
           return quantile(cors, [0.05, 0.95])
       end
bootcor2 (generic function with 1 method)

(the functions could be further optimized for performance but I did not want to
overly complicate the code)

The difference between bootcor1 and bootcor2 is that the former copies a
data frame, while the latter uses a view. Both take four parameters:

  • df: a data frame to analyze
  • c1, c2: column identifiers of columns we want to compute the correlation;
  • n: number of bootstrapping samples;

Now create a simple data frame and compare the performance of both functions
(I present timings after compilation):

julia> df = DataFrame(rand(10^5, 10), :auto);

julia> @time bootcor1(df, :x1, :x2, 10_000)
 47.059650 seconds (430.02 k allocations: 81.976 GiB, 1.88% gc time)
2-element Vector{Float64}:
 -0.007373812772086598
  0.0029150608879804406

julia> @time bootcor2(df, :x1, :x2, 10_000)
 11.239822 seconds (80.02 k allocations: 7.453 GiB, 0.92% gc time)
2-element Vector{Float64}:
 -0.007643923412421664
  0.002966538851599437

As you can see, because the data frame was wide (10 columns), we saved a lot of
time by avoiding copying of the data.

Of course if the data frame were narrower we would not see such a difference:

julia> df = DataFrame(rand(10^5, 2), :auto);

julia> @time bootcor1(df, :x1, :x2, 10_000)
 10.829548 seconds (190.02 k allocations: 22.363 GiB, 1.60% gc time)
2-element Vector{Float64}:
 -0.006650139955186956
  0.0038227359319118795

julia> @time bootcor2(df, :x1, :x2, 10_000)
 10.963020 seconds (80.02 k allocations: 7.453 GiB, 0.53% gc time)
2-element Vector{Float64}:
 -0.006575024146232311
  0.0038253588364537162

The reason is that now while using a view still allocates less this is offset
by the fact that working with views has some computational overhead as it was
explained above.

What is new for SubDataFrame in DataFrames.jl 1.3?

Let us start with our original small data frame:

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

Assume you wanted to assign a 1.5 value in the first row of column :a.
Before the upcoming DataFrames.jl 1.3 release it is quite cumbersome. If you
try doing it you get:

julia> df[1, :a] = 1.5
ERROR: InexactError: Int64(1.5)

You need to do two steps:

  1. promote the element type of column :a to allow Float64 values;
  2. perform the assignment.

Here is a way to do it:

julia> df.a = Vector{Float64}(df.a)
3-element Vector{Float64}:
 1.0
 2.0
 3.0

julia> df[1, :a] = 1.5
1.5

julia> df
3×2 DataFrame
 Row │ a        b
     │ Float64  Int64
─────┼────────────────
   1 │     1.5      4
   2 │     2.0      5
   3 │     3.0      6

Here is one more, well known, example of a similar situation, that sometimes
surprises users:

julia> df[1, :b] = 'a'
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> df
3×2 DataFrame
 Row │ a        b
     │ Float64  Int64
─────┼────────────────
   1 │     1.5     97
   2 │     2.0      5
   3 │     3.0      6

In this case Julia silently converted Char value 'a' to its Int
representation which is 97.

The key change in the 1.3 release of DataFrames.jl is that views will allow to
use ! as row index (currently it is disallowed). The mechanics of this
functionality is the same as when ! is used for DataFrame objects – a
column will get replaced in the data frame.

A natural question is the following with what will it get replaced? It is quite
valid as we are replacing only a portion of the column. The design decision we
took is that promote_type will be used to decide the element type of the new
column combining the element type of the already present column and element type
of the newly assigned values.

Therefore in our examples above, when using a view you get the following:

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> dfv = @view df[1:1, :]
1×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4

julia> dfv[!, :a] = [1.5]
1-element Vector{Float64}:
 1.5

julia> dfv[!, :b] .= 'a'
1-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

julia> df
3×2 DataFrame
 Row │ a        b
     │ Float64  Any
─────┼──────────────
   1 │     1.5  a
   2 │     2.0  5
   3 │     3.0  6

julia> dfv
1×2 SubDataFrame
 Row │ a        b
     │ Float64  Any
─────┼──────────────
   1 │     1.5  a

As you can see it works both with standard assignment as well as with
broadcasted assignment.

Admittedly you still have to make two steps in the process:

  1. create a view;
  2. perform an assignment to it.

This is a bit cumbersome. Fortunately we can expect that in the future
DataFramesMeta.jl will provide a convenience syntax to perform
conditional assignment using this feature, e.g. like in data.table, where you
can write something like df[x == 1, y := 2] to set column y to 2 if
column x is equal to 1.

One special case that is often required is adding columns. It is supported
with both : and ! row selectors (like for DataFrame objects). In this case
we do not have a reference column in a parent data frame, so rows that are not
included in the view are filled with missing.

Here are two examples:

julia> dfv[!, :c] = ["x"]
1-element Vector{String}:
 "x"

julia> dfv[:, :d] .= true
1-element Vector{Bool}:
 1

julia> df
3×4 DataFrame
 Row │ a        b    c        d
     │ Float64  Any  String?  Bool?
─────┼────────────────────────────────
   1 │     1.5  a    x           true
   2 │     2.0  5    missing  missing
   3 │     3.0  6    missing  missing

julia> dfv
1×4 SubDataFrame
 Row │ a        b    c        d
     │ Float64  Any  String?  Bool?
─────┼──────────────────────────────
   1 │     1.5  a    x         true

The only limitation is that in this case it is only allowed if SubDataFrame
was created with : as column selector. The reason of this limitation is that
when one uses : selector we are guaranteed that SubDataFrame has the same
columns and in the same order as its parent, so the requested operation is
guaranteed not to be problematic in interpretation (otherwise we would have to
handle e.g. the case when we want to add a column whose name is not present in
the SubDataFrame but is present in its parent which could confuse users).

Conclusions

In summary the new functionality allows to replace columns in a data frame
through its view. The two main intended use cases of this feature are:

  • adding new columns for which we have data only for some rows
    (selected in the view); it is only allowed when SubDataFrame
    was created with : as column selector;
  • updating data in existing columns even if the new elements cannot be converted
    to the element type of existing column; in this case promote_type is used to
    determine the target column element type.