Tag Archives: julialang

Refreshing Mining Complex Networks book

Re-posted from: https://bkamins.github.io/julialang/2023/06/09/graphs.html

Introduction

Two years ago together with Paweł Prałat and François Théberge
we have written Mining Complex Networks book.
Since it has received a positive response from the community
we decided to start planning for its second edition, where
we want to add material that covers recent advances in the field.

This decision prompted me to write something about graph analysis
that is at the same time a nice application of the Julia language.

The post was written under Julia 1.9.0 and Graphs.jl 1.8.0.

The problem

Let me start with a business scenario.
Assume you have a set of products in a store and know which of them
are bought together. You represent this data as a graph where
nodes are products and edges are representing co-purchase of products.

You want to find products that are not bought together, but that
are fall into similar baskets. For example, assume you have two types
of milk. Most likely they are not bought together, but they are both
bought with e.g. bread. Such products could be called substitutes.

In the book we discuss some advanced methods that can be used for this task.
In this post let me concentrate on a simple way to detect such products.

Assume that I have four products i, j, k, and l. If in the
graph we have edges i-j, j-k, k-l, and l-i, but do not have
edges i-k and j-l then we can say that i and k are substitutes
and j and l are substitutes (so this four nodes could be called
double-substitutes).

From a graph theory perspective the set of nodes {i, j, k, l} form a
cycle of length 4 and they do not form any sorter cycle
(so this cycle has a hole inside it). Let us call such cycles minimal.
We can visualize this situation for example as follows:

  i
 / \
l   j
 \ /
  k

Our task for today is the following. Assume we have a random graph
with n nodes. In this graph each edge is included with probability
p, independently from every other edge. We want to check how
many minimal 4-cycles it contains.

Analytical solution

For a random graph it is possible to compute the expected number of
minimal 4-cycles relatively easily.

Using the additivity of expectation we can concentrate on four-element
subsets of sets of nodes. For each such subset {i, j, k, l}
we have three possibilities that they can form a minimal 4-cycle:

  i        i        i
 / \      / \      / \
l   j    l   k    k   j
 \ /      \ /      \ /
  k        j        l

Note that the probability of observing any of them is p^4*(1-p)^2
(we see 4 edges and do not see 2 edges).

Therefore we can easily write a function that computes the expected
number of such motifs as:

expected_count_empty_4cycle(n, p) = p^4*(1-p)^2 * 3 * binomial(n, 4)

Let us calculate the expected number of such motifs for n equal to 10
and 100 and p equal to 0.1 and 0.2:

julia> expected_count_empty_4cycle.([10, 100], [0.1 0.2 0.3])
2×3 Matrix{Float64}:
   0.05103      0.64512      2.50047
 952.858    12046.0      46690.0

The function works fast and nice, but what if we made a mistake when defining it?
With Julia it is easy to run a simulation checking the result.

Simulation solution, part 1

Let us start with a solution that reproduces the thinking we used to derive
our analytical result:

function naive_count_empty_4cycle(g)
    empty4cycle = 0
    for i in 1:nv(g), j in i+1:nv(g), k in j+1:nv(g), l in k+1:nv(g)
        empty4cycle += has_edge(g, i, j) && has_edge(g, j, k) &&
                       has_edge(g, k, l) && has_edge(g, l, i) &&
                       !has_edge(g, i, k) && !has_edge(g, j, l)
        empty4cycle += has_edge(g, i, k) && has_edge(g, k, j) &&
                       has_edge(g, j, l) && has_edge(g, l, i) &&
                       !has_edge(g, i, j) && !has_edge(g, k, l)
        empty4cycle += has_edge(g, i, j) && has_edge(g, j, l) &&
                       has_edge(g, l, k) && has_edge(g, k, i) &&
                       !has_edge(g, i, l) && !has_edge(g, j, k)
    end
    return empty4cycle
end

The function expects to get a Graphs.jl graph g and traverses all
subsets of its nodes.

Let us check it:

julia> using Graphs

julia> using Random

julia> using Statistics

julia> Random.seed!(1234);

julia> [mean(naive_count_empty_4cycle(erdos_renyi(10, p)) for i in 1:1000)
        for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
 0.051
 0.598
 2.455

So the results look good for n=10. But what about n=100?

Let us check timing of a single run:

julia> @time naive_count_empty_4cycle(erdos_renyi(100, 0.1));
  0.142410 seconds (274 allocations: 38.594 KiB)

We can see that the function is slow. If we wanted to run it 1000 times it would take us
over 2 minutes. Let us think of a faster approach.

Simulation solution, part 2

The idea we can use is the following. Consider two nodes i and j and assume they
are not connected. Then it is enough to find common neighbors of i and j
and check how many pairs of them are not connected.

Here is an example implementation of this idea:

function find_common_sorted!(common, ni, nj)
    empty!(common)
    iidx = 1
    jidx = 1
    while iidx <= length(ni) && jidx <= length(nj)
        if ni[iidx] < nj[jidx]
            iidx += 1
        elseif ni[iidx] > nj[jidx]
            jidx += 1
        else
            push!(common, ni[iidx])
            iidx += 1
            jidx += 1
        end
    end
    nothing
end

function fast_count_empty_4cycle(g)
    common = Int[]
    empty4cycle = 0

    for i in 1:nv(g)
        ni = neighbors(g, i)
        for j in i+1:nv(g)
            has_edge(g, i, j) && continue
            nj = neighbors(g, j)
            find_common_sorted!(common, ni, nj)
            for a in 1:length(common), b in a+1:length(common)
                empty4cycle += !has_edge(g, common[a], common[b])
            end
        end
    end
    @assert iseven(empty4cycle)
    return empty4cycle ÷ 2
end

Note that in the code we divide the number of empty cycles found by two as
assuming that nodes {i, j, k, l} form an empty cycle we count it twice
(once starting with {i, k} pair and once starting with {j, l} pair).

Let us check our function:

julia> @time fast_count_empty_4cycle(erdos_renyi(100, 0.1));
  0.001491 seconds (280 allocations: 40.047 KiB)

Indeed it is significantly faster. Therefore we can use it to also check the n=100 case:

julia> [mean(fast_count_empty_4cycle(erdos_renyi(10, p)) for i in 1:1000)
        for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
 0.051
 0.598
 2.455

julia> [mean(fast_count_empty_4cycle(erdos_renyi(100, p)) for i in 1:1000)
        for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
   954.364
 12017.067
 46574.827

The obtained results confirm our analytical solution, so we have some more confidence
that indeed we have derived it correctly.

Conclusions

I hope you found the problem of finding the number of minimal 4-cycles interesting.
Personally, really enjoyed coding it in Julia. In particular note that in Julia
with our faster version of code we almost do not do any allocations:

julia> gr = erdos_renyi(1000, 0.1); @time fast_count_empty_4cycle(gr)
  2.534875 seconds (4 allocations: 496 bytes)
10169170

julia> expected_count_empty_4cycle.(1000, 0.1)
1.0064361314250002e7

and the code runs quite fast.

Filtering grouped data frames in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/06/02/filter.html

Introduction

In DataFrames.jl a data frame can be grouped by columns.
In this way its view is created. The type of this view
is GroupedDataFrame and it allows for performing groupwise
operations on the data stored in the original data frame.

If you use SQL then you can think of GroupedDataFrame
as an object with is obtained by using GROUP BY command
but without performing an aggregation step. This is often
useful as after grouping data you might want to perform
several different aggregation operations on it without
having to group it each time.

In this post I want to discuss how filtering of
GroupedDataFrame works in DataFrames.jl.
In general there are two ways how you might want to filter it:

by dropping whole groups;
by dropping rows within group.

I will discuss both today.

The post was written under Julia 1.9.0 and DataFrames.jl 1.5.0.

Preparing the data

Start by creating a sample GroupedDataFrame:

julia> using DataFrames

julia> df = DataFrame(x = repeat(1:4, 2), id=1:8)
8×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4
   5 │     1      5
   6 │     2      6
   7 │     3      7
   8 │     4      8

julia> gdf = groupby(df, :x, sort=true);

julia> show(gdf, allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
Group 2 (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
Group 3 (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7
Group 4 (2 rows): x = 4
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     4      4
   2 │     4      8

Dropping whole groups

The first type of operation is when you want to drop whole groups
from your grouped data frame.

First, assume that you want to drop
the first and last group from gdf. You can achieve this using indexing:

julia> gdf[[2, 3]]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

julia> gdf[Not(begin, end)]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
   2 │     2      6
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

Now assume that you want to drop groups in which the value of the grouping
variable :x is even. You could still use indexing like this:

julia> gdf[isodd.(getproperty.(keys(gdf), :x))]
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

However, a more general approach is to use the filter function:

julia> filter(sdf -> isodd(first(sdf.x)), gdf)
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
⋮
Last Group (2 rows): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     3      7

If you wanted to drop grouping of the result you can pass ungroup=true in filter:

julia> filter(sdf -> isodd(first(sdf.x)), gdf, ungroup=true)
4×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      5
   3 │     3      3
   4 │     3      7

Dropping rows within group

Now that you know how to drop whole groups you might wonder how to drop individual rows
within group. Assume that we want to drop from each group all rows except
the row with minimal value of :id within group. You can achieve this using the
subset function:

julia> subset(gdf, :id => x -> x .== minimum(x))
4×2 DataFrame
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4

Note that by default subset ungroups data. If you want to keep it grouped
pass ungroup=false:

julia> show(subset(gdf, :id => x -> x .== minimum(x), ungroup=false),
            allgroups=true)
GroupedDataFrame with 4 groups based on key: x
Group 1 (1 row): x = 1
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
Group 2 (1 row): x = 2
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     2      2
Group 3 (1 row): x = 3
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
Group 4 (1 row): x = 4
 Row │ x      id
     │ Int64  Int64
─────┼──────────────
   1 │     4      4

Conclusions

I hope you found the examples using indexing, filter, and subset
useful and they improved your understanding of row filtering options
available in DataFrames.jl.

In particular remember that filter and subset have a different
default value of the ungoup keyword argument. This difference was
made to reflect the fact that when doing filter typically we do not
want to ungroup data, while when doing subset typically we want
to drop grouping.

Lessons learned from WAW2023 conference

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/05/26/waw2023.html

Introduction

Today is the last day of WAW2023 conference.
It was a great experience to attend it. I had an opportunity
to meet some of the top experts in analyzing and modeling graphs
and hypergraphs. In particular, it was really nice to listen
to several talks on tools available in the Julia language
for processing such data.

Out of many problems discussed during the conference I picked
one that I particularly like and would like to cover it today.

The problem

The task was proposed by Lasse Leskelä from Aalto University.

The problem is formulated as follows. Assume we have two types
of hockey attackers: wingers and centers. There are 300 players
in total out of which 100 are wingers and 200 are centers.

The players go to ice in triplets. The probability that a given
triplet scores a goal in a single action depends on its composition
and is as follows:

winger-winger-winger: 0.24
winger-winger-center: 0.40
winger-center-center: 0.32
center-center-center: 0.36

As you can see the best composition is to have one center and two wingers,
and the worst composition of the players is three wingers.

Now assume that:

you do not know who is a center and who is a winger;
you do not know what is the fraction of centers and wingers in the data;
you do not know the scoring probabilities.

The question is if you are given the information about scoring performance
of all triplets that can be formed from these 300 players (each triplet
is allowed to play exactly one action) can you identify from this data that
there are indeed two groups of players and correctly identify them.

Why is this problem interesting?

A natural question you might ask what is the expected proportion of
goals scored by a center and by a winger in our data. Let us calculate it.

If I am a center then my probability of scoring is:

1/9 * 0.40 + 4/9 * 0.32 + 4/9 * 0.36 = 26/75 = 34.(6)%

The probabilities 1/9, 4/9 and 4/9 correspond to probability
that a player is matched with winger-winger, winger-center, and center-center
combination respectively (if you are not sure about them we will soon
check them with data).

The same calculation for winger yields:

1/9 * 0.24 + 4/9 * 0.40 + 4/9 * 0.32 = 26/75 = 34.(6)%

As you can see wingers and centers score with the same
probability on the average in our data.

Therefore we cannot expect to be able to distinguish centers
and wingers by just looking at their average scoring.

Maybe then looking at pairs of players could help us?
Let us check what is the scoring probability of winger-winger,
winger-center, and center-center pairs. They are as follows:

For winger-winger we get:

2/3 * 0.4 + 1/3 * 0.24 = 26/74 = 34.(6)%

For winger-center:

2/3 * 0.32 + 1/3 * 0.40 = 26/74 = 34.(6)%

And for center-center:

2/3 * 0.36 + 1/3 * 0.32 = 26/74 = 34.(6)%

Unfortunately looking at pairs of players still would not give
us enough insight to solve the problem. It turns our that
we will need to take into account three-way interactions between
players to be able to hope to solve our problem.

In graph perspective we can look at this problem as follows.
Players are nodes and their triplets that scored form a three
element hyperedge. What we have learned above is that to solve
the problem we need to use some method that is hyperege-aware.

Preparing the data

To set up the stage start with generating sample data for our problem.
I the code I encode wingers with 0 and centers with 1.
Therefore if we take three players the sum of their codes ranges
from 0 to 3 and encodes the composition of the attacking team.

First load the packages:

julia> using Random

julia> using Graphs

julia> using DataFrames

julia> using Plots

julia> using Chain

julia> using Statistics

julia> import MultivariateStats

Next define a function result that maps player type triplet to a boolean
value that tells us if a given set of players scored (with probabilities
I described above):

julia> prob(t1, t2, t3) = (0.24, 0.40, 0.32, 0.36)[t1 + t2 + t3 + 1]
prob (generic function with 1 method)

julia> result(t1, t2, t3) = rand() < prob(t1, t2, t3)
result (generic function with 1 method)

Now we are ready to generate players. Start with assigning
each player a random value of winger and center:

julia> ids = shuffle!([fill(0, n_winger); fill(1, n_center)])
300-element Vector{Int64}:
 1
 1
 0
 ⋮

 1
 0
 1

We are now ready to generate the table holding all player triplets.
Note that I want to ensure that each triplet is stored in it only
once (with respect to permutation) so I filter it ensuring that
player numbers are increasing:

julia> df = allcombinations(DataFrame, x1=1:300, x2=1:300, x3=1:300)
27000000×3 DataFrame
      Row │ x1     x2     x3
          │ Int64  Int64  Int64
──────────┼─────────────────────
        1 │     1      1      1
        2 │     2      1      1
        3 │     3      1      1
        4 │     4      1      1
    ⋮     │   ⋮      ⋮      ⋮
 26999997 │   297    300    300
 26999998 │   298    300    300
 26999999 │   299    300    300
 27000000 │   300    300    300
           26999992 rows omitted

julia> subset!(df, [[:x1, :x2], [:x2, :x3]] .=> ByRow(<))
4455100×3 DataFrame
     Row │ x1     x2     x3
         │ Int64  Int64  Int64
─────────┼─────────────────────
       1 │     1      2      3
       2 │     1      2      4
       3 │     1      3      4
       4 │     2      3      4
    ⋮    │   ⋮      ⋮      ⋮
 4455097 │   295    299    300
 4455098 │   296    299    300
 4455099 │   297    299    300
 4455100 │   298    299    300
           4455092 rows omitted

Now for each triplet we add the player type information
and next call the result function to randomly decide
if a given player pair scored or not:

julia> transform!(df, All() .=> (x -> ids[x]) .=> [:t1, :t2, :t3])
4455100×6 DataFrame
     Row │ x1     x2     x3     t1     t2     t3
         │ Int64  Int64  Int64  Int64  Int64  Int64
─────────┼──────────────────────────────────────────
       1 │     1      2      3      1      1      0
       2 │     1      2      4      1      1      1
       3 │     1      3      4      1      0      1
       4 │     2      3      4      1      0      1
    ⋮    │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
 4455097 │   295    299    300      0      0      1
 4455098 │   296    299    300      0      0      1
 4455099 │   297    299    300      0      0      1
 4455100 │   298    299    300      1      0      1
                                4455092 rows omitted

julia> transform!(df, r"t" => ByRow(result) => :result)
4455100×7 DataFrame
     Row │ x1     x2     x3     t1     t2     t3     result
         │ Int64  Int64  Int64  Int64  Int64  Int64  Bool
─────────┼──────────────────────────────────────────────────
       1 │     1      2      3      1      1      0    true
       2 │     1      2      4      1      1      1   false
       3 │     1      3      4      1      0      1    true
       4 │     2      3      4      1      0      1    true
    ⋮    │   ⋮      ⋮      ⋮      ⋮      ⋮      ⋮      ⋮
 4455097 │   295    299    300      0      0      1    true
 4455098 │   296    299    300      0      0      1    true
 4455099 │   297    299    300      0      0      1    true
 4455100 │   298    299    300      1      0      1   false
                                        4455092 rows omitted

We have our data ready. Let us first check if indeed the
theoretical computations we did above hold.

Initial screening of data

Start with computing mean scoring probability by player
and number of times each player was present in
the data (and sort the data frame by mean result
so see if there are any large differences in it):

julia> @chain df begin
           [select(_, "x$i" => :x, :result) for i in 1:3]
               vcat(_...)
           groupby(:x)
           combine(:result => mean, nrow)
           sort!(:result_mean)
       end
300×3 DataFrame
 Row │ x      result_mean  nrow
     │ Int64  Float64      Int64
─────┼───────────────────────────
   1 │   106     0.341294  44551
   2 │   121     0.341429  44551
   3 │    10     0.341878  44551
   4 │    46     0.342529  44551
   5 │    71     0.342596  44551
   6 │     2     0.342663  44551
   7 │   281     0.342776  44551
   8 │    60     0.342865  44551
   9 │   233     0.342888  44551
  10 │    73     0.343045  44551
  11 │   266     0.34318   44551
  12 │   131     0.343202  44551
  13 │    40     0.343225  44551
  14 │   138     0.343292  44551
  ⋮  │   ⋮         ⋮         ⋮
 288 │   176     0.350205  44551
 289 │   293     0.350205  44551
 290 │    19     0.35034   44551
 291 │    48     0.350407  44551
 292 │   155     0.350632  44551
 293 │   195     0.350744  44551
 294 │   132     0.350879  44551
 295 │    96     0.351103  44551
 296 │   110     0.35144   44551
 297 │   172     0.351687  44551
 298 │   254     0.351687  44551
 299 │   231     0.352383  44551
 300 │   163     0.354964  44551
                 273 rows omitted

Indeed we see that all values are close to 34.(6)% and
that for each player we have 299*298/2 entries in our table
as expected.

The second check is mean scoring probability by player type
and fraction of player types:

julia> @chain df begin
           [select(_, "t$i" => :t, :result) for i in 1:3]
               vcat(_...)
           groupby(:t)
               combine(:result => mean, proprow)
       end
2×3 DataFrame
 Row │ t      result_mean  proprow
     │ Int64  Float64      Float64
─────┼──────────────────────────────
   1 │     0     0.346636  0.333333
   2 │     1     0.346613  0.666667

Again the results are consistent with what we should get.

Let us now turn to checking the scoring probability and distribution by
player type:

julia> @chain df begin
           [select(_, ["t$i", "t$(mod1(i+1, 3))"] =>
                      ByRow(minmax) =>
                      [:ta, :tb], :result) for i in 1:3]
           vcat(_...)
               groupby([:ta, :tb])
           combine(:result => mean, proprow)
       end
3×4 DataFrame
 Row │ ta     tb     result_mean  proprow
     │ Int64  Int64  Float64      Float64
─────┼─────────────────────────────────────
   1 │     0      0     0.346844  0.110368
   2 │     0      1     0.346532  0.445931
   3 │     1      1     0.346653  0.443701

All looks good again. The mean is constant and we get the 1:4:4 proportion
of different player types.

Finally we check the three way combinations:

julia> combine(groupby(df, r"t\d"), :result => mean, proprow)
8×5 DataFrame
 Row │ t1     t2     t3     result_mean  proprow
     │ Int64  Int64  Int64  Float64      Float64
─────┼─────────────────────────────────────────────
   1 │     0      0      0     0.238695  0.0362955
   2 │     0      0      1     0.398932  0.0821452
   3 │     0      1      0     0.400749  0.076282
   4 │     0      1      1     0.319947  0.169792
   5 │     1      0      0     0.399914  0.06379
   6 │     1      0      1     0.320193  0.14399
   7 │     1      1      0     0.319905  0.132896
   8 │     1      1      1     0.360107  0.294808

Now we indeed see the expected differences in means.
Also the computed probabilities of seeing a given combination
of values are close to expectation, e.g. for 0-0-0 combination
we expected (1/3)^3 ≈ 3.7% and for 1-1-1 we expected (1/3)^3 ≈ 29.6%
of observations.

Solving the puzzle

From the analysis above we know that we need to take three-way
relationships into account to distingiush centers from wingers.

Let me recall you that we assumed that:

you do not know who is a center and who is a winger;
you do not know what is the fraction of centers and wingers in the data;
you do not know the scoring probabilities.

Therefore we are in an unsupervised learning setting. What we are looking
for is some representation of this data under which wingers and centers
would form separate clusters. We cannot hope to say which players are centers
and which are wingers, but maybe we can visually separate them.

The idea is to represent our data as a matrix (designed in a way that it
captures three-way relationships between players) and then project
it to two dimensions (and hope to see some nice pattern).

How to define the matrix. Let us try the following. I call the matrix
inc2. It will have 300 rows and 300*299 columns.
Rows are representing node numbers. Columns are representing node pairs
(and we potentially have 300*299 such pairs).
The player pair (a, b) is assigned to column number j=300*(a-1) + b.
I put 1 in the inc2[i, j] cell if there is an unordered triplet {i, a, b}
scored a goal. Otherwise I put 0 in the matrix.
Note that this matrix is sparse (some of its columns are all zeros), but my
data is small enough that I do not need to optimize the code to take this into account.

Here is the code generating the matrix:

julia> inc2 = zeros(Bool, 300, 300*299)
300×89700 Matrix{Bool}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 ⋮              ⋮              ⋮              ⋮           ⋱              ⋮              ⋮              ⋮
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

julia> for (i, row) in enumerate(eachrow(subset(df, :result)))
           inc2[row.x1, 300*(row.x2-1) + row.x3] = true
           inc2[row.x2, 300*(row.x1-1) + row.x3] = true
           inc2[row.x3, 300*(row.x1-1) + row.x2] = true
       end

julia> inc2
300×89700 Matrix{Bool}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  1  0  0  0  0  0  1  0  0  0  0  1  0  1  1  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  1  0  1  0  0  1  0  1  0  0  1  0  1  1  1  0  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  1  0  0  1  0  0  0  0  1  0  1  0  1  0  1  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  1  0  0  1  0  0  0  1  1  0  0  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  1  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 ⋮              ⋮              ⋮              ⋮           ⋱              ⋮              ⋮              ⋮
 0  0  1  0  0  1  1  0  0  1  0  0  0  1  1  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  1  1  0  0  0  0  0  1  0  0  0  0  1  …  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  0  0  0  0  0  0  0  1  0  1  0  1  0  1  0  0  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
 0  1  0  1  0  0  0  1  0  0  1  1  0  1  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  1  0  0  0  0  0  0  1  0  1  1  0  1     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 0  1  1  1  0  1  0  1  0  1  0  0  0  0  0  1  1  0  0     0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

I called the matrix inc2 as it represents incidence between nodes and pairs of nodes.
In the for loop above I took advantage of the fact that I generated my data so that x1 < x2 < x3.

Let us quickly check if indeed each node has approximately the same number of 1 entries:

julia> sum(inc2, dims=2)
300×1 Matrix{Int64}:
 15340
 15266
 15362
 15402
 15523
 15324
     ⋮
 15324
 15464
 15367
 15543
 15470
 15495

Indeed we see that we roughly have 44551 * 34.(6)% for all rows.

Now we have a 89700-dimensional representation of each node. We want to compress it to 1-dimension.
Let us pick the classic PCA. In MultivariateStats.jl observations are stored in columns,
so I work with a transposition of inc2 matrix:

julia> p = MultivariateStats.fit(MultivariateStats.PCA, inc2', maxoutdim=1);

julia> t = MultivariateStats.transform(p, inc2')
1×300 Matrix{Float64}:
 -7.38127  -7.47636  14.2659  -5.88158  13.5279  -6.48501  13.0815  …  13.9629  14.2848  13.4581  -7.5586  14.5669  -7.71527

Now for each observation (column in the t matrix) we have its 1-dimensional representation.

Let us visually check if these components have some structure. Start with a histogram:

julia> histogram(vec(t); nbins=50, legend=false)

You can see the produced plot below:

Histogram

It looks promising. We got two separated clusters. But are they connected with our ids?

julia> histogram(t[ids .== 1]; nbins=20, label="1")

julia> histogram!(t[ids .== 0]; nbins=10, label="0")

You can see the produced plot below:

Histogram splitted

Indeed we got a perfect separation it seems. Let us double check it:

julia> describe(t[ids .== 1])
Summary Stats:
Length:         200
Missing Count:  0
Mean:           -6.914677
Minimum:        -9.086461
1st Quartile:   -7.496597
Median:         -6.926006
3rd Quartile:   -6.359373
Maximum:        -4.606485
Type:           Float64

julia> describe(t[ids .== 0])
Summary Stats:
Length:         100
Missing Count:  0
Mean:           13.829354
Minimum:        12.317402
1st Quartile:   13.322935
Median:         13.911846
3rd Quartile:   14.269240
Maximum:        15.791385
Type:           Float64

The distributions confirm that we recovered the communities correctly.

Conclusions

The presented approach is one of the many that could be used to solve the puzzle.
I encourage you to think of alternative approaches that could be used.

However, I think that the puzzle nicely shows that sometimes slightly
reformulating the problem allows you to use standard data science tools to solve it.

I hope you enjoyed it!

juliabloggers.com

A Julia Language Blog Aggregator

Tag Archives: julialang

Refreshing Mining Complex Networks book

Introduction

The problem

Analytical solution

Simulation solution, part 1

Simulation solution, part 2

Conclusions

Filtering grouped data frames in DataFrames.jl

Introduction

Preparing the data

Dropping whole groups

Dropping rows within group

Conclusions

Lessons learned from WAW2023 conference

Introduction

The problem

Why is this problem interesting?

Preparing the data

Initial screening of data

Solving the puzzle

Conclusions