Refreshing Mining Complex Networks book

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/06/09/graphs.html

Introduction

Two years ago together with Paweł Prałat and François Théberge
we have written Mining Complex Networks book.
Since it has received a positive response from the community
we decided to start planning for its second edition, where
we want to add material that covers recent advances in the field.

This decision prompted me to write something about graph analysis
that is at the same time a nice application of the Julia language.

The post was written under Julia 1.9.0 and Graphs.jl 1.8.0.

The problem

Let me start with a business scenario.
Assume you have a set of products in a store and know which of them
are bought together. You represent this data as a graph where
nodes are products and edges are representing co-purchase of products.

You want to find products that are not bought together, but that
are fall into similar baskets. For example, assume you have two types
of milk. Most likely they are not bought together, but they are both
bought with e.g. bread. Such products could be called substitutes.

In the book we discuss some advanced methods that can be used for this task.
In this post let me concentrate on a simple way to detect such products.

Assume that I have four products i, j, k, and l. If in the
graph we have edges i-j, j-k, k-l, and l-i, but do not have
edges i-k and j-l then we can say that i and k are substitutes
and j and l are substitutes (so this four nodes could be called
double-substitutes).

From a graph theory perspective the set of nodes {i, j, k, l} form a
cycle of length 4 and they do not form any sorter cycle
(so this cycle has a hole inside it). Let us call such cycles minimal.
We can visualize this situation for example as follows:

  i
 / \
l   j
 \ /
  k

Our task for today is the following. Assume we have a random graph
with n nodes. In this graph each edge is included with probability
p, independently from every other edge. We want to check how
many minimal 4-cycles it contains.

Analytical solution

For a random graph it is possible to compute the expected number of
minimal 4-cycles relatively easily.

Using the additivity of expectation we can concentrate on four-element
subsets of sets of nodes. For each such subset {i, j, k, l}
we have three possibilities that they can form a minimal 4-cycle:

  i        i        i
 / \      / \      / \
l   j    l   k    k   j
 \ /      \ /      \ /
  k        j        l

Note that the probability of observing any of them is p^4*(1-p)^2
(we see 4 edges and do not see 2 edges).

Therefore we can easily write a function that computes the expected
number of such motifs as:

expected_count_empty_4cycle(n, p) = p^4*(1-p)^2 * 3 * binomial(n, 4)

Let us calculate the expected number of such motifs for n equal to 10
and 100 and p equal to 0.1 and 0.2:

julia> expected_count_empty_4cycle.([10, 100], [0.1 0.2 0.3])
2×3 Matrix{Float64}:
   0.05103      0.64512      2.50047
 952.858    12046.0      46690.0

The function works fast and nice, but what if we made a mistake when defining it?
With Julia it is easy to run a simulation checking the result.

Simulation solution, part 1

Let us start with a solution that reproduces the thinking we used to derive
our analytical result:

function naive_count_empty_4cycle(g)
    empty4cycle = 0
    for i in 1:nv(g), j in i+1:nv(g), k in j+1:nv(g), l in k+1:nv(g)
        empty4cycle += has_edge(g, i, j) && has_edge(g, j, k) &&
                       has_edge(g, k, l) && has_edge(g, l, i) &&
                       !has_edge(g, i, k) && !has_edge(g, j, l)
        empty4cycle += has_edge(g, i, k) && has_edge(g, k, j) &&
                       has_edge(g, j, l) && has_edge(g, l, i) &&
                       !has_edge(g, i, j) && !has_edge(g, k, l)
        empty4cycle += has_edge(g, i, j) && has_edge(g, j, l) &&
                       has_edge(g, l, k) && has_edge(g, k, i) &&
                       !has_edge(g, i, l) && !has_edge(g, j, k)
    end
    return empty4cycle
end

The function expects to get a Graphs.jl graph g and traverses all
subsets of its nodes.

Let us check it:

julia> using Graphs

julia> using Random

julia> using Statistics

julia> Random.seed!(1234);

julia> [mean(naive_count_empty_4cycle(erdos_renyi(10, p)) for i in 1:1000)
        for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
 0.051
 0.598
 2.455

So the results look good for n=10. But what about n=100?

Let us check timing of a single run:

julia> @time naive_count_empty_4cycle(erdos_renyi(100, 0.1));
  0.142410 seconds (274 allocations: 38.594 KiB)

We can see that the function is slow. If we wanted to run it 1000 times it would take us
over 2 minutes. Let us think of a faster approach.

Simulation solution, part 2

The idea we can use is the following. Consider two nodes i and j and assume they
are not connected. Then it is enough to find common neighbors of i and j
and check how many pairs of them are not connected.

Here is an example implementation of this idea:

function find_common_sorted!(common, ni, nj)
    empty!(common)
    iidx = 1
    jidx = 1
    while iidx <= length(ni) && jidx <= length(nj)
        if ni[iidx] < nj[jidx]
            iidx += 1
        elseif ni[iidx] > nj[jidx]
            jidx += 1
        else
            push!(common, ni[iidx])
            iidx += 1
            jidx += 1
        end
    end
    nothing
end

function fast_count_empty_4cycle(g)
    common = Int[]
    empty4cycle = 0

    for i in 1:nv(g)
        ni = neighbors(g, i)
        for j in i+1:nv(g)
            has_edge(g, i, j) && continue
            nj = neighbors(g, j)
            find_common_sorted!(common, ni, nj)
            for a in 1:length(common), b in a+1:length(common)
                empty4cycle += !has_edge(g, common[a], common[b])
            end
        end
    end
    @assert iseven(empty4cycle)
    return empty4cycle ÷ 2
end

Note that in the code we divide the number of empty cycles found by two as
assuming that nodes {i, j, k, l} form an empty cycle we count it twice
(once starting with {i, k} pair and once starting with {j, l} pair).

Let us check our function:

julia> @time fast_count_empty_4cycle(erdos_renyi(100, 0.1));
  0.001491 seconds (280 allocations: 40.047 KiB)

Indeed it is significantly faster. Therefore we can use it to also check the n=100 case:

julia> [mean(fast_count_empty_4cycle(erdos_renyi(10, p)) for i in 1:1000)
        for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
 0.051
 0.598
 2.455

julia> [mean(fast_count_empty_4cycle(erdos_renyi(100, p)) for i in 1:1000)
        for p in [0.1, 0.2, 0.3]]
3-element Vector{Float64}:
   954.364
 12017.067
 46574.827

The obtained results confirm our analytical solution, so we have some more confidence
that indeed we have derived it correctly.

Conclusions

I hope you found the problem of finding the number of minimal 4-cycles interesting.
Personally, really enjoyed coding it in Julia. In particular note that in Julia
with our faster version of code we almost do not do any allocations:

julia> gr = erdos_renyi(1000, 0.1); @time fast_count_empty_4cycle(gr)
  2.534875 seconds (4 allocations: 496 bytes)
10169170

julia> expected_count_empty_4cycle.(1000, 0.1)
1.0064361314250002e7

and the code runs quite fast.