Tag Archives: julialang

Infiltrator.jl: a no-nonsense debugging

Re-posted from: https://bkamins.github.io/julialang/2023/08/25/infiltrate.html

Introduction

During JuliaCon 2023 I have had several
discussions about setup of my working environment when
I develop Julia code. One of the common questions was
tools that I use to debug my code.

With Julia an important aspect of the debugging tool
one uses is to ensure that it does not have a significant
impact on the performance of code execution. Keeping this
condition in mind, I find Infiltrator.jl
quite convenient. In this post I want to show an example
how this package can be used.

The post was written using Julia 1.9.2 and Infiltrator.jl 1.6.4.
Note that you should use the examples that I present in the terminal
using the standard Julia REPL.

An example project

Let us write a simple function that merges two sorted vectors into
a new sorted vector:

function mergesorted(x::T, y::T) where {T<:Vector}
    @assert issorted(x) && issorted(y)
    z = T(undef, length(x) + length(y))
    ix, iy, iz = 1, 1, 1
    for iz in eachindex(z)
        if x[ix] < y[iy]
            z[iz] = x[ix]
            ix += 1
        else
            z[iz] = y[iy]
            iy += 1
        end
    end
    return z
end

Now test it on some example data:

julia> mergesorted([1, 3], [2, 4])
ERROR: BoundsError: attempt to access 2-element Vector{Int64} at index [3]

We see that we get a problem when we try to get an element from a
vector with a too large index. Let us try to infiltrate this issue
but providing an appropriate condition when we want to investigate
the state of our function:

using Infiltrator
function mergesorted(x::T, y::T) where {T<:Vector}
    @assert issorted(x) && issorted(y)
    z = T(undef, length(x) + length(y))
    ix, iy, iz = 1, 1, 1
    for iz in eachindex(z)
        @infiltrate ix > length(x) || iy > length(y)
        if x[ix] < y[iy]
            z[iz] = x[ix]
            ix += 1
        else
            z[iz] = y[iy]
            iy += 1
        end
    end
    return z
end

The magic of @infiltrate ix > length(x) || iy > length(y)
is that the execution of the mergesorted function will
be interrupted at the moment when the passed condition is met.
Our condition is that either ix or iy index gets too large.

Let us run our test with an updated definition of the function:

julia> mergesorted([1, 3], [2, 4])
Infiltrating mergesorted(x::Vector{Int64}, y::Vector{Int64})
  at REPL[11]:6

infil> @locals
- x::Vector{Int64} = [1, 3]
- T::DataType = Vector{Int64}
- iz::Int64 = 4
- y::Vector{Int64} = [2, 4]
- z::Vector{Int64} = [1, 2, 3, 2041652468496]
- iy::Int64 = 2
- ix::Int64 = 3

infil> @continue

ERROR: BoundsError: attempt to access 2-element Vector{Int64} at index [3]

We see that we have a problem that we get beyond the end of the
x vector while the last element of the y vector is still not
processed. It looks that we need to check if we are beyond
the end of the x vector, and if it is the case jump right
to the else part of the code:

function mergesorted(x::T, y::T) where {T<:Vector}
    @assert issorted(x) && issorted(y)
    z = T(undef, length(x) + length(y))
    ix, iy, iz = 1, 1, 1
    for iz in eachindex(z)
        @infiltrate ix > length(x) || iy > length(y)
        if ix <= length(x) && x[ix] < y[iy]
            z[iz] = x[ix]
            ix += 1
        else
            z[iz] = y[iy]
            iy += 1
        end
    end
    return z
end

Let us run the code:

julia> mergesorted([1, 3], [2, 4])
Infiltrating mergesorted(x::Vector{Int64}, y::Vector{Int64})
  at REPL[13]:6

infil> @locals
- x::Vector{Int64} = [1, 3]
- T::DataType = Vector{Int64}
- iz::Int64 = 4
- y::Vector{Int64} = [2, 4]
- z::Vector{Int64} = [1, 2, 3, 8589934594]
- iy::Int64 = 2
- ix::Int64 = 3

infil> @continue

4-element Vector{Int64}:
 1
 2
 3
 4

This time things seem to work as expected. Let us thus turn-off
infiltration and run some randomized tests:

function mergesorted(x::T, y::T) where {T<:Vector}
    @assert issorted(x) && issorted(y)
    z = T(undef, length(x) + length(y))
    ix, iy, iz = 1, 1, 1
    for iz in eachindex(z)
        if ix <= length(x) && x[ix] < y[iy]
            z[iz] = x[ix]
            ix += 1
        else
            z[iz] = y[iy]
            iy += 1
        end
    end
    return z
end

And run the tests:

julia> using Random

julia> using Test

julia> Random.seed!(1234);

julia> for i in 1:10
           x = sort!(rand(rand(1:10)))
               y = sort!(rand(rand(1:10)))
           @assert mergesorted(x, y) == sort!([x; y])
       end
ERROR: BoundsError: attempt to access 3-element Vector{Float64} at index [4]

As a side note, the rand(rand(1:10)) is a convenient pattern for generating
random vectors of random length.

Going back to our main topic we see that we still get a problem.
How to diagnose it? This time, as an example,
let me show how to turn infiltration on when an error happens:

function mergesorted(x::T, y::T) where {T<:Vector}
    @assert issorted(x) && issorted(y)
    z = T(undef, length(x) + length(y))
    ix, iy, iz = 1, 1, 1
    for iz in eachindex(z)
        try
            if ix <= length(x) && x[ix] < y[iy]
                z[iz] = x[ix]
                ix += 1
            else
                z[iz] = y[iy]
                iy += 1
            end
        catch e
            @infiltrate
            rethrow(e)
        end
    end
    return z
end

Now run the same test code:

julia> Random.seed!(1234);

julia> for i in 1:10
           x = sort!(rand(rand(1:10)))
               y = sort!(rand(rand(1:10)))
           @assert mergesorted(x, y) == sort!([x; y])
       end
Infiltrating mergesorted(x::Vector{Float64}, y::Vector{Float64})
  at REPL[31]:15

infil> @locals
- x::Vector{Float64} = [0.6932923170086805, 0.7600131804670265]
- T::DataType = Vector{Float64}
- iz::Int64 = 4
- y::Vector{Float64} = [0.11679226454435099, 0.20295936651757684, 0.43552097477865936]
- e::BoundsError = BoundsError([0.11679226454435099, 0.20295936651757684, 0.43552097477865936], (4,))
- z::Vector{Float64} = [0.11679226454435099, 0.20295936651757684, 0.43552097477865936, 5.0e-324, 5.0e-324]
- iy::Int64 = 4
- ix::Int64 = 1

infil> @continue

ERROR: BoundsError: attempt to access 3-element Vector{Float64} at index [4]

Ah – we can now see where the problem is. We have not covered the case when iy is greater than
length of y. It seems we are close. Let us do a final attempt:

function mergesorted(x::T, y::T) where {T<:Vector}
    @assert issorted(x) && issorted(y)
    z = T(undef, length(x) + length(y))
    ix, iy, iz = 1, 1, 1
    for iz in eachindex(z)
        if ix <= length(x) && (iy > length(y) || x[ix] < y[iy])
            z[iz] = x[ix]
            ix += 1
        else
            z[iz] = y[iy]
            iy += 1
        end
    end
    return z
end

We are ready for an even more comprehensive test:

julia> Random.seed!(1234);

julia> for i in 1:10_000
           x = sort!(rand(rand(0:10)))
               y = sort!(rand(rand(0:10)))
           @assert mergesorted(x, y) == sort!([x; y])
       end

This time we got no errors, so we can be relatively confident that our code works well.

Conclusions

I find Infiltrator.jl useful because it is a lightweight solution.
I did not discuss all of its features. However, even these minimal examples
I have shown are quite nice in practice:

You can use @infiltrate with a condition.
You can put @infiltrate in a try-catch-end block
to be able to infiltrate into the state of your computations
at the moment an exception is thrown (a thing that I quite often need in practice).

Happy debugging!

DataFrames.jl survey: selecting columns of a data frame based on their values

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/08/18/selectcolumn.html

Introduction

Today I want to make a user survey about a future direction
of development of DataFrames.jl.

Most commonly users want to select columns of a data frame
using their names and this is the operation that has an
extensive support in DataFrames.jl.
However, in some cases one might need to perform such a
selection conditional on a value stored in a column.

I want to ask if we should add a special selector allowing for
picking columns of a data frame based on their values.
The question is given in the conclusions section.
However, before we get to it let us briefly review what is
currently supported.

The post was written using Julia 1.9.2 and DataFrames.jl 1.6.1.

Column selection using column names

First create a simple data frame on which we are going
to perform the column selection examples:

julia> using DataFrames

julia> df = DataFrame(a1 = [1, 2],
                      a2=[1, missing],
                      b1=Any[3, missing],
                      b2=Any[3, 4])
2×4 DataFrame
 Row │ a1     a2       b1       b2
     │ Int64  Int64?   Any      Any
─────┼──────────────────────────────
   1 │     1        1  3        3
   2 │     2  missing  missing  4

Now assume we want to pick columns whose name starts with "b":

julia> select(df, Cols(startswith("b")))
2×2 DataFrame
 Row │ b1       b2
     │ Any      Any
─────┼──────────────
   1 │ 3        3
   2 │ missing  4

As you can see, if you pass a function returning a Bool
(such a function is often called a predicate) to Cols
selector you get columns whose names math the condition
defined by this function.

Today I want to focus on cases when you specify the condition
using a function. However, let me mention that there
are other ways to perform selection we discussed above. For example
we could use a regular expression:

julia> select(df, Cols(r"^b"))
2×2 DataFrame
 Row │ b1       b2
     │ Any      Any
─────┼──────────────
   1 │ 3        3
   2 │ missing  4

In general, in DataFrames.jl you currently have the following ways to
select columns (Warning! The list is long.):

a symbol, string, or integer;
vector of symbols, strings, integers, or bools;
regular expression;
All, Between, Cols, Not, and : selectors.

Column selection using column values

What if we wanted to select the columns using their values.
For example, assume that we want to pick columns that contain
missing value. In this case the easiest way to do it is to
use the eachcol(df) iterator over columns of our data frame:

julia> select(df, any.(ismissing, eachcol(df)))
2×2 DataFrame
 Row │ a2       b1
     │ Int64?   Any
─────┼──────────────────
   1 │       1  3
   2 │ missing  missing

Notice that the any.(ismissing, eachcol(df)) condition
iterates all columns of df and for each of them returns true
if they contain any missing value (and false otherwise):

julia> any.(ismissing, eachcol(df))
4-element BitVector:
 0
 1
 1
 0

An alternative, similar condition would be to select all columns
that allow for storing missing value (without requiring that
they actually have it stored in them). For this we need to use
the eltype function on columns:

julia> select(df, Missing .<: eltype.(eachcol(df)))
2×3 DataFrame
 Row │ a2       b1       b2
     │ Int64?   Any      Any
─────┼───────────────────────
   1 │       1  3        3
   2 │ missing  missing  4

Note that the difference is column :b2, which does not
contain missing values, but could contain them since its
element type is Any.

Column selection using column names and values

Now, what if we wanted to perform column selection based on both
their names and values?

The general pattern uses pairs(eachcol(df)) which iterates
pairs of column names and values:

julia> pairs(eachcol(df))
Iterators.Pairs(::DataFrames.DataFrameColumns{DataFrame}, ::Vector{Symbol})(...):
  :a1 => [1, 2]
  :a2 => Union{Missing, Int64}[1, missing]
  :b1 => Any[3, missing]
  :b2 => Any[3, 4]

So for example if we wanted to pick columns that contain missing values
and start with "a" we can write:

julia> select(df, [startswith(string(n), "a") && any(ismissing, c)
                   for (n,c) in pairs(eachcol(df))])
2×1 DataFrame
 Row │ a2
     │ Int64?
─────┼─────────
   1 │       1
   2 │ missing

This pattern is fully general, but slightly verbose, especially column names
returned by pairs are Symbols. The same condition
can be written more naturally as follows:

julia> select(df, Cols(startswith("a"),
                       any.(ismissing, eachcol(df));
                       operator=intersect))
2×1 DataFrame
 Row │ a2
     │ Int64?
─────┼─────────
   1 │       1
   2 │ missing

Here we take advantage of the fact that our condition was a conjunction
and Cols selector accepts the operator keyword argument with allows
to get an intersection of two selectors.

If we wanted to select columns that meet
at least one of the conditions this would be even simpler:

julia> select(df, Cols(startswith("a"),
                       any.(ismissing, eachcol(df))))
2×3 DataFrame
 Row │ a1     a2       b1
     │ Int64  Int64?   Any
─────┼─────────────────────────
   1 │     1        1  3
   2 │     2  missing  missing

Note that by default Cols select columns that are a union of selectors
passed to it.

Conclusions

I hope the examples I presented today will be useful when you work with
DataFrames.jl.

You might ask why, when performing column selection
based on their values one needs to invoke eachcol(df). This is indeed
a bit verbose. However, we decided that Cols(predicate), which is
shorter to write should apply the predicate function to column names
as this is a more common operation. And if user wants to apply
the predicate function to column values (which is a less frequent case)
writing predicate.(eachcol(df)) is readable and easy enough.

If we added a built-in way for selection of columns by value
it would mean that instead of writing
predicate.(eachcol(df)) you would write something like
Vals(predicate) (the Vals is an example name – the choice of name can
be done later).

The benefits of having it are:

It is shorter.
We do not have a redundance of having to pass the df data frame in the expression.

Here are the cons of adding it:

It makes the list of things to learn longer (and the list is already quite long).
It is ambiguous how the Vals(predicate) should be interpreted if it were
used in the context of GroupedDataFrame as the question would be how should we
treat the groups (so most likely it should be only allowed for AbstractDataFrame).
It would require a change of internal memory layout of DataFrame object
(which means that the next release of DataFrames.jl would be incompatible on
binary level with the current release so serialization/deserialization would
not work cross-versions).

Now comes the question:

Should we add a new special selector that would allow picking
columns based on their values?

If you have an opinion please vote or comment in this issue on GitHub.

Variable importance with EvoTrees.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/08/11/evotrees.html

Introduction

Variable importance in predictive modeling context
can have several varied meanings. Today I want to investigate two of them:

How important is a feature with respect to a given machine learning model?
How important is a feature with respect to a given learning algorithm and a given dataset?

In the post we will discuss why and how these two concepts can differ using an example
of boosted trees. We will use their implementation available in the EvoTrees.jl package.

The post was written under Julia 1.9.2, DataFrames.jl 1.9.2, Distributions.jl 0.25.98,
and EvoTrees.jl 0.15.2.

Generating the test data

Let us start with loading the packages we are going to use and generating the test data.

What I want to have is a data set with 10,000 observations.
We will have 9 continuous features denoted x1 to x9 and a continuous target variable y.
Our goal is to have triplets of features (x1 to x3, x4 to x6, x7 to x9)
highly correlated together (but independent between groups) and having a different
correlation level with the y target. I have made the within-group variables
highly correlated but non-identical (so they are distinguishable and have
a slightly different correlation with y in the sampled data set).

The code generating such data is as follows:

using DataFrames
using Distributions
using EvoTrees
using LinearAlgebra
using Random

δ = 1.0e-6
b = fill(1.0 - δ, 3, 3) + δ * I
z = zeros(3, 3)
y = fill(0.5, 3)
dist = MvNormal([b      z  z      0.8*y
                 z      b  z      y
                 z      z  b      1.2*y
                 0.8*y' y' 1.2*y' 1.0])
Random.seed!(1)
mat = rand(dist, 10_000);
df = DataFrame(transpose(mat), [string.("x", 1:9); "y"]);

Let us have a peek at the data:

julia> names(df)
10-element Vector{String}:
 "x1"
 "x2"
 "x3"
 "x4"
 "x5"
 "x6"
 "x7"
 "x8"
 "x9"
 "y"

julia> df
10000×10 DataFrame
   Row │ x1           x2          x3           x4           x5           x6          ⋯
       │ Float64      Float64     Float64      Float64      Float64      Float64     ⋯
───────┼──────────────────────────────────────────────────────────────────────────────
     1 │  0.0619327    0.0623264   0.0613998    0.0466594    0.0481949    0.0454962  ⋯
     2 │  0.217879     0.216951    0.217742     0.00738607   0.00888615   0.00626926
     3 │  1.54641      1.54598     1.54328     -1.00261     -1.0013      -0.999863
     4 │  0.208777     0.207593    0.209145    -1.21253     -1.21556     -1.21462
     5 │ -0.458805    -0.458081   -0.457956     0.103491     0.10537      0.103313   ⋯
     6 │  0.392072     0.390938    0.390447    -0.354123    -0.354995    -0.353026
     7 │ -0.313095    -0.310223   -0.311185     1.09256      1.09373      1.09443
   ⋮   │      ⋮           ⋮            ⋮            ⋮            ⋮            ⋮      ⋱
  9994 │ -1.24411     -1.24363    -1.24439     -0.789893    -0.793004    -0.792177
  9995 │  0.199036     0.199199    0.199344     0.945945     0.945308     0.943717   ⋯
  9996 │  1.81075      1.80926     1.81064     -2.53813     -2.53805     -2.53996
  9997 │ -0.00896532  -0.0079907  -0.00876527  -0.629303    -0.629402    -0.630129
  9998 │ -1.62881     -1.62626    -1.62703     -0.222873    -0.222469    -0.22166
  9999 │  1.45152      1.44833     1.45131     -2.543       -2.54377     -2.544      ⋯
 10000 │  0.436075     0.435492    0.436974    -0.28131     -0.281519    -0.283039
                                                       4 columns and 9986 rows omitted

Indeed we see that there are 9 features and one target variable. Also we visually
see that variables x1, x2, and x3 are almost the same but not identical.
Similarly x4, x5, and x6.
(I have cropped the rest of the printout as it was too wide for the post.)

I chose such a data generation scheme since a priori, that is
with respect to a given machine learning algorithm and a given dataset,
their importance is as follows:

x1, x2, and x3 should have a very similar and lowest importance
(their correlation with y is lowest by design);
x4, x5, and x6 should have a very similar and medium importance;
x7, x8, and x9 should have a very similar and highest importance
(their correlation with y is highest by design).

However, if we build a specific boosted tree model can we expect the same relationship?
Let us check.

Variable importance with respect to a specific model

We build a boosted tree model (using the default settings) and evaluate
variable importance of the features:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df;
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.33002820995126636
 "x4" => 0.17950260124468856
 "x5" => 0.10630471720405912
 "x7" => 0.1002898622306779
 "x1" => 0.09023808819243322
 "x8" => 0.060680998291169054
 "x3" => 0.04789330560493748
 "x6" => 0.044689013127277216
 "x2" => 0.040373204153491105

We see that x9 feature has the highest importance, but it is quite different
from x8 and x7. The x4 feature, although it has a lower correlation with
y than e.g. the x8 feature has a higher variable importance (the same holds
for x1 vs x8).

What is the reason for such a situation? When a boosted tree model is built it
seems that what has happened is that x9 variable captured most of the value
of explanation of y from x7 and x8 variables as they are very similar.
Therefore, in this specific model, x7 and x8 are not that important.

Let us try estimating the model for the second time to see if we notice any difference:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df;
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.33002820995126636
 "x4" => 0.17950260124468856
 "x5" => 0.10630471720405912
 "x7" => 0.1002898622306779
 "x1" => 0.09023808819243322
 "x8" => 0.060680998291169054
 "x3" => 0.04789330560493748
 "x6" => 0.044689013127277216
 "x2" => 0.040373204153491105

The results are identical. You might wonder what is the reason for this? The cause
of this situation is that the fit_evotree function uses a default seed when doing
computations so we get the same tree twice. To be precise, when we call
EvoTreeRegressor() it sets the seed of the default random number generator in the
current task to 123.

So let us try shuffling the variables to see if we would get a different result:

julia> model = fit_evotree(EvoTreeRegressor(),
                           df[!, randperm(10)];
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.23187113718728977
 "x8" => 0.20285271199278873
 "x4" => 0.16779901582722756
 "x5" => 0.15415181545562057
 "x3" => 0.08494533205347177
 "x2" => 0.06781415123236784
 "x7" => 0.05788796227269619
 "x1" => 0.024214826049282448
 "x6" => 0.008463047929255035

Indeed the variable importance changed. Would it be still different if we did another
randomized run?

julia> model = fit_evotree(EvoTreeRegressor(),
                           df[!, randperm(10)];
                           target_name="y",
                           verbosity=0);

julia> EvoTrees.importance(model)
9-element Vector{Pair{String, Float64}}:
 "x9" => 0.23187113718728977
 "x8" => 0.20285271199278873
 "x4" => 0.16779901582722756
 "x5" => 0.15415181545562057
 "x3" => 0.08494533205347177
 "x2" => 0.06781415123236784
 "x7" => 0.05788796227269619
 "x1" => 0.024214826049282448
 "x6" => 0.008463047929255035

Maybe this was a surprise to you but the answer is: no. We get the same results.
What is going on?

This time the answer is that the fit_evotree and randperm functions share the same
random number generator (as we run them in the same task) and fit_evotree resets its state
to 123 when we call EvoTreeRegressor().
This means that when we invoked randperm the generator was in the same state both times so the
randperm(10) call produced the same sequence of numbers.

Variable importance with respect to an algorithm and a data set

We were given an important general lesson. We need to properly initialize the functions
that use randomness in our code. Let us leverage this knowledge to
try assessing variable importance with respect to a boosted trees and a data set
we have generated.

What I do in the code below is generating 1000 boosted trees
and computing mean variable importance (along with some additional statistics)
across all of them:

julia> rng = Xoshiro(1);

julia> reduce(vcat,
           map(1:1000) do _
               fit_evotree(EvoTreeRegressor(rng=rng),
                           df;
                           target_name="y",
                           verbosity=0) |>
               EvoTrees.importance |>
               sort |>
               DataFrame
           end) |> describe
9×7 DataFrame
 Row │ variable  mean       min         median     max        nmissing  eltype
     │ Symbol    Float64    Float64     Float64    Float64    Int64     DataType
─────┼───────────────────────────────────────────────────────────────────────────
   1 │ x1        0.094902   0.0456523   0.0946136  0.1437            0  Float64
   2 │ x2        0.0475577  0.0109109   0.0469086  0.0969166         0  Float64
   3 │ x3        0.0361729  0.00189622  0.0353323  0.0844639         0  Float64
   4 │ x4        0.128568   0.032614    0.127093   0.285664          0  Float64
   5 │ x5        0.111577   0.00310871  0.109548   0.238868          0  Float64
   6 │ x6        0.0891413  0.0         0.0879858  0.196108          0  Float64
   7 │ x7        0.189893   0.0400903   0.179193   0.417476          0  Float64
   8 │ x8        0.155352   0.013377    0.154884   0.371452          0  Float64
   9 │ x9        0.146837   0.00642872  0.141293   0.397151          0  Float64

This time we see a better separation between variables x1, x2, and x3, followed
by x4, x5, and x6, and finally the x7, x8, and x9 group.
However, still we see some non-negligible within-group differences (and x1 is even better than x6).
It seems that just ensuring that we properly pass the random number generator
the fit_evotree model random number generator is not enough.

Let us then do a final attempt. This time we both properly pass the random number generator to
the fit_evotree model and randomize the order of variables in the source data frame
(making sure we also properly pass the random number generator to the randperm function):

julia> reduce(vcat,
           map(1:1000) do _
               fit_evotree(EvoTreeRegressor(rng=rng),
                           df[!, randperm(rng, 10)];
                           target_name="y",
                           verbosity=0) |>
               EvoTrees.importance |>
               sort |>
               DataFrame
           end) |> describe
9×7 DataFrame
 Row │ variable  mean       min         median     max       nmissing  eltype
     │ Symbol    Float64    Float64     Float64    Float64   Int64     DataType
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ x1        0.0595918  0.00260002  0.0524585  0.153613         0  Float64
   2 │ x2        0.0604078  0.00374867  0.0534607  0.148399         0  Float64
   3 │ x3        0.0586352  0.00219417  0.051398   0.140393         0  Float64
   4 │ x4        0.103219   0.00484802  0.101625   0.252127         0  Float64
   5 │ x5        0.119415   0.00146451  0.116338   0.268475         0  Float64
   6 │ x6        0.106432   0.00672514  0.102708   0.254382         0  Float64
   7 │ x7        0.153185   0.00346922  0.139289   0.388995         0  Float64
   8 │ x8        0.166709   0.00651302  0.161872   0.419284         0  Float64
   9 │ x9        0.172406   0.00979053  0.166798   0.444192         0  Float64

This time we have succeeded. While there is still some variability in within-group
variable importance it is small, and the groups are clearly separated. The worst
features have their importance around 6%, the medium value features around 11%,
and the best features around 16%.

Conclusions

Let us recap what we have seen today.

First, we see that variable importance with respect to a given concrete instance
of a machine learning model can be significantly different from variable importance
with respect to a given learning algorithm and a given dataset. Therefore, one should
carefully think which one is interesting from the perspective of a problem that one
wants to solve.

The second lesson is related to implementation details of
machine learning algorithms:

Many of them use pseudorandom numbers when building a model and proper
handling of pseudorandom generator is crucial.
The result of model building can depend on the order of features in the
source data set. In our example we have seen that shuffling the columns
of an input data table produced significantly different variable
importance results.

I hope you found these examples useful!

juliabloggers.com

A Julia Language Blog Aggregator

Tag Archives: julialang

Infiltrator.jl: a no-nonsense debugging

Introduction

An example project

Conclusions

DataFrames.jl survey: selecting columns of a data frame based on their values

Introduction

Column selection using column names

Column selection using column values

Column selection using column names and values

Conclusions

Variable importance with EvoTrees.jl

Introduction

Generating the test data

Variable importance with respect to a specific model

Variable importance with respect to an algorithm and a data set

Conclusions