Author Archives: Blog by Bogumił Kamiński

Welcome to DuckDB

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/12/23/duckdb.html

Introduction

I usually write in this blog about DataFrames.jl. This time I decided
to switch my mode to good old SQL queries.

The inspiration from the post is a follow up after the talk
“Data manipulation in Julia for pandas users” I gave last week during
Data Science Summit 2022. In that talk I explained how
DataFrames.jl compares to pandas.

After the talk, I thought to check how the comparison would look against
DuckDB, which is an excellent in-process SQL OLAP database management
system.

In this post I show how a sample analytical scenario can be executed in
DataFrames.jl and DuckDB, and how excellently they are integrated.

The code was run under Julia 1.8.2, CSV.jl 0.10.8, DataFrames.jl 1.4.4, and
DuckDB.jl 0.6.0 on a laptop with 32GB of RAM. I perform tests using 8 cores.

The task

When I was preparing the “Data manipulation in Julia for pandas users” talk
I wanted to choose some data set that is not too big (so that things are
quick) nor not too small (so that they are not too quick).

So I went to Kaggle and searched for datasets with: 5 to 10 GB of data, data in
CSV files, a few (but not many) files. Ordered such datasets by hotness and the
one on top of the list was: Instagram dataset, consisting of three tables:

  • instagram_posts.csv: 42M rows of individual post information;
  • instagram_locations.csv: 1.2M rows of location information;
  • instagram_profiles.csv: 4.5M rows of user profile information.

You can find the dataset here.

For each post we can match a profile of a person who made it and a location
where it was made. Therefore I decided to do the following operations as a test:

  • read the data from disk to RAM;
  • drop all rows with missing data in posts;
  • drop all rows with missing data in profiles, and in case of duplicate profiles
    keep only one of them;
  • drop all rows with missing data in locations;
  • left join posts and profiles;
  • left join posts and locations;
  • compute some aggregate statistics of posts data by country.

Below I show how these tasks can be performed in DataFrames.jl and
DuckDB.

(as a side note – doing this “as is” in pandas was not possible on my
laptop as I was running out of RAM and some of the steps in this process took
too long; therefore in my talk during the conference I presented a bit different
scenario)

Initial inspection of the data

We start with Julia to peek into data. First do it in Julia:

julia> using DataFrames

julia> using CSV

julia> using Statistics

julia> describe(CSV.read("instagram_posts.csv", DataFrame, limit=100), :eltype)
10×2 DataFrame
 Row │ variable         eltype
     │ Symbol           Type
─────┼─────────────────────────────────────────
   1 │ sid              Int64
   2 │ sid_profile      Int64
   3 │ post_id          String15
   4 │ profile_id       Int64
   5 │ location_id      Int64
   6 │ cts              String31
   7 │ post_type        Int64
   8 │ description      Union{Missing, String}
   9 │ numbr_likes      Int64
  10 │ number_comments  Int64

julia> describe(CSV.read("instagram_profiles.csv", DataFrame, limit=100), :eltype)
11×2 DataFrame
 Row │ variable             eltype
     │ Symbol               Type
─────┼─────────────────────────────────────────────
   1 │ sid                  Int64
   2 │ profile_id           Int64
   3 │ profile_name         String31
   4 │ firstname_lastname   String31
   5 │ description          Union{Missing, String}
   6 │ following            Union{Missing, Int64}
   7 │ followers            Union{Missing, Int64}
   8 │ n_posts              Union{Missing, Int64}
   9 │ url                  Union{Missing, String}
  10 │ cts                  String31
  11 │ is_business_account  Union{Missing, Bool}

julia> describe(CSV.read("instagram_locations.csv", DataFrame, limit=100), :eltype)
23×2 DataFrame
 Row │ variable                eltype
     │ Symbol                  Type
─────┼──────────────────────────────────────────────────
   1 │ sid                     Int64
   2 │ id                      Int64
   3 │ name                    String
   4 │ street                  Union{Missing, String}
   5 │ zip                     Union{Missing, String15}
   6 │ city                    Union{Missing, String}
   7 │ region                  Missing
   8 │ cd                      Union{Missing, String3}
   9 │ phone                   Union{Missing, String31}
  10 │ aj_exact_city_match     Union{Missing, Bool}
  11 │ aj_exact_country_match  Union{Missing, Bool}
  12 │ blurb                   String
  13 │ dir_city_id             Union{Missing, String15}
  14 │ dir_city_name           Union{Missing, String}
  15 │ dir_city_slug           Union{Missing, String}
  16 │ dir_country_id          Union{Missing, String3}
  17 │ dir_country_name        Union{Missing, String15}
  18 │ lat                     Union{Missing, Float64}
  19 │ lng                     Union{Missing, Float64}
  20 │ primary_alias_on_fb     String
  21 │ slug                    String
  22 │ website                 String
  23 │ cts                     String31

Since the data frames had many columns I summarized them keeping only element
type. I loaded first 100 rows of data for inspection.

Now a similar task in DuckDB:

julia> using DuckDB

julia> con = DBInterface.connect(DuckDB.DB, ":memory:")
DuckDB.DB(":memory:")

julia> DBInterface.execute(con,
           """
           DESCRIBE
           SELECT *
           FROM 'instagram_posts.csv'
           LIMIT 100
           """)
10×6 DataFrame
 Row │ column_name      column_type  null     key      default  extra
     │ String?          String?      String?  String?  String?  String?
─────┼──────────────────────────────────────────────────────────────────
   1 │ sid              INTEGER      YES      missing  missing  missing
   2 │ sid_profile      INTEGER      YES      missing  missing  missing
   3 │ post_id          VARCHAR      YES      missing  missing  missing
   4 │ profile_id       BIGINT       YES      missing  missing  missing
   5 │ location_id      BIGINT       YES      missing  missing  missing
   6 │ cts              TIMESTAMP    YES      missing  missing  missing
   7 │ post_type        INTEGER      YES      missing  missing  missing
   8 │ description      VARCHAR      YES      missing  missing  missing
   9 │ numbr_likes      INTEGER      YES      missing  missing  missing
  10 │ number_comments  INTEGER      YES      missing  missing  missing

julia> DBInterface.execute(con,
           """
           DESCRIBE
           SELECT *
           FROM 'instagram_profiles.csv'
               LIMIT 100
           """)
11×6 DataFrame
 Row │ column_name          column_type  null     key      default  extra
     │ String?              String?      String?  String?  String?  String?
─────┼──────────────────────────────────────────────────────────────────────
   1 │ sid                  INTEGER      YES      missing  missing  missing
   2 │ profile_id           BIGINT       YES      missing  missing  missing
   3 │ profile_name         VARCHAR      YES      missing  missing  missing
   4 │ firstname_lastname   VARCHAR      YES      missing  missing  missing
   5 │ description          VARCHAR      YES      missing  missing  missing
   6 │ following            INTEGER      YES      missing  missing  missing
   7 │ followers            INTEGER      YES      missing  missing  missing
   8 │ n_posts              INTEGER      YES      missing  missing  missing
   9 │ url                  VARCHAR      YES      missing  missing  missing
  10 │ cts                  VARCHAR      YES      missing  missing  missing
  11 │ is_business_account  BOOLEAN      YES      missing  missing  missing

julia> DBInterface.execute(con,
           """
           DESCRIBE
           SELECT *
           FROM 'instagram_locations.csv'
           LIMIT 100
           """)
23×6 DataFrame
 Row │ column_name             column_type  null     key      default  extra
     │ String?                 String?      String?  String?  String?  String?
─────┼─────────────────────────────────────────────────────────────────────────
   1 │ sid                     INTEGER      YES      missing  missing  missing
   2 │ id                      BIGINT       YES      missing  missing  missing
   3 │ name                    VARCHAR      YES      missing  missing  missing
   4 │ street                  VARCHAR      YES      missing  missing  missing
   5 │ zip                     VARCHAR      YES      missing  missing  missing
   6 │ city                    VARCHAR      YES      missing  missing  missing
   7 │ region                  VARCHAR      YES      missing  missing  missing
   8 │ cd                      VARCHAR      YES      missing  missing  missing
   9 │ phone                   VARCHAR      YES      missing  missing  missing
  10 │ aj_exact_city_match     BOOLEAN      YES      missing  missing  missing
  11 │ aj_exact_country_match  BOOLEAN      YES      missing  missing  missing
  12 │ blurb                   VARCHAR      YES      missing  missing  missing
  13 │ dir_city_id             VARCHAR      YES      missing  missing  missing
  14 │ dir_city_name           VARCHAR      YES      missing  missing  missing
  15 │ dir_city_slug           VARCHAR      YES      missing  missing  missing
  16 │ dir_country_id          VARCHAR      YES      missing  missing  missing
  17 │ dir_country_name        VARCHAR      YES      missing  missing  missing
  18 │ lat                     DOUBLE       YES      missing  missing  missing
  19 │ lng                     DOUBLE       YES      missing  missing  missing
  20 │ primary_alias_on_fb     VARCHAR      YES      missing  missing  missing
  21 │ slug                    VARCHAR      YES      missing  missing  missing
  22 │ website                 VARCHAR      YES      missing  missing  missing
  23 │ cts                     TIMESTAMP    YES      missing  missing  missing

Note that we already see how nicely DuckDB is integrated with Julia. You get
a DataFrame as a result, so you can work with it, if you wanted.

Performing the analysis

Let us check the performance of the requested operation end-to-end in both
ecosystems.

Start with pure Julia solution:

julia> @time begin
           posts = CSV.read("instagram_posts.csv", DataFrame,
                            select=[:profile_id, :location_id, :number_comments])
           profiles = CSV.read("instagram_profiles.csv", DataFrame,
                               select=[:profile_id, :n_posts])
           locations = CSV.read("instagram_locations.csv", DataFrame,
                                select=[:id, :dir_country_name])
           posts2 = dropmissing(posts)
           profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
                               :n_posts => maximum => :n_posts)
           locations2 = dropmissing(locations)
           leftjoin!(posts2, profiles2, on="profile_id")
           leftjoin!(posts2, locations2, on="location_id" => "id")
           gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
           combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
       end
 23.323866 seconds (514.97 k allocations: 5.622 GiB, 3.30% gc time)
235×3 DataFrame
 Row │ dir_country_name      n_posts_mean_skipmissing  number_comments_mean_skipmissing
     │ String?               Float64                   Float64
─────┼──────────────────────────────────────────────────────────────────────────────────
   1 │ Afghanistan                            445.58                            8.10767
   2 │ Albania                                524.789                           6.83736
   3 │ Algeria                                353.933                           9.9979
   4 │ American Samoa                         117.692                           7.87879
   5 │ Andorra                                486.206                           7.01296
   6 │ Angola                                 469.621                           7.48282
   7 │ Anguilla                              1206.22                            4.77193
   8 │ Antarctica                            1165.97                           20.3073
   9 │ Antigua                                638.995                           8.05999
  10 │ Argentina                              727.673                          11.0821
  11 │ Armenia                                663.196                           4.76728
  ⋮  │          ⋮                       ⋮                             ⋮
 226 │ Uruguay                                560.936                           8.00079
 227 │ Uzbekistan                             445.975                           8.84245
 228 │ Vanuatu                                539.349                           4.63688
 229 │ Vatican City                           583.498                           7.05353
 230 │ Venezuela                              551.79                            9.4976
 231 │ Vietnam                                582.829                           4.46671
 232 │ Western Sahara                         643.888                           4.6676
 233 │ Yemen                                  411.5                            27.9416
 234 │ Zambia                                 411.779                           7.09239
 235 │ Zimbabwe                               518.17                            8.3875
                                                                        214 rows omitted

Now compare it to DuckDB:

julia> @time DBInterface.execute(con,
           """
           SELECT dir_country_name,
               mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
               mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
           FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
                 FROM (SELECT profile_id, location_id, number_comments FROM 'instagram_posts.csv'
                       WHERE profile_id IS NOT NULL
                             AND location_id IS NOT NULL
                             AND number_comments IS NOT NULL
                       ) po
                 LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
                            FROM 'instagram_profiles.csv'
                            WHERE profile_id IS NOT NULL
                                  AND n_posts IS NOT NULL
                            GROUP BY profile_id) pr
                 ON po.profile_id = pr.profile_id
                 LEFT JOIN (SELECT id, dir_country_name FROM 'instagram_locations.csv'
                            WHERE id IS NOT NULL
                                  AND dir_country_name IS NOT NULL) loc
                 ON po.location_id = loc.id)
           WHERE dir_country_name IS NOT NULL
           GROUP BY dir_country_name
           ORDER BY dir_country_name
           """)
 93.220900 seconds (166.60 k allocations: 8.578 MiB, 0.11% compilation time)
235×3 DataFrame
 Row │ dir_country_name      n_posts_mean  number_comments_mean
     │ String?               Float64?      Float64?
─────┼──────────────────────────────────────────────────────────
   1 │ Afghanistan                445.58                8.10767
   2 │ Albania                    524.789               6.83736
   3 │ Algeria                    353.933               9.9979
   4 │ American Samoa             117.692               7.87879
   5 │ Andorra                    486.206               7.01296
   6 │ Angola                     469.621               7.48282
   7 │ Anguilla                  1206.22                4.77193
   8 │ Antarctica                1165.97               20.3073
   9 │ Antigua                    638.995               8.05999
  10 │ Argentina                  727.673              11.0821
  11 │ Armenia                    663.196               4.76728
  ⋮  │          ⋮                 ⋮                 ⋮
 226 │ Uruguay                    560.936               8.00079
 227 │ Uzbekistan                 445.975               8.84245
 228 │ Vanuatu                    539.349               4.63688
 229 │ Vatican City               583.498               7.05353
 230 │ Venezuela                  551.79                9.4976
 231 │ Vietnam                    582.829               4.46671
 232 │ Western Sahara             643.888               4.6676
 233 │ Yemen                      411.5                27.9416
 234 │ Zambia                     411.779               7.09239
 235 │ Zimbabwe                   518.17                8.3875
                                                214 rows omitted

We see that we get the same result. The performance is worse. However, let
us split out the data ingest into RAM part to a separate process.

Here we will use a very nice feature of DuckDB that you can register a data
frame to be a table.

We start with Julia again:

julia> @time begin
           posts2 = dropmissing(posts)
           profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
                               :n_posts => maximum => :n_posts)
           locations2 = dropmissing(locations)
           leftjoin!(posts2, profiles2, on="profile_id")
           leftjoin!(posts2, locations2, on="location_id" => "id")
           gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
           combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
       end;
  5.632483 seconds (1.61 k allocations: 4.065 GiB, 2.94% gc time)

Now check DuckDB:

julia> DuckDB.register_data_frame(con, posts, "posts")

julia> DuckDB.register_data_frame(con, profiles, "profiles")

julia> DuckDB.register_data_frame(con, locations, "locations")

julia> @time DBInterface.execute(con,
           """
           SELECT dir_country_name,
               mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
               mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
           FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
                 FROM (SELECT profile_id, location_id, number_comments 
                       FROM posts
                       WHERE profile_id IS NOT NULL
                             AND location_id IS NOT NULL
                             AND number_comments IS NOT NULL
                       ) po
                 LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
                            FROM (SELECT profile_id, n_posts
                                  FROM profiles
                                  WHERE n_posts IS NOT NULL)
                            GROUP BY profile_id) pr
                 ON po.profile_id = pr.profile_id
                 LEFT JOIN (SELECT id, dir_country_name
                            FROM locations
                            WHERE id IS NOT NULL
                                  AND dir_country_name IS NOT NULL) loc
                 ON po.location_id = loc.id)
           GROUP BY dir_country_name
           ORDER BY dir_country_name
           """);
  1.818248 seconds (616.46 k allocations: 24.675 MiB, 14.41% compilation time)

As you can see this time DuckDB is faster than Julia. Also we see that the
majority of original time was spent in reading the data from disk and that
the analysis part in both ecosystems is reasonably fast.

I have investigated into this further and the major conclusions for the
performance difference are:

  • DuckDB performs the query in one shot, while in DataFrames.jl the API is
    eager, materializing all intermediate tables; when I split the big SQL query
    into steps materializing intermediate tables the performance difference was
    significantly smaller.
  • joins are faster in DuckDB (this is a known issue in DataFrames.jl that we
    still need to add multi-threading support to some of the algorithms we
    provide);
  • split-apply-combine is faster in DataFrames.jl.

Conclusions

For me the key takeaways from this test were:

  • DuckDB excellently integrates with Julia and DataFrames.jl. As you saw
    you can register data frames as tables in DuckDB and get query results as
    data frames.
  • DuckDB is fast. You can appreciate these differences especially if
    you perform large joins (this will improve in DataFrames.jl in the future)
    and when you execute complex queries (this will not likely improve in
    DataFrames.jl in the near future – the current design of the package is eager
    and does not support composed query optimization).
  • I found DataFrmes.jl code easier to read, but probably I am biased here.
    Data scientists proficient with SQL probably will find DuckDB code more
    natural.

Happy hacking with Julia and DuckDB!

Handling summary statistics of empty collections

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/12/17/empty.html

Introduction

When designing solutions for data analysis one is often faced with a tough
choice between doing the correct thing and the convenient thing.

One particular case of such a situation is computing summary statistics of
empty collections. The reason is that often such statistics are not properly
defined for empty data (so Julia throws an error), but data scientist instead
would want to get some flag value instead. Today I want to discuss typical
cases of such situations and possible solutions.

The post was written under Julia 1.8.2 and DataFrames.jl 1.4.4.

Sum and product

In this case the situation is least problematic. Typically you will get what
you expect (i.e. respectively zero or one of the domain of values you
aggregate):

julia> sum(Int[])
0

julia> prod(Float64[])
1.0

julia> sum(skipmissing(Union{Int, Missing}[missing, missing]))
0

julia> prod(skipmissing(Union{Float64, Missing}[missing, missing]))
1.0

The only case that is problematic is when you want to work with an empty
container of a too-wide element type:

julia> sum([])
ERROR: MethodError: no method matching zero(::Type{Any})

A standard solution, since both sum and prod are reductions is to provide an
initialization value manually in this case:

julia> sum([], init=0)
0

julia> prod([], init=1.0)
1.0

Minimum and maximum

When computing minimum and maximum by default you get an error with empty
collections:

julia> minimum(Int[])
ERROR: MethodError: reducing over an empty collection is not allowed;
consider supplying `init` to the reducer

As you can see, we get and error are prompted to pass the init value for the
reduction, so the situation is less convenient.

Indeed, if such init value can be reasonably passed this is a good solution:

julia> minimum(Float64[], init=Inf)
Inf

julia> maximum(Int[], init=typemin(Int))
-9223372036854775808

or e.g. if we know we are working with values that must be in some range, we can
provide this range. A common case is with probabilities:

julia> minimum([], init=1.0)
1.0

julia> maximum([], init=0.0)
0.0

Sometimes, however, we might want to have a special signal value. In this case
you have two options. One is to check if the collection is empty, the other is
to catch exception:

julia> x = Int[]
Int64[]

julia> isempty(x) ? missing : minimum(x)
missing

julia> try
           minimum(x)
       catch e
           isa(e, MethodError) ? missing : rethrow(e)
       end
missing

You could wrap both solutions with a function for convenience, if you use them
often in your code. Their downside is that they add a bit of computational
overhead. The isempty check in some cases is not a O(1) operation. The most
common case is skipmissing. The trycatch approach introduces the cost of
handling of the exception.

Extrema

In case of the extrema function the situation is analogous to minimum and
maximum. The only difference is that you pass two values to init if you
want to use this method. Here is an example assuming we are processing data
that are probabilities:

julia> extrema(Float64[], init=(1.0, 0.0))
(1.0, 0.0)

Note that in this case minimum is greater than the maximum, so we can
immediately see that the passed collection was empty.

Mean, variance, and standard deviation

When computing mean, var, or std, we get NaN when working with an empty
collection:

julia> using Statistics

julia> mean(Int[])
NaN

julia> var(Float64[])
NaN

julia> std(Float64[])
NaN

This is expected, as we are performing division by zero in their computation.
Also, similarly to sum and prod, when the collection has a too-wide element
type we get an error:

julia> mean([])
ERROR: MethodError: no method matching zero(::Type{Any})

If we do not like this default behavior and want to handle an empty collection
in a special case checking if the container is empty is a standard solution:

julia> x = Float64[]
Float64[]

julia> isempty(x) ? missing : var(x)
missing

Median and quantile

Computing quantiles, and median in particular, is the least convenient case
as for them we always get an error and cannot use init value (as they are not
reductions):

julia> median(Int[])
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

julia> quantile(Float64[], 0.1)
ERROR: ArgumentError: empty data vector

Here, currently, the only solution is to either check if the collection is empty
or catch the exception:

julia> x = Float64[]
Float64[]

julia> isempty(x) ? missing : median(x)
missing

julia> try
           quantile(x, 0.1)
       catch e
           isa(e, ArgumentError) ? missing : rethrow(e)
       end
missing

Conclusions

My post today was meant to be a quick reference for Julia users who sometimes
hit these issues when working with their data. My experience is that the most
common scenario of this kind is connected with missing data. Here is a typical
problematic case:

julia> using DataFrames

julia> using Random

julia> Random.seed!(1234);

julia> df = DataFrame(id=rand(1:10^6, 10^6),
                      value=rand([1:10; missing], 10^6))
1000000×2 DataFrame
     Row │ id      value
         │ Int64   Int64?
─────────┼─────────────────
       1 │ 325977        4
       2 │ 549052        9
       3 │ 218587        9
       4 │ 894246        8
       5 │ 353112        1
       6 │ 394256       10
       7 │ 953125  missing
       8 │ 795547        5
       9 │ 494250        1
    ⋮    │   ⋮        ⋮
  999993 │ 967428        9
  999994 │ 557085        1
  999995 │ 353965        5
  999996 │ 590548       10
  999997 │ 657727        2
  999998 │ 928733        3
  999999 │ 884126  missing
 1000000 │ 587503        2
        999983 rows omitted

I on purpose generated the data in a way that has quite a few missing values:

julia> combine(groupby(df, :value), proprow)
11×2 DataFrame
 Row │ value    proprow
     │ Int64?   Float64
─────┼───────────────────
   1 │       1  0.091109
   2 │       2  0.091387
   3 │       3  0.091394
   4 │       4  0.090954
   5 │       5  0.090504
   6 │       6  0.091412
   7 │       7  0.090809
   8 │       8  0.090844
   9 │       9  0.090254
  10 │      10  0.090795
  11 │ missing  0.090538

Now notice that some groups will only have missing values (in the output below
group with :id equal to 12 in row 7 is such a case):

julia> combine(groupby(df, :id),
               :value => (x -> mean(ismissing, x)) => :propmissing)
632166×2 DataFrame
    Row │ id       propmissing
        │ Int64    Float64
────────┼──────────────────────
      1 │       2     0.0
      2 │       3     0.25
      3 │       4     0.0
      4 │       6     0.0
      5 │       8     0.0
      6 │       9     0.333333
      7 │      12     1.0
      8 │      15     0.5
      9 │      16     0.0
   ⋮    │    ⋮          ⋮
 632159 │  999990     0.0
 632160 │  999991     0.0
 632161 │  999992     0.0
 632162 │  999993     0.0
 632163 │  999994     0.0
 632164 │  999996     0.0
 632165 │  999997     0.0
 632166 │ 1000000     0.0
            632149 rows omitted

If we now try to compute e.g. median value per group while skipping missing
we fail:

julia> combine(groupby(df, :id), :value => median∘skipmissing)
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

The solution, as we discussed in this post is to handle the case of an empty
collection in a special way. I typically prefer the isempty check.

So first define a helper function:

julia> withempty(f, default) = x -> isempty(x) ? default : f(x)
withempty (generic function with 1 method)

and now we can write:

julia> combine(groupby(df, :id),
               :value => withempty(median, missing)∘skipmissing)
632166×2 DataFrame
    Row │ id       value_function_skipmissing
        │ Int64    Union{Missing, Float64}
────────┼─────────────────────────────────────
      1 │       2                         3.5
      2 │       3                         5.0
      3 │       4                         8.0
      4 │       6                         4.0
      5 │       8                         1.0
      6 │       9                         9.0
      7 │      12                   missing
      8 │      15                         3.0
      9 │      16                         7.0
   ⋮    │    ⋮                 ⋮
 632159 │  999990                         4.0
 632160 │  999991                         3.0
 632161 │  999992                         8.0
 632162 │  999993                         2.0
 632163 │  999994                         7.0
 632164 │  999996                         4.0
 632165 │  999997                         7.0
 632166 │ 1000000                         7.0
                           632149 rows omitted

As you can see we get missing in row 7 for group with :id value equal to
12 as expected.

If you have some thoughts about pros and cons of the approaches I discussed
today please check out this issue and comment there. Thank you!

Lessons in the fundamentals of Julia from master Toshiro Kageyama

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/12/09/lessons.html

Introduction

Toshiro Kageyama was an amateur Go player who became a successful
professional. His story had a profound influence on my thinking about to data
science and programming.

Let me start with a quote about his understanding of how go players differ
(taken from the Lessons In The Fundamentals of Go book):

Once it was thought (…) that amateurs could not even approach the
professional level. Nowadays, however, the great surge in the size of the
Go-playing public has narrowed the distance has narrowed the distance between
the tracks.

Does this sound familiar? In the past people that created complex, large scale
machine learning solutions, were mostly PhDs in computer science or similar.
Now the situation has changed. Thousands of developers, with various
backgrounds, successfully create amazing projects.

Toshiro Kageyama notes that every Go-player wants to get stronger. Likewise,
every data scientist, amateur or professional, wants to be able to create
better models, faster, and more reliably.

So what is the secret for becoming stronger? The major message of Toshiro
Kageyama is that you need to learn and focus on the fundamentals. In the book
he gives numerous examples from Go, baseball, sumo wrestling, or even cooking,
highlighting how important it is to follow basic principles of your trade no
matter if you are amateur or professional. He stresses that in order to
progress you cannot take shortcuts hoping that you can bypass fundamentals. In
short term this might give some effects, but will not work in the long run.

What does all this have to do with Julia you probably wonder by now.

I have been writing the Julia For Data Analysis book for a year and
it has been published this week. In this post I want to discuss what important
fundamental principle I re-learned many times during this process.

Always crate project environment for your code

One of the basic things that any data scientist learns is that any project
should be accompanied by a complete specification of environment in which
it should be run.

In Julia this is achieved via sharing Project.toml and Manifest.toml files as
is explained in the Pkg.jl manual.

So far all seems easy. If you want reproducibility of your work create the
Project.toml and Manifest.toml files for your code and Julia will
automatically instantiate the required environment.

This is indeed a golden rule that works like magic. The fact that Julia has
such excellent package manager built into the language is one of the things
that make me use it.

However, when I was writing a book I faced several challenges that are similar
to problems that large projects developed over a long period of time face.

Changes in versions of Julia

So what was my major problem? I wrote the book under Julia 1.7.
After I finished it Julia 1.8 was released.

Why is it an issue? Well – version of Julia is part of the specification
of the project environment. Therefore, technically, I should expect my readers
to install Julia 1.7 when running examples from my book.

However, this is what I wanted to avoid. It is natural to expect that users
will want to use latest version of Julia when learning it. Therefore I had
to make sure that my codes work both under Julia 1.7 and Julia 1.8.

And here I hit the first problem – not all that works under Julia 1.7 works
under Julia 1.8. You might be surprised by this, as Julia 1.8 should not
introduce any breaking changes. However, the major problem were binary
dependencies, that is external code dependencies that are not written in Julia.
Such dependencies sometimes have versions that are Julia-version specific. This
is an observation that I recommend you keep in mind when working on some
project that is expected to be worked on for a longer period of time.

Binary dependencies, round two

Julia has an excellent integration with Python. You can easily and conveniently
execute Python code as a part of your Julia programs. I wanted to show this
functionality in the book. As an example I used t-SNE from scikit-learn
(as a side note: there are very nice implementations of this algorithm in pure
Julia in TSne.jl or Manifolds.jl).

Guess what happened next? I got many requests from the readers indicating they
had problems with proper setting up of Python on their machines. Although there
are packages in Julia that help to automatically perform Python setup
(PyCall.jl and Conda.jl), still the diverse range of end-user environment
configurations meant that ensuring that proper Python executable along with all
proper Python packages is available is not always trivial.

Handling of bug fixes

Another problem I had was handling of bug fixes in packages. The (real)
scenario is as follows: I have finished writing a book and released all the
codes and after some time a serious bug in some package is revealed (by serious
I mostly mean things like security threats or severe memory leaks; guess what –
they mostly happen when some package has binary dependencies).

Now the problem is as follows: I have some package in version e.g. 4.3.1
defined in my Project.toml, but the bug is found and fixed only after the
package has version e.g. 5.1.2.

You most likely already see where the problem is. Typically the bug fix is not
backported to 4.x branch of the package. I need to update the version of the
package to 5.x branch (and then a cascade of other packages gets updated). In
the end – I need to re-run all my source code for the book to check if it still
works both under Julia 1.7 and 1.8.

The reality is that it did not work. And here let me comment on two reasons
why:

  • I had to make a small adjustment because version 5.x of the package made some
    breaking change in the API. It is a problem but not super big – since I got
    an error indicating that API changed.
  • The second problem is more subtle. In the packages that got recursively
    updated some of them changed minor or patch versions. You would assume that
    nothing should break then. And here we hit a problem that is really serious
    (and is my call to package developers, if they still read this text):

Do not use internal APIs of packages on which your package depends!

(another fundamental rule to re-learn)

In my case this is what happened. Some package had a version change and
modified its internal API. This is a non-breaking change, so all should be
good. Unfortunately, it was not, as some other package relied on this
internal API and stopped working.

Why I still think Julia package manager is great

Someone might say that all these problems are easily solved by contenerization.
Indeed, I could have done this, but I did not want to. There are two reasons:

  • First, I want readers of my book to be exposed to real experience of using
    Julia – in all its aspects. And in practice you expect that you will use
    Julia installed on your machine and need to install packages.
  • Second, Julia package manager is excellent. Although I had problems I listed
    above, the fact is that they are minor and solvable.

Let me comment on several features of Julia package manager that are especially
useful:

  • When you add or update package you can choose resolution tier that
    precisely describes what changes of versions of dependencies of the package
    are allowed (ranging from: preserve all to update all to the latest and
    greatest). This is important as when I only want to make a bug-fix update
    I want to allow as few changes to the other packages as possible.
  • Package management is lightweight. Although each of your projects can have
    a bit different set of packages and their versions it depends on, Julia
    package manager keeps a federated repository of the packages. This means that
    a specific version of the package needs to be installed only once per machine;
    this saves a lot of time and disk space.
  • It has an excellent support for artifacts, that can be any
    binary dependencies of the package; in particular it allows for shipping
    pre-built platform-specific dependencies. So although the binaries can be
    Julia-version dependent you do not need any extra external tools to build
    such packages. And the beauty of this solution is that it works
    on Linux, Mac, and Windows (think if have you ever had a situation that some
    package does not want to build on Linux when you e.g. used R).

Final thoughts

Today we talk about fundamentals. I also tried to write my
Julia for Data Analysis book in a way that covers fundamentals of
Julia for data analysis.

For this reason I decided to skip discussing advanced machine learning
algorithms or integration options with various data sources. Instead, I cover
in the book essential topics that a data scientist will relatively soon need to
learn about the language (while skipping topics that are most likely less
important). The fundamentals are divided into two parts in the book:

  • Part 1: the Julia language (e.g. various data container types, multiple
    dispatch, various ways to work with strings).
  • Part 2: the data analysis ecosystem (e.g. working with data frames, plotting,
    representing relationships in data using graphs).

With this book I hope to convince the readers that Julia language was designed
so that you can easily pick it up but go very far to become a top data science
developer. So, in principle, it sticks to what Toshiro Kageyama said: it allows
a wide community of analysts to be able to do what in the past was only
possible for experts.

Let me give a single example of this kind. When starting to work with Julia
users are often confused by -> (syntax defining an anonymous function), =>
(a notation allowing you to create key-value pairs), ==(1) (a way to create a
function that compares its argument to 1 using ==, a.k.a partial function
application), or . (symbol used when broadcasting a function).

Often, when reading some advanced code such new users see expressions like
(tested under Julia 1.8.2, DataFrames.jl 1.4.4, and DataFramesMeta.jl 0.12.0):

julia> using DataFrames

julia> df = DataFrame(a = [1, 3, 2, 1, 4])
5×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     3
   3 │     2
   4 │     1
   5 │     4

julia> transform(df, :a => ByRow(==(1)) => :a_eq_1)
5×2 DataFrame
 Row │ a      a_eq_1
     │ Int64  Bool
─────┼───────────────
   1 │     1    true
   2 │     3   false
   3 │     2   false
   4 │     1    true
   5 │     4   false

They might feel intimidated by the complexity of the syntax.
While they could just write:

julia> using DataFramesMeta

julia> @rtransform(df, :a_eq_1 = :a == 1)
5×2 DataFrame
 Row │ a      a_eq_1
     │ Int64  Bool
─────┼───────────────
   1 │     1    true
   2 │     3   false
   3 │     2   false
   4 │     1    true
   5 │     4   false

The first style (using transform) is an underlying operation specification
syntax of DataFrames.jl: designed to be flexible and allowing for expression of
very complex operations. The second style (using @rtransform) is a domain
specific language that is designed to be easy to pick up (dplyr users can
mostly just start using it without learning much).

In the book I cover both. Most likely you want to start with an easy syntax
that allows you to just get you the result you desire. However, I also explain
all the fundamentals of both Julia and its data analysis ecosystem, so when you
need to perform more complex tasks you will know all the required features and
you are ready to use them.

For example in the last chapter of the book, I show how to create and use a
web service that performs valuation of Asian options using multi-threading
to ensure its fast response time. As you can imagine, it is a relatively
complex (but fortunately short) code, and you need to have a good grasp of
fundamentals to easily understand it. If you would like to learn why similar
solutions are written using Julia in the industry I recommend you check out
Timeline case study.