Tag Archives: julialang

Welcome to DuckDB

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/12/23/duckdb.html

Introduction

I usually write in this blog about DataFrames.jl. This time I decided
to switch my mode to good old SQL queries.

The inspiration from the post is a follow up after the talk
“Data manipulation in Julia for pandas users” I gave last week during
Data Science Summit 2022. In that talk I explained how
DataFrames.jl compares to pandas.

After the talk, I thought to check how the comparison would look against
DuckDB, which is an excellent in-process SQL OLAP database management
system.

In this post I show how a sample analytical scenario can be executed in
DataFrames.jl and DuckDB, and how excellently they are integrated.

The code was run under Julia 1.8.2, CSV.jl 0.10.8, DataFrames.jl 1.4.4, and
DuckDB.jl 0.6.0 on a laptop with 32GB of RAM. I perform tests using 8 cores.

The task

When I was preparing the “Data manipulation in Julia for pandas users” talk
I wanted to choose some data set that is not too big (so that things are
quick) nor not too small (so that they are not too quick).

So I went to Kaggle and searched for datasets with: 5 to 10 GB of data, data in
CSV files, a few (but not many) files. Ordered such datasets by hotness and the
one on top of the list was: Instagram dataset, consisting of three tables:

  • instagram_posts.csv: 42M rows of individual post information;
  • instagram_locations.csv: 1.2M rows of location information;
  • instagram_profiles.csv: 4.5M rows of user profile information.

You can find the dataset here.

For each post we can match a profile of a person who made it and a location
where it was made. Therefore I decided to do the following operations as a test:

  • read the data from disk to RAM;
  • drop all rows with missing data in posts;
  • drop all rows with missing data in profiles, and in case of duplicate profiles
    keep only one of them;
  • drop all rows with missing data in locations;
  • left join posts and profiles;
  • left join posts and locations;
  • compute some aggregate statistics of posts data by country.

Below I show how these tasks can be performed in DataFrames.jl and
DuckDB.

(as a side note – doing this “as is” in pandas was not possible on my
laptop as I was running out of RAM and some of the steps in this process took
too long; therefore in my talk during the conference I presented a bit different
scenario)

Initial inspection of the data

We start with Julia to peek into data. First do it in Julia:

julia> using DataFrames

julia> using CSV

julia> using Statistics

julia> describe(CSV.read("instagram_posts.csv", DataFrame, limit=100), :eltype)
10×2 DataFrame
 Row │ variable         eltype
     │ Symbol           Type
─────┼─────────────────────────────────────────
   1 │ sid              Int64
   2 │ sid_profile      Int64
   3 │ post_id          String15
   4 │ profile_id       Int64
   5 │ location_id      Int64
   6 │ cts              String31
   7 │ post_type        Int64
   8 │ description      Union{Missing, String}
   9 │ numbr_likes      Int64
  10 │ number_comments  Int64

julia> describe(CSV.read("instagram_profiles.csv", DataFrame, limit=100), :eltype)
11×2 DataFrame
 Row │ variable             eltype
     │ Symbol               Type
─────┼─────────────────────────────────────────────
   1 │ sid                  Int64
   2 │ profile_id           Int64
   3 │ profile_name         String31
   4 │ firstname_lastname   String31
   5 │ description          Union{Missing, String}
   6 │ following            Union{Missing, Int64}
   7 │ followers            Union{Missing, Int64}
   8 │ n_posts              Union{Missing, Int64}
   9 │ url                  Union{Missing, String}
  10 │ cts                  String31
  11 │ is_business_account  Union{Missing, Bool}

julia> describe(CSV.read("instagram_locations.csv", DataFrame, limit=100), :eltype)
23×2 DataFrame
 Row │ variable                eltype
     │ Symbol                  Type
─────┼──────────────────────────────────────────────────
   1 │ sid                     Int64
   2 │ id                      Int64
   3 │ name                    String
   4 │ street                  Union{Missing, String}
   5 │ zip                     Union{Missing, String15}
   6 │ city                    Union{Missing, String}
   7 │ region                  Missing
   8 │ cd                      Union{Missing, String3}
   9 │ phone                   Union{Missing, String31}
  10 │ aj_exact_city_match     Union{Missing, Bool}
  11 │ aj_exact_country_match  Union{Missing, Bool}
  12 │ blurb                   String
  13 │ dir_city_id             Union{Missing, String15}
  14 │ dir_city_name           Union{Missing, String}
  15 │ dir_city_slug           Union{Missing, String}
  16 │ dir_country_id          Union{Missing, String3}
  17 │ dir_country_name        Union{Missing, String15}
  18 │ lat                     Union{Missing, Float64}
  19 │ lng                     Union{Missing, Float64}
  20 │ primary_alias_on_fb     String
  21 │ slug                    String
  22 │ website                 String
  23 │ cts                     String31

Since the data frames had many columns I summarized them keeping only element
type. I loaded first 100 rows of data for inspection.

Now a similar task in DuckDB:

julia> using DuckDB

julia> con = DBInterface.connect(DuckDB.DB, ":memory:")
DuckDB.DB(":memory:")

julia> DBInterface.execute(con,
           """
           DESCRIBE
           SELECT *
           FROM 'instagram_posts.csv'
           LIMIT 100
           """)
10×6 DataFrame
 Row │ column_name      column_type  null     key      default  extra
     │ String?          String?      String?  String?  String?  String?
─────┼──────────────────────────────────────────────────────────────────
   1 │ sid              INTEGER      YES      missing  missing  missing
   2 │ sid_profile      INTEGER      YES      missing  missing  missing
   3 │ post_id          VARCHAR      YES      missing  missing  missing
   4 │ profile_id       BIGINT       YES      missing  missing  missing
   5 │ location_id      BIGINT       YES      missing  missing  missing
   6 │ cts              TIMESTAMP    YES      missing  missing  missing
   7 │ post_type        INTEGER      YES      missing  missing  missing
   8 │ description      VARCHAR      YES      missing  missing  missing
   9 │ numbr_likes      INTEGER      YES      missing  missing  missing
  10 │ number_comments  INTEGER      YES      missing  missing  missing

julia> DBInterface.execute(con,
           """
           DESCRIBE
           SELECT *
           FROM 'instagram_profiles.csv'
               LIMIT 100
           """)
11×6 DataFrame
 Row │ column_name          column_type  null     key      default  extra
     │ String?              String?      String?  String?  String?  String?
─────┼──────────────────────────────────────────────────────────────────────
   1 │ sid                  INTEGER      YES      missing  missing  missing
   2 │ profile_id           BIGINT       YES      missing  missing  missing
   3 │ profile_name         VARCHAR      YES      missing  missing  missing
   4 │ firstname_lastname   VARCHAR      YES      missing  missing  missing
   5 │ description          VARCHAR      YES      missing  missing  missing
   6 │ following            INTEGER      YES      missing  missing  missing
   7 │ followers            INTEGER      YES      missing  missing  missing
   8 │ n_posts              INTEGER      YES      missing  missing  missing
   9 │ url                  VARCHAR      YES      missing  missing  missing
  10 │ cts                  VARCHAR      YES      missing  missing  missing
  11 │ is_business_account  BOOLEAN      YES      missing  missing  missing

julia> DBInterface.execute(con,
           """
           DESCRIBE
           SELECT *
           FROM 'instagram_locations.csv'
           LIMIT 100
           """)
23×6 DataFrame
 Row │ column_name             column_type  null     key      default  extra
     │ String?                 String?      String?  String?  String?  String?
─────┼─────────────────────────────────────────────────────────────────────────
   1 │ sid                     INTEGER      YES      missing  missing  missing
   2 │ id                      BIGINT       YES      missing  missing  missing
   3 │ name                    VARCHAR      YES      missing  missing  missing
   4 │ street                  VARCHAR      YES      missing  missing  missing
   5 │ zip                     VARCHAR      YES      missing  missing  missing
   6 │ city                    VARCHAR      YES      missing  missing  missing
   7 │ region                  VARCHAR      YES      missing  missing  missing
   8 │ cd                      VARCHAR      YES      missing  missing  missing
   9 │ phone                   VARCHAR      YES      missing  missing  missing
  10 │ aj_exact_city_match     BOOLEAN      YES      missing  missing  missing
  11 │ aj_exact_country_match  BOOLEAN      YES      missing  missing  missing
  12 │ blurb                   VARCHAR      YES      missing  missing  missing
  13 │ dir_city_id             VARCHAR      YES      missing  missing  missing
  14 │ dir_city_name           VARCHAR      YES      missing  missing  missing
  15 │ dir_city_slug           VARCHAR      YES      missing  missing  missing
  16 │ dir_country_id          VARCHAR      YES      missing  missing  missing
  17 │ dir_country_name        VARCHAR      YES      missing  missing  missing
  18 │ lat                     DOUBLE       YES      missing  missing  missing
  19 │ lng                     DOUBLE       YES      missing  missing  missing
  20 │ primary_alias_on_fb     VARCHAR      YES      missing  missing  missing
  21 │ slug                    VARCHAR      YES      missing  missing  missing
  22 │ website                 VARCHAR      YES      missing  missing  missing
  23 │ cts                     TIMESTAMP    YES      missing  missing  missing

Note that we already see how nicely DuckDB is integrated with Julia. You get
a DataFrame as a result, so you can work with it, if you wanted.

Performing the analysis

Let us check the performance of the requested operation end-to-end in both
ecosystems.

Start with pure Julia solution:

julia> @time begin
           posts = CSV.read("instagram_posts.csv", DataFrame,
                            select=[:profile_id, :location_id, :number_comments])
           profiles = CSV.read("instagram_profiles.csv", DataFrame,
                               select=[:profile_id, :n_posts])
           locations = CSV.read("instagram_locations.csv", DataFrame,
                                select=[:id, :dir_country_name])
           posts2 = dropmissing(posts)
           profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
                               :n_posts => maximum => :n_posts)
           locations2 = dropmissing(locations)
           leftjoin!(posts2, profiles2, on="profile_id")
           leftjoin!(posts2, locations2, on="location_id" => "id")
           gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
           combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
       end
 23.323866 seconds (514.97 k allocations: 5.622 GiB, 3.30% gc time)
235×3 DataFrame
 Row │ dir_country_name      n_posts_mean_skipmissing  number_comments_mean_skipmissing
     │ String?               Float64                   Float64
─────┼──────────────────────────────────────────────────────────────────────────────────
   1 │ Afghanistan                            445.58                            8.10767
   2 │ Albania                                524.789                           6.83736
   3 │ Algeria                                353.933                           9.9979
   4 │ American Samoa                         117.692                           7.87879
   5 │ Andorra                                486.206                           7.01296
   6 │ Angola                                 469.621                           7.48282
   7 │ Anguilla                              1206.22                            4.77193
   8 │ Antarctica                            1165.97                           20.3073
   9 │ Antigua                                638.995                           8.05999
  10 │ Argentina                              727.673                          11.0821
  11 │ Armenia                                663.196                           4.76728
  ⋮  │          ⋮                       ⋮                             ⋮
 226 │ Uruguay                                560.936                           8.00079
 227 │ Uzbekistan                             445.975                           8.84245
 228 │ Vanuatu                                539.349                           4.63688
 229 │ Vatican City                           583.498                           7.05353
 230 │ Venezuela                              551.79                            9.4976
 231 │ Vietnam                                582.829                           4.46671
 232 │ Western Sahara                         643.888                           4.6676
 233 │ Yemen                                  411.5                            27.9416
 234 │ Zambia                                 411.779                           7.09239
 235 │ Zimbabwe                               518.17                            8.3875
                                                                        214 rows omitted

Now compare it to DuckDB:

julia> @time DBInterface.execute(con,
           """
           SELECT dir_country_name,
               mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
               mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
           FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
                 FROM (SELECT profile_id, location_id, number_comments FROM 'instagram_posts.csv'
                       WHERE profile_id IS NOT NULL
                             AND location_id IS NOT NULL
                             AND number_comments IS NOT NULL
                       ) po
                 LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
                            FROM 'instagram_profiles.csv'
                            WHERE profile_id IS NOT NULL
                                  AND n_posts IS NOT NULL
                            GROUP BY profile_id) pr
                 ON po.profile_id = pr.profile_id
                 LEFT JOIN (SELECT id, dir_country_name FROM 'instagram_locations.csv'
                            WHERE id IS NOT NULL
                                  AND dir_country_name IS NOT NULL) loc
                 ON po.location_id = loc.id)
           WHERE dir_country_name IS NOT NULL
           GROUP BY dir_country_name
           ORDER BY dir_country_name
           """)
 93.220900 seconds (166.60 k allocations: 8.578 MiB, 0.11% compilation time)
235×3 DataFrame
 Row │ dir_country_name      n_posts_mean  number_comments_mean
     │ String?               Float64?      Float64?
─────┼──────────────────────────────────────────────────────────
   1 │ Afghanistan                445.58                8.10767
   2 │ Albania                    524.789               6.83736
   3 │ Algeria                    353.933               9.9979
   4 │ American Samoa             117.692               7.87879
   5 │ Andorra                    486.206               7.01296
   6 │ Angola                     469.621               7.48282
   7 │ Anguilla                  1206.22                4.77193
   8 │ Antarctica                1165.97               20.3073
   9 │ Antigua                    638.995               8.05999
  10 │ Argentina                  727.673              11.0821
  11 │ Armenia                    663.196               4.76728
  ⋮  │          ⋮                 ⋮                 ⋮
 226 │ Uruguay                    560.936               8.00079
 227 │ Uzbekistan                 445.975               8.84245
 228 │ Vanuatu                    539.349               4.63688
 229 │ Vatican City               583.498               7.05353
 230 │ Venezuela                  551.79                9.4976
 231 │ Vietnam                    582.829               4.46671
 232 │ Western Sahara             643.888               4.6676
 233 │ Yemen                      411.5                27.9416
 234 │ Zambia                     411.779               7.09239
 235 │ Zimbabwe                   518.17                8.3875
                                                214 rows omitted

We see that we get the same result. The performance is worse. However, let
us split out the data ingest into RAM part to a separate process.

Here we will use a very nice feature of DuckDB that you can register a data
frame to be a table.

We start with Julia again:

julia> @time begin
           posts2 = dropmissing(posts)
           profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
                               :n_posts => maximum => :n_posts)
           locations2 = dropmissing(locations)
           leftjoin!(posts2, profiles2, on="profile_id")
           leftjoin!(posts2, locations2, on="location_id" => "id")
           gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
           combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
       end;
  5.632483 seconds (1.61 k allocations: 4.065 GiB, 2.94% gc time)

Now check DuckDB:

julia> DuckDB.register_data_frame(con, posts, "posts")

julia> DuckDB.register_data_frame(con, profiles, "profiles")

julia> DuckDB.register_data_frame(con, locations, "locations")

julia> @time DBInterface.execute(con,
           """
           SELECT dir_country_name,
               mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
               mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
           FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
                 FROM (SELECT profile_id, location_id, number_comments 
                       FROM posts
                       WHERE profile_id IS NOT NULL
                             AND location_id IS NOT NULL
                             AND number_comments IS NOT NULL
                       ) po
                 LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
                            FROM (SELECT profile_id, n_posts
                                  FROM profiles
                                  WHERE n_posts IS NOT NULL)
                            GROUP BY profile_id) pr
                 ON po.profile_id = pr.profile_id
                 LEFT JOIN (SELECT id, dir_country_name
                            FROM locations
                            WHERE id IS NOT NULL
                                  AND dir_country_name IS NOT NULL) loc
                 ON po.location_id = loc.id)
           GROUP BY dir_country_name
           ORDER BY dir_country_name
           """);
  1.818248 seconds (616.46 k allocations: 24.675 MiB, 14.41% compilation time)

As you can see this time DuckDB is faster than Julia. Also we see that the
majority of original time was spent in reading the data from disk and that
the analysis part in both ecosystems is reasonably fast.

I have investigated into this further and the major conclusions for the
performance difference are:

  • DuckDB performs the query in one shot, while in DataFrames.jl the API is
    eager, materializing all intermediate tables; when I split the big SQL query
    into steps materializing intermediate tables the performance difference was
    significantly smaller.
  • joins are faster in DuckDB (this is a known issue in DataFrames.jl that we
    still need to add multi-threading support to some of the algorithms we
    provide);
  • split-apply-combine is faster in DataFrames.jl.

Conclusions

For me the key takeaways from this test were:

  • DuckDB excellently integrates with Julia and DataFrames.jl. As you saw
    you can register data frames as tables in DuckDB and get query results as
    data frames.
  • DuckDB is fast. You can appreciate these differences especially if
    you perform large joins (this will improve in DataFrames.jl in the future)
    and when you execute complex queries (this will not likely improve in
    DataFrames.jl in the near future – the current design of the package is eager
    and does not support composed query optimization).
  • I found DataFrmes.jl code easier to read, but probably I am biased here.
    Data scientists proficient with SQL probably will find DuckDB code more
    natural.

Happy hacking with Julia and DuckDB!

Handling summary statistics of empty collections

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/12/17/empty.html

Introduction

When designing solutions for data analysis one is often faced with a tough
choice between doing the correct thing and the convenient thing.

One particular case of such a situation is computing summary statistics of
empty collections. The reason is that often such statistics are not properly
defined for empty data (so Julia throws an error), but data scientist instead
would want to get some flag value instead. Today I want to discuss typical
cases of such situations and possible solutions.

The post was written under Julia 1.8.2 and DataFrames.jl 1.4.4.

Sum and product

In this case the situation is least problematic. Typically you will get what
you expect (i.e. respectively zero or one of the domain of values you
aggregate):

julia> sum(Int[])
0

julia> prod(Float64[])
1.0

julia> sum(skipmissing(Union{Int, Missing}[missing, missing]))
0

julia> prod(skipmissing(Union{Float64, Missing}[missing, missing]))
1.0

The only case that is problematic is when you want to work with an empty
container of a too-wide element type:

julia> sum([])
ERROR: MethodError: no method matching zero(::Type{Any})

A standard solution, since both sum and prod are reductions is to provide an
initialization value manually in this case:

julia> sum([], init=0)
0

julia> prod([], init=1.0)
1.0

Minimum and maximum

When computing minimum and maximum by default you get an error with empty
collections:

julia> minimum(Int[])
ERROR: MethodError: reducing over an empty collection is not allowed;
consider supplying `init` to the reducer

As you can see, we get and error are prompted to pass the init value for the
reduction, so the situation is less convenient.

Indeed, if such init value can be reasonably passed this is a good solution:

julia> minimum(Float64[], init=Inf)
Inf

julia> maximum(Int[], init=typemin(Int))
-9223372036854775808

or e.g. if we know we are working with values that must be in some range, we can
provide this range. A common case is with probabilities:

julia> minimum([], init=1.0)
1.0

julia> maximum([], init=0.0)
0.0

Sometimes, however, we might want to have a special signal value. In this case
you have two options. One is to check if the collection is empty, the other is
to catch exception:

julia> x = Int[]
Int64[]

julia> isempty(x) ? missing : minimum(x)
missing

julia> try
           minimum(x)
       catch e
           isa(e, MethodError) ? missing : rethrow(e)
       end
missing

You could wrap both solutions with a function for convenience, if you use them
often in your code. Their downside is that they add a bit of computational
overhead. The isempty check in some cases is not a O(1) operation. The most
common case is skipmissing. The trycatch approach introduces the cost of
handling of the exception.

Extrema

In case of the extrema function the situation is analogous to minimum and
maximum. The only difference is that you pass two values to init if you
want to use this method. Here is an example assuming we are processing data
that are probabilities:

julia> extrema(Float64[], init=(1.0, 0.0))
(1.0, 0.0)

Note that in this case minimum is greater than the maximum, so we can
immediately see that the passed collection was empty.

Mean, variance, and standard deviation

When computing mean, var, or std, we get NaN when working with an empty
collection:

julia> using Statistics

julia> mean(Int[])
NaN

julia> var(Float64[])
NaN

julia> std(Float64[])
NaN

This is expected, as we are performing division by zero in their computation.
Also, similarly to sum and prod, when the collection has a too-wide element
type we get an error:

julia> mean([])
ERROR: MethodError: no method matching zero(::Type{Any})

If we do not like this default behavior and want to handle an empty collection
in a special case checking if the container is empty is a standard solution:

julia> x = Float64[]
Float64[]

julia> isempty(x) ? missing : var(x)
missing

Median and quantile

Computing quantiles, and median in particular, is the least convenient case
as for them we always get an error and cannot use init value (as they are not
reductions):

julia> median(Int[])
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

julia> quantile(Float64[], 0.1)
ERROR: ArgumentError: empty data vector

Here, currently, the only solution is to either check if the collection is empty
or catch the exception:

julia> x = Float64[]
Float64[]

julia> isempty(x) ? missing : median(x)
missing

julia> try
           quantile(x, 0.1)
       catch e
           isa(e, ArgumentError) ? missing : rethrow(e)
       end
missing

Conclusions

My post today was meant to be a quick reference for Julia users who sometimes
hit these issues when working with their data. My experience is that the most
common scenario of this kind is connected with missing data. Here is a typical
problematic case:

julia> using DataFrames

julia> using Random

julia> Random.seed!(1234);

julia> df = DataFrame(id=rand(1:10^6, 10^6),
                      value=rand([1:10; missing], 10^6))
1000000×2 DataFrame
     Row │ id      value
         │ Int64   Int64?
─────────┼─────────────────
       1 │ 325977        4
       2 │ 549052        9
       3 │ 218587        9
       4 │ 894246        8
       5 │ 353112        1
       6 │ 394256       10
       7 │ 953125  missing
       8 │ 795547        5
       9 │ 494250        1
    ⋮    │   ⋮        ⋮
  999993 │ 967428        9
  999994 │ 557085        1
  999995 │ 353965        5
  999996 │ 590548       10
  999997 │ 657727        2
  999998 │ 928733        3
  999999 │ 884126  missing
 1000000 │ 587503        2
        999983 rows omitted

I on purpose generated the data in a way that has quite a few missing values:

julia> combine(groupby(df, :value), proprow)
11×2 DataFrame
 Row │ value    proprow
     │ Int64?   Float64
─────┼───────────────────
   1 │       1  0.091109
   2 │       2  0.091387
   3 │       3  0.091394
   4 │       4  0.090954
   5 │       5  0.090504
   6 │       6  0.091412
   7 │       7  0.090809
   8 │       8  0.090844
   9 │       9  0.090254
  10 │      10  0.090795
  11 │ missing  0.090538

Now notice that some groups will only have missing values (in the output below
group with :id equal to 12 in row 7 is such a case):

julia> combine(groupby(df, :id),
               :value => (x -> mean(ismissing, x)) => :propmissing)
632166×2 DataFrame
    Row │ id       propmissing
        │ Int64    Float64
────────┼──────────────────────
      1 │       2     0.0
      2 │       3     0.25
      3 │       4     0.0
      4 │       6     0.0
      5 │       8     0.0
      6 │       9     0.333333
      7 │      12     1.0
      8 │      15     0.5
      9 │      16     0.0
   ⋮    │    ⋮          ⋮
 632159 │  999990     0.0
 632160 │  999991     0.0
 632161 │  999992     0.0
 632162 │  999993     0.0
 632163 │  999994     0.0
 632164 │  999996     0.0
 632165 │  999997     0.0
 632166 │ 1000000     0.0
            632149 rows omitted

If we now try to compute e.g. median value per group while skipping missing
we fail:

julia> combine(groupby(df, :id), :value => median∘skipmissing)
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

The solution, as we discussed in this post is to handle the case of an empty
collection in a special way. I typically prefer the isempty check.

So first define a helper function:

julia> withempty(f, default) = x -> isempty(x) ? default : f(x)
withempty (generic function with 1 method)

and now we can write:

julia> combine(groupby(df, :id),
               :value => withempty(median, missing)∘skipmissing)
632166×2 DataFrame
    Row │ id       value_function_skipmissing
        │ Int64    Union{Missing, Float64}
────────┼─────────────────────────────────────
      1 │       2                         3.5
      2 │       3                         5.0
      3 │       4                         8.0
      4 │       6                         4.0
      5 │       8                         1.0
      6 │       9                         9.0
      7 │      12                   missing
      8 │      15                         3.0
      9 │      16                         7.0
   ⋮    │    ⋮                 ⋮
 632159 │  999990                         4.0
 632160 │  999991                         3.0
 632161 │  999992                         8.0
 632162 │  999993                         2.0
 632163 │  999994                         7.0
 632164 │  999996                         4.0
 632165 │  999997                         7.0
 632166 │ 1000000                         7.0
                           632149 rows omitted

As you can see we get missing in row 7 for group with :id value equal to
12 as expected.

If you have some thoughts about pros and cons of the approaches I discussed
today please check out this issue and comment there. Thank you!