By: Logan Kilpatrick
Re-posted from: https://juliazoid.com/heres-how-nasa-is-using-julia-to-better-understand-the-ocean-fd172322a918?source=rss-2c8aac9051d3------2
How ocean and climate research are being powered by the Julia programming language
By: Logan Kilpatrick
Re-posted from: https://juliazoid.com/heres-how-nasa-is-using-julia-to-better-understand-the-ocean-fd172322a918?source=rss-2c8aac9051d3------2
How ocean and climate research are being powered by the Julia programming language
Re-posted from: https://bkamins.github.io/julialang/2022/12/23/duckdb.html
I usually write in this blog about DataFrames.jl. This time I decided
to switch my mode to good old SQL queries.
The inspiration from the post is a follow up after the talk
“Data manipulation in Julia for pandas users” I gave last week during
Data Science Summit 2022. In that talk I explained how
DataFrames.jl compares to pandas.
After the talk, I thought to check how the comparison would look against
DuckDB, which is an excellent in-process SQL OLAP database management
system.
In this post I show how a sample analytical scenario can be executed in
DataFrames.jl and DuckDB, and how excellently they are integrated.
The code was run under Julia 1.8.2, CSV.jl 0.10.8, DataFrames.jl 1.4.4, and
DuckDB.jl 0.6.0 on a laptop with 32GB of RAM. I perform tests using 8 cores.
When I was preparing the “Data manipulation in Julia for pandas users” talk
I wanted to choose some data set that is not too big (so that things are
quick) nor not too small (so that they are not too quick).
So I went to Kaggle and searched for datasets with: 5 to 10 GB of data, data in
CSV files, a few (but not many) files. Ordered such datasets by hotness and the
one on top of the list was: Instagram dataset, consisting of three tables:
instagram_posts.csv: 42M rows of individual post information;instagram_locations.csv: 1.2M rows of location information;instagram_profiles.csv: 4.5M rows of user profile information.You can find the dataset here.
For each post we can match a profile of a person who made it and a location
where it was made. Therefore I decided to do the following operations as a test:
Below I show how these tasks can be performed in DataFrames.jl and
DuckDB.
(as a side note – doing this “as is” in pandas was not possible on my
laptop as I was running out of RAM and some of the steps in this process took
too long; therefore in my talk during the conference I presented a bit different
scenario)
We start with Julia to peek into data. First do it in Julia:
julia> using DataFrames
julia> using CSV
julia> using Statistics
julia> describe(CSV.read("instagram_posts.csv", DataFrame, limit=100), :eltype)
10×2 DataFrame
Row │ variable eltype
│ Symbol Type
─────┼─────────────────────────────────────────
1 │ sid Int64
2 │ sid_profile Int64
3 │ post_id String15
4 │ profile_id Int64
5 │ location_id Int64
6 │ cts String31
7 │ post_type Int64
8 │ description Union{Missing, String}
9 │ numbr_likes Int64
10 │ number_comments Int64
julia> describe(CSV.read("instagram_profiles.csv", DataFrame, limit=100), :eltype)
11×2 DataFrame
Row │ variable eltype
│ Symbol Type
─────┼─────────────────────────────────────────────
1 │ sid Int64
2 │ profile_id Int64
3 │ profile_name String31
4 │ firstname_lastname String31
5 │ description Union{Missing, String}
6 │ following Union{Missing, Int64}
7 │ followers Union{Missing, Int64}
8 │ n_posts Union{Missing, Int64}
9 │ url Union{Missing, String}
10 │ cts String31
11 │ is_business_account Union{Missing, Bool}
julia> describe(CSV.read("instagram_locations.csv", DataFrame, limit=100), :eltype)
23×2 DataFrame
Row │ variable eltype
│ Symbol Type
─────┼──────────────────────────────────────────────────
1 │ sid Int64
2 │ id Int64
3 │ name String
4 │ street Union{Missing, String}
5 │ zip Union{Missing, String15}
6 │ city Union{Missing, String}
7 │ region Missing
8 │ cd Union{Missing, String3}
9 │ phone Union{Missing, String31}
10 │ aj_exact_city_match Union{Missing, Bool}
11 │ aj_exact_country_match Union{Missing, Bool}
12 │ blurb String
13 │ dir_city_id Union{Missing, String15}
14 │ dir_city_name Union{Missing, String}
15 │ dir_city_slug Union{Missing, String}
16 │ dir_country_id Union{Missing, String3}
17 │ dir_country_name Union{Missing, String15}
18 │ lat Union{Missing, Float64}
19 │ lng Union{Missing, Float64}
20 │ primary_alias_on_fb String
21 │ slug String
22 │ website String
23 │ cts String31
Since the data frames had many columns I summarized them keeping only element
type. I loaded first 100 rows of data for inspection.
Now a similar task in DuckDB:
julia> using DuckDB
julia> con = DBInterface.connect(DuckDB.DB, ":memory:")
DuckDB.DB(":memory:")
julia> DBInterface.execute(con,
"""
DESCRIBE
SELECT *
FROM 'instagram_posts.csv'
LIMIT 100
""")
10×6 DataFrame
Row │ column_name column_type null key default extra
│ String? String? String? String? String? String?
─────┼──────────────────────────────────────────────────────────────────
1 │ sid INTEGER YES missing missing missing
2 │ sid_profile INTEGER YES missing missing missing
3 │ post_id VARCHAR YES missing missing missing
4 │ profile_id BIGINT YES missing missing missing
5 │ location_id BIGINT YES missing missing missing
6 │ cts TIMESTAMP YES missing missing missing
7 │ post_type INTEGER YES missing missing missing
8 │ description VARCHAR YES missing missing missing
9 │ numbr_likes INTEGER YES missing missing missing
10 │ number_comments INTEGER YES missing missing missing
julia> DBInterface.execute(con,
"""
DESCRIBE
SELECT *
FROM 'instagram_profiles.csv'
LIMIT 100
""")
11×6 DataFrame
Row │ column_name column_type null key default extra
│ String? String? String? String? String? String?
─────┼──────────────────────────────────────────────────────────────────────
1 │ sid INTEGER YES missing missing missing
2 │ profile_id BIGINT YES missing missing missing
3 │ profile_name VARCHAR YES missing missing missing
4 │ firstname_lastname VARCHAR YES missing missing missing
5 │ description VARCHAR YES missing missing missing
6 │ following INTEGER YES missing missing missing
7 │ followers INTEGER YES missing missing missing
8 │ n_posts INTEGER YES missing missing missing
9 │ url VARCHAR YES missing missing missing
10 │ cts VARCHAR YES missing missing missing
11 │ is_business_account BOOLEAN YES missing missing missing
julia> DBInterface.execute(con,
"""
DESCRIBE
SELECT *
FROM 'instagram_locations.csv'
LIMIT 100
""")
23×6 DataFrame
Row │ column_name column_type null key default extra
│ String? String? String? String? String? String?
─────┼─────────────────────────────────────────────────────────────────────────
1 │ sid INTEGER YES missing missing missing
2 │ id BIGINT YES missing missing missing
3 │ name VARCHAR YES missing missing missing
4 │ street VARCHAR YES missing missing missing
5 │ zip VARCHAR YES missing missing missing
6 │ city VARCHAR YES missing missing missing
7 │ region VARCHAR YES missing missing missing
8 │ cd VARCHAR YES missing missing missing
9 │ phone VARCHAR YES missing missing missing
10 │ aj_exact_city_match BOOLEAN YES missing missing missing
11 │ aj_exact_country_match BOOLEAN YES missing missing missing
12 │ blurb VARCHAR YES missing missing missing
13 │ dir_city_id VARCHAR YES missing missing missing
14 │ dir_city_name VARCHAR YES missing missing missing
15 │ dir_city_slug VARCHAR YES missing missing missing
16 │ dir_country_id VARCHAR YES missing missing missing
17 │ dir_country_name VARCHAR YES missing missing missing
18 │ lat DOUBLE YES missing missing missing
19 │ lng DOUBLE YES missing missing missing
20 │ primary_alias_on_fb VARCHAR YES missing missing missing
21 │ slug VARCHAR YES missing missing missing
22 │ website VARCHAR YES missing missing missing
23 │ cts TIMESTAMP YES missing missing missing
Note that we already see how nicely DuckDB is integrated with Julia. You get
a DataFrame as a result, so you can work with it, if you wanted.
Let us check the performance of the requested operation end-to-end in both
ecosystems.
Start with pure Julia solution:
julia> @time begin
posts = CSV.read("instagram_posts.csv", DataFrame,
select=[:profile_id, :location_id, :number_comments])
profiles = CSV.read("instagram_profiles.csv", DataFrame,
select=[:profile_id, :n_posts])
locations = CSV.read("instagram_locations.csv", DataFrame,
select=[:id, :dir_country_name])
posts2 = dropmissing(posts)
profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
:n_posts => maximum => :n_posts)
locations2 = dropmissing(locations)
leftjoin!(posts2, profiles2, on="profile_id")
leftjoin!(posts2, locations2, on="location_id" => "id")
gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
end
23.323866 seconds (514.97 k allocations: 5.622 GiB, 3.30% gc time)
235×3 DataFrame
Row │ dir_country_name n_posts_mean_skipmissing number_comments_mean_skipmissing
│ String? Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────────────────
1 │ Afghanistan 445.58 8.10767
2 │ Albania 524.789 6.83736
3 │ Algeria 353.933 9.9979
4 │ American Samoa 117.692 7.87879
5 │ Andorra 486.206 7.01296
6 │ Angola 469.621 7.48282
7 │ Anguilla 1206.22 4.77193
8 │ Antarctica 1165.97 20.3073
9 │ Antigua 638.995 8.05999
10 │ Argentina 727.673 11.0821
11 │ Armenia 663.196 4.76728
⋮ │ ⋮ ⋮ ⋮
226 │ Uruguay 560.936 8.00079
227 │ Uzbekistan 445.975 8.84245
228 │ Vanuatu 539.349 4.63688
229 │ Vatican City 583.498 7.05353
230 │ Venezuela 551.79 9.4976
231 │ Vietnam 582.829 4.46671
232 │ Western Sahara 643.888 4.6676
233 │ Yemen 411.5 27.9416
234 │ Zambia 411.779 7.09239
235 │ Zimbabwe 518.17 8.3875
214 rows omitted
Now compare it to DuckDB:
julia> @time DBInterface.execute(con,
"""
SELECT dir_country_name,
mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
FROM (SELECT profile_id, location_id, number_comments FROM 'instagram_posts.csv'
WHERE profile_id IS NOT NULL
AND location_id IS NOT NULL
AND number_comments IS NOT NULL
) po
LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
FROM 'instagram_profiles.csv'
WHERE profile_id IS NOT NULL
AND n_posts IS NOT NULL
GROUP BY profile_id) pr
ON po.profile_id = pr.profile_id
LEFT JOIN (SELECT id, dir_country_name FROM 'instagram_locations.csv'
WHERE id IS NOT NULL
AND dir_country_name IS NOT NULL) loc
ON po.location_id = loc.id)
WHERE dir_country_name IS NOT NULL
GROUP BY dir_country_name
ORDER BY dir_country_name
""")
93.220900 seconds (166.60 k allocations: 8.578 MiB, 0.11% compilation time)
235×3 DataFrame
Row │ dir_country_name n_posts_mean number_comments_mean
│ String? Float64? Float64?
─────┼──────────────────────────────────────────────────────────
1 │ Afghanistan 445.58 8.10767
2 │ Albania 524.789 6.83736
3 │ Algeria 353.933 9.9979
4 │ American Samoa 117.692 7.87879
5 │ Andorra 486.206 7.01296
6 │ Angola 469.621 7.48282
7 │ Anguilla 1206.22 4.77193
8 │ Antarctica 1165.97 20.3073
9 │ Antigua 638.995 8.05999
10 │ Argentina 727.673 11.0821
11 │ Armenia 663.196 4.76728
⋮ │ ⋮ ⋮ ⋮
226 │ Uruguay 560.936 8.00079
227 │ Uzbekistan 445.975 8.84245
228 │ Vanuatu 539.349 4.63688
229 │ Vatican City 583.498 7.05353
230 │ Venezuela 551.79 9.4976
231 │ Vietnam 582.829 4.46671
232 │ Western Sahara 643.888 4.6676
233 │ Yemen 411.5 27.9416
234 │ Zambia 411.779 7.09239
235 │ Zimbabwe 518.17 8.3875
214 rows omitted
We see that we get the same result. The performance is worse. However, let
us split out the data ingest into RAM part to a separate process.
Here we will use a very nice feature of DuckDB that you can register a data
frame to be a table.
We start with Julia again:
julia> @time begin
posts2 = dropmissing(posts)
profiles2 = combine(groupby(dropmissing(profiles), :profile_id),
:n_posts => maximum => :n_posts)
locations2 = dropmissing(locations)
leftjoin!(posts2, profiles2, on="profile_id")
leftjoin!(posts2, locations2, on="location_id" => "id")
gdf = groupby(posts2, "dir_country_name", sort=true, skipmissing=true)
combine(gdf, [:n_posts, :number_comments] .=> mean∘skipmissing)
end;
5.632483 seconds (1.61 k allocations: 4.065 GiB, 2.94% gc time)
Now check DuckDB:
julia> DuckDB.register_data_frame(con, posts, "posts")
julia> DuckDB.register_data_frame(con, profiles, "profiles")
julia> DuckDB.register_data_frame(con, locations, "locations")
julia> @time DBInterface.execute(con,
"""
SELECT dir_country_name,
mean(n_posts) FILTER (WHERE n_posts IS NOT NULL) n_posts_mean,
mean(number_comments) FILTER (WHERE number_comments IS NOT NULL) number_comments_mean
FROM (SELECT po.number_comments, pr.n_posts, loc.dir_country_name
FROM (SELECT profile_id, location_id, number_comments
FROM posts
WHERE profile_id IS NOT NULL
AND location_id IS NOT NULL
AND number_comments IS NOT NULL
) po
LEFT JOIN (SELECT profile_id, max(n_posts) n_posts
FROM (SELECT profile_id, n_posts
FROM profiles
WHERE n_posts IS NOT NULL)
GROUP BY profile_id) pr
ON po.profile_id = pr.profile_id
LEFT JOIN (SELECT id, dir_country_name
FROM locations
WHERE id IS NOT NULL
AND dir_country_name IS NOT NULL) loc
ON po.location_id = loc.id)
GROUP BY dir_country_name
ORDER BY dir_country_name
""");
1.818248 seconds (616.46 k allocations: 24.675 MiB, 14.41% compilation time)
As you can see this time DuckDB is faster than Julia. Also we see that the
majority of original time was spent in reading the data from disk and that
the analysis part in both ecosystems is reasonably fast.
I have investigated into this further and the major conclusions for the
performance difference are:
For me the key takeaways from this test were:
Happy hacking with Julia and DuckDB!
Re-posted from: https://bkamins.github.io/julialang/2022/12/17/empty.html
When designing solutions for data analysis one is often faced with a tough
choice between doing the correct thing and the convenient thing.
One particular case of such a situation is computing summary statistics of
empty collections. The reason is that often such statistics are not properly
defined for empty data (so Julia throws an error), but data scientist instead
would want to get some flag value instead. Today I want to discuss typical
cases of such situations and possible solutions.
The post was written under Julia 1.8.2 and DataFrames.jl 1.4.4.
In this case the situation is least problematic. Typically you will get what
you expect (i.e. respectively zero or one of the domain of values you
aggregate):
julia> sum(Int[])
0
julia> prod(Float64[])
1.0
julia> sum(skipmissing(Union{Int, Missing}[missing, missing]))
0
julia> prod(skipmissing(Union{Float64, Missing}[missing, missing]))
1.0
The only case that is problematic is when you want to work with an empty
container of a too-wide element type:
julia> sum([])
ERROR: MethodError: no method matching zero(::Type{Any})
A standard solution, since both sum and prod are reductions is to provide an
initialization value manually in this case:
julia> sum([], init=0)
0
julia> prod([], init=1.0)
1.0
When computing minimum and maximum by default you get an error with empty
collections:
julia> minimum(Int[])
ERROR: MethodError: reducing over an empty collection is not allowed;
consider supplying `init` to the reducer
As you can see, we get and error are prompted to pass the init value for the
reduction, so the situation is less convenient.
Indeed, if such init value can be reasonably passed this is a good solution:
julia> minimum(Float64[], init=Inf)
Inf
julia> maximum(Int[], init=typemin(Int))
-9223372036854775808
or e.g. if we know we are working with values that must be in some range, we can
provide this range. A common case is with probabilities:
julia> minimum([], init=1.0)
1.0
julia> maximum([], init=0.0)
0.0
Sometimes, however, we might want to have a special signal value. In this case
you have two options. One is to check if the collection is empty, the other is
to catch exception:
julia> x = Int[]
Int64[]
julia> isempty(x) ? missing : minimum(x)
missing
julia> try
minimum(x)
catch e
isa(e, MethodError) ? missing : rethrow(e)
end
missing
You could wrap both solutions with a function for convenience, if you use them
often in your code. Their downside is that they add a bit of computational
overhead. The isempty check in some cases is not a O(1) operation. The most
common case is skipmissing. The try–catch approach introduces the cost of
handling of the exception.
In case of the extrema function the situation is analogous to minimum and
maximum. The only difference is that you pass two values to init if you
want to use this method. Here is an example assuming we are processing data
that are probabilities:
julia> extrema(Float64[], init=(1.0, 0.0))
(1.0, 0.0)
Note that in this case minimum is greater than the maximum, so we can
immediately see that the passed collection was empty.
When computing mean, var, or std, we get NaN when working with an empty
collection:
julia> using Statistics
julia> mean(Int[])
NaN
julia> var(Float64[])
NaN
julia> std(Float64[])
NaN
This is expected, as we are performing division by zero in their computation.
Also, similarly to sum and prod, when the collection has a too-wide element
type we get an error:
julia> mean([])
ERROR: MethodError: no method matching zero(::Type{Any})
If we do not like this default behavior and want to handle an empty collection
in a special case checking if the container is empty is a standard solution:
julia> x = Float64[]
Float64[]
julia> isempty(x) ? missing : var(x)
missing
Computing quantiles, and median in particular, is the least convenient case
as for them we always get an error and cannot use init value (as they are not
reductions):
julia> median(Int[])
ERROR: ArgumentError: median of an empty array is undefined, Int64[]
julia> quantile(Float64[], 0.1)
ERROR: ArgumentError: empty data vector
Here, currently, the only solution is to either check if the collection is empty
or catch the exception:
julia> x = Float64[]
Float64[]
julia> isempty(x) ? missing : median(x)
missing
julia> try
quantile(x, 0.1)
catch e
isa(e, ArgumentError) ? missing : rethrow(e)
end
missing
My post today was meant to be a quick reference for Julia users who sometimes
hit these issues when working with their data. My experience is that the most
common scenario of this kind is connected with missing data. Here is a typical
problematic case:
julia> using DataFrames
julia> using Random
julia> Random.seed!(1234);
julia> df = DataFrame(id=rand(1:10^6, 10^6),
value=rand([1:10; missing], 10^6))
1000000×2 DataFrame
Row │ id value
│ Int64 Int64?
─────────┼─────────────────
1 │ 325977 4
2 │ 549052 9
3 │ 218587 9
4 │ 894246 8
5 │ 353112 1
6 │ 394256 10
7 │ 953125 missing
8 │ 795547 5
9 │ 494250 1
⋮ │ ⋮ ⋮
999993 │ 967428 9
999994 │ 557085 1
999995 │ 353965 5
999996 │ 590548 10
999997 │ 657727 2
999998 │ 928733 3
999999 │ 884126 missing
1000000 │ 587503 2
999983 rows omitted
I on purpose generated the data in a way that has quite a few missing values:
julia> combine(groupby(df, :value), proprow)
11×2 DataFrame
Row │ value proprow
│ Int64? Float64
─────┼───────────────────
1 │ 1 0.091109
2 │ 2 0.091387
3 │ 3 0.091394
4 │ 4 0.090954
5 │ 5 0.090504
6 │ 6 0.091412
7 │ 7 0.090809
8 │ 8 0.090844
9 │ 9 0.090254
10 │ 10 0.090795
11 │ missing 0.090538
Now notice that some groups will only have missing values (in the output below
group with :id equal to 12 in row 7 is such a case):
julia> combine(groupby(df, :id),
:value => (x -> mean(ismissing, x)) => :propmissing)
632166×2 DataFrame
Row │ id propmissing
│ Int64 Float64
────────┼──────────────────────
1 │ 2 0.0
2 │ 3 0.25
3 │ 4 0.0
4 │ 6 0.0
5 │ 8 0.0
6 │ 9 0.333333
7 │ 12 1.0
8 │ 15 0.5
9 │ 16 0.0
⋮ │ ⋮ ⋮
632159 │ 999990 0.0
632160 │ 999991 0.0
632161 │ 999992 0.0
632162 │ 999993 0.0
632163 │ 999994 0.0
632164 │ 999996 0.0
632165 │ 999997 0.0
632166 │ 1000000 0.0
632149 rows omitted
If we now try to compute e.g. median value per group while skipping missing
we fail:
julia> combine(groupby(df, :id), :value => median∘skipmissing)
ERROR: ArgumentError: median of an empty array is undefined, Int64[]
The solution, as we discussed in this post is to handle the case of an empty
collection in a special way. I typically prefer the isempty check.
So first define a helper function:
julia> withempty(f, default) = x -> isempty(x) ? default : f(x)
withempty (generic function with 1 method)
and now we can write:
julia> combine(groupby(df, :id),
:value => withempty(median, missing)∘skipmissing)
632166×2 DataFrame
Row │ id value_function_skipmissing
│ Int64 Union{Missing, Float64}
────────┼─────────────────────────────────────
1 │ 2 3.5
2 │ 3 5.0
3 │ 4 8.0
4 │ 6 4.0
5 │ 8 1.0
6 │ 9 9.0
7 │ 12 missing
8 │ 15 3.0
9 │ 16 7.0
⋮ │ ⋮ ⋮
632159 │ 999990 4.0
632160 │ 999991 3.0
632161 │ 999992 8.0
632162 │ 999993 2.0
632163 │ 999994 7.0
632164 │ 999996 4.0
632165 │ 999997 7.0
632166 │ 1000000 7.0
632149 rows omitted
As you can see we get missing in row 7 for group with :id value equal to
12 as expected.
If you have some thoughts about pros and cons of the approaches I discussed
today please check out this issue and comment there. Thank you!