Author Archives: Blog by Bogumił Kamiński

How to make your joins faster in DataFrames.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/07/30/joins.html

Introduction

Joining tables is one of the fundamental operations when wrangling data.
Therefore people using DataFrames.jl would understandably want them
to be as fast as possible.

In this post I will show you some tricks that can make joins faster in case
that is currently a bottleneck, which is when you have many Strings in your
data. This scenario is relevant as it is quite common in data science workflows.

This post was written under Julia 1.6.1, DataFrames.jl 1.2.1, BenchmarkTools.jl
1.1.1, and WeakRefStrings.jl 1.1.0. I am running the tests on a laptop
under Linux and with 16GB of RAM using a single thread.

In the timings below I report the memory footprint of julia process in
respective scenarios. This is relevant, as when we would be very close to
available memory limit Julia might struggle with memory management. I have
chosen the size of the tests that they easily fit into memory (but of course
they are large enough to cause issues).

The baseline

In the baseline scenario I join tables that do not contain any strings.

Here is my benchmark:

julia> using DataFrames

julia> using BenchmarkTools

julia> df1 = DataFrame(id=1:5*10^7, left=1:5*10^7)
50000000×2 DataFrame
      Row │ id        left
          │ Int64     Int64
──────────┼────────────────────
        1 │        1         1
        2 │        2         2
    ⋮     │    ⋮         ⋮
 49999999 │ 49999999  49999999
 50000000 │ 50000000  50000000
          49999996 rows omitted

julia> df2 = DataFrame(id=1:5*10^7, right=1:5*10^7)
50000000×2 DataFrame
      Row │ id        right
          │ Int64     Int64
──────────┼────────────────────
        1 │        1         1
        2 │        2         2
    ⋮     │    ⋮         ⋮
 49999999 │ 49999999  49999999
 50000000 │ 50000000  50000000
          49999996 rows omitted

julia> GC.gc() # julia process memory footprint: 1.7GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 28 samples with 1 evaluation.
 Range (min … max):  1.963 s …   2.255 s  ┊ GC (min … max): 0.02% … 9.21%
 Time  (median):     2.190 s              ┊ GC (median):    9.49%
 Time  (mean ± σ):   2.188 s ± 47.369 ms  ┊ GC (mean ± σ):  9.12% ± 1.79%

                                             █ ▆
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁███▅▄▁▁▁▁▁▁▁▁▅ ▁
  1.96 s         Histogram: frequency by time        2.26 s <

 Memory estimate: 1.86 GiB, allocs estimate: 198.

As you can see the baseline timing is around 2 seconds. The GC
(garbage collector) gets triggered in the benchmarks but it does not take more
than 10% of total run time.

This gives us a baseline for our tests.

The issue of many small allocated objects

Now consider that :left column in df1 and :right column in df2 are
instead containing elements of type String. Note that this will not affect
the joining process as I do not touch the column on which we perform the join.

Here is what we get:

julia> df1.left = string.(df1.left);

julia> df2.right = string.(df2.right);

julia> GC.gc() # julia process memory footprint: 4.7GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  4.670 s … 8.088 s  ┊ GC (min … max): 54.57% … 73.43%
 Time  (median):     7.764 s            ┊ GC (median):    73.47%
 Time  (mean ± σ):   7.457 s ± 1.051 s  ┊ GC (mean ± σ):  72.17% ±  6.31%

                                                   █
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▄ ▁
  4.67 s        Histogram: frequency by time        890 s <

 Memory estimate: 1.86 GiB, allocs estimate: 198.

As you can see now we are roughly 4 times slower on the average. In this case GC
typically takes around a bit over 70% of total run time of join operation. We
observe this issue although the memory footprint of julia process is 4.7GB,
which is still well below the memory I have available on my machine.

The issue is that GC, if triggered, takes very long time because it has to
traverse very many small allocated objects (Strings in our case).

We try inlining our strings

Being aware of the issue @quinnj has implemented a set of InlineString*
types in the WeakRefStrings.jl package (and soon CSV.jl will use
them by default).

Let us check if using them resolves our issues:

julia> using WeakRefStrings

julia> df1.left = InlineString15.(df1.left);

julia> df2.right = InlineString15.(df2.right);

julia> GC.gc()  # julia process memory footprint: 2.5GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 24 samples with 1 evaluation.
 Range (min … max):  2.258 s …   2.544 s  ┊ GC (min … max):  0.03% … 10.36%
 Time  (median):     2.534 s              ┊ GC (median):    10.27%
 Time  (mean ± σ):   2.521 s ± 56.830 ms  ┊ GC (mean ± σ):   9.83% ±  2.11%

                                                       ▁▃█▃
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁████ ▁
  2.26 s         Histogram: frequency by time        2.54 s <

 Memory estimate: 2.61 GiB, allocs estimate: 198.

Wow! We are almost back to the timings with the integer columns we had
originally. The memory footprint and the timings are a bit bigger than in the
baseline scenario because the width of the strings in our case requires us to
use InlineString15 type.

Does using Symbol solve the issue?

One of the alternative practices that people use to work around the issues with
Strings is to use Symbols instead. Since Symbols are interned they are
faster for certain operations. This is not a free lunch though as, in particular
the memory used by them cannot be reclaimed when they are no longer referenced to.

Let us check if using Symbol instead of String helps in our case:

julia> df1.left = Symbol.(df1.left);

julia> df2.right = Symbol.(df2.right);

julia> GC.gc() # julia process memory footprint: 4.0GB

julia> @benchmark innerjoin($df1, $df2, on=:id) seconds=60
BenchmarkTools.Trial: 16 samples with 1 evaluation.
 Range (min … max):  1.879 s …    4.831 s  ┊ GC (min … max):  0.04% … 59.90%
 Time  (median):     3.940 s               ┊ GC (median):    51.65%
 Time  (mean ± σ):   3.868 s ± 574.918 ms  ┊ GC (mean ± σ):  50.72% ± 13.20%

                                          █
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁
  1.88 s         Histogram: frequency by time         4.83 s <

 Memory estimate: 1.86 GiB, allocs estimate: 198.

As you can see the benchmarks are better than for String. However, still
GC time is typically taking 50% of execution time. This shows that although
Symbols are interned they are tracked by the GC (which in theory could be
avoided since we know that the memory they occupy cannot be reclaimed).

Conclusions

Here is a summary of what we have learned:

  • using String type can lead to significant issues with GC;
  • using Symbol instead is somewhat better, but does not resolve the GC issues
    in full;
  • using InlineString15 gave very good results.

One just has to remember that InlineString* types are limited and can hold
small strings only. Fortunately in typical scenarios when we have a lot of
strings they are not super long.

Also note that I use String type as an example in our tests because it is
common in data science workflows. However, one would run to similar issues if
one would use many objects that have to be tracked by GC. In fact it does not
even matter if they would be processed by e.g. join operation that we used in
our examples. What matters is that they reside in memory and thus need to be
tracked by GC.

Let me also highlight that the issues would be even more visible if I used
multiple threads. The reason is that currently the GC in Julia does not handle
this scenario as efficiently as it potentially could. This is an issue that is
planned to be resolved by the Julia core devs, as we could learn during
this talk given at JuliaCon2021.

I believe in the future it will be possible to make everything to be efficient
out-of-the-box. However, till the issues with GC when many small objects need
to be tracked are present it is good to know how they can be resolved.

Basic DataFrames.jl: getting the data

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/07/23/getdata.html

Introduction

During JuliaCon 2021 I got some reqests to show some more basic
functionalities of DataFrames.jl. I started writing such a post.
It seems it ended up not so basic, but I hope it is useful.

What we will want to do is to create a plot of PLN/USD buy and sell exchange
rates for the year 2020. The key challenge is getting the data. I fetch it from
a Web API provided by NBP (Narodowy Bank Polski).

Getting ready

Before writing our solutoin let us play a bit with the NBP Web API.

Here is a sample a query that you can submit to get information about bid
and ask rates of PLN/USD on 2020-01-03:

https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-03/?format=json

(you can copy-paste it into your browser to see the rendered result)

We have to be careful though as sometimes the prices are not available, and
we get the dreaded 404 error. It happens e.g. for 2020-01-01:

https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-01/?format=json

Before we start let us learn how to handle such queries in Julia.

The first step is getting the query result as a string:

julia> using HTTP

julia> q = "https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-03/?format=json"
"https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-03/?format=json"

julia> r = HTTP.get(q)
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Date: Fri, 23 Jul 2021 17:04:48 GMT
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 147
Content-Type: application/json; charset=utf-8
Expires: -1
ETag: "ahlzfBWI/hBaFs4mrcmnRvKkGdL1gH6hS2LIHIfppJI="
Set-Cookie: ee3la5eizeiY4Eix=ud5ahSho; path=/

{"table":"C","currency":"dolar amerykański","code":"USD","rates":[{"no":"002/C/NBP/2020","effectiveDate":"2020-01-03","bid":3.7667,"ask":3.8427}]}"""

julia> s = String(r.body)
"{\"table\":\"C\",\"currency\":\"dolar amerykański\",\"code\":\"USD\",\"rates\":[{\"no\":\"002/C/NBP/2020\",\"effectiveDate\":\"2020-01-03\",\"bid\":3.7667,\"ask\":3.8427}]}"

We can see that the result of the request is a JSON file, so let us parse it and inspect a bit:

julia> using JSON3

julia> j = JSON3.read(s)
JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}} with 4 entries:
  :table    => "C"
  :currency => "dolar amerykański"
  :code     => "USD"
  :rates    => JSON3.Object[{…

julia> j.code
"USD"

julia> j.rates
1-element JSON3.Array{JSON3.Object, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}:
 {
              "no": "002/C/NBP/2020",
   "effectiveDate": "2020-01-03",
             "bid": 3.7667,
             "ask": 3.8427
}

julia> j.rates[1].effectiveDate
"2020-01-03"

julia> j.rates[1].bid
3.7667

julia> j.rates[1].ask
3.8427

Now we know enough about the structure of the response to write a full solution.

Before moving forward let us just check what happens if we try to fetch data
in a day for which there are no quotations:

julia> q = "https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-01/?format=json"
"https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-01/?format=json"

julia> r = HTTP.get(q)
ERROR: HTTP.ExceptionRequest.StatusError(404, "GET", "/api/exchangerates/rates/c/usd/2020-01-01/?format=json", HTTP.Messages.Response:
"""
HTTP/1.1 404 Not Found
Date: Fri, 23 Jul 2021 17:09:07 GMT
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 38
Content-Type: text/plain; charset=utf-8
Expires: -1
Set-Cookie: ee3la5eizeiY4Eix=Naew5Ohp; path=/

404 NotFound - Not Found - Brak danych""")

We can see that in this case we get a 404 error thrown as
HTTP.ExceptionRequest.StatusError exception.

We are ready to write the solution we wanted.

Julia in action

First define a function that returns a NamedTuple with the date and bid and
ask rates. If the rates are not available it should return them as missing.

Here it goes:

julia> using Chain

julia> using Dates

julia> function get_rate(date::Date)
           q = "https://api.nbp.pl/api/exchangerates/rates/c/usd/$date/?format=json"

           try
               return @chain q begin
                   HTTP.get
                   String(_.body)
                   JSON3.read
                   @aside @assert _.code == "USD"
                   _.rates[1]
                   @aside @assert date == Date(_.effectiveDate)
                   (date=date, bid=_.bid, ask=_.ask)
               end
           catch e
               if e isa HTTP.ExceptionRequest.StatusError
                   return (date=date, bid=missing, ask=missing)
               else
                   rethrow(e)
               end
           end
       end
get_rate (generic function with 1 method)

Let us note a few things:

  1. In order to streamline processing I used @chain macro from the Chain.jl
    package (a lot of people are asking for more examples how it can be used in
    practice).
  2. In particular I used the @aside statement twice to perform validation of
    the result. Note that this is a useful way to inject into a chain of commands
    an operation whose result should not be stored but it is still useful to be
    performed.
  3. I had to use trycatch block to handle the 404 error. Note though, that
    if for some strange reason we got an exception of type other than
    HTTP.ExceptionRequest.StatusError I rethrow it to avoid covering such error
    (and many bad things can happen in the wild e.g. we could have
    run out of memory).

Let us test our function before moving forward:

julia> get_rate(Date("2020-01-03"))
(date = Date("2020-01-03"), bid = 3.7667, ask = 3.8427)

julia> get_rate(Date("2020-01-01"))
(date = Date("2020-01-01"), bid = missing, ask = missing)

I am sure everyone is waiting now to get to working with DataFrames.jl. Here
it goes, we collect the data from the whole year to a DataFrame using the
push! function:

julia> using DataFrames

julia> usd_pln = DataFrame()
0×0 DataFrame

julia> for date in Date("2020-01-01"):Day(1):Date("2020-12-31")
           push!(usd_pln, get_rate(date), promote=true)
       end

julia> usd_pln
366×3 DataFrame
 Row │ date        bid           ask
     │ Date        Float64?      Float64?
─────┼────────────────────────────────────────
   1 │ 2020-01-01  missing       missing
   2 │ 2020-01-02        3.7597        3.8357
   3 │ 2020-01-03        3.7667        3.8427
   4 │ 2020-01-04  missing       missing
   5 │ 2020-01-05  missing       missing
   6 │ 2020-01-06  missing       missing
   7 │ 2020-01-07        3.7679        3.8441
  ⋮  │     ⋮            ⋮             ⋮
 360 │ 2020-12-25  missing       missing
 361 │ 2020-12-26  missing       missing
 362 │ 2020-12-27  missing       missing
 363 │ 2020-12-28        3.6611        3.7351
 364 │ 2020-12-29  missing       missing
 365 │ 2020-12-30        3.6833        3.7577
 366 │ 2020-12-31        3.6714        3.7456
                              352 rows omitted

For me all worked cleanly. Note that I am using the promote=true keyword
argument, as some of the rows of :bid and :ask columns contain missing
values while other contain Float64 values.

We are ready to produce the plot we wanted:

julia> using Plots
julia> plot(usd_pln.date, [usd_pln.bid usd_pln.ask], label=["bid" "ask"],
            xlabel="date", ylabel="PLN/USD", xrotation=45);

And here is the plot:

PLN/USD bid and ask rates

Now feel free to perform any analyses that you find interesting on this data.

From my side let me just show you how to compute the fraction of days in which
the bid data is missing by day of the week.

julia> using Statistics

julia> @chain usd_pln begin
           transform(:date => ByRow(dayofweek))
           groupby(:date_dayofweek)
           combine(:bid => (x -> ismissing.(x) |> mean) => :fraction_missing)
       end
7×2 DataFrame
 Row │ date_dayofweek  fraction_missing
     │ Int64           Float64
─────┼──────────────────────────────────
   1 │              1         0.0769231
   2 │              2         0.0769231
   3 │              3         0.113208
   4 │              4         0.0754717
   5 │              5         0.0961538
   6 │              6         1.0
   7 │              7         1.0

As you probably might have expected on Saturdays and on Sundays we never
have any information on bid and ask rates.

Conclusion

I hope you enjoyed this post. I have chosen the topic because, at least in my
experience, these days fetching JSON data from the Internet is bread and butter
of virtually every data scientist, so I think it is well worth learning how to
do it.

DataFrames.jl at JuliaCon2021

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/07/16/juliacon2021.html

Introduction

JuliaCon 2021 schedule looks awesome.
The breadth and depth of topics covered is really impressive and I am sure
everyone will find many interesting talks to attend.

There are so many valuable presentations that it is super cool that everything
is recorded as otherwise I would not be able to attend every session I would
like to learn from.

The conference is divided into two blocks.
From July 20 to July 27 we have two 3-hour workshops each day. I think this
is a super nice to have them. Then from July 28 to July 30 we have talks,
posters, and BoF/Mini Track sessions.

DataFrames.jl during workshops

From the perspective of learning how to use
DataFrames.jl in your data science workflows I will give a
DataFrames.jl 1.2 tutorial on Tuesday, July 20, 2021.
The workshop is live session.

All the material that I will use in the tutorial is available on GitHub. If you want to follow my examples live I recommend you to download
it and read the prerequisite instructions before the workshop.

There are already many on-line materials that walk users though basic
functionality and syntax of DataFrames.jl. You can find the appropriate
links here. Therefore this year I have decided to prepare a tutorial
that shows how a simple data science project can be done end-to-end using
DataFrames.jl and other packages from the JuliaData and
JuliaStats ecosystem.

I would also drive your attention to the
Statistics with Julia from the ground up workshop that will be held
on Wednesday, July 21, 2021, where also DataFrames.jl will be featured.

Presentations about DataFrames.jl

I will give two DataFrames.jl related presentations during the conference.

The first one is on Thursday, July 29, 2021 and is about
Release management – lessons learned in JuliaData ecosystem.
In this presentation I discuss my experiences from serving as one of the
maintainers of DataFrames.jl. I hope that the talk will be useful both for
regular users as I discuss typical issues that people starting to use
DataFrames.jl face, and for package contributors since I share the package
maintenance practices we I have learned to be useful. This talk is pre-recorded.

The second talk on Friday, July 30, 2021 will present a summary about
the state and plans for DataFrames.jl
package and is not pre-recorded.

Conclusion

I am really excited about JuliaCon 2021 and cannot wait to
virtually meet you during the conference!