Author Archives: Blog by Bogumił Kamiński

How to get head or tail of a data frame in DataFrames.jl?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/06/18/first.html

Introduction

This time I thought to make a post that will help people who know R or Python
and are starting to use DataFrames.jl.

The point is that DataFrames.jl does not define head and tail functions,
but rather first and last. In other words I want to serve those
of you who have googled up "DataFrames.jl head" and landed on this page.

All codes were tested under Julia 1.6.1 and DataFrames.jl 1.1.1.

What Julia Base offers

In Julia Base if you have some collection, e.g. a vector, you can use the
first and last functions to get their head/tail as follows:

julia> x = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
 1
 2
 3
 4
 5

julia> first(x)
1

julia> last(x)
5

julia> first(x, 2)
2-element Vector{Int64}:
 1
 2

julia> last(x, 2)
2-element Vector{Int64}:
 4
 5

As you can see there are two modes how these functions work:

  • if you pass just a collection then its first/last element is extracted out
    and returned;
  • if you pass a collection and a non-negative integer, then you get a collection
    holding the requested number of elements from the head/tail of the source
    collection.

Additionally the first and last functions have a nice feature that if you
pass them a non negative integer, then they do not fail, e.g.:

julia> first(x, 10)
5-element Vector{Int64}:
 1
 2
 3
 4
 5

This is a very convenient feature, especially when working with data, where you
do not know for sure that you will have enough observations. Consider a simple
scenario: we want to pick top three products per product category, and some
product categories might have less than tree entries (we will soon go back to
this example when we switch to DataFrames.jl).

The point is that with ordinary indexing if you request more elements than the
collection can hold you get:

julia> x[1:10]
ERROR: BoundsError: attempt to access 5-element Vector{Int64} at index [1:10]

so you have to write something like e.g.:

julia> x[1:min(10, end)]
5-element Vector{Int64}:
 1
 2
 3
 4
 5

which, while still nice, is not so easy to reason about. A more advanced issue
with indexing that first and last resolve is that not all collections are
1-based, so you have to be careful if you write generic code.

How DataFrames.jl works

Since getting a head/tail of a data frame is essentially the operation that
first and last functions from Julia Base offer these functions have
special methods defined in DataFrames.jl (and thus there is no need to introduce
new head and tail functions).

Let us see this at work:

julia> using DataFrames

julia> df = DataFrame(group=[1, 1, 1, 1, 2, 2], id=1:6)
6×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     1      4
   5 │     2      5
   6 │     2      6

julia> first(df)
DataFrameRow
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1

julia> last(df)
DataFrameRow
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   6 │     2      6

julia> first(df, 2)
2×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2

julia> last(df, 2)
2×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     2      5
   2 │     2      6

As you can see all is consistent with Julia Base. If you just pass a single
data frame as an argument to first/last you get a DataFrameRow. If you
additionally pass an integer you get a DataFrame.

Let us now show why the behavior “take at most n elements”(instead of “take
exactly n elements”
) is useful. Consider we want to pick top 3 ids
per group. This is easily achievable as follows (we assume that the table was
already sorted in some meaningful way):

julia> combine(groupby(df, :group), sdf -> first(sdf, 3))
5×2 DataFrame
 Row │ group  id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     1      2
   3 │     1      3
   4 │     2      5
   5 │     2      6

As you can see the operation worked smoothly, picking top three rows for group
1, but only two for group 2 (as this group held only two rows). This is the
behavior that I think is most convenient when doing split-apply-combine
strategy.

Conclusions

This post is aimed mostly at people starting to work with DataFrames.jl.
There are two messages I wanted to convey:

  • a direct one: use first and last functions if you want to get a head/tail
    of your data frame;
  • a more general one: DataFrames.jl is designed to follow the API provided by
    Julia Base. So if some functions exist there (like first and last) you can
    in general expect that these functions will have methods defined also in
    DataFrames.jl that work consistently (of course provided that some operation
    makes sense in this context).

How much do collections of allocated objects cost?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/06/11/vecvec.html

Introduction

In my recent comment on StackOverflow I have said that using a vector of
vectors is: a) slow, b) uses more memory than needed, and c) puts much more
stress on garbage collection. Having read it one of my students asked me to
expand on this issue. In this post I want to give you some examples that were
designed to clarify this issue.

The post was tested under Julia 1.6.1 on Linux with a machine having 16 GB of
RAM (the last point affects the frequency of triggering GC).

In the post we will compare the same operations for vector of vectors and
vector of tuples defined using the following functions:

test1(n) =[rand(2) for _ in 1:n]
test2(n) =[(rand(), rand()) for _ in 1:n]

The difference is that test1 creates a vector of references to dynamically
allocated objects, while in test2 the tuples are stored directly in the vector.

Let us test all three claims. In the comparisons I use a fresh session in
each code block (as the examples given use a lot of memory). This means that
the timings will include compilation time, but the size of the computations is
large enough so that this is relatively negligible.

Examples

Start with a vector of vectors:

julia> test1(n) =[rand(2) for _ in 1:n]
test1 (generic function with 1 method)

julia> @time t1 = test1(10^8);
 21.907014 seconds (100.00 M allocations: 9.686 GiB, 67.93% gc time)

julia> @time sum(x -> x[1], t1);
  0.768779 seconds (179.18 k allocations: 11.105 MiB, 7.44% compilation time)

julia> @time Base.summarysize(test1(10^7))
262.832256 seconds (50.06 M allocations: 2.864 GiB, 4.58% gc time, 0.02% compilation time)
640000040

julia> @time GC.gc()
  4.631474 seconds (100.00% gc time)

julia> @time GC.gc()
  2.647404 seconds (100.00% gc time)

julia> @time GC.gc(false)
  0.004821 seconds (99.89% gc time)

julia> @time GC.gc(false)
  0.005526 seconds (99.89% gc time)

And now vector of tuples (fresh Julia session):

julia> test2(n) =[(rand(), rand()) for _ in 1:n]
test2 (generic function with 1 method)

julia> @time t2 = test2(10^8);
  1.458052 seconds (25 allocations: 1.490 GiB, 0.36% gc time)

julia> @time sum(x -> x[1], t2);
  0.164072 seconds (178.23 k allocations: 11.036 MiB, 35.07% compilation time)

julia> @time Base.summarysize(test2(10^7))
  0.190496 seconds (24.27 k allocations: 153.937 MiB, 3.25% gc time, 17.60% compilation time)
160000040

julia> @time GC.gc()
  0.070696 seconds (99.99% gc time)

julia> @time GC.gc()
  0.102222 seconds (99.99% gc time)

julia> @time GC.gc(false)
  0.000517 seconds (99.13% gc time)

julia> @time GC.gc(false)
  0.000523 seconds (98.97% gc time)

As you can see:

  • creation of vector of vectors is much slower; in particular a lot of small
    allocations happens (which is expensive) and in total also more memory is
    allocated.
  • a simple aggregation with sum is also slower because for vector of vectors
    we have to go through references to objects (which takes time) which also
    means that this is less CPU cache friendly.
  • With Base.summarysize we can check that using vectors also uses up much more
    memory; also as a side issue we learn that functions like Base.summarysize
    which traverse the tree of object references are much, much slower for a
    vector of vectors.
  • Finally both full sweep GC.gc() and incremental sweep GC.gc(false) are
    slower with vector of vectors; this is especially visible for full sweep case
    (fortunately it is triggered less often in normal usage). The important thing
    to note here is that for garbage collection time to be affected it is enough
    that the vector of vectors is somewhere in the memory; it does not have to be
    used in some operation you do.

Conclusions

The conclusion is something that seasoned Julia developers know very well:
avoid having many small allocated objects in your Julia programs. Having read
this post I hope you now have a better understanding what aspects of the
performance of your code can be affected when you have to use such data.

Before I finish let me add that collections of any mutable objects will be
affected by the same issue, and even some special immutable objects (like
String, or to some extent e.g. Symbol) can cause issues like presented
in this post.

DataFrames.jl joins: matchmissing=:notequal

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/06/05/notequal.html

Introduction

In DataFrames.jl we have recently added in this PR a new
option for matchmissing keyword argument in joins. This functionality will be
made available in 1.2 release. In this post I want to discuss this new feature
before we release it.

The post is tested under Julia 1.6.1 and on DataFrames.jl main branch
(that includes the relevant PR).

How matchmissing keyword argument works

The matchmissing keyword argument allows the user to decide how missing
value is handled in on columns in joins. After this PR you have
three options to choose from:

  • :error (the default): throw an error if missing value is present in any of
    the on columns; the rationale is that missing indicates unknown value so
    if we knew it it could match to any of the non-missing values in the on
    columns in the other data frame we join;
  • :equal: missing values are allowed and they are matched to missing values
    only; in this scenario we treat missing as any other value without giving
    it a special treatment;
  • :notequal (a new option): in this case missing is considered to be not
    equal to any other value (including missing).

Let me comment a bit more on the consequences of the :notequal rule. In
innerjoin this means that rows with missing values will be dropped both in
left and right table. In leftjoin, semijoin and antijoin they are dropped
from the right table only (which means that if missing is present in the left
table it is retained in processing but considered not to match any row in right
table). Similarly in rightjon rows with missing are dropped from left table
only. The case that is most difficult to handle is outerjoin. The reason is
that if missing would be present in both left and right table they would be
considered not equal and produce separate rows in the output table. We
considered this behavior as potentially confusing and therefore decided not to
allow :notequal in outerjoin.

Let me move to the examples showing the matchmissing=:notequal at work.

Examples

Here is a simple example code showing how the new option works:

julia> using DataFrames

julia> df1 = DataFrame(id=[1, missing, 3, 4], x=1:4)
4×2 DataFrame
 Row │ id       x
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │ missing      2
   3 │       3      3
   4 │       4      4

julia> df2 = DataFrame(id=[1, 2, missing, 4], y=1:4)
4×2 DataFrame
 Row │ id       y
     │ Int64?   Int64
─────┼────────────────
   1 │       1      1
   2 │       2      2
   3 │ missing      3
   4 │       4      4

Now we investigate all the possible join operations:

julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
2×3 DataFrame
 Row │ id      x      y
     │ Int64?  Int64  Int64
─────┼──────────────────────
   1 │      1      1      1
   2 │      4      4      4

As you can see for innerjoin only rows with :id equal to 1 and 4 were
retained. Let us move forward:

julia> leftjoin(df1, df2, on=:id, matchmissing=:notequal, source=:source)
4×4 DataFrame
 Row │ id       x      y        source
     │ Int64?   Int64  Int64?   String
─────┼────────────────────────────────────
   1 │       1      1        1  both
   2 │       4      4        4  both
   3 │ missing      2  missing  left_only
   4 │       3      3  missing  left_only

julia> rightjoin(df1, df2, on=:id, matchmissing=:notequal, source=:source)
4×4 DataFrame
 Row │ id       x        y      source
     │ Int64?   Int64?   Int64  String
─────┼─────────────────────────────────────
   1 │       1        1      1  both
   2 │       4        4      4  both
   3 │       2  missing      2  right_only
   4 │ missing  missing      3  right_only

For leftjoin and rightjoin we retain missing but only in the table for
which all rows must be retained. Therefore in leftjoin for :id equal to
missing we have :x equal to 2, but :y equal to missing (signaling that
there was no match which we can also see in :source column). The same
happens for :id equal to missing in rightjoin, but then :x is set to
missing.

The same rules work with semijoin and antijoin as you can see here:

julia> semijoin(df1, df2, on=:id, matchmissing=:notequal)
2×2 DataFrame
 Row │ id      x
     │ Int64?  Int64
─────┼───────────────
   1 │      1      1
   2 │      4      4

julia> antijoin(df1, df2, on=:id, matchmissing=:notequal)
2×2 DataFrame
 Row │ id       x
     │ Int64?   Int64
─────┼────────────────
   1 │ missing      2
   2 │       3      3

Finally outerjoin just throws an error:

julia> outerjoin(df1, df2, on=:id, matchmissing=:notequal)
ERROR: ArgumentError: matchmissing == :notequal for `outerjoin` is not allowed

Conclusions

I hope this post helped you to learn the rationale and design of the new option
for the matchmissing keyword argument in joins. If you have any comments on
the functionality or its documentation please open an issue on DataFrames.jl GitHub repository.

Finally I would like to thank pstorozenko, nilshg, and
nalimilan for working on this functionality.