Author Archives: Blog by Bogumił Kamiński

Clip your data with ClipData.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/11/26/clip.html

Introduction

Occasionally I write posts about Julia tools that are often not commonly
known, but are useful in practice. Today I want to talk about
the ClipData.jl package.

The post was written under Julia 1.6.3, DataFrames.jl 1.2.2, and
ClipData.jl 0.2.1.

What is ClipData.jl?

The package does one thing and does it well: it allows you to move
tabular data between your Julia session and the system clipboard both ways.

The to major use cases are:

  1. You have a table in e.g. Google Sheet, you copy it to the system clipboard,
    and want to interactively ingest it in the Julia session as a table
    (in my examples I will use DataFrame).
  2. You have a DataFrame in your Julia session and you want to copy it to the
    system clipboard so that you can later paste it in e.g. Google Sheet.

Many data scientists need to do both operations virtually every day, and
ClipData.jl comes to the rescue. This package is not only nice, but it
has an excellent visuals explaining how things work. Therefore, since they are
MIT licensed, I just link to the videos prepared by Peter Deffebach here.
Let us get to action.

First you need to know if your data has a header of not. If it has a
header we will work with a DataFrame, if it does not we will work with a
Matrix.

Clipping tables

To work with tabular data (having a header) use the cliptable function. To
copy data from the system clipboard and store it in a DataFrame called df
just write:

df = cliptable() |> DataFrame

On the other hand if you want to copy your df data frame to the system
clipboard use:

cliptable(df)

All this is very nicely presented in the following video (in particular notice
that column element types are automatically detected):

Clipping matrices

To work with arrays use the cliparray function. To copy data from
system clipboard and store it in a Matrix called mat just write:

mat = cliparray()

On the other hand if you want to copy your mat matrix to system clipboard
use:

cliparray(mat)

Here is a video showing the process:

Conclusions

There are several additional features that ClipData.jl provides (like handling
how table cells should be parsed). If you want to know more details please
refer to the ClipData.jl homepage.

I am sure you will find this little package quite useful in your data science
projects!

Welcome to DataFramesMeta.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/11/19/dfm.html

Introduction

If you start using Julia for data science you might get overwhelmed by the
number of available options and features. Today I want to write about the
DataFramesMeta.jl package that greatly simplifies one of the most
difficult parts of the DataFrames.jl package to learn, namely – performing
data transformations.

In this post I will omit all advanced features of both DataFramesMeta.jl
and DataFrames.jl and focus on simple issues to help you build a correct
mental model how things should be used.

The post was written under Julia 1.6.3, DataFrames.jl 1.2.2, and
DataFramesMeta.jl 0.10.0.

Setting up the stage

Let us first load the required packages and create some simple data frame we
will want to work with:

julia> using DataFramesMeta

julia> using Statistics

julia> df = DataFrame(x=1:5, y=11:15)
5×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14
   5 │     5     15

Notice that when we load DataFramesMeta.jl also DataFrames.jl is automatically
loaded to your working environment. Additionally, I have loaded the Statistics
module as soon we will use it in our examples.

Understanding data transformations

When you want to perform some transformation of your data the first thing you
need to answer is if you want to aggregate data or manipulate columns.

Data aggregation is a simple concept – I take a column as input and produce e.g.
its mean, which is a single aggregated value. In DataFrames.jl we call this
operation combine, as we are combining rows.

When I talk about column manipulation I mean operations that we take a column
and produce output that is also a column that has the same number of elements
as the source, e.g. I multiply the column by 2. In DataFrames.jl we call this
operation either select or transform. What is the difference between
select and transform? When you perform a select operation you keep in the
result only the results of the operations you performed. On the other hand,
when you transform a data frame you additionally keep all the columns from
the source data frame.

Let us now have a look at examples of these three operations. Start with
aggregation:

julia> @combine(df, :sum_y = sum(:x), :mean_y = mean(:y))
1×2 DataFrame
 Row │ sum_y  mean_y
     │ Int64  Float64
─────┼────────────────
   1 │    15     13.0

As you can see we used the combine word and prepended it with @ which
signals that this is a DataFramesMeta.jl operation. As a first argument in our
call we passed the source data frame. Next we specified the aggregations we
want to perform. Note that each aggregation is specified just as you would
write normal Julia code using variables. There is only one rule to learn. When
you prefix the variable name with : it means that you are referring to a
column of a data frame.

Now let us perform selection and transformation side by side to see the
difference:

julia> @select(df, :z = :x + :y)
5×1 DataFrame
 Row │ z
     │ Int64
─────┼───────
   1 │    12
   2 │    14
   3 │    16
   4 │    18
   5 │    20

julia> @transform(df, :z = :x + :y)
5×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11     12
   2 │     2     12     14
   3 │     3     13     16
   4 │     4     14     18
   5 │     5     15     20

As you can see both operations create a new column :z. The difference is that
@transform also keeps the :x and :y variables, while @select drops them.

Let us write another transformation:

julia> @transform(df, :z = :x * :y)
ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64})

This time the operation failed. Most Julia users know why. You cannot multiply a
vector by a vector – this is not a properly defined mathematical operation.
Instead you have to broadcast the multiplication operation like this (this is
often called a vectorized operation):

julia> @transform(df, :z = :x .* :y)
5×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11     11
   2 │     2     12     24
   3 │     3     13     39
   4 │     4     14     56
   5 │     5     15     75

In more complex scenarios adding the . for broadcasting can easily get
annoying, e.g.:

julia> @transform(df, :a = 2 .* :x, :b = :x .* :y .^ 2)
5×4 DataFrame
 Row │ x      y      a      b
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1     11      2    121
   2 │     2     12      4    288
   3 │     3     13      6    507
   4 │     4     14      8    784
   5 │     5     15     10   1125

On the other hand practice shows that such broadcasted operations are quite
common. Therefore in DataFrames.jl parlance they are called by-row operations.
DataFramesMeta.jl allows an easy way to tell @select and @transform that
all operations that user passes to them should be applied by-row. Just prefix
the name of the transformation function with the r character (r stands for
row). Therefore we have @rselect and @rtransform:

julia> @rselect(df, :a = 2 * :x, :b = :x * :y ^ 2)
5×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     2    121
   2 │     4    288
   3 │     6    507
   4 │     8    784
   5 │    10   1125

julia> @rtransform(df, :a = 2 * :x, :b = :x * :y ^ 2)
5×4 DataFrame
 Row │ x      y      a      b
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1     11      2    121
   2 │     2     12      4    288
   3 │     3     13      6    507
   4 │     4     14      8    784
   5 │     5     15     10   1125

As you can see we got rid of the dots, paying the cost of having all operations
applied by-row to our data.

As an exercise think how you would subtract the mean from column :x in our
data frame. Can we use @rselect or we must use @rselect? You can use both:

julia> @select(df, :x, :x2 = :x .- mean(:x))
5×2 DataFrame
 Row │ x      x2
     │ Int64  Float64
─────┼────────────────
   1 │     1     -2.0
   2 │     2     -1.0
   3 │     3      0.0
   4 │     4      1.0
   5 │     5      2.0

julia> @rselect(df, :x, :x2 = :x - mean(df.x))
5×2 DataFrame
 Row │ x      x2
     │ Int64  Float64
─────┼────────────────
   1 │     1     -2.0
   2 │     2     -1.0
   3 │     3      0.0
   4 │     4      1.0
   5 │     5      2.0

I would say, however, that this time using @select is more natural. Although
we have to use the . in :x2 = :x .- mean(:x) it is pretty easy to understand
what was going on there.

When we used @rselect we had to pass the df.x column to the mean (this is a
value computed as any other Julia code, DataFramesMeta.jl does not touch it as
it does not have : in front). Note that just passing :x would be incorrect,
as mean would be also applied by-row to it so we would broadcast mean over
the :x column and the result would be:

julia> @rselect(df, :x, :x2 = :x - mean(:x))
5×2 DataFrame
 Row │ x      x2
     │ Int64  Float64
─────┼────────────────
   1 │     1      0.0
   2 │     2      0.0
   3 │     3      0.0
   4 │     4      0.0
   5 │     5      0.0

and this is most likely not what we want (unless we wanted to check that
subtracting some number from itself is equal to zero). In summary putting a r
prefix broadcasts the operation with respect to the columns of a data frame
(i.e. parts of the passed expression that contain names with a : prefix).

So now we know that if we prefix select or transform with r we switch to
by-row mode. Is there anything more to learn? Indeed there is one more thing
you need to know. This is a ! suffix that these functions can take. What it
does is that it makes the operation update the passed data frame. Note that
above when we performed transformations we were getting a fresh data frame, but
our df source data frame was untouched. When you suffix ! you get exactly
the same result but it gets stored in the data frame you passed to the
operation. Here are some examples:

julia> df
5×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14
   5 │     5     15

julia> @transform!(df, :z = :x + :y)
5×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11     12
   2 │     2     12     14
   3 │     3     13     16
   4 │     4     14     18
   5 │     5     15     20

julia> df
5×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1     11     12
   2 │     2     12     14
   3 │     3     13     16
   4 │     4     14     18
   5 │     5     15     20

julia> @select!(df, :s = :x + :y + :z)
5×1 DataFrame
 Row │ s
     │ Int64
─────┼───────
   1 │    24
   2 │    28
   3 │    32
   4 │    36
   5 │    40

julia> df
5×1 DataFrame
 Row │ s
     │ Int64
─────┼───────
   1 │    24
   2 │    28
   3 │    32
   4 │    36
   5 │    40

Why might we want such in-place operations? Consider a large data frame
with 10,000 columns. If you perform a @transform of such a data frame adding
one column to it you will copy a lot of data (which takes time and RAM). By
doing @transform! you will be faster and more memory efficient, at the expense
of mutating the source data frame.

Conclusions

Today as a conclusion let me present the following flowchart summarizing
the basic available data transformation options in DataFramesMeta.jl
that I have covered:

Transformations guideline flowchart

There are many more features of DataFramesMeta.jl that I have not covered like:
subsetting rows of a data frame, sorting it, or performing operations on
grouped data. You can find all the details in the documentation of
DataFramesMeta.jl.

One thousand and one stories

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/11/12/so.html

Introduction

I have just hit 1000 answers for the [julia] tag on Stack Overflow so I felt
like writing about it. In order to have a complete one thousand and one stories
collection today I thought of writing about a feature of Julia that will show
you how challenging the design decisions that have to be made when designing
functions are. We will work with one of the most fundamental functions, the
sum.

Before I go to the technical details let me go back to my recent post that I wrote about a model that Alan Edelman prepared for one of
classes during his studies. Recently I had an opportunity to discuss with him
about the exact context of creation of the model. Here you can read the summary
which I was lacking when I was writing the original post:

In 1980, Alan Edelman was a 17 year old freshman at Yale University where he
met his social science distribution requirement by taking Psychology 101
with Dr. Kenji Hakuta. There he learned about the famous
Kitty Genovese murder and the concept of diffusion of responsibility.
Having to write a paper for this freshman class, and being a “math person” he
figured why not take the idea of diffusion literally and write a paper about
that. He does not recall if he still has a copy of that paper, but it may be
in a box in the attic. He thinks he got an A on that paper.

The post was written under Julia 1.6.3.

Are you sure you understand how the sum function works?

I will test your knowledge by example. The first task is to perform row-wise
summation of the following matrix:

julia> mat = fill(Int8(100), 5, 10)
5×10 Matrix{Int8}:
 100  100  100  100  100  100  100  100  100  100
 100  100  100  100  100  100  100  100  100  100
 100  100  100  100  100  100  100  100  100  100
 100  100  100  100  100  100  100  100  100  100
 100  100  100  100  100  100  100  100  100  100

julia> sum(eachcol(mat))
5-element Vector{Int8}:
 -24
 -24
 -24
 -24
 -24

julia> sum.(eachrow(mat))
5-element Vector{Int64}:
 1000
 1000
 1000
 1000
 1000

If you are surprised by the result let me explain the situation. In the first
case sum operates on whole vectors. In the second case sum operates on
scalars. Why would this make the difference? The reason is that sum does not
use + for aggregation, but it employs the Base.add_sum as the reduction
operator which is defined as follows:

add_sum(x, y) = x + y
add_sum(x::SmallSigned, y::SmallSigned) = Int(x) + Int(y)
add_sum(x::SmallUnsigned, y::SmallUnsigned) = UInt(x) + UInt(y)
add_sum(x::Real, y::Real)::Real = x + y

As you can see the add_sum will promote the result to Int or UInt only if
it is passed scalar integers. Therefore if we pass it vectors of integers no
promotion happens.

Now for sure you know what will be the sum of the following vector:

julia> v = Integer[0x64; fill(Int8(100), 9)]
10-element Vector{Integer}:
 0x64
  100
  100
  100
  100
  100
  100
  100
  100
  100

Let us check:

julia> sum(v)
0xe8

Could you have guessed it? If yes, you probably assume that sum is using
foldl and you have noticed that Base.add_sum does not perform promotion
to Int or UInt when you mix signed and unsigned integers.

Unfortunately, if this was your guess, you are wrong in general. Consider this
scenario:

julia> sum(Integer[0x64; fill(Int8(100), 9999)])
937536

The result we get is quite surprising. We could have expected:

julia> foldl(+, Integer[0x64; fill(Int8(100), 9999)])
0x40

as we know that Base.add_sum will always fall back to + in this case. This
would be consistent with the previous result.

What is the reason of the difference? Actually sum does not use foldl but
reduce, and reduce does not have a guaranteed order of summation. This
means that in the latter case we must have made some summation of two Int8
(100)
values using Base.add_sum which promoted the result to Int.

Maybe above you thought of calling foldl with Base.add_sum like ths?:

julia> foldl(Base.add_sum, Integer[0x64; fill(Int8(100), 9999)])
0x00000000000f4240

0x00000000000f4240 is just 1000000 (which is a correct result if we were
widening types always when doing the summation), but why do we get such a weird
value? The reason is that foldl differs from sum in the way it initializes
the summation. It uses the first element of the collection(0x64 in our case)
and promotes it to the type that woud be produced if this element were added
using Base.add_sum to itself and we know UInt8 to UInt8 invokes promotion
to UInt.

What else could go wrong? Try this:

julia> using Random

julia> Random.seed!(1234)
MersenneTwister(1234)

julia> x = rand(1000)
1000-element Vector{Float64}:
 0.5908446386657102
 0.7667970365022592
 0.5662374165061859
 0.4600853424625171
 0.7940257103317943
 0.8541465903790502
 0.20058603493384108
 0.2986142783434118
 ⋮
 0.5762976355934157
 0.08831200391130656
 0.8994769043886504
 0.8232831225471882
 0.37869007913520947
 0.7812366659068535
 0.4651012221417914

julia> sum(x)
496.84883432553806

julia> sum(x, init=0.0)
496.84883432553846

As you can see the results are not identical (they differ at the second least
significant digit). What have messed up this time? By specifying init keyword
argument, although 0.0 is a neutral in summation, we forced sum to
use a different summation order again and for floats order of summation is
known to affect the result.

Wait – have I said that 0.0 is a neutral in summation? I have lied. See:

julia> isequal(sum([-0.0, -0.0]), sum([-0.0, -0.0], init=0.0))
false

because:

julia> sum([-0.0, -0.0])
-0.0

julia> sum([-0.0, -0.0], init=0.0)
0.0

At this point I start asking myself why in my math classes I was taught to use
real numbers and not IEEE 754 standard which seems to be at play much
more often in practice (at least if you are using computers). I will have to
pose this question Alan Edelman who is a professor of both mathematics and
computer science the next time I have the privilege of talking with him.

Conclusions

In this blog and on Stack Overflow I try to show users that Julia is a nice
language to work with (of which I am really convinced).

However, having written these 1000 stories that end good one wants to show at
least one story when the dark side shows up (of course the problems I have
discussed are not Julia specific, but their particular manifestation is a
consequence of design decisions that Julia developers made).

Can we do anything about the problems I have shown? There is one remedy and one
warning to keep in mind.

The remedy is the following: when working with collections in Julia always take
care to make sure they have homogeneous type of elements and this type is
chosen appropriately to the operation you want to perform. Except for very rare
situations there is little sense in mixing numbers of different types in one
collection so just do not do it.

The warning is the IEEE floating point arithmetic standard consequence: if you
work with floats better treat the result of operations on them as only
approximate.