Tag Archives: julialang

Dropping columns from a data frame

Re-posted from: https://bkamins.github.io/julialang/2023/09/08/dropcols.html

Introduction

One of the common tasks when working with a data frame is dropping some of its columns.
There are two ways to do it. You can either specify
which columns you want to keep or which columns you want to drop.

One of the frequent questions I get is how to do these operations with
[DataFrames.jl][tables] in case the list of columns to keep or drop might not
be a subset of columns of the data frame. This is the topic I want to cover in my today’s post.

The post was tested using Julia 1.9.2 and DataFrames.jl 1.6.1.

Standard column selection

First, create an example data frame:

julia> using DataFrames

julia> df = DataFrame(a=1, b=2, c=3, d=4)
1×4 DataFrame
 Row │ a      b      c      d
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      3      4

Now assume I want to keep columns :a and :c from it. You can do it by writing, for example:

julia> select(df, :a, :c)
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

You could also pass the columns as a variable using e.g., a vector:

julia> keep1 = [:a, :c]
2-element Vector{Symbol}:
 :a
 :c

julia> select(df, keep1)
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Now, let us discuss dropping columns. Assume we want to keep all columns except columns :b and :d.
We can achieve this by using the Not command:

julia> select(df, Not(:b, :d))
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Also in this case we can use a helper variable:

julia> drop1 = [:b, :d]
2-element Vector{Symbol}:
 :b
 :d

julia> select(df, Not(drop1))
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

The problematic case: the selected column is not present in a data frame

In some scenarios, we might want to provide a list of columns of which not
all are present in the data frame. For example, assume we want to keep
columns :a and :x. We see that the :x column is not present in our
df data frame.

Before we move forward, let me comment when such a situation occurs most often.
Assume you have 100 data frames that describe your data. Each data frame is similar,
but not identical. For example, a single data frame might represent data from one country
and the list of information for the countries does not have to be identical (for some
countries we might have more information, which results in more columns in a data frame).
When processing such data we might want to write one general condition on which columns
we want to keep or drop, and some of these columns might be present in only a subset of
all data frames.

Now let us go back to our example. Let us try keeping columns :a and :x:

julia> select(df, :a, :x)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

julia> keep2 = [:a, :x]
2-element Vector{Symbol}:
 :a
 :x

julia> select(df, keep2)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

We get an error. DataFrames.jl is designed to check, by default, that the operation you want to perform
on your data frame is valid. This is a conscious design decision. The reason is that in production application
settings most often when you say that you want to keep columns :a and :x you assume that they are present in df.
Thus you want to get an error if they would not be all present in it.

The same behavior can be observed for dropping columns. Assume we want to drop columns :b and :x:

julia> select(df, Not(:b, :x))
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

julia> drop2 = [:b, :x]
2-element Vector{Symbol}:
 :b
 :x

julia> select(df, drop2)
ERROR: ArgumentError: column name "x" not found in the data frame; existing most similar names are: "a", "b", "c" and "d"

So what we saw here is a default behavior that was designed to be safe.
In what follows let me discuss how to perform a flexible selection.

Performing column selection when some of them are not present in a data frame

There are several solutions for column selection when some of them are not present in a data frame.
Let me present the one that I find the most convenient.
For this operation I typically use the Cols selector. The reason is that you can pass
a condition function (a predicate) as an argument to Cols that will select columns
whose names meet a passed condition.

Therefore the following operation:

julia> select(df, keep1)
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Is the same as:

julia> keep1s = string.(keep1)
2-element Vector{String}:
 "a"
 "c"

julia> select(df, Cols(in(keep1s)))
1×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     1      3

Note what we did here. The in(keep1s) expression produces a function that checks if a value passed to it is in the keep1s vector.
It is important to note that although column selection in DataFrames.jl accepts both Symbol (like :a) and strings (like "a")
as column names the Cols-based selector will perform the check against strings. Therefore I had to convert the keep1 vector
of symbols to a keep1s vector of strings.

So far the select(df, Cols(in(keep1s))) is more verbose than just writing select(df, keep1). However, the benefit of Cols
is that when the in(keep1s) check is done we can have in keep1s vector whatever values we like, in particular,
they do not have be valid column names of our df.

Therefore to keep columns :a and :x, if we are unsure if these columns are present in df we can write:

julia> select(df, Cols(in(["a", "x"])))
1×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

Note that this time the operation works without an error. And again, please keep in mind that in this selector we need to pass
column names as strings.

Now you can probably already tell how to pass a list of columns to drop, without requiring that they are present in df.
The only thing to do is to use the !in function (not-in) instead of in. Let us drop columns :b and :x from our
data frame (keeping in mind that :x is not present in it):

julia> select(df, Cols(!in(["b", "x"])))
1×3 DataFrame
 Row │ a      c      d
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      3      4

All worked as expected – the :b column was dropped and the :x column was ignored in the column dropping operation.

Conclusions

I hope you find the examples I gave today useful.

In general, the whole design of DataFrames.jl is similar to
what was discussed in this post. The default behavior is picked to be safe (as in our example: by default,
select checks if the columns you pass are present in a data frame), but it is possible to switch to an unsafe
mode relatively easily (in our example: using Cols with a predicate function).

How does Tables.jl handle schema-less tables?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/09/01/tables.html

Introduction

Tables.jl is a fundamental package in the JuliaData ecosystem.

One of the key concepts used in Tables.jl is a schema of a table.
Schema is information about the column names and types of a table.
Having access to schema is useful as being able to query these properties
is constantly needed in practice.

In this post I want to discuss in what cases a table might not have
a schema information associated with it and how Tables.jl handles this case.

The post was written using Julia 1.9.2, Tables.jl 1.10.1, and DataFrames.jl 1.6.1.

What is a schema of a table?

Let us create some example tables and investigate their schema:

julia> df = DataFrame(a=[1, 2], b=[3.0, 4.0], c=[1, "a"])
2×3 DataFrame
 Row │ a      b        c
     │ Int64  Float64  Any
─────┼─────────────────────
   1 │     1      3.0  1
   2 │     2      4.0  a

julia> Tables.schema(df)
Tables.Schema:
 :a  Int64
 :b  Float64
 :c  Any

julia> ntv = Tables.columntable(df)
(a = [1, 2], b = [3.0, 4.0], c = Any[1, "a"])

julia> Tables.schema(ntv)
Tables.Schema:
 :a  Int64
 :b  Float64
 :c  Any

julia> vnt = Tables.rowtable(df)
2-element Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}}:
 NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}((1, 3.0, 1))
 NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}((2, 4.0, "a"))

julia> Tables.schema(vnt)
Tables.Schema:
 :a  Int64
 :b  Float64
 :c  Any

I have created a data frame df with three columns and then converted it into
a named tuple of vectors ntv and a vector of named tuples vnt.
All these three objects are Tables.jl tables.

In all cases we have checked that the Tables.schema properly identifies the schema
of the se tables. They have three columns :a, :b, and :c, and these columns
have types :Int64, Float64, and Any.

So in what case a table might not have a schema? To understand this consider the
following transformation:

julia> vnt2 = [(; collect(pairs(r))...) for r in vnt]
2-element Vector{NamedTuple{(:a, :b, :c)}}:
 (a = 1, b = 3.0, c = 1)
 (a = 2, b = 4.0, c = "a")

julia> Tables.schema(vnt2)

julia> isnothing(Tables.schema(vnt2))
true

Let us understand what happens in this code. With
(; collect(pairs(r))...) operation I make the type of the fields
of the named tuples stored in the vnt vector concrete.
To understand this better note:

julia> typeof(vnt[1])
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}

julia> typeof((; collect(pairs(vnt[1]))...))
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Int64}}

Note that the type of the :c field is Any and Int64 respectively.

This transformation means that the vnt2 vector itself does not, in consequence,
have a concrete element type:

julia> eltype(vnt2)
NamedTuple{(:a, :b, :c)}

This type only specifies column names, but not their types. In such a situation
the Tables.schema function returns nothing signaling to the user that the schema
of a table is unknown.

You might think that this is an artificial example but consider the following common case:

julia> vnt3 = [(a=1, b=2), (a=missing, b=3)]
2-element Vector{NamedTuple{(:a, :b)}}:
 (a = 1, b = 2)
 (a = missing, b = 3)

julia> Tables.schema(vnt3)

julia> isnothing(Tables.schema(vnt3))
true

As you can see when we have a vector of named tuples, a mix in of missing values
makes it schema-less.

How does Tables.jl handle schema-less tables? Part 1

If Tables.jl detects a schema-less table it tries to dynamically detect its schema
when functions defined in Tables.jl are called. The three key functions here are:

Tables.columns: returns an object that can be queried column-wise;
Tables.rows: returns an object that can be queried row-wise;
Tables.dictrowtable: returns a row-wise vector that performs “column unioning”.

You might wonder what this “column unioning” part means. We will get to it in a second.

First let us investigate how these functions work on our vnt2 object that is schema-less:

julia> Tables.columns(vnt2)
Tables.CopiedColumns{NamedTuple{(:a, :b, :c), Tuple{Vector{Int64}, Vector{Float64}, Vector{Any}}}} with 2 rows, 3 columns, and schema:
 :a  Int64
 :b  Float64
 :c  Any

julia> Tables.schema(Tables.columns(vnt2))
Tables.Schema:
 :a  Int64
 :b  Float64
 :c  Any

julia> Tables.rows(vnt2)
2-element Vector{NamedTuple{(:a, :b, :c)}}:
 (a = 1, b = 3.0, c = 1)
 (a = 2, b = 4.0, c = "a")

julia> Tables.schema(Tables.rows(vnt2))

julia> isnothing(Tables.schema(Tables.rows(vnt2)))
true

julia> Tables.dictrowtable(vnt2)
Tables.DictRowTable([:a, :b, :c], Dict{Symbol, Type}(:a => Int64, :b => Float64, :c => Union{Int64, String}), Dict{Symbol, Any}[Dict(:a => 1, :b => 3.0, :c => 1), Dict(:a => 2, :b => 4.0, :c => "a")])

julia> Tables.schema(Tables.dictrowtable(vnt2))
Tables.Schema:
 :a  Int64
 :b  Float64
 :c  Union{Int64, String}

As you can see Tables.columns performed eltype detection and determined the type of column :c as Any.
The Tables.rows produces a schema-less object as for vnt2 object calling Tables.rows on it just returns the input.
Finally Tables.dictrowtable performs a narrower eltype for column :c which is Union{Int64, String}.
All these results are technically correct, but it is important to note that they start to matter when we create
a data frame from the result:

julia> DataFrame(Tables.columns(vnt2))
2×3 DataFrame
 Row │ a      b        c
     │ Int64  Float64  Any
─────┼─────────────────────
   1 │     1      3.0  1
   2 │     2      4.0  a

julia> DataFrame(vnt2)
2×3 DataFrame
 Row │ a      b        c
     │ Int64  Float64  Any
─────┼─────────────────────
   1 │     1      3.0  1
   2 │     2      4.0  a

julia> DataFrame(Tables.dictrowtable(vnt2))
2×3 DataFrame
 Row │ a      b        c
     │ Int64  Float64  Union…
─────┼────────────────────────
   1 │     1      3.0  1
   2 │     2      4.0  a

In this case for Tables.columns and Tables.rows we have Any as element type for :c.
While for Tables.dictrowtable we get the Union. This difference might matter to you
if you wanted to later change data stored in :c column.

However, the key difference between Tables.dictrowtable and other methods is visible
in case when we have heterogeneous column list data.

How does Tables.jl handle schema-less tables? Part 2

What is the data that has a heterogeneous column list?
It is relatively common in practice. Let me create two examples:

julia> h1 = [(; a=1), (; a=2, b=3)]
2-element Vector{NamedTuple}:
 (a = 1,)
 (a = 2, b = 3)

julia> h2 = [(; a=1), (; b=3)]
2-element Vector{NamedTuple{names, Tuple{Int64}} where names}:
 (a = 1,)
 (b = 3,)

Both h1 and h2 have a different set of columns. How does the
Tables.jl handle them? Let us check by creating a data frame from them:

julia> DataFrame(Tables.columns(h1))
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

julia> DataFrame(Tables.rows(h1))
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

julia> DataFrame(Tables.dictrowtable(h1))
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64?
─────┼────────────────
   1 │     1  missing
   2 │     2        3

For the h1 input we see that using Tables.columns and Tables.rows drops the :b
column while Tables.dictrowtable keeps it and fills missing entries with missing.
This behavior is called “column unioning”.

Now let us check the h2 case:

julia> DataFrame(Tables.columns(h2))
ERROR: type NamedTuple has no field a

julia> DataFrame(Tables.rows(h2))
ERROR: type NamedTuple has no field a

julia> DataFrame(Tables.dictrowtable(h2))
2×2 DataFrame
 Row │ a        b
     │ Int64?   Int64?
─────┼──────────────────
   1 │       1  missing
   2 │ missing        3

Now we have an even stranger behavior. The first two calls error, while the
Tables.dictrowtable works by “column unioning”. Why do the first two cases error?
The reason is that DataFrame constructor internally always calls Tables.columns
on a passed table. And Tables.columns when doing column detection assumes that
the list of columns for row-wise table is given by its first row. So even
without calling DataFrame we get an error when we do:

julia> Tables.columns(h2)
ERROR: type NamedTuple has no field a

In summary we have two cases of behaviors for row-wise tables:

assume the columns are given in the first row of a table;
this is what Tables.columns does, and, in consequence DataFrame
constructor;
perform “column unioning”; this is what Tables.dictrowtable does.

Conclusions

The take-away from the examples we have given today are as follows:

by default DataFrame(row_wise_table) inherits the Tables.columns behavior and assumes
that the set of columns of the passed row-wise table is given by its first row;
if you want to override the default behavior, and want to perform “column unioning” instead
call DataFrame(Tables.dictrowtable(row_wise_table)); this option is useful if you have
data that has heterogeneous column lists in different rows.

The examples given today were a bit low-level, but I hope they improved your understanding
how the Tables.jl functions handle different tabular data sources.

How does Tables.jl handle schema-less tables?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/09/01/tables.html

Introduction

Tables.jl is a fundamental package in the JuliaData ecosystem.

In this post I want to discuss in what cases a table might not have
a schema information associated with it and how Tables.jl handles this case.

The post was written using Julia 1.9.2, Tables.jl 1.10.1, and DataFrames.jl 1.6.1.

What is a schema of a table?

Let us create some example tables and investigate their schema:

julia> df = DataFrame(a=[1, 2], b=[3.0, 4.0], c=[1, "a"])
2×3 DataFrame
 Row │ a      b        c
     │ Int64  Float64  Any
─────┼─────────────────────
   1 │     1      3.0  1
   2 │     2      4.0  a

julia> Tables.schema(df)
Tables.Schema:
 :a  Int64
 :b  Float64
 :c  Any

julia> ntv = Tables.columntable(df)
(a = [1, 2], b = [3.0, 4.0], c = Any[1, "a"])

julia> Tables.schema(ntv)
Tables.Schema:
 :a  Int64
 :b  Float64
 :c  Any

julia> vnt = Tables.rowtable(df)
2-element Vector{NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}}:
 NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}((1, 3.0, 1))
 NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}((2, 4.0, "a"))

julia> Tables.schema(vnt)
Tables.Schema:
 :a  Int64
 :b  Float64
 :c  Any

I have created a data frame df with three columns and then converted it into
a named tuple of vectors ntv and a vector of named tuples vnt.
All these three objects are Tables.jl tables.

So in what case a table might not have a schema? To understand this consider the
following transformation:

julia> vnt2 = [(; collect(pairs(r))...) for r in vnt]
2-element Vector{NamedTuple{(:a, :b, :c)}}:
 (a = 1, b = 3.0, c = 1)
 (a = 2, b = 4.0, c = "a")

julia> Tables.schema(vnt2)

julia> isnothing(Tables.schema(vnt2))
true

julia> typeof(vnt[1])
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Any}}

julia> typeof((; collect(pairs(vnt[1]))...))
NamedTuple{(:a, :b, :c), Tuple{Int64, Float64, Int64}}

Note that the type of the :c field is Any and Int64 respectively.

This transformation means that the vnt2 vector itself does not, in consequence,
have a concrete element type:

julia> eltype(vnt2)
NamedTuple{(:a, :b, :c)}

This type only specifies column names, but not their types. In such a situation
the Tables.schema function returns nothing signaling to the user that the schema
of a table is unknown.

You might think that this is an artificial example but consider the following common case:

julia> vnt3 = [(a=1, b=2), (a=missing, b=3)]
2-element Vector{NamedTuple{(:a, :b)}}:
 (a = 1, b = 2)
 (a = missing, b = 3)

julia> Tables.schema(vnt3)

julia> isnothing(Tables.schema(vnt3))
true

As you can see when we have a vector of named tuples, a mix in of missing values
makes it schema-less.

How does Tables.jl handle schema-less tables? Part 1

If Tables.jl detects a schema-less table it tries to dynamically detect its schema
when functions defined in Tables.jl are called. The three key functions here are:

Tables.columns: returns an object that can be queried column-wise;
Tables.rows: returns an object that can be queried row-wise;
Tables.dictrowtable: returns a row-wise vector that performs “column unioning”.

You might wonder what this “column unioning” part means. We will get to it in a second.

First let us investigate how these functions work on our vnt2 object that is schema-less:

julia> Tables.columns(vnt2)
Tables.CopiedColumns{NamedTuple{(:a, :b, :c), Tuple{Vector{Int64}, Vector{Float64}, Vector{Any}}}} with 2 rows, 3 columns, and schema:
 :a  Int64
 :b  Float64
 :c  Any

julia> Tables.schema(Tables.columns(vnt2))
Tables.Schema:
 :a  Int64
 :b  Float64
 :c  Any

julia> Tables.rows(vnt2)
2-element Vector{NamedTuple{(:a, :b, :c)}}:
 (a = 1, b = 3.0, c = 1)
 (a = 2, b = 4.0, c = "a")

julia> Tables.schema(Tables.rows(vnt2))

julia> isnothing(Tables.schema(Tables.rows(vnt2)))
true

julia> Tables.dictrowtable(vnt2)
Tables.DictRowTable([:a, :b, :c], Dict{Symbol, Type}(:a => Int64, :b => Float64, :c => Union{Int64, String}), Dict{Symbol, Any}[Dict(:a => 1, :b => 3.0, :c => 1), Dict(:a => 2, :b => 4.0, :c => "a")])

julia> Tables.schema(Tables.dictrowtable(vnt2))
Tables.Schema:
 :a  Int64
 :b  Float64
 :c  Union{Int64, String}

julia> DataFrame(Tables.columns(vnt2))
2×3 DataFrame
 Row │ a      b        c
     │ Int64  Float64  Any
─────┼─────────────────────
   1 │     1      3.0  1
   2 │     2      4.0  a

julia> DataFrame(vnt2)
2×3 DataFrame
 Row │ a      b        c
     │ Int64  Float64  Any
─────┼─────────────────────
   1 │     1      3.0  1
   2 │     2      4.0  a

julia> DataFrame(Tables.dictrowtable(vnt2))
2×3 DataFrame
 Row │ a      b        c
     │ Int64  Float64  Union…
─────┼────────────────────────
   1 │     1      3.0  1
   2 │     2      4.0  a

However, the key difference between Tables.dictrowtable and other methods is visible
in case when we have heterogeneous column list data.

How does Tables.jl handle schema-less tables? Part 2

What is the data that has a heterogeneous column list?
It is relatively common in practice. Let me create two examples:

julia> h1 = [(; a=1), (; a=2, b=3)]
2-element Vector{NamedTuple}:
 (a = 1,)
 (a = 2, b = 3)

julia> h2 = [(; a=1), (; b=3)]
2-element Vector{NamedTuple{names, Tuple{Int64}} where names}:
 (a = 1,)
 (b = 3,)

Both h1 and h2 have a different set of columns. How does the
Tables.jl handle them? Let us check by creating a data frame from them:

julia> DataFrame(Tables.columns(h1))
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

julia> DataFrame(Tables.rows(h1))
2×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2

julia> DataFrame(Tables.dictrowtable(h1))
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64?
─────┼────────────────
   1 │     1  missing
   2 │     2        3

Now let us check the h2 case:

julia> DataFrame(Tables.columns(h2))
ERROR: type NamedTuple has no field a

julia> DataFrame(Tables.rows(h2))
ERROR: type NamedTuple has no field a

julia> DataFrame(Tables.dictrowtable(h2))
2×2 DataFrame
 Row │ a        b
     │ Int64?   Int64?
─────┼──────────────────
   1 │       1  missing
   2 │ missing        3

julia> Tables.columns(h2)
ERROR: type NamedTuple has no field a

In summary we have two cases of behaviors for row-wise tables:

assume the columns are given in the first row of a table;
this is what Tables.columns does, and, in consequence DataFrame
constructor;
perform “column unioning”; this is what Tables.dictrowtable does.

Conclusions

The take-away from the examples we have given today are as follows:

by default DataFrame(row_wise_table) inherits the Tables.columns behavior and assumes
that the set of columns of the passed row-wise table is given by its first row;
if you want to override the default behavior, and want to perform “column unioning” instead
call DataFrame(Tables.dictrowtable(row_wise_table)); this option is useful if you have
data that has heterogeneous column lists in different rows.

The examples given today were a bit low-level, but I hope they improved your understanding
how the Tables.jl functions handle different tabular data sources.

juliabloggers.com

A Julia Language Blog Aggregator

Tag Archives: julialang

Dropping columns from a data frame

Introduction

Standard column selection

The problematic case: the selected column is not present in a data frame

Performing column selection when some of them are not present in a data frame

Conclusions

How does Tables.jl handle schema-less tables?

Introduction

What is a schema of a table?

How does Tables.jl handle schema-less tables? Part 1

How does Tables.jl handle schema-less tables? Part 2

Conclusions

How does Tables.jl handle schema-less tables?

Introduction

What is a schema of a table?

How does Tables.jl handle schema-less tables? Part 1

How does Tables.jl handle schema-less tables? Part 2

Conclusions