Author Archives: Blog by Bogumił Kamiński

New features in DataFrames.jl 1.3: conclusion

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/01/07/release13.html

Introduction

This is the last post from the series introducing features added in DataFrames.jl 1.3. There are many changes I have not covered yet. I have selected some
of them that I think are most relevant in typical data wrangling workflows.

The topics I plan to discuss are:

  • ordering of groups in groupby;
  • unstack now supports fill keyword argument;
  • deprecations in deleting rows and sorting API.

The post was written under Julia 1.7.0, DataFrames.jl 1.3.1,
Chain.jl 0.4.10, and FreqTables.jl 0.4.5.

Ordering of groups in groupby

Let me start with highlighting that GroupedDataFrame objects produced by the
groupby function are indexable. This means that you can flexibly subset groups
or re-order them. Here is an example:

julia> using DataFrames

julia> df = DataFrame(a=[1,1,2,2,2,3])
6×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     1
   3 │     2
   4 │     2
   5 │     2
   6 │     3

julia> gdf = groupby(df, :a, sort=true)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = 1
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     1
⋮
Last Group (1 row): a = 3
 Row │ a
     │ Int64
─────┼───────
   1 │     3

julia> gdf[[3, 1]]
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = 3
 Row │ a
     │ Int64
─────┼───────
   1 │     3
⋮
Last Group (2 rows): a = 1
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     1

Here the gdf[[3, 1]] operation picked two groups from gdf putting group
with original index 3 first and group with original index 1 next.

This feature is often useful and gives a lot of flexibility to the users. Here
is an example showing how you can sort groups based on non-key column values:

julia> df = DataFrame(a=[1,1,2,2,2,3], x=6:-1:1)
6×2 DataFrame
 Row │ a      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      6
   2 │     1      5
   3 │     2      4
   4 │     2      3
   5 │     2      2
   6 │     3      1

julia> gdf = groupby(df, :a, sort=true)
GroupedDataFrame with 3 groups based on key: a
First Group (2 rows): a = 1
 Row │ a      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      6
   2 │     1      5
⋮
Last Group (1 row): a = 3
 Row │ a      x
     │ Int64  Int64
─────┼──────────────
   1 │     3      1

julia> gdf[sortperm([sum(sdf.x) for sdf in gdf])]
GroupedDataFrame with 3 groups based on key: a
First Group (1 row): a = 3
 Row │ a      x
     │ Int64  Int64
─────┼──────────────
   1 │     3      1
⋮
Last Group (2 rows): a = 1
 Row │ a      x
     │ Int64  Int64
─────┼──────────────
   1 │     1      6
   2 │     1      5

However, this means that one should be careful when considering the ordering
of groups in a GroupedDataFrame. For this reason apart from integer indexing
GroupedDataFrame also supports indexing using values of grouping columns
(in the example I show Tuple indexing, but also NamedTuple and dictionary
indexing is supported):

julia> df = DataFrame(name=["Alice", "Bob"])
2×1 DataFrame
 Row │ name
     │ String
─────┼────────
   1 │ Alice
   2 │ Bob

julia> gdf = groupby(df, :name, sort=true)
GroupedDataFrame with 2 groups based on key: name
First Group (1 row): name = "Alice"
 Row │ name
     │ String
─────┼────────
   1 │ Alice
⋮
Last Group (1 row): name = "Bob"
 Row │ name
     │ String
─────┼────────
   1 │ Bob

julia> gdf[("Bob",)]
1×1 SubDataFrame
 Row │ name
     │ String
─────┼────────
   1 │ Bob

or you can use a special GroupKey object that is produced by the keys
function (this option is fastest):

julia> keys(gdf)
2-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (name = "Alice",)
 GroupKey: (name = "Bob",)

So what is new in DataFrames.jl 1.3? The thing is that previously user was not
able to fully control the initial ordering of groups produced by groupby in
all cases. Now this can be controlled by the sort keyword argument and the
API has been established with the following rules:

  • if you pass sort=true the groups will be sorted by values of grouping columns;
  • if you pass sort=false the groups will be produced in order of their first
    appearance in the source data frame;
  • if you omit passing the sort keyword argument the ordering of groups is
    undefined and will depend on the grouping algorithm used (DataFrames.jl has
    several grouping algorithms and tries to choose the fastest available).

To see that these options matter let me show two examples of grouping on an
integer column:

julia> df = DataFrame(id=[2, 3, 1])
3×1 DataFrame
 Row │ id
     │ Int64
─────┼───────
   1 │     2
   2 │     3
   3 │     1

julia> keys(groupby(df, :id))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1,)
 GroupKey: (id = 2,)
 GroupKey: (id = 3,)

julia> keys(groupby(df, :id, sort=true))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1,)
 GroupKey: (id = 2,)
 GroupKey: (id = 3,)

julia> keys(groupby(df, :id, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 2,)
 GroupKey: (id = 3,)
 GroupKey: (id = 1,)

julia> df = DataFrame(id=[2, 30, 1])
3×1 DataFrame
 Row │ id
     │ Int64
─────┼───────
   1 │     2
   2 │    30
   3 │     1

julia> keys(groupby(df, :id))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 2,)
 GroupKey: (id = 30,)
 GroupKey: (id = 1,)

julia> keys(groupby(df, :id, sort=true))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1,)
 GroupKey: (id = 2,)
 GroupKey: (id = 30,)

julia> keys(groupby(df, :id, sort=false))
3-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 2,)
 GroupKey: (id = 30,)
 GroupKey: (id = 1,)

As you can see passing the sort keyword argument produces a consistent
ordering. However, when it is not passed in both examples we got a different
order of groups.

unstack now supports fill keyword argument

The change in unstack is pretty simple, but in many common scenarios will be
useful I think. Now you can specify what value should be used to fill missing
combinations of data.

Let me give a practical example. Assume you have a data frame where you have
several observations of peoples’ hair color and eye color:

julia> df = DataFrame(hair=["brown", "yellow", "brown", "brown"],
                      eyes=["blue", "blue", "green", "blue"])
4×2 DataFrame
 Row │ hair    eyes
     │ String  String
─────┼────────────────
   1 │ brown   blue
   2 │ yellow  blue
   3 │ brown   green
   4 │ brown   blue

You can create a frequency table of this data with the FreqTables.jl package:

julia> using FreqTables

julia> freqtable(df, :hair, :eyes)
2×2 Named Matrix{Int64}
hair ╲ eyes │  blue  green
────────────┼─────────────
brown       │     2      1
yellow      │     1      0

You got a matrix with a desired result. However, what if you wanted to get
a DataFrame instead. In the past you would do:

julia> using Chain

julia> @chain df begin
           groupby([:hair, :eyes], sort=true)
           combine(nrow)
           unstack(:hair, :eyes, :nrow)
       end
2×3 DataFrame
 Row │ hair    blue    green
     │ String  Int64?  Int64?
─────┼─────────────────────────
   1 │ brown        2        1
   2 │ yellow       1  missing

The only problem is that you get missing instead of 0 in the cell where
there were no observations. To get 0 you would write:

julia> @chain df begin
           groupby([:hair, :eyes], sort=true)
           combine(nrow)
           unstack(:hair, :eyes, :nrow)
           coalesce.(0)
       end
2×3 DataFrame
 Row │ hair    blue   green
     │ String  Int64  Int64
─────┼──────────────────────
   1 │ brown       2      1
   2 │ yellow      1      0

Since DataFrames.jl the pipeline is easier as you can pass fill=0 keyword
argument to unstack:

julia> @chain df begin
           groupby([:hair, :eyes], sort=true)
           combine(nrow)
           unstack(:hair, :eyes, :nrow, fill=0)
       end
2×3 DataFrame
 Row │ hair    blue   green
     │ String  Int64  Int64
─────┼──────────────────────
   1 │ brown       2      1
   2 │ yellow      1      0

Deprecations in deleting rows and sorting

The deprecation in row deletion is simple. The delete! function is deprecated
in favor of deleteat! function. This change was made to make the DataFrames.jl
API consistent with the Julia Base API (where delete! is defined to remove a
mapping for the given key in a collection, while deleteat! removes items
from given indices).

The deprecation in sorting API is more subtle. Consider the following data
frame:

julia> df = DataFrame(x=[1, 2, 2, 1], y =[2, 2, 1, 1], z=1:4)
4×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      1
   2 │     2      2      2
   3 │     2      1      3
   4 │     1      1      4

If you sort it without passing the list of columns on which it should be sorted
by default a lexicographic sort on all columns is performed:

julia> sort(df)
4×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      4
   2 │     1      2      1
   3 │     2      1      3
   4 │     2      2      2

is the same as:

julia> sort(df, All())
4×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      4
   2 │     1      2      1
   3 │     2      1      3
   4 │     2      2      2

However, to our surprise, currently also when you ask for sorting on no columns
you also get a data frame sorted on all columns:

julia> sort(df, Cols())
┌ Warning: When empty column selector is passed ordering is done on all colums. This behavior is deprecated and will change in the future.
│   caller = sortperm(df::DataFrame, cols::Cols{Tuple{}}; alg::Nothing, lt::typeof(isless), by::typeof(identity), rev::Bool, order::Base.Order.ForwardOrdering) at sort.jl:579
└ @ DataFrames ~/.julia/packages/DataFrames/BM4OQ/src/abstractdataframe/sort.jl:579
4×3 DataFrame
 Row │ x      y      z
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      1      4
   2 │     1      2      1
   3 │     2      1      3
   4 │     2      2      2

We think that it is an incorrect behavior and in the future sorting on no
columns will produce the result identical to the input data frame (no sorting
will be performed).

Conclusions

This post concludes a series of reviews of new features in DataFrames.jl release
1.3. I have not covered everything that was introduced, a complete list of
changes can be found in the NEWS.md file.

I hope you will enjoy using the package! Happy data wrangling in year 2022!

News features in DataFrames.jl 1.3: part 4

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/12/31/leftjoin.html

Introduction

This post continues the presentation of new features added in DataFrames.jl 1.3. This time I will discuss the leftjoin! function.

The post was written under Julia 1.7.0 and DataFrames.jl 1.3.1.
For performance comparison I have used R 4.1.2 and data.table 0.14.2.
Both in R and Julia I run the computations on 4 threads.

An in place left join

Before release 1.3, DataFrames.jl already offered a rich set of efficient
join functions:
innerjoin, leftjoin, rightjoin, outerjoin, semijoin, antijoin,
and crossjoin.

However, they all have a common limitation: their result is a freshly allocated
data frame.

A common usage scenario in practice is that we would like to add some
new columns to an existing table in-place. This is more efficient and uses less
memory (which is relevant if we work with very large data frames).

Since DataFrames.jl 1.3 this option is available with the addition of the
leftjoin! function. if you run leftjoin!(df1, df2; on=...) then the contract
is that the df1 data frame is updated in-place with columns coming from df2
based on matching rows of both data frames using the columns passed in the on
keyword argument.

It is important to remember that the design of leftjoin! assumes that the
columns of df1 are left unchanged (this is crucial for performance of the
operation). However, this implies that each row in df1 must have at most one
match in df2. Otherwise, leftjoin! would not be able to execute the
operation in-place since new rows would need to be added to df1. If you have
matching duplicate rows in df2 then just use leftjoin.

Here are two minimal examples of leftjoin!.

julia> using DataFrames

julia> using Random

julia> df1 = DataFrame(a=1:6, b=1:6)
6×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      4
   5 │     5      5
   6 │     6      6

julia> df2 = DataFrame(a=[2, 4, 6], c=1:3)
3×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     4      2
   3 │     6      3

julia> leftjoin!(df1, df2, on=:a)
6×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64?
─────┼───────────────────────
   1 │     1      1  missing
   2 │     2      2        1
   3 │     3      3  missing
   4 │     4      4        2
   5 │     5      5  missing
   6 │     6      6        3

julia> Random.seed!(1234);

julia> df1 = DataFrame(a=randperm(6), b=1:6)
6×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     3      1
   2 │     2      2
   3 │     6      3
   4 │     5      4
   5 │     1      5
   6 │     4      6

julia> df2 = DataFrame(a=shuffle!([2, 4, 6]), c=1:3)
3×2 DataFrame
 Row │ a      c
     │ Int64  Int64
─────┼──────────────
   1 │     4      1
   2 │     2      2
   3 │     6      3

julia> leftjoin!(df1, df2, on=:a)
6×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64?
─────┼───────────────────────
   1 │     3      1  missing
   2 │     2      2        2
   3 │     6      3        3
   4 │     5      4  missing
   5 │     1      5  missing
   6 │     4      6        1

Performance benchmarks

Now let me run two performance benchmarks of DataFrames.jl against data.table.
In the benchmarks I use 32-bit integers to ensure comparability of memory
footprint of objects between R and Julia.

The first test is on sorted key column. We start with Julia:

julia> df1 = DataFrame(a=Int32.(1:10^8));

julia> df2 = DataFrame(a=Int32.(1:10^8), x = true);

julia> @time leftjoin!(df1, df2, on=:a);
  2.867632 seconds (221.45 k allocations: 2.433 GiB, 6.47% gc time, 3.98% compilation time)

julia> df1 = DataFrame(a=Int32.(1:10^8));

julia> @time leftjoin!(df1, df2, on=:a);
  2.934633 seconds (150 allocations: 2.421 GiB, 8.19% gc time)

And now the data.table:

> library(data.table)
> df1 = data.table(a=1:10^8)
> df2 = data.table(a=1:10^8, x=TRUE)
> system.time(df1[df2, on = 'a', x := i.x])
   user  system elapsed
  8.067   1.106   5.679
> df1 = data.table(a=1:10^8)
> system.time(df1[df2, on = 'a', x := i.x])
   user  system elapsed
  9.305   1.184   6.652

As you can see for sorted data DataFrames.jl timings are competitive. Let us
now check shuffled data.

We start with DataFrames.jl:

julia> df1 = DataFrame(a=shuffle!(Int32.(1:10^8)));

julia> df2 = DataFrame(a=shuffle!(Int32.(1:10^8)), x = true);

julia> @time leftjoin!(df1, df2, on=:a);
 23.881552 seconds (175 allocations: 3.167 GiB, 1.43% gc time)

julia> df1 = DataFrame(a=Int32.(1:10^8));

julia> @time leftjoin!(df1, df2, on=:a);
 18.909113 seconds (175 allocations: 3.167 GiB, 1.40% gc time)

and now the timing for data.table:

> df1 = data.table(a=sample(1:10^8))
> df2 = data.table(a=sample(1:10^8), x=TRUE)
> system.time(df1[df2, on = 'a', x := i.x])
   user  system elapsed
 30.778   1.791  23.153
> df1 = data.table(a=sample(1:10^8))
> system.time(df1[df2, on = 'a', x := i.x])
   user  system elapsed
 30.586   1.695  22.893

Again the timing of DataFrames.jl is competitive.

(let me stress here that this is just one set of examples and relative
performance of different packages can vary depending on the data and the
operating environment; the point of these tests is to show that currently
DataFrmes.jl is not much worse than data.table in execution of joins, as this
was a performance bottleneck of DataFrames.jl in the past)

Conclusions

This time in the post I have focused on a single function: the leftjoin!.
The reason is that I believe that addition of an in-place left join to
DataFrames.jl is quite significant since it is needed in many data processing
scenarios, especially when working with large tables.

News features in DataFrames.jl 1.3: part 3

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/12/24/selection.html

Introduction

This post continues the presentation of new features added in DataFrames.jl 1.3.0. This time I will discuss what is new in indexing syntax.

The post was written under Julia 1.7.0 and DataFrames.jl 1.3.1.
When running the examples use the --depwarn=yes option when starting Julia.

Adding columns in views

Since DataFrames.jl 1.3 a long requested feature to allow adding columns to
views has been added. As, in general, in a view you can reorder and/or drop
columns this feature is only allowed if a view was created with : as column
selector (remember, that when using : as column selector a view will always
reflect the list of columns of its parent DataFrame). Here is an example:

julia> using DataFrames

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> dfv = @view df[[1,3], :]
2×1 SubDataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     3

julia> dfv[:, :b] = 4:5
4:5

julia> dfv
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64?
─────┼───────────────
   1 │     1       4
   2 │     3       5

julia> df
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64?
─────┼────────────────
   1 │     1        4
   2 │     2  missing
   3 │     3        5

Note that in column :b in df in filtered out rows missing value was
placed.

As noted creating new columns is not allowed if other column selector than :
is passed when creating a view:

julia> dfv = @view df[[1,3], 1:2]
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64?
─────┼───────────────
   1 │     1       4
   2 │     3       5

julia> dfv[:, :c] = 4:5
ERROR: ArgumentError: creating new columns in a SubDataFrame that subsets columns of its parent data frame is disallowed

Additionally it is allowed to replace columns in a view when ! selector is
used (here it works for any view as we are not creating new columns):

julia> dfv.a = ["111", "113"]
2-element Vector{String}:
 "111"
 "113"

julia> df
3×2 DataFrame
 Row │ a    b
     │ Any  Int64?
─────┼──────────────
   1 │ 111        4
   2 │ 2    missing
   3 │ 113        5

As you can see the values that were present in filtered-out rows are retained.
If the new values have type not allowed in the current element type of the
column an appropriate type promotion is performed. Here is another example:

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> dfv = @view df[[1,3], :]
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     3      6

julia> dfv.a = 11.0:12.0
11.0:1.0:12.0

julia> dfv.b = 'a':'b'
'a':1:'b'

julia> df
3×2 DataFrame
 Row │ a        b
     │ Float64  Any
─────┼──────────────
   1 │    11.0  a
   2 │     2.0  5
   3 │    12.0  b

Why does adding columns in views matter?

The huge benefit of allowing adding columns in views is as follows: we can make
all standard functions like insertols!, select!, transform! work on
SubDataFrames. This is very useful if you want to perform some operation only
under some condition. Here is a simple example:

julia> df = DataFrame(x = -1:0.2:1)
11×1 DataFrame
 Row │ x
     │ Float64
─────┼─────────
   1 │    -1.0
   2 │    -0.8
   3 │    -0.6
   4 │    -0.4
   5 │    -0.2
   6 │     0.0
   7 │     0.2
   8 │     0.4
   9 │     0.6
  10 │     0.8
  11 │     1.0

julia> transform!(subset(df, :x => ByRow(>(0)), view=true), :x => ByRow(log))
5×2 SubDataFrame
 Row │ x        x_log
     │ Float64  Float64?
─────┼────────────────────
   1 │     0.2  -1.60944
   2 │     0.4  -0.916291
   3 │     0.6  -0.510826
   4 │     0.8  -0.223144
   5 │     1.0   0.0

julia> df
11×2 DataFrame
 Row │ x        x_log
     │ Float64  Float64?
─────┼─────────────────────────
   1 │    -1.0  missing
   2 │    -0.8  missing
   3 │    -0.6  missing
   4 │    -0.4  missing
   5 │    -0.2  missing
   6 │     0.0  missing
   7 │     0.2       -1.60944
   8 │     0.4       -0.916291
   9 │     0.6       -0.510826
  10 │     0.8       -0.223144
  11 │     1.0        0.0

In DataFrames.jl you have to do this in two steps: subset to a view and then
transform!. However, I hope that DataFramesMeta.jl and
DataFrameMacros packages in the coming releases will provide a nicer
syntax for this, allowing to combine transformation and filtering in one step.

Hard deprecation period for broadcasted assignment

Since Julia 1.7 is out a long missing feature is now available.
The feature is that it is allowed to add new columns to a data frame using
the broadcasting assignment with setproperty:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df.b .= 1
3-element Vector{Int64}:
 1
 1
 1

julia> df
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      1
   3 │     3      1

Also views are supported the way we have described earlier:

julia> dfv = view(df, [1, 3], :)
2×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     3      1

julia> dfv.c .= 2
2-element Vector{Int64}:
 2
 2

julia> df
3×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Int64?
─────┼───────────────────────
   1 │     1      1        2
   2 │     2      1  missing
   3 │     3      1        2

In essence using setproperty is made almost the same as using the ! row
selector and assignment. Why do I say almost? The reason is that the only
place where it is inconsistent is broadcasted assignment to an existing column:

julia> df.a .= 10
┌ Warning: In the 1.4 release of DataFrames.jl this operation will allocate a new column instead of performing an in-place assignment. To perform an in-place assignment use `df[:, col] .= ...` instead.
│   caller = top-level scope at REPL[8]:1
└ @ Core REPL[8]:1
3-element Vector{Int64}:
 10
 10
 10

As you can see in the warning message this inconsistency (that was known and
discussed for some time already) will be fixed in DataFrames.jl 1.4.
We have been waiting with this change for several releases in order to clearly
inform users about this fix in advance.

This change is not only about consistency, but also to make sure we do not
perform accidental conversions where users most likely do not expect them:

julia> df2 = DataFrame(x = 'a':'c')
3×1 DataFrame
 Row │ x
     │ Char
─────┼──────
   1 │ a
   2 │ b
   3 │ c

julia> df2.x .= 104
┌ Warning: In the 1.4 release of DataFrames.jl this operation will allocate a new column instead of performing an in-place assignment. To perform an in-place assignment use `df[:, col] .= ...` instead.
│   caller = top-level scope at REPL[18]:1
└ @ Core REPL[18]:1
3-element Vector{Char}:
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)
 'h': ASCII/Unicode U+0068 (category Ll: Letter, lowercase)

julia> df2
3×1 DataFrame
 Row │ x
     │ Char
─────┼──────
   1 │ h
   2 │ h
   3 │ h

In the future, when DataFrames.jl 1.4 is released, instead you will get a data
frame with column :x having element type Int and storing three 104
values.

Conclusions

The changes I have described today are not something that a new person starts to
use on the first day of working with DataFrames.jl. However, after one learns
the basics more and more advanced queries are needed in practice. Improvements
in functionality and consistency of the design of the core of indexing
mechanisms in DataFrames.jl are hopefully going to make these complex
requirements easier to meet.