Author Archives: Blog by Bogumił Kamiński

Knight’s tour puzzle

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/11/11/knights2.html

Introduction

Before I start I would like to make a small announcement. If you are interested
in the Julia language you are welcome to participate in a 4-day “Introduction
to Julia for Data Science” short course that is organized at MIT on Jan 17-20, 2023.
Everyone is invited. You can find a PDF with the schedule here.

Now we can get back to the usual blogging business.

After my recent post about knight covering puzzle I was asked for
another puzzle-solving content. Therefore today I want to present you
how the Knight’s tour problem can be cracked using Julia.

The code in this post was tested under Julia 1.8.2 and Plots.jl 1.35.0.

The puzzle

Consider a rectangle grid. We place a chess knight in one of the squares on
this grid. A chess knight can move two squares vertically and one square
horizontally, or two squares horizontally and one square vertically.

We want to find a sequence of moves of a knight so that it visits each
square on a grid exactly once and goes back to a starting position.

I will show you how to find a solution to this problem (or learn that the task
is impossible) for arbitrary grid sizes. Next, we will see the solution for
a standard chessboard that has 8 rows and 8 columns.

The code

In the solution the key object that we will track is a grid. It will
be a matrix storing consecutive moves of a knight. In this matrix a 0
entry means that the square has not visited yet it and positive entry indicates
move number when the square was visited. So number 1 is a starting position of
the knight.

To track location of the knight on a grid we will use a 2-tuple holding
current row and column location of the knight. It is called p in the code.

We first create the listoptions helper function. It takes grid and p
as arguments and returns a vector of possible moves of the knight from p
to squares that have not been visited yet. Here is its implementation:

listoptions(grid, p) =
    [p .+ d for d in ((1, 2), (-1, 2), (1, -2), (-1, -2),
                      (2, 1), (-2, 1), (2, -1), (-2, -1))
     if get(grid, p .+ d, -1) == 0]

Notice how nicely the get function works in this case. We use it to get
a -1 value in case p .+ d is not within bounds of grid (so such invalid
moves are discarded).

It is time to present a key function that will handle the traversal of the
grid by the knight:

function knight_jump!(grid=fill(0, 8, 8), p=(1, 1), i=1)
    grid[p...] = i
    if i == length(grid)
        p1 = Tuple(findfirst(==(1), grid))
        return extrema(abs, p .- p1) == (1, 2) ? grid : nothing
    end
    v = listoptions(grid, p)
    sort!(v, by=np -> length(listoptions(grid, np)))
    for np in v
        knight_jump!(grid, np, i + 1) !== nothing && return grid
    end
    grid[p...] = 0
    return nothing
end

Let me explain how it works. The grid and p arguments were already
discussed. The extra i argument stores the move number. The function
returns grid in case it found a feasible solution and nothing if no
feasible solution is found. This invariant is crucial, as we will use it
to perform depth first search for a valid knight tour.

First we set the grid at location p to i to record the current placement
of the knight.

Next in i == length(grid) check we verify that we have hit the last free spot
on a grid. If this is the case we check if the current position p is
knight-jump away from the initial position of the knight (denoted by p1 in
the code). If this is the case we return grid. Otherwise the tour is invalid
and we return nothing.

If our tour is not finished yet we store in the v vector the possible moves
we can do next. Now a crucial part of the algorithm is applied. We sort v
using Warnsdorff’s rule, that is, we put the squares with fewest
onward moves in the front of the verctor. We then recursively visit them
and try to solve the puzzle. If we succeed, i.e. when the recursive call
to knight_jump! does not return nothing, we are done and return the result.
If we fail for all values in v, this means that an attempt to visit p was
an incorrect choice. In this case we need to reset grid so that in position
p it has 0 and return nothing (to signal a problem).

The result

Let us check how the solution looks on a 8×8 grid (which is the default in
our code):

julia> res = knight_jump!()
8×8 Matrix{Int64}:
  1  16  51  34   3  18  21  36
 50  33   2  17  52  35   4  19
 15  64  49  56  45  20  37  22
 32  55  44  63  48  53  42   5
 61  14  57  54  43  46  23  38
 28  31  62  47  58  41   6   9
 13  60  29  26  11   8  39  24
 30  27  12  59  40  25  10   7

First we see that indeed the solution was found (as we did not get nothing
from the call). Second, a visual inspection of the solution shows that indeed
the solution is correct.

Let us additionally visualize it to make the analysis easier. We first convert
the res matrix into the moves vector of consecutive knight locations stored
as tuples. Next we add first location to the end of this vector (as we have
a cycle). Finally we plot a chessboard with consecutive knight moves presented
by lines.

julia> using Plots

julia> moves = Tuple.(CartesianIndices(res)[sortperm(vec(res))])
64-element Vector{Tuple{Int64, Int64}}:
 (1, 1)
 (2, 3)
 (1, 5)
 ⋮
 (6, 3)
 (4, 4)
 (3, 2)

julia> push!(moves, first(moves))
65-element Vector{Tuple{Int64, Int64}}:
 (1, 1)
 (2, 3)
 (1, 5)
 ⋮
 (4, 4)
 (3, 2)
 (1, 1)

julia> plot(getindex.(moves, 1), getindex.(moves, 2);
            legend=false, size=(400, 400), color=:blue,
            marker=:o, markerstrokecolor=:blue,
            xlim=(0.5, 8.5), ylim=(0.5, 8.5),
            minorticks=2, minorgrid=true, grid=false,
            xticks=(0:9, [""; 'A':'H'; ""]), yticks=0:9,
            minorgridalpha=1.0, showaxis=false)

The resulting plot looks as follows:

Knight's tour

Homework

Since I have started the post with an announcement of a short course, let me
switch to lecturing mode for a moment. For this puzzle I have the following
exercises for you to train your Julia muscle:

  • Check if we ever need to backtrack in the algorithm as maybe we solve the
    problem on the first attempt thanks to Warnsdorff’s rule?
  • Measure the performance of the code.
  • Do you have ideas how its speed could be improved (hint: think how you can
    reduce the number of allocations and avoid sorting)
  • Check what would happen if we did not use Warnsdorff’s rule. Two natural
    candidate rules are: visiting squares in random order and visiting squares in
    the order produced by the listoptions function.
  • The proposed solution uses recursion. Since Julia has a recursion depth limit
    the code will not work as expected for large grids. First, find this limit.
    Second, think how you could rewrite the program so that it avoids using
    recursion.

Conclusions

I hope you enjoyed the puzzle and the presented solution. If you get stuck on
any parts of the homework we can discuss them in January, 2023 at MIT during
the short course or just contact me on Julia Discourse or
Julia Slack and I will gladly answer your questions.

DataFrames.jl 1.4: operation specification syntax news

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/11/04/minilanguage.html

Introduction

Operation specification syntax in DataFrames.jl is used to pass information
how functions like select, transform, or combine should process
data frames or grouped data frames.

If you have never used it I recommend you to first read an
introductory post about it.

Today I want to discuss what additions to operation specification language we
made in DataFrames.jl 1.4.

The presented code was tested under Julia 1.8.2 and DataFrames.jl 1.4.2.

Preliminaries

Operation specification syntax is built around ETL (extract-transform-load)
process, that you might know from data integration. Its general form is:

[source columns] => [operation] => [target columns names]

Here is a simple example:

julia> using DataFrames

julia> df = DataFrame(customer=[1, 1, 2, 2, 2, 3],
                      transaction_id=1:6,
                      volume=[2, 3, 1, 4, 5, 9])
6×3 DataFrame
 Row │ customer  transaction_id  volume
     │ Int64     Int64           Int64
─────┼──────────────────────────────────
   1 │        1               1       2
   2 │        1               2       3
   3 │        2               3       1
   4 │        2               4       4
   5 │        2               5       5
   6 │        3               6       9

julia> combine(df, :volume => sum => :total_volume)
1×1 DataFrame
 Row │ total_volume
     │ Int64
─────┼──────────────
   1 │           24

julia> combine(groupby(df, :customer), :volume => sum => :total_volume)
3×2 DataFrame
 Row │ customer  total_volume
     │ Int64     Int64
─────┼────────────────────────
   1 │        1             5
   2 │        2            10
   3 │        3             9

In these examples we first aggregated volume column to get total volume
for the whole data frame, and next we computed total volume per customer.

In both cases we used the same operation specification syntax:

:volume => sum => :total_volume

Which says:

  • extract column :volume;
  • transform it using sum;
  • load it to :total_volume column.

However, there are cases when there is no natural source column on which we
might want to perform computations. One of such common cases is getting the
number of rows per group. For this special case we have a short syntax nrow
or nrow => [target column] to compute number of rows in a data frame or in
each group of a data frame. Notice that there is no extract part in this
syntax as number of rows is not a property of a specific column, but of a data
frame as a whole.

Here is an example how it works:

julia> combine(df, nrow)
1×1 DataFrame
 Row │ nrow
     │ Int64
─────┼───────
   1 │     6

julia> combine(groupby(df, :customer), nrow => :transactions_per_customer)
3×2 DataFrame
 Row │ customer  transactions_per_customer
     │ Int64     Int64
─────┼─────────────────────────────────────
   1 │        1                          2
   2 │        2                          3
   3 │        3                          1

There are three other common operations that have the same nature:

  • adding a column with row number;
  • adding a column with group number (makes sense only for working with grouped
    data frame);
  • computing fraction of rows (also for grouped data frames only).

In DataFrames.jl 1.4 these three operations are now supported through
eachindex, groupindices, and proprow operations. Let me show you how they
work.

Adding a column with row number

This is the simplest functionality. The eachindex operation adds row number
in a data frame or per group in a grouped data frame. Here is an example:

julia> combine(df, eachindex, :transaction_id)
6×2 DataFrame
 Row │ eachindex  transaction_id
     │ Int64      Int64
─────┼───────────────────────────
   1 │         1               1
   2 │         2               2
   3 │         3               3
   4 │         4               4
   5 │         5               5
   6 │         6               6

julia> combine(groupby(df, :customer),
               eachindex => :transaction_number,
               :transaction_id)
6×3 DataFrame
 Row │ customer  transaction_number  transaction_id
     │ Int64     Int64               Int64
─────┼──────────────────────────────────────────────
   1 │        1                   1               1
   2 │        1                   2               2
   3 │        2                   1               3
   4 │        2                   2               4
   5 │        2                   3               5
   6 │        3                   1               6

Note that when we work on a whole data frame we got the same column as
:transaction_id. However, when working on a grouped data frame we got
transaction numbers per customer.

Adding a column with group number

The eachindex operation added row within group. So it is natural to ask for a
function that does produce a group number. The groupindices operation is
designed to achieve this goal. Here is an example:

julia> combine(groupby(df, :customer), groupindices)
3×2 DataFrame
 Row │ customer  groupindices
     │ Int64     Int64
─────┼────────────────────────
   1 │        1             1
   2 │        2             2
   3 │        3             3

julia> combine(groupby(df, :customer), groupindices => :customer_id, :customer)
6×2 DataFrame
 Row │ customer  customer_id
     │ Int64     Int64
─────┼───────────────────────
   1 │        1            1
   2 │        1            1
   3 │        2            2
   4 │        2            2
   5 │        2            2
   6 │        3            3

Note that in our example the produced numbers are the same as values in the
customer column. However, in general it does not have to be the case.
Let us subset the grouped data frame before the operation:

julia> gdf = groupby(df, :customer)[[3, 2]]
GroupedDataFrame with 2 groups based on key: customer
First Group (1 row): customer = 3
 Row │ customer  transaction_id  volume
     │ Int64     Int64           Int64
─────┼──────────────────────────────────
   1 │        3               6       9
⋮
Last Group (3 rows): customer = 2
 Row │ customer  transaction_id  volume
     │ Int64     Int64           Int64
─────┼──────────────────────────────────
   1 │        2               3       1
   2 │        2               4       4
   3 │        2               5       5

julia> combine(gdf, groupindices)
2×2 DataFrame
 Row │ customer  groupindices
     │ Int64     Int64
─────┼────────────────────────
   1 │        3             1
   2 │        2             2

As you can see groupindices returns the number of a group within the grouped
data frame.

As I have mentioned earlier this operation is not supported for data frames:

julia> combine(df, groupindices)
ERROR: ArgumentError: groupindices only supports `GroupedDataFrame` as an
argument. Additionally it can be used in transformation functions (combine,
select, etc.) when processing a `GroupedDataFrame`, using the syntax
`groupindices => target_col_name` or just `groupindices`

Computing the fraction of rows per group

DataFrames.jl supports nrow convenience function for a long time already as
it was a common use case that users needed. An almost as frequent use-case is
to get a faction of rows per group. This can be achieved using the proprow
operation:

julia> combine(groupby(df, :customer), nrow, proprow)
3×3 DataFrame
 Row │ customer  nrow   proprow
     │ Int64     Int64  Float64
─────┼───────────────────────────
   1 │        1      2  0.333333
   2 │        2      3  0.5
   3 │        3      1  0.166667

julia> combine(groupby(df, :customer), nrow => :count, proprow => :proportion)
3×3 DataFrame
 Row │ customer  count  proportion
     │ Int64     Int64  Float64
─────┼─────────────────────────────
   1 │        1      2    0.333333
   2 │        2      3    0.5
   3 │        3      1    0.166667

Similarly to groupindices the proprow operation is only supported for
grouped data frames:

julia> combine(df, proprow)
ERROR: ArgumentError: proprow can only be used in transformation functions
(combine, select, etc.) when processing a `GroupedDataFrame`, using the syntax
`proprow => target_col_name` or just `proprow`

Conclusions

I hope you will find the eachindex, groupindices and proprow operations
useful in your daily data wrangling with DataFrames.jl.

DataFrames.jl indexing rules

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/10/28/indexing.html

Introduction

In DataFrames.jl 1.4 release we reached the target state of data frame indexing
that we aimed for when we designed 1.0 release.

In this post I want to present you what mental model you should have when
thinking about indexing into a data frame. The reason why learning this is
important is that we needed to extend standard indexing rules that are defined
for arrays in Julia to cover all scenarios that users need when working with
data frames.

I will focus on working with a single column of a data frame as this is the most
common indexing scenario.

The code was tested under Julia 1.8.2 and DataFrames.jl 1.4.2.

Rule 1: indexing always requires passing both row and column selector

When you index into a data frame you must pass exactly two dimensions: a row
selector and a column selector like df[row_selector, column_selector].
When indexing you can think of a data frame as of a matrix.

Rule 2: all indexing that works on matrices works the same way for data frames

The benefit of the rule that data frame follows matrix indexing is that it is
easy to remember. If you know how matrix indexing works then all translates
directly to a data frame.

Here are some examples of extracting a column or its fragment from a data frame.

julia> using DataFrames

julia> df = DataFrame(a=1:3, b=4:6)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6

julia> df[1, 1]
1

julia> df[1:2, 1]
2-element Vector{Int64}:
 1
 2

julia> df[:, 2]
3-element Vector{Int64}:
 4
 5
 6

(As I commented in the Introduction in this post I concentrate on getting a
single column from a data frame so the second index was always an integer.)

One important rule of Julia Base indexing is that when you get a column or its
part from a matrix a copy is made, except if you extract a single element. This
is exactly what happens for data frame: df[:, 2] is a copy of a second
column stored in df. You can check it by writing e.g.:

julia> df[:, 2] === eachcol(df)[2]
false

julia> df[:, 2] == eachcol(df)[2]
true

We see that we get the same data, but not the same object.

If you instead want a view of a column without copying the data, you can use
@view exactly like you would do it with matrices:

julia> @view df[1, 1]
0-dimensional view(::Vector{Int64}, 1) with eltype Int64:
1

julia> @view df[1:2, 1]
2-element view(::Vector{Int64}, 1:2) with eltype Int64:
 1
 2

julia> @view df[:, 2]
3-element view(::Vector{Int64}, :) with eltype Int64:
 4
 5
 6

As you can see, again all worked as if df were a matrix.

The same rules that worked for getting data from a data frame work for setting
data in a data frame. You have two options here: normal assignment and
broadcasted assignment.

julia> df[1:2, 1] = [11, 12]
2-element Vector{Int64}:
 11
 12

julia> df[:, 2] .= 100
3-element view(::Vector{Int64}, :) with eltype Int64:
 100
 100
 100

julia> df
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │    11    100
   2 │    12    100
   3 │     3    100

These operations, again, work exactly like for matrices. In particular they
are in-place, that is no memory is allocated when performing them. The data
is written into already allocated column. This rule is important as it means
that by such assignment you cannot change the element type of a column:

julia> df[:, 1] = ["a", "b", "c"]
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

julia> df[:, 2] .= "x"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

In summary, standard array indexing works the same way for matrices and for data
frames.

In what follows I describe the extensions to the indexing rules that are
DataFrames.jl specific.

Rule 3: you can use strings or symbols to pass column names

The first extension is related to the fact that standard matrices have to be
indexed by integers. In DataFrames.jl you can alternatively use string or symbol
to select a column when indexing. Here are some basic examples:

julia> df[1:2, "a"]
2-element Vector{Int64}:
 11
 12

julia> df[:, :b] .= 1000
3-element view(::Vector{Int64}, :) with eltype Int64:
 1000
 1000
 1000

This is intuitive so far. However, this rule leads to one extension. It is
related to the fact that you can pass a column name that does not exist yet
in a data frame. In this case if you pass : as a column selector a new column
in a data frame will be allocated (that is copy of the source data will be
performed) and the new column will be added at the end of the data frame:

julia> df[:, "d"] = [-1, -2, -3]
3-element Vector{Int64}:
 -1
 -2
 -3

julia> df
3×3 DataFrame
 Row │ a      b      d
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │    11   1000     -1
   2 │    12   1000     -2
   3 │     3   1000     -3

This rule has two additional special cases. The first is when a data frame
has no columns yet. In such a situation you can add any vector to a data frame:

julia> df2 = DataFrame()
0×0 DataFrame

julia> df2[:, :x] = [1, 2]
2-element Vector{Int64}:
 1
 2

julia> df2
2×1 DataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │     1
   2 │     2

What is non-standard with this rule? We allow changing the number of rows
in a data frame from zero to whatever is needed. For a data frame having some
columns changing their number of rows is not allowed with indexing.

The second special case is when you try to create a new column in a view of
a data frame. It is not allowed in general, but if the view was created with
: as a column selector we accept it (the reason is that in this case view
subsets only rows, but does not change columns; this is a common use case
in practice). Here is an example:

julia> df
3×3 DataFrame
 Row │ a      b      d
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │    11   1000     -1
   2 │    12   1000     -2
   3 │     3   1000     -3

julia> dfv = @view df[[3, 1], :]
2×3 SubDataFrame
 Row │ a      b      d
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     3   1000     -3
   2 │    11   1000     -1

julia> dfv[:, :e] = [1, 2]
2-element Vector{Int64}:
 1
 2

julia> dfv
2×4 SubDataFrame
 Row │ a      b      d      e
     │ Int64  Int64  Int64  Int64?
─────┼─────────────────────────────
   1 │     3   1000     -3       1
   2 │    11   1000     -1       2

julia> df
3×4 DataFrame
 Row │ a      b      d      e
     │ Int64  Int64  Int64  Int64?
─────┼──────────────────────────────
   1 │    11   1000     -1        2
   2 │    12   1000     -2  missing
   3 │     3   1000     -3        1

As you can see in this case a new column gets missing for rows that are not
present in the dfv data frame.

Rule 4: You can use Not to negate row selection

This is a pretty simple rule. Instead of selecting rows, you can use Not
to say which rows you want to drop when indexing:

julia> df
3×4 DataFrame
 Row │ a      b      d      e
     │ Int64  Int64  Int64  Int64?
─────┼──────────────────────────────
   1 │    11   1000     -1        2
   2 │    12   1000     -2  missing
   3 │     3   1000     -3        1

julia> df[Not(2), :d]
2-element Vector{Int64}:
 -1
 -3

Rule 5: There is a special ! row selector

The special ! row selector selects all rows similarly to :, but it has
a different behavior in relation to column allocation.

When extracting the data from a data frame if you use ! no copying of data is
made. Instead, you just get a column as it is stored in the source data frame
(or an appropriate view if you work with a view of a data frame):

julia> df[!, 1]
3-element Vector{Int64}:
 11
 12
  3

julia> df[!, 1] === eachcol(df)[1]
true

julia> dfv[!, 1]
2-element view(::Vector{Int64}, [3, 1]) with eltype Int64:
  3
 11

The reason why ! is allowed for when extracting a column from a data frame
is performance. In general it is not safe to use !. You should prefer :
as copying data (in R like style) is safer. However, in some cases, when your
operations are memory-bound or you need to maximize performance, you have !
at hand to avoid unnecessary operations.

For writing data into a data frame if you use ! selector it has also a
different behavior than :. Recall, that : is always in-place. Instead
! stores a fresh column in a data frame.

julia> df
3×4 DataFrame
 Row │ a      b      d      e
     │ Int64  Int64  Int64  Int64?
─────┼──────────────────────────────
   1 │    11   1000     -1        2
   2 │    12   1000     -2  missing
   3 │     3   1000     -3        1

julia> df[!, "a"] = ["a", "b", "c"]
3-element Vector{String}:
 "a"
 "b"
 "c"

julia> dfv[!, "e"] .= "x"
2-element Vector{String}:
 "x"
 "x"

julia> df
3×4 DataFrame
 Row │ a       b      d      e
     │ String  Int64  Int64  Any
─────┼───────────────────────────────
   1 │ a        1000     -1  x
   2 │ b        1000     -2  missing
   3 │ c        1000     -3  x

How the fresh column is stored depends on the type of data frame you use and the
type of operation:

  • if you use standard assignment on a data frame (df[!, "a"] = ["a", "b", "c"])
    then source vector is not copied, but is stored as-is (so this kind of storage
    is the fastest way to store a column in a data frame).
  • if you use broadcasted assignment (df[!, "a"] .= ["a", "b", "c"]) or assign
    into a view of a data frame (dfv[!, "a"] = ["x", "x"]) then a new column is
    freshly allocated (this kind of behavior is needed if you want to change
    element type of the column already present in a data frame, but want a copy
    of the source vector to be made for safety reasons).

So for assignment with ! the basic rule to remember is that it replaces the
existing column in a data frame. Then the additional rule is that normal
assignment on a data frame is non-copying, and broadcasted assignment or
using a view of a data frame allocates a copy.

These rules were designed to cover all possible scenarios that users might need
when working with columns of a data frame.

Rule 6: property access works the same as if you used ! as row selector

With this rule we are getting to the point that changed in DataFrames.jl 1.4
(but was planned and announced much earlier).

When you write df.a it is always treated the same as df[!, "a"], for
extracting a column from a data frame, for assignment, and for broadcasted
assignment.

Since DataFrames.jl 1.4 this is all you need to remember about property access
to a data frame. We decided on this rule as it is easy to remember and it does
not add any new concepts or exceptions for users to learn.

However, unfortunately, this is the place where we had to deviate from how
property access works in Base Julia. Let me explain the issue step by step.

If you write df.a = ["a", "b", "c"] then you expect that column a in df
data frame is replaced by ["a", "b", "c"]. And this is what happens. Recall
that this is what df[!, "a"] = ["a", "b", "c"] also does.

Now this means that, if we want df.a .= ["a", "b", "c"] to work the same as
df[!, "a"] .= ["a", "b", "c"], then both operations replace column a in df.

And here we have a slight inconsistency, as in Base Julia users would expect
that df.a .= ["a", "b", "c"] would work in-place, like
df[:, "a"] .= ["a", "b", "c"]. This is not the case.

Therefore you need to remember that in DataFrames.jl if you use property access
syntax for setting data it always replaces the colum: both when doing assignment
and broadcasted assignment.

The reason why we decided on this design is twofold. First is learning. We have
a simple rule: ! selector and property access work the same way always.
The second reason is convenience. Most users preferred the following operation:

julia> df3 = DataFrame(c=1:3)
3×1 DataFrame
 Row │ c
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df3.c .= 'x'
3-element Vector{Char}:
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
 'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)

julia> df3
3×1 DataFrame
 Row │ c
     │ Char
─────┼──────
   1 │ x
   2 │ x
   3 │ x

to replace column c with a vector of 'x' values. Note that instead if you
use : as row selector you get a bit surprising result:

julia> df3 = DataFrame(c=1:3)
3×1 DataFrame
 Row │ c
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df3[:, "c"] .= 'x'
3-element view(::Vector{Int64}, :) with eltype Int64:
 120
 120
 120

julia> df3
3×1 DataFrame
 Row │ c
     │ Int64
─────┼───────
   1 │   120
   2 │   120
   3 │   120

The reason of this behavior is that Char is silently converted to Int:

julia> convert(Int, 'x')
120

What is the general benefit of the behavior we adopted? When you write
df[!, "a"] .= 'x' or df.a .= 'x' you are sure that as a result you will
have a freshly allocated column "a" in df containing 'x' values as all
its elements, independent of the fact if column "a" was already present
in the data frame or not. So this is like git push --force operation.
It is guaranteed to succeed no matter what the starting state of the data
frame was (of course assuming that the object on the right hand side supports
broadcasting and has proper dimensions).

Why has it taken us so long to reach this state?

The reason why we landed with these rules only in DataFrames.jl 1.4 is that
before Julia 1.7 it was not possible to support the behavior we wanted.
Therefore we had to wait till Julia 1.7 to be released to achieve consistency
between property access and ! row selector behavior in DataFrames.jl in all
cases.

So people porting code from DataFrames.jl earlier than 1.4 or from Julia earlier
than 1.7 will notice a change that previously df.col .= value was in-place,
and currently it allocates a new column. We have considered the risks that this
change will be problematic for users, and the conclusion was that:

  • In a vast majority of code this change will not affect the result and users
    will not even notice this change.
  • It might affect the code that is performance critical (as an extra copy is now
    made). However, if this is the case it is easy to fix the issue (and it will
    not cause the code to be broken).
  • It might affect the code that relied on the fact that previously a conversion
    during in-place assignment was made (and e.g. relied on the fact that such
    assignment does not change element type of the column). In this case such code
    would indeed be broken. In this situation the conclusion was that such cases,
    although possible in practice, are most likely quite rare and if present would
    affect only experienced Julia users who have a good grasp of conversion upon
    assignment rules in Base Julia. We concluded that such users would be able
    to identify the problematic cases and change df.col .= value to
    df[:, :col] .= value in their code to get back the old behavior.

Conclusions

I hope you found this post useful in building your understanding how
the details of indexing in DataFrames.jl work and what are their intended
use cases.

Additionally, wanted to present you the mental process we went through when we
were making hard design decisions in DataFrames.jl development team. In this
process we had to balance three things:

  • user convenience (especially taking into account the fact that target audience
    of DataFrames.jl are data scientists, who sometimes are not computer science
    experts);
  • internal (within DataFrames.jl) consistency of rules that one needs to learn;
  • minimization of cases when we diverge from what Base Julia defines for similar
    objects (the challenge was that data frame is neither a matrix nor a struct,
    and has different design requirements, but we borrow the syntax from both).

In this post I have not covered all the details of DataFrames.jl indexing rules.
If you want to learn about the indexing behavior every supported scenario please
check the Indexing section of the DataFrames.jl manual.