Tips and tricks of broadcasting in DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/07/07/dfbroadcast.html

Introduction

Broadcasting, a.k.a. the . operator, is a powerful feature of Julia that allows you to
easily vectorize any function. Today I want to write about some common broadcasting
patterns that can be used in DataFrames.jl.

The post was written under Julia 1.9.2 and DataFrames.jl 1.5.0.

Conditional replacement of values in a data frame

Let us create a data frame with some random contents:

julia> using DataFrames

julia> using Random

julia> df = DataFrame(rand(5, 6), :auto)
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.690894   0.240867   0.720774  0.331573  0.549766
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.775367  0.35275
   3 │ 0.19913    0.0633375  0.767805   0.280096  0.721995  0.917259
   4 │ 0.708132   0.230088   0.702677   0.947402  0.928979  0.66101
   5 │ 0.0267573  0.0122425  0.549734   0.331788  0.32658   0.00476749

Assume we want to get a new data frame that has true when the value stored
in the cell is greater than 0.5 and false otherwise. This is easy.
We just broadcast the > operator:

julia> df .> 0.5
5×6 DataFrame
 Row │ x1     x2     x3     x4     x5     x6
     │ Bool   Bool   Bool   Bool   Bool   Bool
─────┼──────────────────────────────────────────
   1 │ false   true  false   true  false   true
   2 │ false  false  false  false   true  false
   3 │ false  false   true  false   true   true
   4 │  true  false   true   true   true   true
   5 │ false  false   true  false  false  false

Now assume we want to replace all values greater than 0.5 with 0.5 and
keep the lower values untouched. This can be done with ifelse:

julia> ifelse.(df .> 0.5, 0.5, df)
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.5        0.240867   0.5       0.331573  0.5
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.5       0.35275
   3 │ 0.19913    0.0633375  0.5        0.280096  0.5       0.5
   4 │ 0.5        0.230088   0.5        0.5       0.5       0.5
   5 │ 0.0267573  0.0122425  0.5        0.331788  0.32658   0.00476749

Or with clamp:

julia> clamp.(df, -Inf, 0.5)
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.5        0.240867   0.5       0.331573  0.5
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.5       0.35275
   3 │ 0.19913    0.0633375  0.5        0.280096  0.5       0.5
   4 │ 0.5        0.230088   0.5        0.5       0.5       0.5
   5 │ 0.0267573  0.0122425  0.5        0.331788  0.32658   0.00476749

Similarly we could clamp values to the [0.1, 0.9] interval:

julia> clamp.(df, 0.1, 0.9)
5×6 DataFrame
 Row │ x1        x2        x3        x4        x5        x6
     │ Float64   Float64   Float64   Float64   Float64   Float64
─────┼────────────────────────────────────────────────────────────
   1 │ 0.364225  0.690894  0.240867  0.720774  0.331573  0.549766
   2 │ 0.225226  0.241412  0.1       0.418206  0.775367  0.35275
   3 │ 0.19913   0.1       0.767805  0.280096  0.721995  0.9
   4 │ 0.708132  0.230088  0.702677  0.9       0.9       0.66101
   5 │ 0.1       0.1       0.549734  0.331788  0.32658   0.1

Importantly, we do not need to keep the element type of the source column fixed.
Assume that we want to set values greater than 0.5 to missing:

julia> ifelse.(df .> 0.5, missing, df)
5×6 DataFrame
 Row │ x1               x2               x3               x4              x5              x6
     │ Float64?         Float64?         Float64?         Float64?        Float64?        Float64?
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.364225   missing                0.240867   missing               0.331573  missing
   2 │       0.225226         0.241412         0.0793279        0.418206  missing               0.35275
   3 │       0.19913          0.0633375  missing                0.280096  missing         missing
   4 │ missing                0.230088   missing          missing         missing         missing
   5 │       0.0267573        0.0122425  missing                0.331788        0.32658         0.00476749

Note that the operation performed an automatic promotion of column element types.

As a final operation consider taking sign(log(x) + 1) on our data frame:

julia> sign.(log.(df) .+ 1)
5×6 DataFrame
 Row │ x1       x2       x3       x4       x5       x6
     │ Float64  Float64  Float64  Float64  Float64  Float64
─────┼──────────────────────────────────────────────────────
   1 │    -1.0      1.0     -1.0      1.0     -1.0      1.0
   2 │    -1.0     -1.0     -1.0      1.0      1.0     -1.0
   3 │    -1.0     -1.0      1.0     -1.0      1.0      1.0
   4 │     1.0     -1.0      1.0      1.0      1.0      1.0
   5 │    -1.0     -1.0      1.0     -1.0     -1.0     -1.0

Again – things are easy and intuitive. Data frame behaves just like a matrix in all operations.

I hope now you are comfortable with creation of a new data frame using broadcasting.

We can turn to in-place operations on a data frame.

In-place update of values in a data frame

In general it is enough to just put data frame on a right hand side of a broadcasted assignment
operator to update it in-place:

julia> df2 = copy(df)
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.690894   0.240867   0.720774  0.331573  0.549766
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.775367  0.35275
   3 │ 0.19913    0.0633375  0.767805   0.280096  0.721995  0.917259
   4 │ 0.708132   0.230088   0.702677   0.947402  0.928979  0.66101
   5 │ 0.0267573  0.0122425  0.549734   0.331788  0.32658   0.00476749

julia> df3 = df2
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.690894   0.240867   0.720774  0.331573  0.549766
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.775367  0.35275
   3 │ 0.19913    0.0633375  0.767805   0.280096  0.721995  0.917259
   4 │ 0.708132   0.230088   0.702677   0.947402  0.928979  0.66101
   5 │ 0.0267573  0.0122425  0.549734   0.331788  0.32658   0.00476749

julia> df2 .= log.(df)
5×6 DataFrame
 Row │ x1         x2         x3         x4          x5          x6
     │ Float64    Float64    Float64    Float64     Float64     Float64
─────┼─────────────────────────────────────────────────────────────────────
   1 │ -1.00998   -0.369769  -1.42351   -0.32743    -1.10391    -0.598262
   2 │ -1.49065   -1.42125   -2.53417   -0.87178    -0.254419   -1.04199
   3 │ -1.6138    -2.75928   -0.26422   -1.27262    -0.325737   -0.0863653
   4 │ -0.345124  -1.46929   -0.352858  -0.0540319  -0.0736689  -0.413987
   5 │ -3.62095   -4.40285   -0.598321  -1.10326    -1.11908    -5.34594

julia> df2 === df3
true

Note that with the last check I made sure that indeed the df2 .= log.(df) was in-place.
We updated the contents of df2, and not created a new object.

However, sometimes things are more tricky. Consider the df .> 0.5 operation we did above:

julia> df2 .= df .> 0.5
5×6 DataFrame
 Row │ x1       x2       x3       x4       x5       x6
     │ Float64  Float64  Float64  Float64  Float64  Float64
─────┼──────────────────────────────────────────────────────
   1 │     0.0      1.0      0.0      1.0      0.0      1.0
   2 │     0.0      0.0      0.0      0.0      1.0      0.0
   3 │     0.0      0.0      1.0      0.0      1.0      1.0
   4 │     1.0      0.0      1.0      1.0      1.0      1.0
   5 │     0.0      0.0      1.0      0.0      0.0      0.0

Note that there is a difference from creating a new data frame with df .> 0.5.
The issue is that columns of df2 keep their original types. This is expected, as we
wanted a fully in-place operation. However, sometimes you might want to change the
element type of a column when doing broadcasting. This is possible, however,
then you need to use data frame indexing with a special ! row selector which signals
that column replacement is requested:

julia> df2[!, :] .= df .> 0.5
5×6 DataFrame
 Row │ x1     x2     x3     x4     x5     x6
     │ Bool   Bool   Bool   Bool   Bool   Bool
─────┼──────────────────────────────────────────
   1 │ false   true  false   true  false   true
   2 │ false  false  false  false   true  false
   3 │ false  false   true  false   true   true
   4 │  true  false   true   true   true   true
   5 │ false  false   true  false  false  false

julia> df3
5×6 DataFrame
 Row │ x1     x2     x3     x4     x5     x6
     │ Bool   Bool   Bool   Bool   Bool   Bool
─────┼──────────────────────────────────────────
   1 │ false   true  false   true  false   true
   2 │ false  false  false  false   true  false
   3 │ false  false   true  false   true   true
   4 │  true  false   true   true   true   true
   5 │ false  false   true  false  false  false

Indeed we got what we wanted. I showed df3 variable to convince you that
still all operations were done on the same data frame object and df2 and df3
are still pointing to it.

Let me give an example where the difference between in-place and column replace
operations particularly matters and is a common surprise for new users.
It is a case when we want to introduce missing values to a column that initially does not allow them.

julia> df2 = copy(df)
5×6 DataFrame
 Row │ x1         x2         x3         x4        x5        x6
     │ Float64    Float64    Float64    Float64   Float64   Float64
─────┼─────────────────────────────────────────────────────────────────
   1 │ 0.364225   0.690894   0.240867   0.720774  0.331573  0.549766
   2 │ 0.225226   0.241412   0.0793279  0.418206  0.775367  0.35275
   3 │ 0.19913    0.0633375  0.767805   0.280096  0.721995  0.917259
   4 │ 0.708132   0.230088   0.702677   0.947402  0.928979  0.66101
   5 │ 0.0267573  0.0122425  0.549734   0.331788  0.32658   0.00476749

julia> df2 .= ifelse.(df .> 0.5, missing, df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

julia> df2[!, :] .= ifelse.(df .> 0.5, missing, df)
5×6 DataFrame
 Row │ x1               x2               x3               x4              x5              x6
     │ Float64?         Float64?         Float64?         Float64?        Float64?        Float64?
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │       0.364225   missing                0.240867   missing               0.331573  missing
   2 │       0.225226         0.241412         0.0793279        0.418206  missing               0.35275
   3 │       0.19913          0.0633375  missing                0.280096  missing         missing
   4 │ missing                0.230088   missing          missing         missing         missing
   5 │       0.0267573        0.0122425  missing                0.331788        0.32658         0.00476749

Note that df2 originally does not allow missing values in any of the columns. Therefore
df2 .= ifelse.(df .> 0.5, missing, df) fails. However, replacing df2 .= by df2[!, :] .=
works, because the ! selector explicitly requests overwriting the original columns
with new ones, possibly changing their types.

Conclusions

I hope you found these examples useful and they will help you to work with DataFrames.jl more
easily and confidently.

As a final comment let me explain why df2 .= ifelse.(df .> 0.5, missing, df) is fully in-place
(and does not replace the columns with proper element types like df2[!, :] .=) as this is a common question.
There are three reasons for this:

  • performance: a fully in-place operation is faster and allocates less;
  • safety: in production code we might want to make sure that the type of a column is not changed by mistake;
  • design consistency: the operation was designed to work the same way as broadcasting on matrices
    (broadcasted assignment used with a matrix does allow to change its element type).

set up NeoVim + Tmux for a Data Science Workflow with Julia

By: Navi

Re-posted from: https://indymnv.dev/posts/004_nvim/index.html

set up NeoVim + Tmux for a Data Science Workflow with Julia

Date: 2023-07-04

Summary: Notes to start walking with Neovim and Tmux in the Data Science World

tags: #Python #Julia #rstats #tmux #neovim #tooling




Table of Contents

  1. Introduction
  2. Why start using Neovim and Tmux? My motivations
  3. Instalation process required
  4. Editing the init.lua and tmux.config
  5. Some challenges for improving the workflow
  6. Conclusions

Introduction

In this post, I will provide some notes on getting started with Neovim for a Data Science Workflow. This setup is not strictly related to Julia and can also be used with Python and R. The idea is to work with a double panel structure, where one side contains your code and the other side has the REPL, which receives the snippets of code you send from the code side.

In this blog, I will mention the things you can add to make it comfortable for data analysis or more serious development. With this typical kick starter in Neovim and Tmux, I will explain some changes and new packages that are important for this purpose. Finally, I will dive into the details that still need improvement.

Why start using Neovim and Tmux? My motivations

I must say, I like notebooks. I used them extensively in my first job in analytics, and they really helped me dive into the problem and experiment with different use cases. I have also worked with VSCode, although I am not a big fan of it, it has helped me in some specific use cases where a more "software engineering" perspective is needed.

With that in mind, when I read the book "Approaching Almost Any Machine Learning Problem" by Abhishek Thakur, and then watched the controversial and yet funny conference by Joel Grus on why he dislikes notebooks, I started to think deeply about the perspective of writing software that follows good practices, is expressive, and still easy to prototype. Unfortunately, I think Grus is right about it. Working with notebooks can lead to some weird behaviors, like running cells in different positions, populating your analysis with too much unnecessary information and plots, not creating abstractions when needed, and having issues with reproducibility. On the other hand, the script perspective didn't help me with fast iteration when I needed quick answers to simple questions.

When I moved to Julia, I realized that the REPL is in another level, and I understood that an important part of this community uses (Neo)Vim or Emacs for development. I was curious about using these tools for data science projects. Although there are not many articles about it, and the community using Vim/Emacs is quite small compared to other options, I found it pretty cool because it is minimalistic (though you can customize it extensively), fast, and promises to increase productivity after mastering the Vim keybindings (in 10 years).

What I realized is that you can still prototype like notebooks (but with a perspective closer to RStudio) with one pane for your code and another pane with your REPL open. Then, you can start transforming your code to make it look like serious software, all within one window without moving all the .ipynb file contents to another .py or .jl file. I found this workflow more enjoyable.

Instalation process required

First of all, make sure to install Neovim and Tmux. There are plenty of tutorials out there on this topic, so I won't go into the details here.

The important thing is to create an init.lua file. If you don't want to install everything one by one, I recommend following the kick starter, provided by nvim-lua/kickstart.nvim. It provides the basic tools for working with Neovim, including a package manager, Treesitter, LSP integration, etc. This kickstarter uses LazyVim, which should be faster and doesn't require frequent updates like PackerNvim. Just create the init.lua file, copy and paste all the content into that file, save it, and quit. When you open it again, it should start installing or upgrading everything.

Also you want to create a Tmux config, the default config in Tmux is already ok for working with Data Science or the Julia experience, but anyway you would want to edit a file for setting a colorscheme or edit some shortcuts, to create just add ~/.tmux.conf

You also want to create a Tmux config file. The default config in Tmux is already suitable for working with Data Science or the Julia experience, but you may want to edit it to set a colorscheme or modify some shortcuts. Just add ~/.tmux.conf to create the file.

Editing the init.lua and tmux.config

Here are the things required for editing the init.lua file:

  1. Add Julia in the init.lua for treesitter and also consider to add the julials = {} for local servers

  2. Make sure to add the sysimage for the languageserver. in this discussion, they summarize the procedures well, Follow the instructions and add the snippet of code to your init.lua

-- Run Julia LSP
require'lspconfig'.julials.setup{
    on_new_config = function(new_config, _)
        local julia = vim.fn.expand("~/.julia/environments/nvim-lspconfig/bin/julia")
        if require'lspconfig'.util.path.is_file(julia) then
	    -- vim.notify("Hello!")
            new_config.cmd[1] = julia
        end
    end
}
  1. This should be enough for setting up Julia. If you open a Julia file, Neovim should be able to detect the LSP and work with other properties like jump to definition, etc. For R and Python, this should be a bit more straightforward for now (not need step 2).

  2. There are other things you want to add, for a data science project, one is vim-slime, This package is great for sending snippets of code from your file to a Julia REPL. Make sure to install it and add the following code to your init.lua. This will allow you to open a Tmux pane and start interacting with the file. You can modify the snippet below to change your target_pane (if you prefer the REPL on your left side or above, you can change this). The actual shortcut is Ctrl-c + Ctrl-c, which you can modify if you prefer.

vim.g.slime_target = 'tmux'
-- vim.g.slime_default_config = {"socket_name" = "default", "target_pane" = "{last}"}
vim.g.slime_default_config = {
  -- Lua doesn't have a string split function!
  socket_name = vim.api.nvim_eval('get(split($TMUX, ","), 0)'),
  target_pane = '{top-right}',
}
  1. For Tmux, there are some things you can add. Here is my config, which is really simple. However, I encourage you to find your own taste with Tmux.

set -g mouse on
set -g history-limit 102400
set -g base-index 1
set -g pane-base-index 1
set -g renumber-windows onunbind C-b
set -g prefix C-x# vim key movements between panes
# smart pane switching with awareness of vim splits
bind h select-pane -L
bind j select-pane -D
bind k select-pane -U
bind l select-pane -R
# reloading for now:wunbind r 
bind r source-file ~/.tmux.conf \; display "Reloaded ~/.tmux.conf"
# plugin
# Initialize TMUX plugin manager (keep this line at the very bottom of tmux.conf)set -g @plugin 'egel/tmux-gruvbox'
set -g @tmux-gruvbox 'dark' # or 'light'
run '~/.tmux/plugins/tpm/tpm'

For this purpose, I have considered the following in my Tmux config: activate the mouse, increase the history limit in the panes (this is necessary because the default limit in Tmux is quite constrained), count from 1 with panes, change the prefix (I found it easier to use Ctrl-x rather than Ctrl-b), and add the keybindings to move between paneslike vim. The r shortcut is used to restart the config file when you add or modify features, so you can use prefix + r to apply your changes. Finally, in Tmux, you can use a package manager called TPM. Make sure to added in your config file.

Some challenges for improving the workflow

So far, the workflow with Neovim and Tmux has been set up nicely. However, there are some areas that can be improved. One of them is the visualization aspect. As a data scientist, you need to constantly iterate and visualize your data. If you want to have a deep understanding of your dataset and generate plenty of visualizations, the current setup may not be the best. However, in Julia, you can easily switch to Pluto to display all the figures you want. One thing I have tried is to constantly display those plots you are working on inside the terminal. One way to do this is by using unicodeplots. If you like working with Plots.jl, you can change your backend from gr() to unicodeplots(). In my opinion, the quality of the visualization may not be the best, but it allows for instant plots in your terminal without the need for third-party software. For fast iteration, it is good enough.

Another important point to consider is maintaining consistency in the workflow between Julia code and the Julia REPL. Currently, I have a workflow with Vim and my code, but the REPL follows a different logic. This is where the aforementioned repository could potentially help, as it aims to bridge the gap and maintain homogeneity between the two panels. Integrating Vim keybindings into the REPL would provide a seamless experience, allowing for a smoother transition and enhancing the overall workflow. It is definitely an area I look forward to exploring in the future to further improve my development process.

Conclusions

In this blog, I have explained how to set up Neovim and Tmux with Julia (or any other data science programming language). This setup provides a minimalist perspective. For people who like to have a variety of tools at hand, it may feel a bit lacking. However, if you are someone who is looking for a lightweight tool, minimalistic design, and enjoys working within the terminal, I highly recommend giving it a try.