Category Archives: Julia

Polyglot Sorting

By: Jonathan Carroll

Re-posted from: https://jcarroll.com.au/2022/10/08/polyglot-sorting/

I’ve had the impression lately that everyone is learning Rust and there’s plenty of great material out there to make that easier. {gifski} is perhaps the most well-known example of an R package wrapping a Rust Cargo crate. I don’t really know any system language particularly well, so I figured I’d wade into it and see what it’s like.

The big advantages I’ve heard are that it’s more modern than C++, is “safe” (in the sense that you can’t compile something that tries to read out of bounds memory), and it’s super fast (it’s a compiled, strictly-typed language, so one would hope so).

I had a browse through some beginner material, and watched some videos on Youtube. Just enough to have some understanding of the syntax and keywords so I could actually search for things once I inevitably hit problems.

Getting everything up and running went surprisingly smoothly. Installing the toolchain went okay on my Linux (Pop!_OS) machine, and the getting started guide was straightforward enough to follow along with. I soon enough had Ferris welcoming me to the world of Rust

----------------------------
< Hello fellow Rustaceans! >
----------------------------
              \
               \
                 _~^~^~_
             \) /  o o  \ (/
               '_   -   _'
               / '-----' \

Visual Studio Code works nicely as a multi-language editor, and while it’s great to have errors visible to you immediately, I can imagine that gets annoying pretty quick (especially if you write as much bad Rust code as I do).

Next I needed to actually code something up myself. I love small, silly problems for learning – you don’t know exactly what problems you’ll solve along the way. This one ended up being really helpful.

I had this tweet

in my bookmarks because I wanted to try to solve this with R (naturally) but I decided it was a reasonable candidate for trying to solve a problem and learn some language at the same time, so I decided to give it a go with Rust. This is slightly more complicated than an academic “sort some strings” because it’s “natural sorting” (2 before 10) and has a complicating character in the middle.

The first step was to get Rust to read in and just print back the ‘data’ (strings). I managed to copy some “print a vector of strings” code and got that working. I’ll figure out later what’s going with the format string here

println!("{:?}", x);

After that, I battled errors in converting between String, &str, and i32 types; returning a Result (error) rather than a value; dealing with obscure errors (“cannot move out of borrowed content”, “expected named lifetime parameter” – ???); and a lack of method support for a struct I just created (which didn’t have any inherited ‘type’). All in all, nothing too surprising given I know approximately 0 Rust, but I got there in the end!

Now, this won’t be anything “good”, but it does compile and appears to give the right answer, so I’m led to believe that means it’s “right”.

// enable printing of the struct
#[derive(Debug)]
// create a struct with a String and an integer
// not using &str due to lifetime issues
struct Pair {
    x: String,
    y: i32
}

fn main() {
    // input data vector
    let v = vec!["aa-2", "ab-100", "aa-10", "ba-25", "ab-3"];
    // create an accumulating vector of `Pair`s
    let mut res: Vec<Pair> = vec![];
    // for each string, split at '-', 
    //  convert the first part to String and the second to integer.
    //  then push onto the accumulator
    for s in v {
        let a: Vec<&str> = s.split("-").collect();
        let tmp_pair = Pair {x: a[0].to_string(), y: a[1].parse::<i32>().unwrap() };
        res.push(tmp_pair);
    }
    // sort by Pair.x then Pair.y
    res.sort_by_key(|k| (k.x.clone(), k.y.clone()));
    // start building a new vector for the final result
    let mut res2: Vec<String> = vec![];
    // paste together Pair.x, '-', and Pair.y (as String)
    for s2 in res {
        res2.push(s2.x + "-" + &s2.y.to_string());
    }

    // ["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
    println!("{:?}", res2);
}

Running

cargo run --release

produces the expected output

["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]

Feel free to suggest anything that could be improved, I’m sure there’s plenty.

That might have been an okay place to stop, but I did still want to see if I could solve the problem with R, and how that might compare (in approach, readability, and speed), so I coded that up as

# input vector
s <- c("aa-2", "ab-100", "aa-10", "ba-25", "ab-3")
# split into pairs of strings
x <- strsplit(s, "-")
# take elements of s sorted by the first elements of x then
#  the second (as integers)
s[order(sapply(x, `[[`, 1), as.integer(sapply(x, `[[`, 2)))]
## [1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25"

I don’t love that I had to use sapply() twice, but the only other alternative I could think of was to strip out the first and second element lists and use those in a do.call()

s[do.call(order, list(unlist(x)[c(T, F)], as.integer(unlist(x)[c(F,T)])))]
## [1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25"

which… isn’t better.

I also had an idea to shoehorn dplyr::arrange() into this, but that requires a data.frame. One idea I had was to read in the data, using "-" as a delimiter, explicitly stating that I wanted to read it as character and integer data. That seemed to work, which means I can try what I hoped

suppressMessages(library(dplyr, quietly = TRUE))
# input vector
s <- c("aa-2", "ab-100", "aa-10", "ba-25", "ab-3")

# read strings as fields delimited by '-', 
#  expecting character and integer
s %>% read.delim(
    text = .,
    sep = "-",
    header = FALSE,
    colClasses = c("character", "integer")
) %>%
    # sort by first then second column
    arrange(V1, V2) %>%
    # collapse to single string per row
    mutate(res = paste(V1, V2, sep = "-")) %>%
    pull()
## [1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25"

Why stop there? I know other languages! Okay, the Python and Julia examples I found in other Tweets.

In Julia, two options were offered. This one

strings = String["aa-2", "ab-100", "aa-10", "ba-25", "ab-3"];
print(join.(sort(split.(strings, "-"), by = x -> (x[1], parse(Int, x[2]))), "-"))
## ["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]

(I added a type to the input and an explicit print), and this one

strings = String["aa-2", "ab-100", "aa-10", "ba-25", "ab-3"];
print(sort(strings, by = x->split(x, "-") |> v->(v[1], parse(Int, v[2]))))
## ["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]

The Python example offered by the original author of the challenge was

def parts(s):
    letters, nums = s.split("-")
    return letters, int(nums)

strings = ["aa-2", "ab-100", "aa-10", "ba-25", "ab-3"]

print(sorted(strings, key=parts))
## ['aa-2', 'aa-10', 'ab-3', 'ab-100', 'ba-25']

I actually really like this one – it’s the approach I wanted to use for R; provide sort with a function returning the keys to use. Alas.

Lastly, I remembered that there’s a sort function in bash that can do natural sorting with the -V flag. I’m reminded of this anecdote (“More shell, less egg”) about using a very simple bash script when it’s possible. That came together okay

#!/bin/bash 

v=("aa-2" "ab-100" "aa-10" "ba-25" "ab-3")
readarray -t a_out < <(printf '%s\n' "${v[@]}" | sort -V)
printf '%s ' "${a_out[@]}"
echo 

exit 0
## aa-2 aa-10 ab-3 ab-100 ba-25

By the way, aside from the Rust example, all of these were run directly in the Rmd source of this post with knitr’s powerful engines… multi-language support FTW!

So, how do all these compare? I haven’t tuned any of these for performance; they’re how I would have written them as a developer trying to achieve something. Sure, if performance was an issue, I’d do some optimization, but I was curious just how the performance compares ‘out of the box’.

Mainly for my own posterity, I’ll add how I tracked this. I wrote the relevant code for each language in a file with suffix/filetype appropriate to each language. They’re all here, in case anyone is interested. Then I wanted to run each of them a few times, keeping track of the timing in a file. The solution I went with was to echo into a file (appending each time) both the input and output, with e.g.

echo "Rust (optimised/release)" >> timing
{time cargo run --release} >> timing 2>&1
{time cargo run --release} >> timing 2>&1
{time cargo run --release} >> timing 2>&1

(yes, trivial to loop 3 times, but whatever).

Doing this for all the languages (with both versions for R and Julia) I get

Rust (optimized/release)
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/sort`
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
cargo run --release  0.04s user 0.02s system 99% cpu 0.066 total
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/sort`
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
cargo run --release  0.07s user 0.01s system 99% cpu 0.087 total
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/sort`
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
cargo run --release  0.06s user 0.02s system 98% cpu 0.084 total

R1
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort1.R  0.15s user 0.05s system 102% cpu 0.197 total
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort1.R  0.17s user 0.05s system 102% cpu 0.206 total
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort1.R  0.16s user 0.05s system 103% cpu 0.202 total

R2
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort2.R  0.72s user 0.05s system 100% cpu 0.774 total
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort2.R  0.67s user 0.06s system 100% cpu 0.720 total
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort2.R  0.69s user 0.04s system 99% cpu 0.737 total

Python
['aa-2', 'aa-10', 'ab-3', 'ab-100', 'ba-25']
python3 sort.py  0.03s user 0.00s system 98% cpu 0.032 total
['aa-2', 'aa-10', 'ab-3', 'ab-100', 'ba-25']
python3 sort.py  0.02s user 0.01s system 98% cpu 0.034 total
['aa-2', 'aa-10', 'ab-3', 'ab-100', 'ba-25']
python3 sort.py  0.03s user 0.02s system 98% cpu 0.059 total

Julia1
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort1.jl  1.10s user 0.68s system 236% cpu 0.750 total
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort1.jl  1.14s user 0.64s system 233% cpu 0.765 total
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort1.jl  1.13s user 0.62s system 241% cpu 0.725 total

Julia2
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort2.jl  0.97s user 0.64s system 270% cpu 0.596 total
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort2.jl  1.00s user 0.58s system 259% cpu 0.607 total
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort2.jl  0.96s user 0.63s system 276% cpu 0.578 total

Bash
aa-2 aa-10 ab-3 ab-100 ba-25 
./sort.sh  0.01s user 0.00s system 109% cpu 0.013 total
aa-2 aa-10 ab-3 ab-100 ba-25 
./sort.sh  0.00s user 0.01s system 108% cpu 0.015 total
aa-2 aa-10 ab-3 ab-100 ba-25 
./sort.sh  0.01s user 0.00s system 99% cpu 0.009 total

This wouldn’t be much of a coding/benchmark post without a plot, so I also did a visual comparison

library(ggplot2)
d <- tibble::tribble(
  ~language, ~version, ~run, ~time,
  "Rust", "", 1, 0.066,
  "Rust", "", 2, 0.087,
  "Rust", "", 3, 0.084,
  "R", "1", 1, 0.197,
  "R", "1", 2, 0.206,
  "R", "1", 3, 0.202,
  "R", "2", 1, 0.774,
  "R", "2", 2, 0.720,
  "R", "2", 3, 0.737,
  "Julia", "1", 1, 0.750,
  "Julia", "1", 2, 0.756,
  "Julia", "1", 3, 0.725,
  "Julia", "2", 1, 0.596,
  "Julia", "2", 2, 0.607,
  "Julia", "2", 3, 0.578,
  "Python", "", 1, 0.032,
  "Python", "", 2, 0.034,
  "Python", "", 3, 0.059,
  "Bash", "", 1, 0.013,
  "Bash", "", 2, 0.015,
  "Bash", "", 3, 0.009
)

d$language <- factor(
  d$language, 
  levels = c("Rust", "R", "Julia", "Python", "Bash")
)

ggplot(d, aes(language, time, fill = language, group = run)) + 
  geom_col(position = position_dodge(0.9)) + 
  facet_grid(
    ~language + version, 
    scales = "free_x", 
    labeller = label_wrap_gen(multi_line = FALSE), 
    switch = "x"
  ) + 
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(
    title = "Performance of sort functions by language", 
    y = "Time [s]", 
    x = "Language, Version"
  ) + 
  scale_fill_brewer(palette = "Set1")

It’s true – Rust does pretty well, even with my terrible coding. My R implementation (the sensible one) isn’t too bad – perhaps over many strings it would be a bit slow. Surprisingly, the Julia implementations are actually quite slow. I don’t have a good explanation for that. I’m using Julia 1.5.0 which is slightly out of date, so perhaps that needs an update. The Python implementation does particularly well – I really should learn more python. The syntax there isn’t the worst, either. Oh, no – do I like that?

The big winner, though, is the simplest of all – Bash crushes the rest of the languages with a 2 liner, and calling it doesn’t involve compiling anything.

As I said, I’m not particularly interested in optimizing any of these – this is how they compare as written.

In summary, I learned some Rust – enough to actually manipulate some data. I’ll keep trying and hopefully some day I’ll be semi literate in it.

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-06-17
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version date (UTC) lib source
##  assertthat     0.2.1   2019-03-21 [3] CRAN (R 4.0.1)
##  blogdown       1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown       0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib          0.4.1   2022-11-02 [3] CRAN (R 4.2.2)
##  cachem         1.0.6   2021-08-19 [3] CRAN (R 4.2.0)
##  callr          3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli            3.4.1   2022-09-23 [3] CRAN (R 4.2.1)
##  colorspace     2.0-3   2022-02-21 [3] CRAN (R 4.2.0)
##  crayon         1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  DBI            1.1.3   2022-06-18 [3] CRAN (R 4.2.1)
##  devtools       2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest         0.6.30  2022-10-18 [3] CRAN (R 4.2.1)
##  dplyr        * 1.0.10  2022-09-01 [3] CRAN (R 4.2.1)
##  ellipsis       0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate       0.18    2022-11-07 [3] CRAN (R 4.2.2)
##  fansi          1.0.3   2022-03-24 [3] CRAN (R 4.2.0)
##  farver         2.1.1   2022-07-06 [3] CRAN (R 4.2.1)
##  fastmap        1.1.0   2021-01-25 [3] CRAN (R 4.2.0)
##  fs             1.5.2   2021-12-08 [3] CRAN (R 4.1.2)
##  generics       0.1.3   2022-07-05 [3] CRAN (R 4.2.1)
##  ggplot2      * 3.4.1   2023-02-10 [1] CRAN (R 4.1.2)
##  glue           1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  gtable         0.3.1   2022-09-01 [3] CRAN (R 4.2.1)
##  here           1.0.1   2020-12-13 [1] CRAN (R 4.1.2)
##  highr          0.9     2021-04-16 [3] CRAN (R 4.1.1)
##  htmltools      0.5.3   2022-07-18 [3] CRAN (R 4.2.1)
##  htmlwidgets    1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv         1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib      0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite       1.8.3   2022-10-21 [3] CRAN (R 4.2.1)
##  JuliaCall      0.17.5  2022-09-08 [1] CRAN (R 4.1.2)
##  knitr          1.40    2022-08-24 [3] CRAN (R 4.2.1)
##  labeling       0.4.2   2020-10-20 [3] CRAN (R 4.2.0)
##  later          1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lattice        0.20-45 2021-09-22 [4] CRAN (R 4.2.0)
##  lifecycle      1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr       2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  Matrix         1.5-3   2022-11-11 [4] CRAN (R 4.2.2)
##  memoise        2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime           0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI         0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  munsell        0.5.0   2018-06-12 [3] CRAN (R 4.0.1)
##  pillar         1.8.1   2022-08-19 [3] CRAN (R 4.2.1)
##  pkgbuild       1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgconfig      2.0.3   2019-09-22 [3] CRAN (R 4.0.1)
##  pkgload        1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  png            0.1-7   2013-12-03 [1] CRAN (R 4.1.2)
##  prettyunits    1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx       3.8.0   2022-10-26 [3] CRAN (R 4.2.1)
##  profvis        0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises       1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps             1.7.2   2022-10-26 [3] CRAN (R 4.2.2)
##  purrr          1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6             2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  RColorBrewer   1.1-3   2022-04-03 [3] CRAN (R 4.2.0)
##  Rcpp           1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes        2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  reticulate     1.26    2022-08-31 [1] CRAN (R 4.1.2)
##  rlang          1.0.6   2022-09-24 [1] CRAN (R 4.1.2)
##  rmarkdown      2.18    2022-11-09 [3] CRAN (R 4.2.2)
##  rprojroot      2.0.3   2022-04-02 [1] CRAN (R 4.1.2)
##  rstudioapi     0.14    2022-08-22 [3] CRAN (R 4.2.1)
##  sass           0.4.2   2022-07-16 [3] CRAN (R 4.2.1)
##  scales         1.2.1   2022-08-20 [3] CRAN (R 4.2.1)
##  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny          1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi        1.7.8   2022-07-11 [3] CRAN (R 4.2.1)
##  stringr        1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  tibble         3.1.8   2022-07-22 [3] CRAN (R 4.2.2)
##  tidyselect     1.2.0   2022-10-10 [3] CRAN (R 4.2.1)
##  urlchecker     1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis        2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  utf8           1.2.2   2021-07-24 [3] CRAN (R 4.2.0)
##  vctrs          0.5.2   2023-01-23 [1] CRAN (R 4.1.2)
##  withr          2.5.0   2022-03-03 [3] CRAN (R 4.2.0)
##  xfun           0.34    2022-10-18 [3] CRAN (R 4.2.1)
##  xtable         1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml           2.3.6   2022-10-18 [3] CRAN (R 4.2.1)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ─ Python configuration ───────────────────────────────────────────────────────
##  python:         /usr/bin/python3
##  libpython:      /usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so
##  pythonhome:     //usr://usr
##  version:        3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
##  numpy:          /home/jono/.local/lib/python3.10/site-packages/numpy
##  numpy_version:  1.24.1
##  
##  NOTE: Python version was forced by RETICULATE_PYTHON_FALLBACK
## 
## ──────────────────────────────────────────────────────────────────────────────

Is DataFrames.jl Hamiltonian?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/10/07/metadata.html

Introduction

Hamilton is an interesting general purpose micro-framework for
creating dataflows from python functions. It has been developed by Stitch Fix
to help managing complex DAGs, where the resulting data frames are wide
(1000+ columns).

While DataFrames.jl is not a framework for DAG execution it is a natural tool
for performing single steps in such processes. When I was reading the
explanation of motivation and design of Hamilton I was struck
by the fact that it shares similarities with some concepts in DataFrames.jl.

In this post I want to discuss some of these principles. Additionally, I want to
highlight how I believe that having metadata support in DataFrames.jl nicely
combines with them.

In the post I use Julia 1.8.2, and DataFrames.jl 1.4.1.

The principles

I do not list here all design choices that Hamilton makes. Instead I
want to discuss ideas that are similar between this framework and DataFrames.jl.

These concepts are simple:

  • if you use some column name it should have a single meaning in your pipeline;
  • every transformation should be a function with a name.

Such an approach helps with understanding of the code, its maintenance,
and documentation of the pipeline.

Let me discuss these two concepts one by one in the context of DataFrames.jl
design.

Single column name has a single meaning

This rule seems pretty intuitive but is often violated in practice. For example
users often transform a column (e.g. by taking log) and still keep its name.
The general recommendation is that such practices should be avoided. If you
significantly transform a column a recommended thing to do is to give it a new
name. In this way you are sure that you will not be confused later.

As an additional benefit of following this rule you can safely annotate your
columns with metadata to keep important information about their meaning. In
DataFrames.jl column-level metadata design works as follows:

  • you can add a keyvalue pair as metadata to any column of a data frame;
  • you can choose one of two styles of metadata propagation; it is either :note
    (signaling it is safe to retain metadata under certain transformations) and :default
    (signaling that metadata should be stripped after any transformation).

Let us see it in action. You can use colmetadata! to set column-level metadata
and colmetadata to retrieve it:

julia> using DataFrames

julia> df = DataFrame(sales=[1.2, 3.4, 2.1], year=[2001, 2002, 2003])
3×2 DataFrame
 Row │ sales    year
     │ Float64  Int64
─────┼────────────────
   1 │     1.2   2001
   2 │     3.4   2002
   3 │     2.1   2003

julia> colmetadata!(df, :sales, "label", "Yearly sales in USD", style=:note);

julia> colmetadata!(df, :year, "label", "Fiscal year (ending on June 30)", style=:note);

julia> colmetadata!(df, :sales, "sum", sum(df.sales), style=:default);

julia> colmetadata(df, :sales, "label")
"Yearly sales in USD"

julia> colmetadata(df, :year, "label")
"Fiscal year (ending on June 30)"

julia> colmetadata(df, :sales, "sum")
6.699999999999999

You might ask what is the use of :note and :default metadata distinction.

Most of the time metadata of :default style are things that are specific to
only a given instance of a data frame. An example of such metadata is column
creation time. In some scenarios having such information might be used, e.g.
for auditing purposes. However, such information should not be propagated under
transformations.

On the other hand :note meatadata is kept when data frame is transformed.
The rules, in short, are:

  • if the column is not changed under transformation (data it contains remains
    unchanged) :note-style metadata is kept;
  • if you change column data, but decide to keep the original column name
    metadata is also kept.

What is the rationale behind these rules? We could be super strict and say that
metadata is propagated only if the column data does not change and its name and
does not change. However, such a rule in practice seems to be overly strict.
When you get some source data you usually might want to e.g. rename the column
to some more descriptive name or perform data cleaning (e.g. dropping rows with
missing values). In such cases users usually feel that the contents of the colum
conceptually remains the same. However, some information about data in the
column might change, like for example its length. Therefore :note-style
metadata should not be used to store information that is strictly attached to a
concrete instance of a column (:default-style metadata should be used
instead).

Let us now look back at our example code.
A typical :note-style metadata is descriptive label of a column. We used
"label" key for it. Now we made a "sum" metadata to have style :default.
Someone might have wanted to store the sum information for easy access later.
However, such metadata should not be propagated under any transformation of
a data frame as it might potentially be invalidated, so we made its style
:default.

If you ask why such metadata might be useful have a look at this example:

julia> show(df, header=colmetadata.(Ref(df), names(df), "label"))
3×2 DataFrame
 Row │ Yearly sales in USD  Fiscal year (ending on June 30)
─────┼──────────────────────────────────────────────────────
   1 │                 1.2                             2001
   2 │                 3.4                             2002
   3 │                 2.1                             2003

Note that we have substituted column names (which might be not clear for end
users) with descriptive labels.

The metadata propagation rules work nice and will help you documenting your
data frame assuming that you follow the first principle: do not use a single
column name to store different information.

Every transformation should be a function with a name

The principle to define named functions for transformations, combined with
the principle of single name for single meaning, is a core of
operation specification syntax in DataFrames.jl.

Consider the following example code:

julia> usd2¢(x) = 100x
usd2¢ (generic function with 1 method)

julia> transform(df, :sales => usd2¢)
3×3 DataFrame
 Row │ sales    year   sales_usd2¢
     │ Float64  Int64  Float64
─────┼─────────────────────────────
   1 │     1.2   2001        120.0
   2 │     3.4   2002        340.0
   3 │     2.1   2003        210.0

Note that the auto-generated column name sales_usd2¢ gives us information
about a source column and transformation function applied to it. Of course you
could use a custom output name. However, using automatically generated column
name, assuming that you pass a names function as a transformation, has a big
benefit that it helps in automatic visual documentation of the transformation
applied to data (and looking up the code of the transformation function used).
This is a big feature if you work with hundreds or thousands of columns.

As with the previous principle this works if you follow a simple rule –
define transformations you want to apply as functions with a name.

Conclusions

In this post I have highlighted two principles that are useful when implementing
complex data transformation jobs:

  • if you use some column name it should have a single meaning;
  • every transformation should be a function with a name.

DataFrames.jl was designed to work nicely if you want to follow these rules, as
I have tried to explain in this post. However, nothing is carved in stone in
DataFrames.jl. You can write your code whatever way you want and nothing is
forced on the user. This is especially important in quick-and-dirty one time
interactive sessions, where users want flexibility.

If you would like to learn more about the design of operation specification
syntax I recommend you start with JuliaCon 2022 tutorial.

Similarly, we tried to design handling of metadata in a way that is easy to
understand and convenient to use. You can find more details about metadata
support in DataFrames.jl in Metadata section of the DataFrames.jl
documentation. I also plan to write a separate post in which I will give a more
in-depth example how metadata can be used in DataFrames.jl. Also there is a plan
to release TableMetadataTools.jl package, which will add even more
convenience to most common operations.

Using evotrees.jl for time series prediction

By: Navi

Re-posted from: https://indymnv.dev/posts/003_publish/index.html

Using evotrees.jl for time series prediction

Date: 2022-10-05

Summary: Some basic experimentations with the Gradient Boosting Machine implemented in Julia in a Ts problem.

tags: #Julia #TimeSeries #DataScience #EDA

Table of contents

  1. Introduction
  2. Dataset Preparation
    1. Data Extraction
    2. Dataset Cleaning
  3. Preliminary Visualizations
  4. A brief clustering with kmeans
  5. Using EvoTrees for prediction.
  6. Conclusions

Introduction

In this post, I want to show an analysis of a time series that I've been working on. Usually, when dealing with time series, it is not so common to use machine learning algorithms (without at least trying more traditional models like the ARIMA family), but I still wanted to test how well a GBM model fits for these kinds of problems that are so popular.

NOTE: I don't recommend starting with models of this type for time series problems. There are simpler models to understand that are less computationally expensive.

Dataset Preparation

Data Extraction

You can find the repository here, The codes you will see here, I prototyped in notebooks/tutorial.jl.

Now we start by making the corresponding imports.

using DataFrames
using Plots
using MLJ
using EvoTrees
using UrlDownload
using ZipFile
using HTTP
using CSV
using Dates
using Statistics
using MLJClusteringInterface
using Clustering
using FreqTables
using StatsPlots
using RollingFunctions
using StatsBase
using ShiftedArrays

There are several libraries in this section, and I must admit it took me some time to use each one. But anyway to start reading the dataframe, we can get it directly from its repository.

data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip"
f = download(data_url)
z = ZipFile.Reader(f)
z_by_filename = Dict( f.name => f for f in z.files)
data = CSV.read(z_by_filename["household_power_consumption.txt"], DataFrame,)

The dataframe looks more or less like this:

Row │ Date        Time      Global_active_power  Global_reactive_power  Voltage  Global_intensity  Sub_metering_1  Sub_metering_2  Sub_metering_3
         │ String15    Time      String7              String7                String7  String7           String7         String7         Float64?
─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       1 │ 16/12/2006  17:24:00  4.216                0.418                  234.840  18.400            0.000           1.000                     17.0
       2 │ 16/12/2006  17:25:00  5.360                0.436                  233.630  23.000            0.000           1.000                     16.0
       3 │ 16/12/2006  17:26:00  5.374                0.498                  233.290  23.000            0.000           2.000                     17.0
       4 │ 16/12/2006  17:27:00  5.388                0.502                  233.740  23.000            0.000           1.000                     17.0
       5 │ 16/12/2006  17:28:00  3.666                0.528                  235.680  15.800            0.000           1.000                     17.0
       6 │ 16/12/2006  17:29:00  3.520                0.522                  235.020  15.000            0.000           2.000                     17.0
       7 │ 16/12/2006  17:30:00  3.702                0.520                  235.090  15.800            0.000           1.000                     17.0
       8 │ 16/12/2006  17:31:00  3.700                0.520                  235.220  15.800            0.000           1.000                     17.0
       9 │ 16/12/2006  17:32:00  3.668                0.510                  233.990  15.800            0.000           1.000                     17.0
      10 │ 16/12/2006  17:33:00  3.662                0.510                  233.860  15.800            0.000           2.000                     16.0
      11 │ 16/12/2006  17:34:00  4.448                0.498                  232.860  19.600            0.000           1.000                     17.0
      12 │ 16/12/2006  17:35:00  5.412                0.470                  232.780  23.200            0.000           1.000                     17.0
      13 │ 16/12/2006  17:36:00  5.224                0.478                  232.990  22.400            0.000           1.000                     16.0
      14 │ 16/12/2006  17:37:00  5.268                0.398                  232.910  22.600            0.000           2.000                     17.0
    ⋮    │     ⋮          ⋮               ⋮                     ⋮               ⋮            ⋮                ⋮               ⋮               ⋮
 2075246 │ 26/11/2010  20:49:00  0.948                0.000                  238.160  4.000             0.000           1.000                      0.0
 2075247 │ 26/11/2010  20:50:00  1.198                0.128                  238.110  5.000             0.000           1.000                      0.0
 2075248 │ 26/11/2010  20:51:00  1.024                0.106                  238.840  4.200             0.000           1.000                      0.0
 2075249 │ 26/11/2010  20:52:00  0.946                0.000                  239.050  4.000             0.000           0.000                      0.0
 2075250 │ 26/11/2010  20:53:00  0.944                0.000                  238.720  4.000             0.000           0.000                      0.0
 2075251 │ 26/11/2010  20:54:00  0.946                0.000                  239.310  4.000             0.000           0.000                      0.0
 2075252 │ 26/11/2010  20:55:00  0.946                0.000                  239.740  4.000             0.000           0.000                      0.0
 2075253 │ 26/11/2010  20:56:00  0.942                0.000                  239.410  4.000             0.000           0.000                      0.0
 2075254 │ 26/11/2010  20:57:00  0.946                0.000                  240.330  4.000             0.000           0.000                      0.0
 2075255 │ 26/11/2010  20:58:00  0.946                0.000                  240.430  4.000             0.000           0.000                      0.0
 2075256 │ 26/11/2010  20:59:00  0.944                0.000                  240.000  4.000             0.000           0.000                      0.0
 2075257 │ 26/11/2010  21:00:00  0.938                0.000                  239.820  3.800             0.000           0.000                      0.0
 2075258 │ 26/11/2010  21:01:00  0.934                0.000                  239.700  3.800             0.000           0.000                      0.0
 2075259 │ 26/11/2010  21:02:00  0.932                0.000                  239.550  3.800             0.000           0.000                      0.0
                                                                                                                                   2075231 rows omitted

Dataset Cleaning

As can be seen, it is a quite large dataset and we can take the opportunity to create new variables, so we have the possibility to obtain relevant information.

#Create a variable 
date_time = [DateTime(d, t) for (d,t) in zip(data[!,1], data[!,2])]data[!,:date_time] = date_time#Create variable for date
data[!,:year] = Dates.value.(Year.(data[!,1]))
data[!,:month] = Dates.value.(Month.(data[!,1]))
data[!,:day] = Dates.value.(Day.(data[!,1]))#Create variable for time
data[!, :hour] = Dates.value.(Hour.(data[!,2]))
data[!, :minute] = Dates.value.(Minute.(data[!,2]))#Create variable for weekends
data[!, :dayofweek] = [dayofweek(date) for date in data.Date]
data[!, :weekend] = [day in [6, 7] for day in data.dayofweek]

In addition, we notice that the variables are in String format. We can make some changes to put them in the appropriate form.

for i in 3:8
    data[!,i] = parse.(Float64, data[!,i])
end
data[!,1] = replace.(data[!,1], "/" => "-")
data[!,1] = Date.(data[!,1], "d-m-y")

Preliminary Visualizations

A classic way to plot all the variables is with the following code:

plot([plot(data[1:50000, :date_time],data[1:50000,col]; label = col, xrot=30) for col in ["Global_active_power",  "Global_reactive_power", "Global_intensity", "Voltage", "Sub_metering_1",  "Sub_metering_2", "Sub_metering_3"]]...)

Line Plot All

Note that we only take a sample of 50,000 data points to avoid overloading the graphs with information, and in the same way, we can create histograms.

plot([histogram(data[1:50000, col],label = col, bins = 20 ) for col in ["Global_active_power",  "Global_reactive_power", "Global_intensity", "Voltage", "Sub_metering_1",  "Sub_metering_2", "Sub_metering_3"]]...)

Line Plot All

For now, we can recognize that the time series in its global variables have a white noise behavior, and Voltage also has it, however, it is the only one that seems to have a distribution that is similar to a normal distribution, while the sub-metering, are signs of use of household appliances.

A brief clustering with kmeans

In this section, we are interested in building a clustering model on the time series. The purpose? It is simply a way of evaluating behavior patterns over time, one hypothesis would be to see irregular behavior patterns over time, given that greater consumption would be seen at specific periods of the day or season.

An interesting issue that I was unaware of was that time series clustering is possible and you can use k-means, however in these cases, they cannot be treated from the same perspective, and other types of variants of these algorithms should be used to consider the temporality of neighboring observations when clustering. But since this project is just a toy, and the use of this technique is only for EDA, we will stick with the classical algorithm.

If you want to know more about this topic, yu can read this articule

Continuing with the problem, we can cluster by applying the following code.

X = data[!, 3:9]
transformer_instance = Standardizer()
transformer_model = machine(transformer_instance, X)
fit!(transformer_model)
X = MLJ.transform(transformer_model, X);
KMeans= @load KMeans pkg=Clustering
kmeans = KMeans(k=3)mach = machine(kmeans, X) |> fit!# cluster X into 3 clusters using K-means
Xsmall = MLJ.transform(mach);
selectrows(Xsmall, 1:4) |> pretty
yhat = MLJ.predict(mach)
data[!,:cluster] = yhat

In this case, we have 3 clusters that are ordered as follows.

cluster	nrow
CategoricalValue	Int64
1	1	741077
2	2	1257309
3	3	50894

And if we try to plot the clusters, we would have the following.

plot([scatter(data[1:20000, :date_time],data[1:20000,col]; group=data[1:20000,:].cluster, size=(1200, 1000), title = col, xrot=30) for col in ["Global_active_power",  "Global_reactive_power", "Global_intensity", "Voltage", "Sub_metering_1",  "Sub_metering_2", "Sub_metering_3"]]...)

scatter plot cluster

It looks a bit confusing, although if we look at the voltage variable, we can already size up a certain trend. For now, let's consider a boxplot of the main variables but considering the clusters.

b1 =@df data boxplot(string.(:cluster), :Global_active_power, fillalpha=0.75, linewidth=2, title ="Global active power")
b2 =@df data boxplot(string.(:cluster), :Global_reactive_power, fillalpha=0.75, linewidth=2, title = "Global reactive power")
b3 = @df data boxplot(string.(:cluster), :Global_intensity, fillalpha=0.75, linewidth=2, title ="Global intensity")
b4 = @df data boxplot(string.(:cluster), :Voltage, fillalpha=0.75, linewidth=2, title = "Voltage")
plot(b1, b2, b3, b4 ,layout=(2,2), legend=false)

scatter plot cluster

The truth is that we notice slight differences between the clusters, where we have certain consumption patterns in each category, but in some of their variables these do not necessarily lead us to any conclusion. However, as we had mentioned at the beginning, the idea of ​​clustering was to study consumption patterns during time intervals, so we add the following.

h1 =heatmap(freqtable(data,:cluster,:dayofweek)./freqtable(data,:cluster), title = "day of week")
h2 =heatmap(freqtable(data,:cluster,:hour)./freqtable(data,:cluster), title = "hour")
h3 = heatmap(freqtable(data,:cluster,:month)./freqtable(data,:cluster), title = "month")
h4 = heatmap(freqtable(data,:cluster,:day)./freqtable(data,:cluster), title = "day")plot(h1, h2, h3, h4 ,layout=(2,2), legend=false)

scatter plot cluster

It might be a bit confusing initially, but let me take an example that might help you understand. If you take into account cluster 2, it corresponds to the lowest use of the global intensity used. If we go to the heatmap that represents the hours, we will see that the time where this pattern of behavior is most present is at night, which corresponds to the hours we are usually sleeping. I hope this make more sense.

This might give us a slight hint that time frames might be necessary, we'll take this information for featuer engineering at this point later. Now let's start with the next phase.

Using EvoTrees for prediction.

For now, we want to predict voltage. I'm not an expert in the field of electricity and consumption, but for a simple exercise, we will use the MLJ library (for Python users it would be equivalent to Scikit-Learn). Due to the amount of data and the algorithm we are going to use, it is not practical to perform training with cross-validation, this will take too much time, so we will prefer to only use a train/test split as a strategy.

let's generate a lag and cut the data in the following way:

data[!, :lag_30] = Array(ShiftedArray(data.Voltage, 30))
replace!(data.lag_30, missing => 0);

And to assign the training and testing, we use the following.

train = copy(filter(x -> x.Date < Date(2010,10,01), data))
test = copy(filter(x -> x.Date >= Date(2010,10,01), data))

Then, we remove some variables that we won't use to train the model, and we save our voltage variable.

select!(train, Not([:Date, :Time, :date_time, :cluster, ]))
select!(test, Not([:Date, :Time, :date_time, :cluster, ]))
y_train = copy(train[!,:Voltage])
y_test = copy(test[!,:Voltage])

Now we are going to apply a cyclical encoder to be able to work with the data, we have several new variables related to time (month, day, hour, among others), and all these variables will be more helpful if we allow extracting their cyclical character, that is why we use a trigonometric transformation

function cyclical_encoder(df::DataFrame, columns::Union{Array, Symbol}, max_val::Union{Array, Int} )
    for (column, max) in zip(columns, max_val)        df[:, Symbol(string(column) * "_sin")] = sin.(2*pi*df[:, column]/max)
        df[:, Symbol(string(column) * "_cos")] = cos.(2*pi*df[:, column]/max)
    end
    return df
end

Finally, we can apply this new function to our dataset.

columns_selected = [:day, :year, :month, :hour, :minute, :dayofweek]
max_val = [31, 2010, 12, 23, 59, 7]
train_cyclical = cyclical_encoder(train, columns_selected, max_val)
test_cyclical = cyclical_encoder(test, columns_selected, max_val)

And finally, let's train the model.

EvoTreeRegressor = @load EvoTreeRegressor pkg=EvoTrees verbosity=0
etr_start = EvoTreeRegressor(max_depth =15)machreg = machine(etr_start, train_cyclical[!,14:end], y_train);
fit!(machreg);
pred_etr_train = MLJ.predict(machreg, train_cyclical[!,14:end]);
rms_score_train = rms(pred_etr_train, y_train)
println("The rms in train is $rms_score_train")pred_etr = MLJ.predict(machreg, test_cyclical[!,14:end]);
rms_score = rms(pred_etr, y_test)
println("The rms in test is $rms_score")

This is our result:

  • The rms in train is 2.5364451392238085

  • The rms in test is 3.438565163838837

In this section, we plot the residual left by our model, and here we can detect some signs of overfitting, considering that our model has a much better score in the training dataset than in the test dataset. On the other hand, the plots are showing us that our model has biases in its predictions, it is not being able to recognize trends.

prediction

Finally, we can see how the predictions compare to the test data.

pred-vs-real

As we have confirmed earlier, the prediction does not seem to have been able to determine the magnitudes of the voltage in the testing of the dataset. Despite the fact that our variable is fairly stable over time, the model was trained with different parameters, but ultimately none of the options managed to show a significant improvement.

Conclusions

With this small exercise, we only tried to test that GBM, while a powerful tool and popular in places like Kaggle, requires a certain level of expertise both in the model and in the use case to achieve good performance. A naive approach may not generate results that satisfy the users. This, on one hand, requires:

  1. Understanding how to perform feature engineering for a time series, such as obtaining the decomposition of the time series. This can help capture trends that cannot always be obtained solely with the time horizon.

  2. Applying smoothing strategies like moving averages could help recognize the underlying pattern, but then you will need to estimate that moving average into the future.

Overall, time series analysis requires a deep understanding of the data, proper preprocessing techniques, feature engineering, and selecting appropriate models that can capture the specific patterns and dynamics of the data.