DataFrames.jl vs Pandas, dplyr, and Stata

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/09/25/comparisons.html

New content in DataFrames.jl documentation

Many people moving to DataFrames.jl from other data-management ecosystems are
interested in learning how to map their favorite code patterns to Julia.

It was a long standing issue. Fortunately recently thanks to the efforts of
Matthieu Gomez and Tom Kwong (with the usual major support from
Peter Deffebach and Milan Bouchet-Valat, and a few other
contributors) we finally have a section in the manual on comparisons
against Pandas, dplyr, and Stata.

In parallel Tom Kwong also prepared DataFrames.jl cheat sheet which
excellently shows key functionalities that we currently provide.

We all hope that these materials will be useful for people to get started with
DataFrames.jl. If you would like to see some additional content in the
comparisons section of the DataFrames.jl manual – please do not hesitate
to open an issue or pull request.

Lessons learned

As an after-word let me comment that getting dplyr and Stata material was much
smoother than Pandas. It is also reflected in the volume of the material covered
(though probably dplyr and Stata coverage could be improved). The main reason is
that Pandas differs many more ways from DataFrames.jl than dplyr or Stata.
A few of the notable differences are:

  • the type of return value from loc function in Pandas depends on the value
    (not only the type) of its arguments;
  • 0 based indexing (Pandas) vs 1 based indexing (DataFrames.jl);
  • NaN in Pandas is treated as missing in Julia, but is skipped by default
    as opposed to Julia, where you have to be explicit;
  • Pandas has inplace argument to functions while in Julia we have functions
    with and without ! to distinguish between non-mutating and mutating operations;
  • Pandas provides row index, while in DataFrames.jl you need a separate column
    (or columns) in a DataFrame to hold it and later run a groupby function on
    them to get an efficient row-lookup functionality through GroupedDataFrame
    object (note, in particular, that in this way you can have many different row
    indexing column sets to for the same data frame).