Advanced econometrics with Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/12/22/mars.html

Introduction

Working on DataFrames.jl has been a great experience for me.
One of the by-products of this work is the DataFrames.jl: Flexible and Fast Tabular Data in Julia paper about the design of this package that I have co-authored with Milan Bouchet-Valat.
Recently I got a notification about a citation of this paper. The reference is MarSwitching.jl: A Julia package for Markov Switching Dynamic Models article by Mateusz Dadej. I found the MarSwitching.jl package an interesting contribution to the Julia package ecosystem so I have decided to devote this post to it.

Econometrics in Julia

The basic packages for doing econometrics in Julia that I have been using are GLM.jl and MixedModels.jl.
These packages are quite powerful, however, one could say that they provide “standard” functionality in econometrician’s toolbox.
Users often ask for specialized packages implementing various more advanced econometric models.

A basic answer is that with Julia often they are not needed. For a person knowing this language it is usually easy to implement an appropriate estimation procedure from scratch. The reasons are the following:

  1. Julia is expressive – such code is usually short; often learning the API of the package could take longer than coding the model.
  2. Julia is fast – you can expect that your custom implementation will be efficient, as opposed to R/Python you do not need to have to implement the compute engine in e.g. C++.
  3. Advanced econometrics models are often “open ended”. What I mean by this is that the authors of such models describe some specific case in the paper and your problem at hand requires some modifications. When you implement things yourself you can easily include such custom changes.

Having said that, there are some classes of models that have become widely used in econometric practice and the specification of their API is already well established.
In such a case I believe it is worth to have a package that implements it.
The MarSwitching.jl package is in my opinion a good example of such a case.

Markov switching dynamic models

So what are Markov switching dynamic models? Let me explain it by two examples (taken from the documentation).

The general idea is that you can have an economic phenomenon where the parameters of the relationship between the target
variable and the features depend on some unobservable state, often called “regime”.
The objective of the estimation is to find: (a) the probability of switching of the relationship between regimes, and (b) estimate the parameters of the relationship in each of the regimes. If this sounds abstract to you it is best to learn it by examples.

A stylized fact in economics is that inflation falls during recessions and rises during booms.
This relationship is called Phillips curve.
However, it is argued that the strength of this relationship is not constant in time.
In the Regime switching Phillips curve you can find how to estimate and interpret the model that assumes that indeed the relationship can be either in “weak” or “strong” regime.

The second example is related to stock markets.
The hypothesis tested in the Time-varying transition probabilites – modelling stock market example is that the market can be in two states: “bull” and “bear”.
The bull market is characterized by low volatility and positive drift, while bear market with high volatility and negative drift.
Also the model shows that the chance of switching between the two regimes depends on an option implied measure of expected volatility of S&P500 index.
Again, I leave out the details as they are nicely explained in the documentation.

The point is that in economics we repeatedly hypothesize that some observed phenomena depend on some unobserved state of the market.
Markov switching dynamic models are a perfect tool for studying such processes and it is really nice that we have a package that implements them.

Conclusions

We are getting to the end of the year 2023 and I have been using Julia on a daily basis for over five years now.
It is really encouraging that I constantly discover new high-quality and well documented packages in the Julia ecosystem like MarSwitching.jl.

Have a happy holiday break!

DataFrame vs NamedTuple: a comparison

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/12/15/table.html

Introduction

In Julia we have a common interface for working with tabular data. It is provided by the Tables.jl package.

The fact that such an interface is defined greatly simplifies interoperability between packages.
However, it introduces also a challenge. User needs to decide which concrete type of table to use?

From my experience (biased) two most common types of tables used in practice are:

  • using a NamedTuple;
  • using a DataFrame.

In this post I want to compare them so that you get a guidance which one and when to choose in your projects.

The post was tested under Julia 1.9.2 and DataFrames.jl 1.6.1.

Dependency level

NamedTuple is a type provided by Base Julia. DataFrame is defined in DataFrames.jl.

This means that you always have access to NamedTuple, while for DataFrame you need
to install DataFrames.jl and later load it.

DataFrames.jl is a relatively big package. Its installation and precompilation takes over 1 minute.
This is maybe not a huge time, but if for some reason your project environment would require
frequent recompilation it can start feeling cumbersome.

The other aspect is package load time:

julia> @time using DataFrames
  1.092920 seconds (1.27 M allocations: 78.653 MiB, 6.52% gc time, 0.46% compilation time)

Again, one second is not much, but in some applications users might want to avoid it.

In summary, NamedTuple wins here as a more lightweight option.

Conformance with Tables.jl interface

DataFrame is always a Tables.jl table.
NamedTuple is considered to be a table only if its fields are AbstracVector.
This limitation introduces an extra level of effort. User needs to ensure and check this property.

Let me give you one example when it is relevant:

julia> DataFrame(a=0, b=[1, 2]) # a table
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     0      1
   2 │     0      2

julia> (a=0, b=[1, 2]) # not a table
(a = 0, b = [1, 2])

julia> Tables.istable((a=0, b=[1,2]))
false

Additionally NamedTuple does not provide an automatic check if the lengths of all columns match:

julia> Tables.istable((a=[0], b=[1,2]))
true

julia> DataFrame((a=[0], b=[1,2]))
ERROR: DimensionMismatch: column :a has length 1 and column :b has length 2

In summary, DataFrame wins here as it is safer. With NamedTuple you need to do additional manual checks if the data you are working with is a table indeed.

Data safety

When creating a DataFrame it copies data by default (this can be overriden by copycols=false in the constructor):

julia> x = [1, 2]
2-element Vector{Int64}:
 1
 2

julia> y = [3, 4]
2-element Vector{Int64}:
 3
 4

julia> df = DataFrame(; x, y)
2×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> df.x === x
false

julia> df.y === y
false

This is not a case for a NamedTuple:

julia> nt = (; x, y)
(x = [1, 2], y = [3, 4])

julia> nt.x === x
true

julia> nt.y === y
true

This design means that DataFrame is safer. Once you create it you mostly can forget about the potential risks of modifying the source data that was used to create it.
You might think that this is an issue of a minor relevance. However, in practice, not doing a copy lead to many hard-to-find bugs (and that is why we do copy data by default when creating a DataFrame).

In summary, DataFrame wins here as it is safer (especially that you can disable safe behavior by passing copycols=false to the DataFrame constructor if you wish so).

Flexibility

A big practical difference between NamedTuple and DataFrame is that NamedTuple is immutable. You cannot add, remove, or rename its columns.
On the other hand DataFrame allows for such operations, which makes it more convenient when you need to manipulate your data.

In the flexibility dimension DataFrame is a clear winner.

Performance

There is an opposite side of the flexibility coin. DataFrame is not type stable, while NamedTuple is.
This means that, if you want performance, you need to either use separate kernel functions or higher-order functions provided by DataFrames.jl (like combine or select).

Here is an example of performance unfriendly and performance friendly code for DataFrame:

julia> function sum1(table)
           s = 0
           for v in table.x
               s += v
           end
           return s
       end
sum1 (generic function with 1 method)

julia> function sum2(table)
           function kernel(x)
               s = 0
               for v in x
                   s += v
               end
               return s
           end
           return kernel(table.x)
       end
sum2 (generic function with 1 method)

julia> df = DataFrame(x=1:10^8);

julia> @time sum1(df) # after compilation
  5.810585 seconds (400.00 M allocations: 7.451 GiB, 1.78% gc time)
5000000050000000

julia> @time sum2(df) # after compilation
  0.041333 seconds (1 allocation: 16 bytes)
5000000050000000

Note that there is no such issue with NamedTuple:

julia> nt = NamedTuple(pairs(eachcol(df)))
(x = [1, 2,  …  99999999, 100000000],)

julia> @time sum1(nt) # after compilation
  0.042012 seconds (1 allocation: 16 bytes)
5000000050000000

julia> @time sum2(nt) # after compilation
  0.051452 seconds (1 allocation: 16 bytes)
5000000050000000

The winner is NamedTuple. It is easier to have good performance using it.

Compilation

The downside of NamedTuple being compiled is that it can take a lot of time to compile a function taking
it (or even to create it) if number of columns is large. DataFrame does not have such issues.

Here is an example:

julia> @time df = DataFrame(transpose(1:10_000), :auto)
  0.008328 seconds (39.53 k allocations: 2.431 MiB)
1×10000 DataFrame
 Row │ x1     x2     x3     x4     x5     x6     x ⋯
     │ Int64  Int64  Int64  Int64  Int64  Int64  I ⋯
─────┼──────────────────────────────────────────────
   1 │     1      2      3      4      5      6    ⋯
                                 9994 columns omitted

julia> @time nt = NamedTuple(pairs(eachcol(df)));
  4.660914 seconds (604.77 k allocations: 27.920 MiB, 0.20% gc time, 99.15% compilation time)

As you can see there was a significant compilation overhead of creation of nt.

For wide tables DataFrame is a clearly preferred option.

Display

DataFrame uses a nicely formatted PrettyTables.jl display. NamedTuple is not that readable.
Try displying the nt object I have created in the last section. You will get several pages of
hard-to-read output.

DataFrame has clearly superior default display mechanism.

Convenience

NamedTuple is a generic data type, while DataFrame was designed for working with tabular data specifically.
Therefore DataFrame provides numerous convenience functionalities that NamedTuple lacks. Let me give two
examples:

  • You can select a column of a DataFrame using a string or a Symbol as its name; with NamedTuple you have to use Symbol;
    allowing for strings has two big advantages: first it is slightly easier to generate column names as strings programmatically,
    second – it is easier to type column names containing special characters, like e.g. whitespace, for instance "some column name"
    is inconvenient to work with using NamedTuple.
  • You have convenient column selectors like Cols or regular expressions which work with DataFrame and are not supported by
    NamedTuple.

If you need convenience DataFrame should be your preference.

Functionality

Last, but not least DataFrame comes with dozens of convenience functions provided by DataFrames.jl
package. These include split-apply-combine, joining, sorting, subsetting, broadcasting etc. of
DataFrame objects. None of this is available for NamedTuple out of the box.
Indeed there are extra packages that work nicely with NamedTuple, but this means that you need
to install and load them separately (and usually you will need several to get your job done).

Additionally there is a host of convenience packages (like DataFramesMeta.jl or Tidier.jl) that make it easier
to work with DataFrame objects.

When it comes to functionality DataFrame is a winner.

Metadata

DataFrame supports storing table level and column level metadata (attributes in R or labels in Stata are a similar
concept). NamedTuple does not provide such a functionality.

Therefore, if you want to annotate your data DataFrame is preferable.

Concluding remarks

Let us summarize our findings:

  • NamedTuple wins in: dependency level, performance
  • DataFrame wins in: conformance with Tables.jl, data safety, flexibility, compilation, display, convenience, functionality, metadata

Given these considerations I would say that most of the time DataFrame is a safe default choice for tabular data storage format.
This is especially true for interactive workflows.

However, there are cases, when you will find NamedTuple preferable. In the Julia world usually performance gets a high priority.
NamedTuple is preferable here especially if you would have millions of small tables, as in this case the overhead of larger DataFrame
object will be noticeable.

Julia User Group in Mainz (Germany)

By: Hendrik Ranocha -- Julia blog

Re-posted from: https://ranocha.de/blog/Julia_user_group/

We have initiated a new Julia User Group in Mainz (Germany). It provides a platform
for everybody interested in Julia – both people new to Julia and people having worked
with Julia for years. Our meetings shall provide opportunities to learn something new
and to discuss all issues related to Julia. Thus, we will start with a presentation
from our group. Afterwards, we will have time for open discussions.

Please find more information about the meetings on our website.

As a related event, we will host a talk by Valentin Churavy on December 21,
as announced online.