World Happiness Report – EDA & clustering with Julia

By: Navi

Re-posted from: https://indymnv.dev/posts/005_happines/index.html

World Happiness Report – EDA & clustering with Julia

Date: 2023-11-23

Summary: An exploration of Happiness Report using Julia

tags: #Julia #economy #clustering #EDA




Table of Contents

  1. Introduction
  2. Packages used
  3. Clustering
  4. Conclusions

Introduction

The purpose of this post is to show Julia as a language for data analysis and Machine Learning. Sadly Kaggle does not support Julia Kernels (hopefully, they will add it in the future). Therefore I wanted to take advantage of this space to show a reimplementation of Python/R Notebooks to Julia. In this context, I took data on happiness in countries in 2021 and some factors considered in this exciting survey.

  • You can get the dataset in Kaggle

  • The full code is in my Github

Packages used

I'm using Julia version 1.8.0 in this project, and the library versions are in the Project.toml, there are some installed that I didn't end up using for this analysis, but these are the important ones

using DataFrames
using DataFramesMeta
using CSV
using Plots
using StatsPlots
using Statistics
using HypothesisTests
Plots.theme(:ggplot2)

Let's start reading the file.

df_2021 = DataFrame(CSV.File("./data/2021.csv", normalizenames=true))

You can see the dataset in the REPL.

julia> df_2021 = DataFrame(CSV.File("./data/2021.csv", normalizenames=true))
149×20 DataFrame
 Row │ Country_name    Regional_indicator            Ladder_score  Standard_error_of_ladder_score  upperwhi ⋯
     │ String31        String                        Float64       Float64                         Float64  ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Finland         Western Europe                       7.842                           0.032         7 ⋯
   2 │ Denmark         Western Europe                       7.62                            0.035         7
   3 │ Switzerland     Western Europe                       7.571                           0.036         7
   4 │ Iceland         Western Europe                       7.554                           0.059         7
   5 │ Netherlands     Western Europe                       7.464                           0.027         7 ⋯
   6 │ Norway          Western Europe                       7.392                           0.035         7
   7 │ Sweden          Western Europe                       7.363                           0.036         7
   8 │ Luxembourg      Western Europe                       7.324                           0.037         7
   9 │ New Zealand     North America and ANZ                7.277                           0.04          7 ⋯
  10 │ Austria         Western Europe                       7.268                           0.036         7
  11 │ Australia       North America and ANZ                7.183                           0.041         7
  12 │ Israel          Middle East and North Africa         7.157                           0.034         7
  13 │ Germany         Western Europe                       7.155                           0.04          7 ⋯
  14 │ Canada          North America and ANZ                7.103                           0.042         7
  ⋮  │       ⋮                      ⋮                     ⋮                      ⋮                      ⋮   ⋱
 136 │ Togo            Sub-Saharan Africa                   4.107                           0.077         4
 137 │ Zambia          Sub-Saharan Africa                   4.073                           0.069         4
 138 │ Sierra Leone    Sub-Saharan Africa                   3.849                           0.077         4 ⋯
 139 │ India           South Asia                           3.819                           0.026         3
 140 │ Burundi         Sub-Saharan Africa                   3.775                           0.107         3
 141 │ Yemen           Middle East and North Africa         3.658                           0.07          3
 142 │ Tanzania        Sub-Saharan Africa                   3.623                           0.071         3 ⋯
 143 │ Haiti           Latin America and Caribbean          3.615                           0.173         3
 144 │ Malawi          Sub-Saharan Africa                   3.6                             0.092         3
 145 │ Lesotho         Sub-Saharan Africa                   3.512                           0.12          3
 146 │ Botswana        Sub-Saharan Africa                   3.467                           0.074         3 ⋯
 147 │ Rwanda          Sub-Saharan Africa                   3.415                           0.068         3
 148 │ Zimbabwe        Sub-Saharan Africa                   3.145                           0.058         3
 149 │ Afghanistan     South Asia                           2.523                           0.038         2

To see the columns name, simply use

names(df_2021)

getting a vector with all column names

julia> names(df_2021)
20-element Vector{String}:
 "Country_name"
 "Regional_indicator"
 "Ladder_score"
 "Standard_error_of_ladder_score"
 "upperwhisker"
 "lowerwhisker"
 "Logged_GDP_per_capita"
 "Social_support"
 "Healthy_life_expectancy"
 "Freedom_to_make_life_choices"
 "Generosity"
 "Perceptions_of_corruption"
 "Ladder_score_in_Dystopia"
 "Explained_by_Log_GDP_per_capita"
 "Explained_by_Social_support"
 "Explained_by_Healthy_life_expectancy"
 "Explained_by_Freedom_to_make_life_choices"
 "Explained_by_Generosity"
 "Explained_by_Perceptions_of_corruption"
 "Dystopia_residual"

To see what is a regional indicator, we can see how every country is grouped.

julia> unique(df_2021.Regional_indicator)
10-element Vector{String}:
 "Western Europe"
 "North America and ANZ"
 "Middle East and North Africa"
 "Latin America and Caribbean"
 "Central and Eastern Europe"
 "East Asia"
 "Southeast Asia"
 "Commonwealth of Independent States"
 "Sub-Saharan Africa"
 "South Asia"

Let's do a simple operation with the dataframe getting the number of countries by regional indicator and sorting those

sort(
    combine(groupby(df_2021, :Regional_indicator), nrow), 
    :nrow
)

Getting this output

julia> sort(
           combine(groupby(df_2021, :Regional_indicator), nrow),
           :nrow
       )
10×2 DataFrame
 Row │ Regional_indicator                 nrow
     │ String                             Int64
─────┼──────────────────────────────────────────
   1 │ North America and ANZ                  4
   2 │ East Asia                              6
   3 │ South Asia                             7
   4 │ Southeast Asia                         9
   5 │ Commonwealth of Independent Stat…     12
   6 │ Middle East and North Africa          17
   7 │ Central and Eastern Europe            17
   8 │ Latin America and Caribbean           20
   9 │ Western Europe                        21
  10 │ Sub-Saharan Africa                    36

With this, we can see a more significant number of countries in Sub-Saharan Africa and only a smaller group of countries in North America and ANZ.

Now, let's try to slice our data. We will create a data frame called float_df that contains only the Float64 variables but excludes the "explained_" variables. This new dataframe will help us with some operations later.

#Get all columns Float64
float_df = select(df_2021, findall(col -> eltype(col) <: Float64, eachcol(df_2021)))#Take away the Explained variables
float_df = float_df[:,Not(names(select(float_df, r"Explained")))]

Let's make our first plot.

scatter(
    df_2021.Social_support,
    df_2021.Ladder_score,
    size = (1000,800),
    label="country",
    xaxis = "Social Support",
    yaxis = "Ladder Score",
    title = "Relation between Social Support and Happiness Index Score by country"
)

![scatterplot with ladder score and social support](/assets/005_happines/scatterplot.png)

If we want a view of all float variables in several histograms, we can add this code using Statsplots.

N = ncol(float_df)
numerical_cols = Symbol.(names(float_df,Real))
@df float_df Plots.histogram(cols();
                             layout=N,
                             size=(1400,800),
                             title=permutedims(numerical_cols),
                             label = false)

Histogram of all variables

And If we want to compare it with boxplots.

@df float_df boxplot(cols(), 
                     fillalpha=0.75, 
                     linewidth=2,
                     title = "Comparing distribution for all variables in dataset",
                     legend = :topleft)

Boxplot all variables

Without going into so much detail, we can affirm that the Ladder Score is the variable related to the result of the survey on the degree of happiness in the country (our dependent variable). Explained variables correspond to the preprocessing to build the Ladder Score, for this reason, we remove them from the dataframe and will hold with only the raw data.

What are the top 5 countries and bottom 5?

# Top 5 and bottom 5 countries by ladder score
sort!(df_2021, :Ladder_score, rev=true)
plot(
    bar(
        first(df_2021.Country_name, 5 ),
        first(df_2021.Ladder_score, 5 ),
        color= "green",
        title = "Top 5 countries by Happiness score",
        legend = false,
    ),
    bar(
        last(df_2021.Country_name, 5 ),
        last(df_2021.Ladder_score, 5 ),
        color ="red",
        title = "Bottom 5 countries by Happiness score",
        legend = false,
    ),
size=(1000,800),
yaxis = "Happines Score",
)

top5 and bottom 5

And the classic heatmap for correlation with the following function.

function heatmap_cor(df)
    cm = cor(Matrix(df))
    cols = Symbol.(names(df))    (n,m) = size(cm)
    display(
    heatmap(cm, 
        fc = cgrad([:white,:dodgerblue4]),
        xticks = (1:m,cols),
        xrot= 90,
        size= (800, 800),
        yticks = (1:m,cols),
        yflip=true))
    display(
    annotate!([(j, i, text(round(cm[i,j],digits=3),
                       8,"Computer Modern",:black))
           for i in 1:n for j in 1:m])
    )
end

heatmap

And now, we can build a function where we can get the mean ladder score by regional indicator and compare it with the distribution of all countries.

function distribution_plot(df)
    display(
        @df df density(:Ladder_score,
        legend = :topleft, size=(1000,800) , 
        fill=(0, .3,:yellow),
        label="Distribution" ,
        xaxis="Happiness Index Score", 
        yaxis ="Density", 
        title ="Comparison Happiness Index Score by Region 2021") 
    )
    display(
        plot!([mean(df_2021.Ladder_score)],
        seriestype="vline",
        line = (:dash), 
        lw = 3,
        label="Mean")
    )
    for element in unique(df_2021.Regional_indicator)
        display(
            plot!(
            [mean(mean([filter(row->row["Regional_indicator"]==element, df).Ladder_score]))],
            seriestype="vline",
            lw = 3,
            label="$element") 
        )
    end
end

distribution region

Suppose we want to try the same idea but with countries. In that case, we can take advantage of multiple dispatch and create a function that receives a list of countries and creates a variation of the distribution with countries.

function distribution_plot(df, var_filter, list_elements)
    display(
        @df df density(:Ladder_score,
        legend = :topleft, size=(1000,800) , 
        fill=(0, .3,:yellow),
        label="Distribution" ,
        xaxis="Happiness Index Score", 
        yaxis ="Density", 
        title ="Happiness index score compare by countries 2021") 
    )
    display(
        plot!([mean(df_2021.Ladder_score)],
        seriestype="vline",
        line = (:dash), 
        lw = 3,
        label="Mean")
    )
    for element in list_elements
        display(
            plot!(
            mean([filter(row->row[var_filter]==element, df).Ladder_score]),
            seriestype="vline",
            lw = 3,
            label="$element") 
        )
    end
end

Let's test our new function, comparing three countries.

distribution_plot(df_2021, "Country_name", ["Chile",
                                            "United States",
                                            "Japan",
                                           ])

distribution countries

Here we can see how the USA has the highest score, followed by Chile and Japan.

To end the first part, let's apply some statistical tests. We will use an equal variance T-test to compare distribution from different regions. The function is as follows.

# Perform a simple test to compare distributions
# This function performs a two-sample t-test of the null hypothesis that s1 and s2 
# come from distributions with equal means and variances 
# against the alternative hypothesis that the distributions have different means 
# but equal variances.
function t_test_sample(df, var, x , y)
    x = filter(row ->row[var] == x, df).Ladder_score
    y = filter(row ->row[var] == y, df).Ladder_score
    EqualVarianceTTest(vec(x), vec(y))
end

We will have this output if we compare Western Europe and North America and ANZ.

t_test_sample(df_2021, "Regional_indicator", "Western Europe", "North America and ANZ")
julia> t_test_sample(df_2021, "Regional_indicator", "Western Europe", "North America and ANZ")
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -0.213595
    95% confidence interval: (-0.9068, 0.4796)Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.5301Details:
    number of observations:   [21,4]
    t-statistic:              -0.6374218416101513
    degrees of freedom:       23
    empirical standard error: 0.3350924366753546

We don't have enough evidence to reject the hypothesis that these samples come from distributions with equal means and variance. On another side, if we try comparing Western Europe with South Asia, we can see this:

julia> t_test_sample(df_2021, "Regional_indicator", "South Asia", "Western Europe")
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -2.47305
    95% confidence interval: (-3.144, -1.802)Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-07Details:
    number of observations:   [7,21]
    t-statistic:              -7.576776118465833
    degrees of freedom:       26
    empirical standard error: 0.32639840222022687

In this case, we can reject that hypothesis.

Clustering

Now we will cluster the countries using the popular algorithm Kmeans. My first option was to use clustering.jl. However, determining the ideal number of clusters is necessary to get the Wcss (within-cluster sum of the square). With this, we can evaluate it with the elbow method, so I used Scikit-learn wrapper. I also include an issue. Well, let's continue with the last part. I started adding some libraries.

using Random
using ScikitLearn
using PyCall@sk_import preprocessing: StandardScaler
@sk_import cluster: KMeans

Let's take out from the float_df all the variables related to Ladder_score, and keep only the variables considered in the survey.

select!(float_df, Not([:Standard_error_of_ladder_score, 
                           :Ladder_score, 
                           :Ladder_score_in_Dystopia, 
                           :Dystopia_residual]))

To train our model, we need to standardize the data, and then we will create a list to retrieve the wcss in every iteration. The function is as follows:

function kmeans_train(df)
    X = fit_transform!(StandardScaler(), Matrix(df))    wcss = []
    for n in 1:10        Random.seed!(123)
        cluster =KMeans(n_clusters=n,
                        init = "k-means++",
                        max_iter = 20,
                        n_init = 10,
                        random_state = 0)
        cluster.fit(X)
        push!(wcss, cluster.inertia_)
    end
    return wcss
end

Let's invoke the function and plot the wcss.

wcss = kmeans_train(float_df)plot(wcss, title = "wcss in each cluster",
    xaxis = "cluster",
   yaxis = "Wcss")

Elbow Method

In this case, I decided to go for three clusters. We can <del>abuse</del> make use of multiple dispatch again, adding n for a defined number of clusters.

function kmeans_train(df, n)
    X = fit_transform!(StandardScaler(), Matrix(df))    Random.seed!(123)
    cluster =KMeans(n_clusters=n,
                    init = "k-means++",
                    max_iter = 20,
                    n_init = 10,
                    random_state = 0)
    cluster.fit(X)
    return cluster
endcluster= kmeans_train(float_df, 3)

If we take the first plot we did at the beginning of the post, but now we add the cluster labels, we have this plot.

scatter(df.Social_support,
        df.Ladder_score,
        marker_z = cluster.labels_,
        legend = false,
        size = (1000,800),
        xaxis = "Social Support",
        yaxis = "Ladder Score",
        title = "Comparison between social support and ladder score by country incorporating clustering")

Scatter with cluster

With these clusters, we have a group with developed countries with the highest happiness index score. For example, Finland, Australia and Germany, followed by a group of emerging countries. Finally, countries that still have a significant debt for the well-being of their population.

histogram(filter(row ->row.cluster ==1,df).Ladder_score, label = "cluster 1", title = "Distribution of Happiness Score by Cluster", xaxis = "Ladder Score", yaxis="n° countries")
histogram!(filter(row ->row.cluster ==2,df).Ladder_score, label = "cluster 2")
histogram!(filter(row ->row.cluster ==3,df).Ladder_score, label = "cluster 3")

histogram happiness cluster

Finally, we can compare how this cluster affects all the variables.

@df float_df Plots.density(cols();
                             layout=N,
                             size=(1600,1200),
                             title=permutedims(numerical_cols),
                             group = df.cluster,
                             label = false)

Distribution by variables with cluster

Conclusions

From my experience using Python for about two years in data analysis and recently dabbling with Julia, I can say that the ecosystem generally seems quite mature for this purpose. I had some questions that the community immediately answered on Julia Discourse. More content like this is needed so that the data science community can more widely adopt this technology.

ChatGPT performs better on Julia than Python (and R) for Large Language Model (LLM) Code Generation. Why?

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/chatgpt-performs-better-on-julia-than-python-and-r-for-large-language-model-llm-code-generation-why/

Machine learning is all about examples. The more data you have, the better it should perform, right? With the rise of ChatGPT and Large Language Models (LLMs) as a code helping tool, it was thus just an assumption that the most popular languages like Python would likely be the best for LLMs. But because of the increased productivity, I tend to use a lot of Julia, a language with an estimated user-base of around a million programmers. For this reason, people have often asked me how it fairs with ChatGPT, Github Copilot, etc., and so I checked out those pieces and… was stunned. It’s really good. It seemed better than Python actually?

The data is in: Julia does well with ChatGPT

This question was recently put to the test by a researcher named Alessio Buscemi in A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages. Basically, he setup a bunch of example problems and asked ChatGPT for the code, executed it, and documented the results. The main result of the paper is the following:

or in the words of the author:

“Overall, 1833 runs, or 45.8% of the total number, lead to executable code. However, this percentage varies greatly according to the tested language. ChatGTP performs the best on Julia, with a 81.5% of generated code being successfully executed, and performs the worst on C++, with only 7.3% of the executions being successful. Specifically, the model seems to perform better on high-level dynamically typed languages (Javascript, Julia, Perl, Python, R, Ruby, Smalltalk) rather than lower level statically typed languages (C, C++, Go).”

That is right, of all languages, “ChatGTP performs the best on Julia”. In fact, ChatGPT generally only performs well on slow languages: Julia is the only fast language that it did well on. The author went on to do a podcast at DataSkeptic where at 25:20 this is addressed where he was unable to answer why ChatGPT was able to make him more successful in Julia than Python, even though he himself only had used Python before.

Is it unexpected that Julia would outperform Python in generated code execution and correctness?

But the real question is, is this really unexpected? I don’t think so, this aligns with what I have seen. I am an educator (currently at MIT researching/teaching in machine learning) who has ran many workshops over the last 10 years, with the languages mainly being Julia, Python, MATLAB, R, and C. I also recently have been a power-user of this translation because I recently have been updating the bindings and documentation of diffeqpy, a fast differential equation solver package in Python which uses Julia as the backend. This has been requiring a lot of translating Julia DifferentialEquations.jl code into a Python format for diffeqpy, and thus there was some ChatGPT involved.

[A quick intro to the “why of diffeqpy” is that we recently had a peer reviewed study demonstrating the generated GPU kernels from Julia for ODE solvers were about 20x-100x faster than the leading PyTorch and Jax libraries. Given these results, we wanted to create a Julia backend for scientists who use Python and could demonstrate on Google Collab notebooks that even when used through Python with automated language translation, it’s still about 10x faster than Jax. The future is working together: if you happen to be at PyData Eindhoven or PyData Global, come chat with me about building bridges between languages!]

In both of these scenarios, the development of diffeqpy with the help of ChatGPT and in teaching new students Julia vs Python, one of the things that really stuck out was that new developers and AI get caught on the API inconsistencies of Python. “But Python is so Pythonic, it doesn’t have inconsistencies?”… ehh have you tried to teach it? I think it’s an interesting thing in the psyche of a programmer. If you have used a programming language for 10 years, then everything the programming language does feels natural. It’s like an English speaker using the word “do”, a phase by which do-support is very difficult for new language learners, but someone who natively learned English may have never thought twice about how weird the word is.

There is a lot of this “it’s always been like this and therefore it makes sense” in Python. In the workshops, it always got best highlighted when using a cheatsheet which shows Julia vs MATLAB vs Python syntax side-by-side. One of the examples that comes to mind is the syntax for the simplest things, like “let’s create an array of ones, zeros, or random numbers”. What does that look like side-by-side?

The Python form in the middle is undoubtably a bit weird, requires extra packages (“what is np?”), etc. but it’s not “awful”. However, it’s then the rand part that gets students:

And that’s when it all becomes clear. np.zeros((2, 2)) to np.random.rand(2, 2), students always ask “why the ((” and the answer is “well there’s a tuple and …” but no, no matter how much we as educators try to explain it, it’s just an inconsistency. It’s just harder. And it happens over and over in the teaching. Sometimes it’s that the syntax is more complex or opaque:

while in other cases it’s inconsistency, unconventional wording, or something else.

So what happened with ChatGPT? It tripped up on exactly these same points that new learners commonly tripped up on. Common issues I noticed when developing diffeqpy was:

  • Results that were “right” but contextually wrong because of standard library inconsistencies or lack of a standard library. For example, “create an array of random numbers in Python” does “import random”, then “random_numbers = [random.randint(1, 100) for _ in range(5)]”. Okay, that’s not wrong, but “obviously” the right thing to do is to create a numpy array in any context where I am using a differential equation solver. But ChatGPT doesn’t know these contextual differences, it does not guess well, and if a user is not already familiar in Python they will get tripped up. The solution is to be more specific “create an array of random numbers in Python with numpy”, i.e. prompt engineering. But notably, this prompt engineering was not required with the same examples in Julia, and is generally a result of Python having a lack of core standardization, i.e. having many different ways to do basic things (numpy vs PyTorch vs standard libraries and methods built into CPython).
  • Results that were “close” but tripped up due to extra complexity. “A = coo_matrix(([4, 1],([0, 1], [1, 1])),shape=(2, 2))” is undeniably hard syntax to guess, and ChatGPT messed up quite a bit on examples like these sparse matrix constructions and forced vectorizations. Interestingly, it had some difficulties in translating code to zero-based indexing in a few cases. Mathematical models in texts tend to be written in one-based ($$x_1$$), and giving ChatGPT the extra step of converting one-based to zero-based gave it an extra step which I found tripped it up in a few cases.
  • Results that were too inefficient to be used in practice. This is something that tag-teamed the last one. Sometimes the code that would be correct would be a for loop (that would often times have difficulty compiling in numba), but that code would inhibit performance enough that I would ask for a vectorized version, in which case it would create a weird matrix with some syntax errors (which is also not fast after fixing the syntax errors).
  • Results that were clearly pulling from “bad data”. More data is not always better. One other interesting point to note was that, in the case of differential equations, it was clear that ChatGPT was relying on data from tutorials and web responses I had written because its examples match my favorite examples (Lotka-Volterra, ROBER, Bruss, etc.) and match a lot of the coding styles I prefer (for reference, I wrote and maintain the differential equation libraries in Julia). In the Python examples I could find some oddities of sometimes inappropriately chosen solvers (stiff vs non-stiff) with some odd statements that you wouldn’t expect an expert to say. What I mean is, the Python code clearly had a larger set of training data, but not everyone in that training data seemed to really know the ins and outs of numerical differential equations to be actually trustworthy data. This likely is one of the major parts impacting the results.

These one of the reasons why people tend to say you need to double check your code when generated by LLMs (or that they aren’t quite ready), it’s because these errors tend to happen (often). However, what I found is that these classes of errors were dramatically increased with Python instead of Julia, and where it tripped up is exactly where I would have expected new students to get tripped up. That’s not to say it does perfect (or well) on Julia, but it clearly did better, and thus after trying hard I tended to only use ChatGPT to convert Python examples to Julia and not Julia examples to Python in the end, simply because the Python generated examples required too much work to fix.

Conclusion

It’s interesting what you can get used to in life. Discomfort can become the norm if it’s what you experience every day. I had a back pain that I thought was just normal because I had gotten older, until I got a new office chair and realized it went away. In the same way, programming languages add their own discomforts to your programming life. You get used to them, but it becomes very obvious where the pain points are when you have to teach somebody new, since they have not become accustomed to the pain. When I first started using Julia, I came because I needed a high level language that generated fast code. When I started moving workshops from MATLAB and Python to Julia, I was shocked at how much easier it was to train newcomers to the language because of the syntactic simplicity. But now when using LLMs, it’s interesting to see how that syntactic simplicity and overall uniformity also reduced errors from AI generated code in precisely the same spots. My “eye check” from the diffeqpy project has now been confirmed by a larger study that indeed Julia works well with LLMs.

Are LLMs ready for doing all things code generation? No. But it does do surprisingly well with Julia, and it will be interesting to watch how that evolves as we get more and more Julia code as training data.

The post ChatGPT performs better on Julia than Python (and R) for Large Language Model (LLM) Code Generation. Why? appeared first on Stochastic Lifestyle.

Is vector pooling considered harmful?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/11/17/pooling.html

Introduction

PooledArrays.jl is a package providing a custom array type that can be used for compression of your data if it has a few unique elements.
In this post I want to explain the design of PooledArray object and discuss how it affects its performance.

The post was written under Julia 1.9.2, PooledArrays.jl 1.4.3, and DataFrames.jl 1.6.1.

How pooling works

The definition of the PooledArray type is the following:

mutable struct PooledArray{T, R<:Integer, N, RA} <: AbstractArray{T, N}
    refs::RA
    pool::Vector{T}
    invpool::Dict{T,R}
    # refcount[] is 1 if only one PooledArray holds a reference to pool and invpool
    refcount::Threads.Atomic{Int}
end

It represents a N-dimensional array with element type T.
The internal representation of data is that each unique value in a PooledArray gets an integer
representation (reference) of type R. Then for each element of the array the refs field is an array
that is AbstractArray{R, N} and keeps track of what reference value is stored for each entry of an array.

To be able to efficiently work with these reference numbers PooledArray stores two fields:

  • pool that gives information how reference numbers are mapped to actual values of type T stored in the array;
  • invpool that gives an inverse information on what is a reference number of some value of type T.

Since many operations on pooled arrays do not change pool and invpool the PooledArray has an extra optimization
that automatically ensures that two arrays that can be proven to have the same pool and invpool share them.
The refcount field is used to keep track how many such arrays exist. This is used in case when we modify
pool and invpool of some array and want to make sure that we do not modify another PooledArray by accident.

Let us check these properties in practice.

Test driving pooled arrays

Let us create a simple PooledArray first:

julia> using PooledArrays

julia> pa1 = PooledArray(["a", "b", "a", "b", "a", "b"])
6-element PooledVector{String, UInt32, Vector{UInt32}}:
 "a"
 "b"
 "a"
 "b"
 "a"
 "b"

julia> pa1.refs
6-element Vector{UInt32}:
 0x00000001
 0x00000002
 0x00000001
 0x00000002
 0x00000001
 0x00000002

julia> pa1.pool
2-element Vector{String}:
 "a"
 "b"

julia> pa1.invpool
Dict{String, UInt32} with 2 entries:
  "b" => 0x00000002
  "a" => 0x00000001

julia> pa1.refcount
Base.Threads.Atomic{Int64}(1)

We can see that, by default, each entry is recorded as 4-byte UInt32.
Additionally the pa1.refcount tells us that only this pooled array
uses the pool and refpool objects that it references to.

Let us first check what happens when we operate on this array:

julia> pa2 = pa1[[1, 2, 3]]
3-element PooledVector{String, UInt32, Vector{UInt32}}:
 "a"
 "b"
 "a"

julia> pa2.refcount
Base.Threads.Atomic{Int64}(2)

julia> pa1.refcount
Base.Threads.Atomic{Int64}(2)

julia> pa2.pool === pa1.pool
true

julia> pa2.invpool === pa1.invpool
true

As you can see, since pa2 subsets pa1 we knew that they can share their pool and invpool.
The refcount field tells us that two objects reuse them.

Let us now modify the pool of pa2:

julia> pa2[1] = "c"
"c"

julia> pa2.refcount
Base.Threads.Atomic{Int64}(1)

julia> pa1.refcount
Base.Threads.Atomic{Int64}(1)

julia> pa2.pool
3-element Vector{String}:
 "a"
 "b"
 "c"

julia> pa1.pool
2-element Vector{String}:
 "a"
 "b"

As you can see the pools got automatically decoupled and refcount is adjusted accordingly.

In summary, the benefit of pool-sharing is that we can very fast subset PooledArrays without having to re-create pool and invpool.
This makes working with PooledArray fast as long as we do not change the set of values we store in them.

The second important design aspect of PooledArray is the R type. As I have said, by default it is UInt32. However,
for small pools this is inefficient. Therefore you can write:

julia> pa3 = PooledArray(pa1, compress=true)
6-element PooledVector{String, UInt8, Vector{UInt8}}:
 "a"
 "b"
 "a"
 "b"
 "a"
 "b"

julia> Base.summarysize(pa1)
570

julia> Base.summarysize(pa3)
504

As you can see, you can use the compress=true keyword argument to automatically pick the minimal size of R type that is able to keep the pool at hand.
In our case it is UInt8, which would save a lot of memory in case of large arrays.
Why do we use UInt32 by default then?
The reason is that this type is typically efficient enough memory-wise and at the same time it ensures a pool that is large enough in most scenarios.
For example, the limitation of the UInt8 pool is that it can store up to 255 values only:

julia> pa4 = PooledArray(1:255, compress=true);

julia> pa4[1]=256
ERROR: You're using a PooledArray with ref type UInt8, which can only hold 255 values,
and you just tried to add the 256th reference.  Please change the ref type
to a larger int type, or use the default ref type (UInt32).

So you have a tradeoff here. If you are sure you will not change your pool then compress=true is a safe option.
If you know you might need to change the pool you need to pick the R type more carefully.

What are the benefits of pooled arrays?

There are two types of benefits of PooledArray. The first is memory footprint, the second is performance.
Let me explain them by example.

First create two large vectors storing strings:

julia> v1 = ["x$(isodd(i))" for i in 1:10^6];

julia> v2 = PooledArray(v1, compress=true);

julia> Base.summarysize(v1)
21500040

julia> Base.summarysize(v2)
1000507

As you can see there is a significant compression gain in our example by using a PooledArray.

The second benefit is performance, especially in combination with DataFrames.jl:

julia> using DataFrames

julia> df = DataFrame(; v1, v2)
1000000×2 DataFrame
     Row │ v1      v2
         │ String  String
─────────┼────────────────
       1 │ xtrue   xtrue
       2 │ xfalse  xfalse
       3 │ xtrue   xtrue
    ⋮    │   ⋮       ⋮
  999998 │ xfalse  xfalse
  999999 │ xtrue   xtrue
 1000000 │ xfalse  xfalse
       999994 rows omitted

Now let us perform two example operations (I am measuring the second timing to avoid counting compilation time).

The first is aggregation:

julia> combine(groupby(df, :v1), nrow);

julia> @time combine(groupby(df, :v1), nrow);
  0.025122 seconds (201 allocations: 31.271 MiB)

julia> combine(groupby(df, :v2), nrow);

julia> @time combine(groupby(df, :v2), nrow);
  0.002766 seconds (227 allocations: 7.643 MiB)

As you can see doing aggregation when grouping by PooledArray is much faster.

The second example is innerjoin:

julia> df_ref = df[1:2, :]
2×2 DataFrame
 Row │ v1      v2
     │ String  String
─────┼────────────────
   1 │ xtrue   xtrue
   2 │ xfalse  xfalse

julia> df_ref.val = 1:2
1:2

julia> innerjoin(df, df_ref, on=:v1, makeunique=true);

julia> @time innerjoin(df, df_ref, on=:v1, makeunique=true);
  0.057885 seconds (248 allocations: 36.741 MiB, 23.00% gc time)

julia> innerjoin(df, df_ref, on=:v2, makeunique=true);

julia> @time innerjoin(df, df_ref, on=:v2, makeunique=true);
  0.024692 seconds (265 allocations: 43.416 MiB)

And again we see that joining on :v2 is faster than on :v1.

When using pooled arrays might not be a good idea?

There are three cases when you might not see benefits from using PooledArray.

The first is when you have many unique values in your data. Then you have to pay the price
of storing refs, pool, and invpool objects and all of them will be large.

The second is if you have a value that you store that has a small memory footprint, e.g. Bool
and you did not use compress=true. In such a case refs will take more memory than original data would.

The third case is when you create multiple copies of a PooledArray and modify its pool. In such a case
the cost of copying of the pool and invpool fields might be non-negligible. Let me show you a practical
example of the third situation:

julia> df2 = DataFrame(id=1:10^6, v=PooledArray(repeat(["x$i" for i in 1:1000], 1000)))
1000000×2 DataFrame
     Row │ id       v
         │ Int64    String
─────────┼─────────────────
       1 │       1  x1
       2 │       2  x2
       3 │       3  x3
    ⋮    │    ⋮       ⋮
  999998 │  999998  x998
  999999 │  999999  x999
 1000000 │ 1000000  x1000
        999994 rows omitted

julia> df3 = DataFrame(id=1:10^6, v=repeat(["x$i" for i in 1:1000], 1000));

Note that df2 and df3 are identical except that in df2 the :v column is pooled and in df3 it is not.

Now let us test outerjoin on this data:

julia> outerjoin(df2, df2, on=:id, makeunique=true);

julia> @time outerjoin(df2, df2, on=:id, makeunique=true);
  0.065559 seconds (326 allocations: 30.951 MiB)

julia> outerjoin(df3, df3, on=:id, makeunique=true);

julia> @time outerjoin(df3, df3, on=:id, makeunique=true);
  0.036927 seconds (274 allocations: 38.400 MiB)

Note that working with non-pooled data is faster. If we check innerjoin this is not the case:

julia> innerjoin(df2, df2, on=:id, makeunique=true);

julia> @time innerjoin(df2, df2, on=:id, makeunique=true);
  0.018188 seconds (210 allocations: 30.528 MiB)

julia> innerjoin(df3, df3, on=:id, makeunique=true);

julia> @time innerjoin(df3, df3, on=:id, makeunique=true);
  0.029364 seconds (206 allocations: 38.157 MiB)

What is going on here? Let us look at the output:

julia> outerjoin(df2, df2, on=:id, makeunique=true)
1000000×3 DataFrame
     Row │ id       v        v_1
         │ Int64    String?  String?
─────────┼───────────────────────────
       1 │       1  x1       x1
       2 │       2  x2       x2
       3 │       3  x3       x3
    ⋮    │    ⋮        ⋮        ⋮
  999998 │  999998  x998     x998
  999999 │  999999  x999     x999
 1000000 │ 1000000  x1000    x1000
                  999994 rows omitted

julia> innerjoin(df2, df2, on=:id, makeunique=true)
1000000×3 DataFrame
     Row │ id       v       v_1
         │ Int64    String  String
─────────┼─────────────────────────
       1 │       1  x1      x1
       2 │       2  x2      x2
       3 │       3  x3      x3
    ⋮    │    ⋮       ⋮       ⋮
  999998 │  999998  x998    x998
  999999 │  999999  x999    x999
 1000000 │ 1000000  x1000   x1000
                999994 rows omitted

Note that outerjoin changes the element type of v and v_1 columns, so pool and invpool need to be re-created twice,
which takes time and memory.
In innerjoin the element type is not changed so pool and invpool in the output are reused from the source data frame.

Conclusions

In summary:

  • PooledArray is useful if you have data that has many duplicates.
  • The benefits of using PooledArray are lower memory footprint and the fact that some operations on it can be faster (e.g. groupby in DataFrames.jl).
  • The biggest benefit of PooledArray is when you do not change its pool of values. In such a case the pool and invpool objects are created only once
    and are reused in arrays derived from the source array.
  • Remember to carefully choose the type of the reference used in PooledArray. By default it is UInt32, but you can pick a smaller type to get even better compression
    at the expense of smaller number of unique values that your pooled array can store.

I hope you find the tips shared in this post useful in your data analysis processes.