Author Archives: Blog by Bogumił Kamiński

Metadata in DataFrames.jl: why and how?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/12/02/metadata.html

Introduction

In my recent post I have discussed new features added in DataFrames.jl.
One of them is handling of metadata. Today I want to discuss in more detail
why metadata is useful in practice and how it is supported. I am going to cover
the following topics:

  • Why having metadata support is useful.
  • Reading and writing data frames with metadata to disk.
  • Rules of metadata propagation.

This post uses many features of data management ecosystem of Julia, so
the list of its dependencies is long. I used:

  • Julia 1.8.2
  • CSV.jl 0.10.7
  • DataFrames.jl 1.4.3
  • Parquet2.jl 0.2.2
  • Plots.jl 1.36.6
  • ReadStatTables.jl 0.2.0
  • StatsBase.jl 0.33.21
  • TableMetadataTools.jl @ 9ce81f87
  • ZipFile.jl 0.10.1

Note that the TableMetadataTools.jl package is still in experimental
phase, where we are waiting for users’ feedback about its functionality. That
is why it is not released yet, and its state is indicated by Git hash:

(@v1.8) pkg> st TableMetadataTools
Status `C:\Users\bogum\.julia\environments\v1.8\Project.toml`
  [9ce81f87] TableMetadataTools v0.1.0
  `https://github.com/JuliaData/TableMetadataTools.jl#main`

Before we start load all the required packages:

julia> using CSV

julia> using DataFrames

julia> import Downloads

julia> using Parquet2

julia> using Plots

julia> import ReadStatTables

julia> using TableMetadataTools

julia> using Serialization

julia> using StatsBase

julia> using ZipFile

The task

The problem for today is to create a plot how percentage of health expenditures
in GDP of the United States of America changed over several years.

Assume someone told you that this information can be found in World Bank’s
World Development Indicators dataset that is available on this website.

In the post I will follow all the steps that are needed to make the plot.

Getting the data

We first fetch the data and decompress the file storing it:

julia> WDI_URL = "https://sites.google.com/site/md4stata/linked/\
                  world-bank-s-world-development-indicators-wdi/\
                  WDI2009.zip";

julia> if !isfile("WDI2009.dta")
           isfile("WDI2009.zip") || Downloads.download(WDI_URL, "WDI2009.zip");
           zip = ZipFile.Reader("WDI2009.zip")
           open("WDI2009.dta", "w") do io
               write(io, read(only(zip.files)))
           end
           close(zip)
       end

Note that in the code I download and decompress the data only if it is needed.

As a result of our operation we have written to disk the WDI2009.dta file
with the data. The file is in Stata format.

Loading the data into a data frame and inspecting it

julia> df = "WDI2009.dta" |> ReadStatTables.readstat |> DataFrame
11123×872 DataFrame
   Row │ country   wbcode  year   AG_AGR_TRAC_NO  AG_CON_FERT_MT  AG_CON_FER ⋯
       │ String    String  Int16  Float64?        Float64?        Float64?   ⋯
───────┼──────────────────────────────────────────────────────────────────────
     1 │ Aruba     ABW      1960       missing         missing       missing ⋯
     2 │ Aruba     ABW      1961       missing         missing       missing
     3 │ Aruba     ABW      1962       missing         missing       missing
     4 │ Aruba     ABW      1963       missing         missing       missing
     5 │ Aruba     ABW      1964       missing         missing       missing ⋯
     6 │ Aruba     ABW      1965       missing         missing       missing
   ⋮   │    ⋮        ⋮       ⋮          ⋮               ⋮               ⋮    ⋱
 11118 │ Zimbabwe  ZWE      2003         24000.0        146032.0         453
 11119 │ Zimbabwe  ZWE      2004         24000.0         86352.0         268
 11120 │ Zimbabwe  ZWE      2005         24000.0         85018.0         264 ⋯
 11121 │ Zimbabwe  ZWE      2006         24000.0        132661.0     missing
 11122 │ Zimbabwe  ZWE      2007       missing         missing       missing
 11123 │ Zimbabwe  ZWE      2008       missing         missing       missing
                                            867 columns and 11111 rows omitted

julia> describe(df, :min, :max, :nmissing)
872×4 DataFrame
 Row │ variable           min          max         nmissing
     │ Symbol             Any          Any         Int64
─────┼──────────────────────────────────────────────────────
   1 │ country            Afghanistan  Zimbabwe           0
   2 │ wbcode             ABW          ZWE                0
   3 │ year               1960         2008               0
   4 │ AG_AGR_TRAC_NO     1.0          2.79193e7       2294
   5 │ AG_CON_FERT_MT     0.0          5.59256e7       3755
   6 │ AG_CON_FERT_ZS     0.0          1.57833e5       3295
  ⋮  │         ⋮               ⋮           ⋮          ⋮
 867 │ TX_VAL_OTHR_ZS_WT  -3.84e-13    100.0           5792
 868 │ TX_VAL_SERV_CD_WT  0.0          3.81092e12      5713
 869 │ TX_VAL_TECH_CD     0.0          1.80719e12      8696
 870 │ TX_VAL_TECH_MF_ZS  0.0          74.9541         8574
 871 │ TX_VAL_TRAN_ZS_WT  0.000332059  100.0           5938
 872 │ TX_VAL_TRVL_ZS_WT  0.113351     100.0           5913
                                            860 rows omitted

We note that the data set is quite wide. It has almost 900 columns.
Additionally we see that column names are not very informative.

This is a common situation when working with wide tables. Authors of such
datasets typically try to use relatively short variable names.

How can we find the column that represents percentage of health expenditures
in GDP of the United States of America? We clearly need metadata for this table.

Fortunately it is present in this dataset, so let us investigate it. Before
we start let me highlight that there are two types of metadata in DataFrames.jl:

  • table-level metadata: key-value pairs that are attached to a data frame as a
    whole; keys must be strings, and values can be arbitrary data (but it is
    recommended to use strings if interoperability is important, as some storage
    formats are only able to store string values);
  • column-level metadata: key-value pairs that are attached to a concrete column
    in a data frame (the same considerations about data types of keys and values
    apply as for table-level metadata).

First check table-level metadata:

julia> metadata(df)
Dict{String, Any} with 13 entries:
  "file_ext"             => ".dta"
  "modified_time"        => DateTime("2010-01-08T11:17:00")
  "file_format_version"  => 114
  "file_format_is_64bit" => false
  "table_name"           => ""
  "notes"                => ["2", "dataset coded for stata as in Catini, Pani…
  "file_encoding"        => ""
  "file_label"           => ""
  "var_count"            => 872
  "row_count"            => 11123
  "creation_time"        => DateTime("2010-01-08T11:17:00")
  "endianness"           => READSTAT_ENDIAN_LITTLE
  "compression"          => READSTAT_COMPRESS_NONE

julia> metadata(df, "notes")
3-element Vector{String}:
 "2"
 "dataset coded for stata as in Catini, Panizza and Saade. Macro Data 4 Stata"
 "December 2009"

We can see that the data set was created on Jan 9, 2010 by Catini, Panizza and
Saade in Macro Data 4 Stata project. We also have information that the data
frame should have 11123 rows and 872 columns, and indeed above we see that this
information is consistent.

Let me comment here on one important distinction:

  • metadata such as "var_count or "row_count" is volatile; most likely
    transformation of the df data frame will invalidate it;
  • metadata such as "notes" most likely will not be invalidated when df
    is transformed.

DataFrames.jl distinguishes these both scenarios and I discuss how it is done
below.

Now check column-level metadata:

julia> colmetadata(df)
Dict{Symbol, Dict{String, Any}} with 872 entries:
  :GC_XPN_OTHR_ZS       => Dict("label"=>"Other expense (% of expense)", "for…
  :SH_DYN_AIDS_ZS       => Dict("label"=>"Prevalence of HIV, total (% of popu…
  :EP_PMP_DESL_CD       => Dict("label"=>"Pump price for diesel fuel (US\$ pe…
  :SL_TLF_PRIM_ZS       => Dict("label"=>"Labor force with primary education …
  :NE_CON_PRVT_PC_KD_ZG => Dict("label"=>"Household final consumption expendi…
  :PA_NUS_PPP           => Dict("label"=>"PPP conversion factor, GDP (LCU per…
  :DT_NFL_WFPG_CD       => Dict("label"=>"UN net multilateral flows, WFP (cur…
  :AG_LND_CREL_HA       => Dict("label"=>"Land under cereal production (hecta…
  :EE_BOD_TOTL_KG       => Dict("label"=>"Organic water pollutant (BOD) emiss…
  :SH_DYN_CHLD_MA       => Dict("label"=>"Mortality rate, male child (per 1,0…
  :ER_LND_PTLD_ZS       => Dict("label"=>"Nationally protected areas (% of to…
  :IP_JRN_ARTC_SC       => Dict("label"=>"Scientific and technical journal ar…
  :NY_GDP_DEFL_KD_ZG    => Dict("label"=>"Inflation, GDP deflator (annual %)"…
  :IT_NET_SECR_P6       => Dict("label"=>"Secure Internet servers (per 1 mill…
  :SE_SEC_ENRL_FE_ZS    => Dict("label"=>"Secondary education, pupils (% fema…
  :FR_INR_LNDP          => Dict("label"=>"Interest rate spread (lending rate …
  ⋮                     => ⋮

julia> colmetadata(df, :country)
Dict{String, Any} with 8 entries:
  "label"         => "CountryName"
  "format"        => "%44s"
  "display_width" => 44
  "measure"       => READSTAT_MEASURE_UNKNOWN
  "alignment"     => READSTAT_ALIGNMENT_RIGHT
  "type"          => READSTAT_TYPE_STRING
  "storage_width" => 0x000000000000002d
  "vallabel"      => Symbol("")

As you can see there is a lot of metadata for each column, but most of it is
technical (and most likely volatile) except for "label" which gives us useful
information about the interpretation of data in the column (and this information
is not invalidated under simple operations on a data frame like dropping rows
or columns).

TableMetadataTools.jl provides a convenience functions label and labels that
extract "label" metadata from a column and all columns respectively:

julia> label(df, :country)
"CountryName"

julia> labels(df)
872-element Vector{String}:
 "CountryName"
 ""
 ""
 "Agricultural machinery, tractors"
 "Fertilizer consumption (metric tons)"
 "Fertilizer consumption (100 grams per hectare of arable land)"
 "Agricultural land (sq. km)"
 "Agricultural land (% of land area)"
 ⋮
 "Merchandise exports (current US\$)"
 "Export value index (2000 = 100)"
 "Computer, communications and other services (% of commercial service exports)"
 "Commercial service exports (current US\$)"
 "High-technology exports (current US\$)"
 "High-technology exports (% of manufactured exports)"
 "Transport services (% of commercial service exports)"
 "Travel services (% of commercial service exports)"

Now we could search in the vector returned by labels for a column containing
information about percentage of health expenditures in GDP. The easiest way
to do it is to use the findlabels function from TableMetadataTools.jl:

julia> findlabels(contains(r"health"i), df)
11-element Vector{Pair{Symbol, String}}:
    :SH_MED_CMHW_P3 => "Community health workers (per 1,000 people)"
    :SH_STA_ARIC_ZS => "ARI treatment (% of children under 5 taken to a health provider)"
    :SH_STA_BRTC_ZS => "Births attended by skilled health staff (% of total)"
    :SH_XPD_EXTR_ZS => "External resources for health (% of total expenditure on health)"
    :SH_XPD_OOPC_ZS => "Out-of-pocket health expenditure (% of private expenditure on health)"
       :SH_XPD_PCAP => "Health expenditure per capita (current US\$)"
    :SH_XPD_PRIV_ZS => "Health expenditure, private (% of GDP)"
       :SH_XPD_PUBL => "Health expenditure, public (% of total health expenditure)"
 :SH_XPD_PUBL_GX_ZS => "Health expenditure, public (% of government expenditure)"
    :SH_XPD_PUBL_ZS => "Health expenditure, public (% of GDP)"
    :SH_XPD_TOTL_ZS => "Health expenditure, total (% of GDP)"

We can see that there are 11 columns whose metadata contains health substring
(case insensitive). Now we can relatively easily find that the column we are
interested in is :SH_XPD_TOTL_ZS:

julia> label(df, :SH_XPD_TOTL_ZS)
"Health expenditure, total (% of GDP)"

Let us inspect it:

julia> df.SH_XPD_TOTL_ZS
11123-element Vector{Union{Missing, Float64}}:
  missing
  missing
  missing
  missing
  missing
  missing
  missing
  missing
 ⋮
  missing
 8.1
 7.6
 8.4
 8.9
 9.3
  missing
  missing

julia> describe(df.SH_XPD_TOTL_ZS)
Summary Stats:
Length:         11123
Missing Count:  10098
Mean:           6.310835
Minimum:        1.300000
1st Quartile:   4.550528
Median:         5.900000
3rd Quartile:   7.800000
Maximum:        17.700000
Type:           Union{Missing, Float64}

Understanding metadata styles: part 1

For our analysis we will only need :country, :year, and :SH_XPD_TOTL_ZS
columns. Let us create a data frame holding them:

julia> health_df = select(df, :country, :year, :SH_XPD_TOTL_ZS)
11123×3 DataFrame
   Row │ country   year   SH_XPD_TOTL_ZS
       │ String    Int16  Float64?
───────┼─────────────────────────────────
     1 │ Aruba      1960       missing
     2 │ Aruba      1961       missing
     3 │ Aruba      1962       missing
     4 │ Aruba      1963       missing
     5 │ Aruba      1964       missing
     6 │ Aruba      1965       missing
   ⋮   │    ⋮        ⋮          ⋮
 11118 │ Zimbabwe   2003             7.6
 11119 │ Zimbabwe   2004             8.4
 11120 │ Zimbabwe   2005             8.9
 11121 │ Zimbabwe   2006             9.3
 11122 │ Zimbabwe   2007       missing
 11123 │ Zimbabwe   2008       missing
                       11111 rows omitted

Now check the column-level metadata of this data frame:

julia> colmetadata(health_df)
Dict{Any, Any}()

It is empty. What is the reason for this. We need to go back to df data frame
and check the style of metadata stored there:

julia> colmetadata(df, :country, style=true)
Dict{String, Tuple{Any, Symbol}} with 8 entries:
  "label"         => ("CountryName", :default)
  "format"        => ("%44s", :default)
  "display_width" => (44, :default)
  "measure"       => (READSTAT_MEASURE_UNKNOWN, :default)
  "alignment"     => (READSTAT_ALIGNMENT_RIGHT, :default)
  "type"          => (READSTAT_TYPE_STRING, :default)
  "storage_width" => (0x000000000000002d, :default)
  "vallabel"      => (Symbol(""), :default)

We see that all metadata for :country column their style is :default (we
could check that this style is set for all metadata in df). The :default
style means that metadata is considered volatile, that is it is dropped under
any transformation of df since it could get invalidated. What if we want to
keep column labels? We need to change their style to :note, which indicates
that metadata is safe to be propagated when data frame is transformed. This can
be conveniently done using the setcolmetadatastyle! function from
TableMetadataTools.jl package:

julia> setcolmetadatastyle!(==("label"), df);

julia> colmetadata(df, :country, style=true)
Dict{String, Tuple{Any, Symbol}} with 8 entries:
  "label"         => ("CountryName", :note)
  "format"        => ("%44s", :default)
  "display_width" => (44, :default)
  "measure"       => (READSTAT_MEASURE_UNKNOWN, :default)
  "alignment"     => (READSTAT_ALIGNMENT_RIGHT, :default)
  "type"          => (READSTAT_TYPE_STRING, :default)
  "storage_width" => (0x000000000000002d, :default)
  "vallabel"      => (Symbol(""), :default)

julia> health_df = select(df, :country, :year, :SH_XPD_TOTL_ZS);

julia> colmetadata(health_df)
Dict{Symbol, Dict{String, String}} with 3 entries:
  :year           => Dict("label"=>"")
  :SH_XPD_TOTL_ZS => Dict("label"=>"Health expenditure, total (% of GDP)")
  :country        => Dict("label"=>"CountryName")

Note that the predicate ==("label") selected only "label" metadata and
by default :setcolmetadatastyle! changes the style to :note (the reason is
that most data frame storage formats will produce :default style by default).

We now see that the select operation nicely kept the column labels. However,
we note that for the :year column the label is not nice, as it is an empty
string. We can change it using the label! function:

julia> label!(health_df, :year, "Year");

julia> labels(health_df)
3-element Vector{String}:
 "CountryName"
 "Year"
 "Health expenditure, total (% of GDP)"

The meta2toml function provides a convenient way to dump all table-level and
column-level metadata in TOML format:

julia> print(meta2toml(health_df))
style = true

[colmetadata.SH_XPD_TOTL_ZS]
label = ["Health expenditure, total (% of GDP)", "note"]

[colmetadata.country]
label = ["CountryName", "note"]

[colmetadata.year]
label = ["Year", "note"]

[metadata]

julia> print(meta2toml(health_df, style=false))
style = false

[colmetadata.SH_XPD_TOTL_ZS]
label = "Health expenditure, total (% of GDP)"

[colmetadata.country]
label = "CountryName"

[colmetadata.year]
label = "Year"

[metadata]

The use of this function is twofold. First, you can use it to get
human-readable information about data frame metadata. Second use is to make
it easy to save metadata to disk.

Storing metadata persistently

Let us start with writing our data frame to a CSV file. Unfortunately CSVs do
not support metadata. One of the workarounds is to save a TOML file with
metadata generated using the meta2toml function along the CSV file. The
benefit of this approach is that both files are human-readable (and editable).
Let us try saving such files and then reading them back:

julia> CSV.write("health.csv", health_df)
"health.csv"

julia> open("healthmeta.toml", "w") do io
           print(io, meta2toml(health_df))
       end

julia> tmp_df = CSV.read("health.csv", DataFrame);

julia> print(meta2toml(tmp_df))
style = true

[colmetadata]

[metadata]

julia> open("healthmeta.toml") do io
           toml2meta!(tmp_df, io)
       end;

julia> print(meta2toml(tmp_df))
style = true

[colmetadata.SH_XPD_TOTL_ZS]
label = ["Health expenditure, total (% of GDP)", "note"]

[colmetadata.country]
label = ["CountryName", "note"]

[colmetadata.year]
label = ["Year", "note"]

[metadata]

julia> isequal(tmp_df, health_df)
true

We indeed see that both the data and metadata are correctly recovered.

Another common data storage format is Parquet. One of the many benefits of
Parquet is that it allows for storing metadata. So let us save healpth_df to
disk and load back a Parquet file:

julia> Parquet2.writefile("health.parquet", health_df)
✏ Parquet2.FileWriter{IOStream}(health.parquet)

julia> tmp2_df = "health.parquet" |> Parquet2.readfile |> DataFrame;

julia> tmp2_df |> meta2toml |> print
style = true

[colmetadata.SH_XPD_TOTL_ZS]
label = ["Health expenditure, total (% of GDP)", "default"]

[colmetadata.country]
label = ["CountryName", "default"]

[colmetadata.year]
label = ["Year", "default"]

[metadata]

julia> isequal(tmp2_df, health_df)
true

All looks good except for metadata style. Unfortunately the limitation of
Parquet format is that it does not have a notion of metadata style. We need
to set it manually:

julia> setcolmetadatastyle!(==("label"), tmp2_df);

julia> tmp2_df |> meta2toml |> print
style = true

[colmetadata.SH_XPD_TOTL_ZS]
label = ["Health expenditure, total (% of GDP)", "note"]

[colmetadata.country]
label = ["CountryName", "note"]

[colmetadata.year]
label = ["Year", "note"]

[metadata]

Finally let us try serialization, if short-term storage is enough for us:

julia> tmp3_df = deserialize("health.bin");

julia> tmp3_df |> meta2toml |> print
style = true

[colmetadata.SH_XPD_TOTL_ZS]
label = ["Health expenditure, total (% of GDP)", "note"]

[colmetadata.country]
label = ["CountryName", "note"]

[colmetadata.year]
label = ["Year", "note"]

[metadata]

julia> isequal(tmp3_df, health_df)
true

In case of serialization both data and metadata are fully preserved.

Understanding metadata styles: part 2

We already know that :default metadata style is never propagated (except the
copy function and the DataFrame constructor, which produce a full copy of
all data and metadata in a data frame). What are the propagation rules for
:note-style metadata? Let me discuss the most common cases by example.

The first rule is that dropping rows or columns does not invalidate
:note-style metadata:

julia> dropmissing!(health_df)
1025×3 DataFrame
  Row │ country      year   SH_XPD_TOTL_ZS
      │ String       Int16  Float64
──────┼────────────────────────────────────
    1 │ Andorra       2002             7.0
    2 │ Andorra       2003             7.1
    3 │ Andorra       2004             7.1
    4 │ Andorra       2005             7.5
    5 │ Andorra       2006             7.4
    6 │ Afghanistan   2002             8.3
  ⋮   │      ⋮         ⋮          ⋮
 1020 │ Zambia        2006             6.2
 1021 │ Zimbabwe      2002             8.1
 1022 │ Zimbabwe      2003             7.6
 1023 │ Zimbabwe      2004             8.4
 1024 │ Zimbabwe      2005             8.9
 1025 │ Zimbabwe      2006             9.3
                          1013 rows omitted

julia> health_df |> meta2toml |> print
style = true

[colmetadata.SH_XPD_TOTL_ZS]
label = ["Health expenditure, total (% of GDP)", "note"]

[colmetadata.country]
label = ["CountryName", "note"]

[colmetadata.year]
label = ["Year", "note"]

[metadata]

We dropped rows from our data frame using the dropmissing! function and we
see that metadata is not changed.

The second rule is that column renaming does keep metadata. In our table the
:SH_XPD_TOTL_ZS name of the column is not very easy to remember. Let us
change it:

julia> select!(health_df, :country, :year, :SH_XPD_TOTL_ZS => :health2gdp);

julia> health_df |> meta2toml |> print
style = true

[colmetadata.country]
label = ["CountryName", "note"]

[colmetadata.health2gdp]
label = ["Health expenditure, total (% of GDP)", "note"]

[colmetadata.year]
label = ["Year", "note"]

[metadata]

Again we see that :note-style metadata is kept.

The third rule is that if we transform a column and keep its name then also
:note-style metadada is kept:

julia> tmp4_df = transform(health_df, :country => ByRow(uppercase) => :country)
1025×3 DataFrame
  Row │ country      year   health2gdp
      │ String       Int16  Float64
──────┼────────────────────────────────
    1 │ ANDORRA       2002         7.0
    2 │ ANDORRA       2003         7.1
    3 │ ANDORRA       2004         7.1
    4 │ ANDORRA       2005         7.5
    5 │ ANDORRA       2006         7.4
    6 │ AFGHANISTAN   2002         8.3
  ⋮   │      ⋮         ⋮        ⋮
 1020 │ ZAMBIA        2006         6.2
 1021 │ ZIMBABWE      2002         8.1
 1022 │ ZIMBABWE      2003         7.6
 1023 │ ZIMBABWE      2004         8.4
 1024 │ ZIMBABWE      2005         8.9
 1025 │ ZIMBABWE      2006         9.3
                      1013 rows omitted

julia> tmp4_df |> meta2toml |> print
style = true

[colmetadata.country]
label = ["CountryName", "note"]

[colmetadata.health2gdp]
label = ["Health expenditure, total (% of GDP)", "note"]

[colmetadata.year]
label = ["Year", "note"]

[metadata]

The above rule is important to remember. It assumes that the user does not
use the same column name to store qualitatively different information. In this
case the transformation was just changing country names to uppercase, which did
not change the information it carries.

However, if we wanted to change the :health2gdp column from percentages to
proportions we should change its name, and then metadata for this column is
dropped (you could add an appropriate label for this column manually later):

julia> tmp5_df = select(health_df, :country, :year,
                        :health2gdp =>
                        ByRow(x -> x / 100) =>
                        :health2gdp_prop)
1025×3 DataFrame
  Row │ country      year   health2gdp_prop
      │ String       Int16  Float64
──────┼─────────────────────────────────────
    1 │ Andorra       2002            0.07
    2 │ Andorra       2003            0.071
    3 │ Andorra       2004            0.071
    4 │ Andorra       2005            0.075
    5 │ Andorra       2006            0.074
    6 │ Afghanistan   2002            0.083
  ⋮   │      ⋮         ⋮           ⋮
 1020 │ Zambia        2006            0.062
 1021 │ Zimbabwe      2002            0.081
 1022 │ Zimbabwe      2003            0.076
 1023 │ Zimbabwe      2004            0.084
 1024 │ Zimbabwe      2005            0.089
 1025 │ Zimbabwe      2006            0.093
                           1013 rows omitted

julia> tmp5_df |> meta2toml |> print
style = true

[colmetadata.country]
label = ["CountryName", "note"]

[colmetadata.year]
label = ["Year", "note"]

[metadata]

Let me stress this rule again: we changed the meaning of the data in the column
(here by dividing it by 100) so we use another column name to store it. As you
can see metadata was dropped for :healthe2gdp_prop column.

As a side note: not using the same column name for qualitatively different
values is a good practice in general, just as not using the same variable name
to store different values in a single function.

Now let us change the shape of our data from long to wide:

julia> health_wide = unstack(health_df, :country, :year, :health2gdp)
205×6 DataFrame
 Row │ country               2002      2003      2004      2005      2006
     │ String                Float64?  Float64?  Float64?  Float64?  Float64?
─────┼────────────────────────────────────────────────────────────────────────
   1 │ Andorra                7.0        7.1      7.1       7.5       7.4
   2 │ Afghanistan            8.3        8.5      8.5       9.5       9.2
   3 │ Angola                 2.3        2.6      2.0       1.9       2.6
   4 │ Albania                6.3        6.3      6.8       6.6       6.5
   5 │ United Arab Emirates   3.4        3.2      2.9       2.7       2.5
   6 │ Argentina              8.9        8.3      9.6      10.2      10.1
  ⋮  │          ⋮               ⋮         ⋮         ⋮         ⋮         ⋮
 200 │ World                  9.94274   10.0342   9.92719   9.87598   9.76724
 201 │ Samoa                  4.9        5.1      4.9       4.8       5.0
 202 │ Yemen, Rep.            4.7        5.1      5.0       5.2       4.5
 203 │ South Africa           8.3        8.0      8.0       8.0       8.0
 204 │ Zambia                 6.6        6.6      6.5       6.8       6.2
 205 │ Zimbabwe               8.1        7.6      8.4       8.9       9.3
                                                              193 rows omitted

julia> health_wide |> meta2toml |> print
style = true

[colmetadata.country]
label = ["CountryName", "note"]

[metadata]

Note that the :country column retains its metadata (we have not transformed
it), but all other newly generated columns do not have any metadata.

As a last example let us keep only rows related to United States in our data
frame and additionally add caption metadata for this table:

julia> us = health_df[health_df.country .== "United States", :]
5×3 DataFrame
 Row │ country        year   health2gdp
     │ String         Int16  Float64
─────┼──────────────────────────────────
   1 │ United States   2002        14.7
   2 │ United States   2003        15.1
   3 │ United States   2004        15.2
   4 │ United States   2005        15.2
   5 │ United States   2006        15.3

julia> caption!(us, "United States");

julia> us |> meta2toml |> print
style = true

[colmetadata.country]
label = ["CountryName", "note"]

[colmetadata.health2gdp]
label = ["Health expenditure, total (% of GDP)", "note"]

[colmetadata.year]
label = ["Year", "note"]

[metadata]
caption = ["United States", "note"]

We again see that dropping rows keeps :note-style metadata. Also note that
the caption! function creates metadata having :note style.

Plotting the data

We are finally ready to perform the task we were asked to perform. Let us,
using the us data frame, create a plot of how percentage of health
expenditures in GDP of the United States of America changed over several years:

julia> plot(us.year, us.health2gdp,
            xlabel=label(us, :year), ylabel=label(us, :health2gdp),
            title=caption(us), legend=false)

Note that we used table metadata to provide information about axes and title
for the plot. By running the above command you should get the following result:

US health expenditure

Conclusions

Using metadata in DataFrames.jl makes it easier to organize, find and
understand data stored in it.

In this post I mostly concentrated on note metadata (like column labels).
As you saw this kind of metadata is useful when you have wide tables (or many
tables) to help you easily navigate them.

In general it is possible to manage such metadata in a separate data structure
that keeps it outside of data frame object. The benefit of storing metadata
within a data frame is that its propagation is automatically handled when
data frame is transformed. Therefore DataFrames.jl introduces two styles of
metadata:

  • :default style: this is meant for metadata that is volatile, and can get
    invalidated under transformations (in our case number of rows or columns of
    a data frame were such metadata in the original df data frame);
  • :note style: this is meant for metadata that you want to be retained as
    the data frame is transformed (in our example we wanted to keep column
    labels).

TableMetadataTools.jl provides convenience methods helping you to
perform most common operations on metadata. Of course you can always use
DataAPI.jl low-level metadata API if you need a fine-grained control over
metadata. TableMetadataTools.jl is still in experimental development phase if
you have any comments or feature requests related to it please leave an issue
on GitHub.

Apart from various note-like metadata can be useful also in programmatic
contexts. Here are some examples:

  • GeoDataFrames.jl uses metadata to store information which columns are
    geometry columns and about coordinate reference system; note that such
    approach has a benefit that this package does not have to create its own data
    type; instead it was enough to store metadata information in a data frame
    to allow for extra features (this is much easier to do and much cheaper to
    maintain – assuming the package likes DataFrame as an underlying object).
  • TableMetadataTools.jl provides @track macro. This macro allows
    you to automatically store in data frame metadata a time-stamped information
    about all operations that were applied to this data frame. This is a
    low-overhead method to ensure that you can perform lineage analysis of your
    data (of course more complicated mechanisms are possible to be implemented,
    but I wanted to provide something that is as lightweight as possible that
    does the job). Such functionalities become more and more important in the
    MLOps and DataOps era.

The mighty trio: Project Euler, Simpson’s paradox, and DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/11/25/paradox.html

Introduction

Recently, following users’ feedback, I started solving more puzzles in my blog.
To choose a problem for today I went to Project Euler page, which
provides hundreds of very nice puzzles. I wanted to avoid solving some very
easy problem, so I decided to pick the first puzzle from the list that has
less than 1000 people who solved it. This lead me to problem 236.

To my surprise the problem is an example of the Simpson’s paradox,
so hopefully if you are a data scientist, you will find it interesting.

In general Project Euler does not encourage posting solutions to the problems
given in that page. For this reason, on purpose, I have not optimized my code
and the solution is brute force (to encourage you to find an elegant solution;
the presented solution takes several minutes to produce the result).
Also, I decided to show the code, but not the value of the final solution
(to encourage you to try solving it).

So, why I think it is interesting to read this post? I solved the puzzle
mostly using DataFrames.jl, so I hope you can find some useful data
wrangling patterns when reading it.

The presented codes were run under Julia 1.8.2, DataFrames.jl 1.4.3, and
DataFramesMeta.jl 0.12.0. The solution requires 32GB of RAM (this is, again, on
purpose; if you have less RAM I encourage you to optimize my code as a part of
the Project Euler challenge).

Simpson’s paradox

Simpson’s paradox is a phenomenon in which data analyzed in groups
presents a different direction of relationships than if you analyze the data
after aggregating it.

To show you the paradox in practice let me introduce the
Project Euler problem 236. The description is copied from Project Euler
page:


Suppliers ‘A’ and ‘B’ provided the following numbers of products for the luxury
hamper market:

Product             'A'     'B'
Beluga Caviar       5248     640
Christmas Cake      1312    1888
Gammon Joint        2624    3776
Vintage Port        5760    3776
Champagne Truffles  3936    5664

Although the suppliers try very hard to ship their goods in perfect condition,
there is inevitably some spoilage – i.e. products gone bad.

The suppliers compare their performance using two types of statistic:

  • The five per-product spoilage rates for each supplier are equal to the number
    of products gone bad divided by the number of products supplied, for each of
    the five products in turn.
  • The overall spoilage rate for each supplier is equal to the total number of
    products gone bad divided by the total number of products provided by that
    supplier.

To their surprise, the suppliers found that each of the five per-product
spoilage rates was worse (higher) for ‘B’ than for ‘A’ by the same factor
(ratio of spoilage rates), m>1; and yet, paradoxically, the overall spoilage
rate was worse for ‘A’ than for ‘B’, also by a factor of m.

What’s the largest possible value of m?


This situation is an example of Simpson’s paradox. Within-groups supplier
‘B’ has more spoilage than supplier ‘A’, however in aggregate the situation is
reversed. Before we go and try solving the puzzle let us check if such a
situation is possible.

Assume that the amount of spoilage per product and supplier is as follows:

Product             'A'     'B'
Beluga Caviar       2478    630
Christmas Cake        16     48
Gammon Joint          33     99
Vintage Port         180    246
Champagne Truffles    23     69

We can check that the spoilage percentages are approximately:

Product             'A'     'B'
Beluga Caviar       47.22%  98.44%
Christmas Cake       1.22%   2.54%
Gammon Joint         1.26%   2.62%
Vintage Port         3.13%   6.51%
Champagne Truffles   0.58%   1.22%

As you can see supplier ‘B’ has over 2 times higher spoilage percentage on
every product (you can check that the spoilage percentage ratio is the same for
all products and is approximately equal to 2.085).

Now let us check aggregates. Supplier ‘A’ delivered 18880 products of which
2730 were spoiled. Supplier ‘B’ delivered 15744 products of which 1092 were
spoiled. So the aggregate spoilage percentage for supplier ‘A’ is approximately
14.46%, and for supplier ‘B’ 6.94%. Interestingly, supplier ‘A’ has over two
times higher spoilage percentage than supplier ‘B’ (astonishingly the ratio
of spoilage percentage is the same as above, but in reverse direction).

Solving the puzzle

Let us now solve the puzzle and find the largest possible ratio m of spoilage
percentages between suppliers ‘A’ and ‘B’ that meets the conditions of the
puzzle.

Start by loading the required DataFramesMeta.jl package and the data:

julia> using DataFramesMeta

julia> const pv = [(a=5248, b=640),
                   (a=1312, b=1888),
                   (a=2624, b=3776),
                   (a=5760, b=3776),
                   (a=3936, b=5664)]
5-element Vector{NamedTuple{(:a, :b), Tuple{Int64, Int64}}}:
 (a = 5248, b = 640)
 (a = 1312, b = 1888)
 (a = 2624, b = 3776)
 (a = 5760, b = 3776)
 (a = 3936, b = 5664)

julia> const pt = (a=sum(x -> x.a, pv), b=sum(x -> x.b, pv))
(a = 18880, b = 15744)

The pv vector holds the amounts of products delivered by suppliers ‘A’ and
‘B’ and the pt named tuple stores their totals.

We know that for each product-supplier combination spoilage is at least 1.
Using the allcombinations function we create data frames containing all
possible compositions of spoilage amounts for each product and for total, and
store them in dfs vector:

julia> dfs = [allcombinations(DataFrame, a=1:p.a, b=1:p.b) for p in pv];

julia> push!(dfs, allcombinations(DataFrame, a=1:pt.a, b=1:pt.b));

julia> nrow.(dfs)
6-element Vector{Int64}:
   3358720
   2477056
   9908224
  21749760
  22293504
 297246720

We can see that the number of options is large, but seems within range what
current computers can handle.

Before we proceed let us peek at the contents of one of the data frames to
be sure that its structure is as expected:

julia> dfs[1]
3358720×2 DataFrame
     Row │ a      b
         │ Int64  Int64
─────────┼──────────────
       1 │     1      1
       2 │     2      1
       3 │     3      1
    ⋮    │   ⋮      ⋮
 3358718 │  5246    640
 3358719 │  5247    640
 3358720 │  5248    640
    3358714 rows omitted

All looks good.

Now for each combination we need to compute the m ratio, remembering that for
totals it should be reversed. Additionally, we want to collect columns :a and
:b into one column holding a tuple of values stored in them, as it will be
easier to work with such data later and only keep options for which m is
greater than 1. We can easily do both tasks using macros provided by
DataFramesMeta.jl:

julia> ms = Function[(x, y) -> (y * p.a) // (x * p.b) for p in pv];

julia> push!(ms, (x, y) -> (x * pt.b) // (y * pt.a));

julia> foreach(dfs, ms) do df, m
           @rselect!(df, :m = m(:a, :b), :ab = (:a, :b))
           @rsubset!(df, :m > 1)
       end

Note that for each data frame I defined its own function computing the m
ratio and stored it in ms vector.

Again, let us check how the data frames look after the transformation:

julia> nrow.(dfs)
6-element Vector{Int64}:
   1681600
   1238224
   4953504
  10875840
  11145840
 148621760

julia> dfs[1]
1681600×2 DataFrame
     Row │ m           ab
         │ Rational…   Tuple…
─────────┼─────────────────────────
       1 │      41//5  (1, 1)
       2 │     41//10  (2, 1)
       3 │     41//15  (3, 1)
    ⋮    │     ⋮            ⋮
 1681598 │ 5248//5245  (5245, 640)
 1681599 │ 2624//2623  (5246, 640)
 1681600 │ 5248//5247  (5247, 640)
               1681594 rows omitted

We notice that we dropped around 50% of rows (this is expected). Observe that
using the // operator in the functions calculating m we get rational number
as a result, so our computations are exact, without any rounding.

Let us now group the data frames by the m ratio value (remember we want
a solution where this ratio is the same for all products and for total):

julia> gdfs = groupby.(dfs, :m);

julia> length.(gdfs)
6-element Vector{Int64}:
  1021210
   753076
  3012264
  6612288
  6777256
 90353608

We see that we have quite a lot groups. However we are interested only in cases
where m matches in all the tables, so let us now find such candidate m
values:

julia> key_tuples = [@combine(gdf, :mt = (first(:m),)).mt for gdf in gdfs];

julia> match_keys = intersect(key_tuples...);

julia> sort!(match_keys, rev=true)
7040-element Vector{Tuple{Rational{Int64}}}:
 (1230//1,)
 (738//1,)
 (615//1,)
 ⋮
 (1230//1229,)
 (1476//1475,)
 (3895//3894,)

Fortunately we see that only 7040 possible values of m are available.
I have sorted the matching keys in a descending order to have highest m
values first.

You might wonder why I used @combine to convert the ratios into tuples
holding one-element with this ratio. The reason is that GroupedDataFrame
supports a fast group lookup by the value of the grouping variable and one of
the options of such lookup is to pass a tuple storing it. Therefore we can
create groups vector that stores 6-element vectors of data frames
representing the five products and the total that have matching m:

julia> groups = [[gdfs[i][key] for i in 1:6] for key in match_keys];

Note that this operation in DataFrames.jl is memory efficient. Indexing into
a GroupedDataFrame to get a single group returns a SubDataFrame, which
is a view of the source data frame.

As before, let us peek at the result:

julia> groups[1]
6-element Vector{SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}}:
 4×2 SubDataFrame
 Row │ m          ab
     │ Rational…  Tuple…
─────┼─────────────────────
   1 │   1230//1  (1, 150)
   2 │   1230//1  (2, 300)
   3 │   1230//1  (3, 450)
   4 │   1230//1  (4, 600)
 1×2 SubDataFrame
 Row │ m          ab
     │ Rational…  Tuple…
─────┼──────────────────────
   1 │   1230//1  (1, 1770)
 2×2 SubDataFrame
 Row │ m          ab
     │ Rational…  Tuple…
─────┼──────────────────────
   1 │   1230//1  (1, 1770)
   2 │   1230//1  (2, 3540)
 1×2 SubDataFrame
 Row │ m          ab
     │ Rational…  Tuple…
─────┼──────────────────────
   1 │   1230//1  (3, 2419)
 3×2 SubDataFrame
 Row │ m          ab
     │ Rational…  Tuple…
─────┼──────────────────────
   1 │   1230//1  (1, 1770)
   2 │   1230//1  (2, 3540)
   3 │   1230//1  (3, 5310)
 12×2 SubDataFrame
 Row │ m          ab
     │ Rational…  Tuple…
─────┼────────────────────────
   1 │   1230//1  (1475, 1)
   2 │   1230//1  (2950, 2)
   3 │   1230//1  (4425, 3)
   4 │   1230//1  (5900, 4)
   5 │   1230//1  (7375, 5)
   6 │   1230//1  (8850, 6)
   7 │   1230//1  (10325, 7)
   8 │   1230//1  (11800, 8)
   9 │   1230//1  (13275, 9)
  10 │   1230//1  (14750, 10)
  11 │   1230//1  (16225, 11)
  12 │   1230//1  (17700, 12)

So the highest potentially possible m ratio is 1230. But is it valid? Note
that we have not checked one condition yet. The total spoilage (one of the
options in the last data frame displayed above) must be a sum of five
per-product spoilages.

So we need to drop these m ratios for which this is not possible. To achieve
this we define the test_options helper function:

julia> function test_options(inv, outs)
           v = copy(inv[1])
           for i in 2:4
               s = Set{Tuple{Int32, Int32}}()
               for a in v, b in inv[i]
                   push!(s, a .+ b)
               end
               v = collect(s)
           end
           for a in v, b in inv[5]
               a .+ b in outs && return true
           end
           return false
       end
test_options (generic function with 1 method)

This function takes two arguments inv which is a vector of :ab spoilages
of the five products and outs which is a Set of total spoilages.

The function works as follows. We check what possible total sums of five
product spoilages we can get (we do it iteratively, starting from product one,
then adding options for consecutive products). Then we check if the totals of
spoilages for both suppliers can be found in the outs collection (and that is
the reason this is a Set, as sets support fast lookup). If this is the case
we return true. If it is not possible we return false.

Note that the trick in this function, to make it feasible computationally
(again good enough – the solution is not optimal, this is a challenge for you),
is the incremental computation of possible (a, b) product total sums and
reduction of possible options after each iteration in a Set. From the Julia
programming perspective two lines of code are relevant. The first is
v = copy(inv[1]) and the second is v = collect(s). They make the function
type-stable, as they ensure that the v variable over which we iterate a lot
within a function is always bound to the object that has
Vector{Tuple{Int, Int}} type that can be inferred by the compiler.

Let use use the test_options function to filter out the m values that
meet the condition of our puzzle:

julia> answer = [g[1].m[1] for g in groups
                 if test_options([g[i].ab for i in 1:5], Set(g[6].ab))];

julia> first(answer);

Since we have sorted m in descending order above first(answer) is a solution
to our puzzle. I have put ; at the end of all expressions to hide it.

Though, let me show you the data that was disclosed by the puzzle authors on
the Project Euler problem 236 page:

julia> length(answer)
35

julia> last(answer)
1476//1475

Indeed we see that there are 35 unique values of m that are feasible
and that the smallest of them is 1476/1575, so it is quite likely that the
code works correctly.

Conclusions

I hope you enjoyed the puzzle, the Simpson’s paradox example, and the solution
using DataFrames.jl. If you would like to check your programming skills I
encourage you to improve over my solution (an efficient code solving the puzzle
runs in under 1 second).

Pivot tables in DataFrames.jl 1.4

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/11/18/unstack.html

Introduction

Some time ago I have written a post about the unstack function
in DataFrames.jl. Since DataFrames.jl 1.4 release this function has received
a major update of offered functionality. Currently it is possible to
easily create pivot tables using unstack as I am going to show you today.

The example we will use is simulation analysis of graphs, as in May 2023
I am co-organizing
18th Workshop on Algorithms and Models for the Web Graph.
If you are interested in this topic please come and join us. If you would
like to present your work, here is the Call For Papers.

The post was written under Julia 1.8.2, Graphs.jl 1.7.4,
ProgressMeter.jl 1.7.2, Plots.jl 1.36.2, and DataFrames.jl 1.4.3.

Generating the data

We will consider Erdős–Rényi random graphs. These graphs take two
parameters n (the number of nodes in the graph) and p edge probability
(independently for each edge). It can be calculated thus, that the expected
number of edges in this graph is pn(n-1)/2, so expected average node degree
is d=p(n-1). So, given n and d we can calculate p=d/(n-1). In Graphs.jl
you can generate these graphs using the erdos_renyi function.

Connected components of a graph are equivalence classes of a
reachability relation between pairs of nodes. Informally: in a connected
component all nodes can be reached from each other and no nodes outside
from a connected component can be reached from it.

Let me explain it by example:

julia> using Graphs

julia> using Random

julia> Random.seed!(1234);

julia> er = erdos_renyi(7, 0.2)
{7, 4} undirected simple Int64 graph

julia> collect(edges(er))
4-element Vector{Graphs.SimpleGraphs.SimpleEdge{Int64}}:
 Edge 1 => 6
 Edge 3 => 7
 Edge 4 => 6
 Edge 5 => 6

julia> connected_components(er)
3-element Vector{Vector{Int64}}:
 [1, 4, 5, 6]
 [2]
 [3, 7]

We created a random graph on 7 nodes with edge probability equal to 0.2.
It has 4 edges. Note, that our graph is undirected, so although Graphs.jl
displays edges like this: 1 => 6, the edges are bi-directional.
We note that node 6 is connected to nodes 1, 4, and 5. So they
form one connected component. There is also an edge between nodes 3 and 7,
so this is a second connected component. Finally node 2 is isolated, giving
us a third component. We can easily find the connected components of a graph
using the connected_components function.

We will want to investigate how the size of the largest connected component
of the Erdős–Rényi random graph depends on n and d using simulation.

We generate the data for n having values [100, 1000, 10_000, 100_000]
and d having values 0.5:0.1:2.0. For each combination
of values we repeat the experiment 64 times.

First create an initial data frame with the setup of the experiment:

julia> using DataFrames

julia> df = allcombinations(DataFrame,
                            d=0.5:0.1:2.0,
                            n=[100, 1000, 10_000, 100_000],
                            rep=1:64)
4096×3 DataFrame
  Row │ d        n       rep   
      │ Float64  Int64   Int64 
──────┼────────────────────────
    1 │     0.5     100      1
    2 │     0.6     100      1
    3 │     0.7     100      1
    4 │     0.8     100      1
    5 │     0.9     100      1
  ⋮   │    ⋮       ⋮       ⋮
 4092 │     1.6  100000     64
 4093 │     1.7  100000     64
 4094 │     1.8  100000     64
 4095 │     1.9  100000     64
 4096 │     2.0  100000     64
              4086 rows omitted

Now we compute p for each experiment:

julia> df.p = df.d ./ (df.n .- 1);

julia> df
4096×4 DataFrame
  Row │ d        n       rep    p
      │ Float64  Int64   Int64  Float64     
──────┼─────────────────────────────────────
    1 │     0.5     100      1  0.00505051
    2 │     0.6     100      1  0.00606061
    3 │     0.7     100      1  0.00707071
    4 │     0.8     100      1  0.00808081
    5 │     0.9     100      1  0.00909091
  ⋮   │    ⋮       ⋮       ⋮         ⋮
 4092 │     1.6  100000     64  1.60002e-5
 4093 │     1.7  100000     64  1.70002e-5
 4094 │     1.8  100000     64  1.80002e-5
 4095 │     1.9  100000     64  1.90002e-5
 4096 │     2.0  100000     64  2.00002e-5
                           4086 rows omitted

Finally for each row we compute the fraction of nodes in the largest component
(the computation takes some time, so I use @showprogress to monitor it):

julia> using ProgressMeter

julia> df.largest = @showprogress map(eachrow(df)) do row
           er = erdos_renyi(row.n, row.p)
           cc = connected_components(er)
           return maximum(length, cc) / row.n
       end;
Progress: 100%|███████████████████████████████████████████████| Time: 0:00:39

julia> df
4096×5 DataFrame
  Row │ d        n       rep    p            largest 
      │ Float64  Int64   Int64  Float64      Float64 
──────┼──────────────────────────────────────────────
    1 │     0.5     100      1  0.00505051   0.04
    2 │     0.6     100      1  0.00606061   0.07
    3 │     0.7     100      1  0.00707071   0.22
    4 │     0.8     100      1  0.00808081   0.29
    5 │     0.9     100      1  0.00909091   0.35
  ⋮   │    ⋮       ⋮       ⋮         ⋮          ⋮
 4092 │     1.6  100000     64  1.60002e-5   0.64289
 4093 │     1.7  100000     64  1.70002e-5   0.69333
 4094 │     1.8  100000     64  1.80002e-5   0.73421
 4095 │     1.9  100000     64  1.90002e-5   0.77021
 4096 │     2.0  100000     64  2.00002e-5   0.79805
                                    4086 rows omitted

Now we are ready to analyze this data using unstack.

Analyzing the results

First let me recall the general structure of the unstack call
(please refer to its documentation for more details):

unstack([data_frame],
        [columns to use as row keys],
        [column storing column names],
        [column storing the values];
        combine=[function to apply to the values])

The structure is easy to remember. After a data frame, we sequentially pass
what we want in rows, what we want in columns, what values we want to unstack,
and finally, a combine keyword argument (introduced in DataFrames.jl 1.4)
specifies what operation we want to apply to these values.

By default combine is only, which means that we want a unique value in
row-column combination and otherwise we get an error. Let us check how
unstack works by default by using [:d, :rep] for rows, :n for columns,
and :largest for values:

julia> unstack(df, [:d, :rep], :n, :largest)
1024×6 DataFrame
  Row │ d        rep    100       1000      10000     100000   
      │ Float64  Int64  Float64?  Float64?  Float64?  Float64? 
──────┼────────────────────────────────────────────────────────
    1 │     0.5      1      0.04     0.009    0.0017   0.00021
    2 │     0.6      1      0.07     0.046    0.0024   0.00035
    3 │     0.7      1      0.22     0.011    0.0042   0.00048
    4 │     0.8      1      0.29     0.019    0.01     0.00139
    5 │     0.9      1      0.35     0.027    0.0109   0.00166
  ⋮   │    ⋮       ⋮       ⋮         ⋮         ⋮         ⋮
 1020 │     1.6     64      0.63     0.665    0.6419   0.64289
 1021 │     1.7     64      0.69     0.683    0.6941   0.69333
 1022 │     1.8     64      0.33     0.746    0.7389   0.73421
 1023 │     1.9     64      0.84     0.761    0.7696   0.77021
 1024 │     2.0     64      0.83     0.818    0.7991   0.79805
                                               1104 rows omitted

Everything worked, because for each cell we had a unique combination of row and
column keys. However, if we, for example, dropped :rep, we would get an
error:

julia> unstack(df, :d, :n, :largest)
ERROR: ArgumentError: Duplicate entries in unstack
at row 65 for key (0.5,) and variable 100.
Pass `combine` keyword argumen t to specify how they should be handled.

Let us start with passing combine=copy to just store all values of
:largest per row-column combination:

julia> unstack(df, :d, :n, :largest, combine=copy)
16×5 DataFrame
 Row │ d        100                                1000                              
     │ Float64  Array…?                            Array…?                          
─────┼───────────────────────────────────────────────────────────────────
   1 │     0.5  [0.04, 0.06, 0.05, 0.05, 0.08, 0…  [0.009, 0.015, 0.009,
   2 │     0.6  [0.07, 0.07, 0.05, 0.07, 0.04, 0…  [0.046, 0.013, 0.018,
   3 │     0.7  [0.22, 0.13, 0.05, 0.09, 0.18, 0…  [0.011, 0.019, 0.024,
   4 │     0.8  [0.29, 0.11, 0.08, 0.09, 0.1, 0.…  [0.019, 0.068, 0.018,
   5 │     0.9  [0.35, 0.19, 0.19, 0.19, 0.11, 0…  [0.027, 0.029, 0.057,
   6 │     1.0  [0.42, 0.12, 0.15, 0.22, 0.16, 0…  [0.11, 0.068, 0.042, 
   7 │     1.1  [0.26, 0.32, 0.23, 0.69, 0.36, 0…  [0.146, 0.085, 0.266,
   8 │     1.2  [0.45, 0.09, 0.35, 0.42, 0.25, 0…  [0.232, 0.175, 0.338,
   9 │     1.3  [0.49, 0.47, 0.12, 0.17, 0.67, 0…  [0.376, 0.504, 0.434,
  10 │     1.4  [0.36, 0.45, 0.49, 0.49, 0.47, 0…  [0.448, 0.527, 0.443,
  11 │     1.5  [0.72, 0.6, 0.65, 0.57, 0.69, 0.…  [0.577, 0.474, 0.581,
  12 │     1.6  [0.31, 0.77, 0.41, 0.42, 0.73, 0…  [0.639, 0.584, 0.644,
  13 │     1.7  [0.78, 0.62, 0.68, 0.7, 0.67, 0.…  [0.673, 0.651, 0.653,
  14 │     1.8  [0.86, 0.91, 0.7, 0.71, 0.76, 0.…  [0.721, 0.739, 0.748,
  15 │     1.9  [0.8, 0.76, 0.79, 0.83, 0.76, 0.…  [0.791, 0.794, 0.82, 
  16 │     2.0  [0.7, 0.74, 0.87, 0.88, 0.83, 0.…  [0.809, 0.776, 0.792,
                                                        3 columns omitted

Sometimes such operation is useful, but often we are interested in some
aggregates.

First let us check if indeed the data was generated correctly, i.e. if the
grid indeed has 64 experiments per combination of parameters. For this we use
the combine=length keyword argument (as we just want to check the number
of elements per dn combination:

julia> unstack(df, :d, :n, :largest, combine=length)
16×5 DataFrame
 Row │ d        100     1000    10000   100000 
     │ Float64  Int64?  Int64?  Int64?  Int64?
─────┼─────────────────────────────────────────
   1 │     0.5      64      64      64      64
   2 │     0.6      64      64      64      64
   3 │     0.7      64      64      64      64
   4 │     0.8      64      64      64      64
   5 │     0.9      64      64      64      64
   6 │     1.0      64      64      64      64
   7 │     1.1      64      64      64      64
   8 │     1.2      64      64      64      64
   9 │     1.3      64      64      64      64
  10 │     1.4      64      64      64      64
  11 │     1.5      64      64      64      64
  12 │     1.6      64      64      64      64
  13 │     1.7      64      64      64      64
  14 │     1.8      64      64      64      64
  15 │     1.9      64      64      64      64
  16 │     2.0      64      64      64      64

All looks good. So now we can check the mean and standard deviation of size of
the largest connected component for each combination:

julia> using Statistics

julia> unstack(df, :d, :n, :largest, combine=mean)
16×5 DataFrame
 Row │ d        100        1000       10000       100000      
     │ Float64  Float64?   Float64?   Float64?    Float64?
─────┼────────────────────────────────────────────────────────
   1 │     0.5  0.0525     0.0109375  0.00178125  0.000269219
   2 │     0.6  0.0776562  0.0159844  0.00266563  0.00040125
   3 │     0.7  0.0982813  0.0222969  0.00409688  0.000635625
   4 │     0.8  0.134375   0.031875   0.00668906  0.00121281
   5 │     0.9  0.161562   0.0485625  0.0136969   0.00312172
   6 │     1.0  0.205937   0.0905312  0.0445984   0.0187338
   7 │     1.1  0.265781   0.150547   0.149169    0.175325
   8 │     1.2  0.355625   0.272266   0.307183    0.313707
   9 │     1.3  0.375156   0.403047   0.421645    0.423478
  10 │     1.4  0.455313   0.503437   0.507891    0.510792
  11 │     1.5  0.594062   0.57925    0.580353    0.582628
  12 │     1.6  0.596719   0.63575    0.6428      0.641092
  13 │     1.7  0.697656   0.689906   0.691744    0.691409
  14 │     1.8  0.738281   0.730672   0.732986    0.732874
  15 │     1.9  0.76125    0.763125   0.768134    0.767024
  16 │     2.0  0.8025     0.797609   0.796216    0.796993

julia> agg_std = unstack(df, :d, :n, :largest, combine=std)
16×5 DataFrame
 Row │ d        100        1000        10000        100000      
     │ Float64  Float64?   Float64?    Float64?     Float64?
─────┼──────────────────────────────────────────────────────────
   1 │     0.5  0.0184305  0.00263598  0.000366829  4.57843e-5
   2 │     0.6  0.0422292  0.00599071  0.000623411  8.65108e-5
   3 │     0.7  0.0538238  0.00819539  0.00102446   0.000125444
   4 │     0.8  0.0635928  0.0138844   0.00234887   0.000429261
   5 │     0.9  0.0928426  0.0202656   0.00613093   0.00137704
   6 │     1.0  0.102581   0.0406045   0.026229     0.0111048
   7 │     1.1  0.141442   0.0748327   0.058437     0.0145289
   8 │     1.2  0.15247    0.0863053   0.0289068    0.00968446
   9 │     1.3  0.15323    0.0694691   0.0170002    0.00647732
  10 │     1.4  0.148399   0.0527765   0.0118846    0.00492798
  11 │     1.5  0.118987   0.0439715   0.0121344    0.00403464
  12 │     1.6  0.150407   0.0286661   0.00957977   0.00351468
  13 │     1.7  0.113833   0.0408923   0.00952214   0.00307108
  14 │     1.8  0.0969627  0.021692    0.00890106   0.00240749
  15 │     1.9  0.0829324  0.0224821   0.00788557   0.00210828
  16 │     2.0  0.0755929  0.0213895   0.0066356    0.00233577

What we can notice (I will not go into mathematics of these results; however,
if you find such computations interesting please be sure to join us during
WAW2023 conference):

  • as we increase average node degree the average size of the largest component
    increases;
  • there seems to be a sharp change for average degree above 1; it is especially
    visible as we increase number of nodes in the graph;
  • the results have largest standard deviation around average degree equal to 1;
    it drops for low and high values of average degree; also the standard
    deviation in general decreases as we increase number of nodes in the graph.

Let me additionally visualize the last observation:

julia> using Plots

julia> plot(agg_std.d, Matrix(agg_std[:, 2:end]),
            xlabel="average degree",
            ylabel="std of largest component relative size",
            labels=permutedims(names(agg_std, Not(:d))))

Here is a generated plot:

Standard deviation of largest component relative size

Conclusions

I hope you will find the new combine capabilities of unstack useful.

In DataFrames.jl 1.5 release we plan to add even more capabilities to unstack
(like allowing to use multiple columns to generate columns, or allowing to
process multiple value columns). I will post about it when it is done.