Category Archives: Julia

TableMetadataTools.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/01/13/tmt.html

Introduction

Before I begin let me make a small announcement. In my last post I
have advertised an introductory Julia course for data science that will start
next week at MIT. If someone is interested in the material we are going to
cover you can find it here. There is also an option to attend the course
remotely. The content will be streamed via MS Teams. If you would like to join
please register yourself here.

Now, back to the usual business. This week I have registered the
TableMetadataTools.jl package. It aims to make common operations
on table metadata more convenient.

Recently I have written about how one can work with metadata in
this post. Today, I want to give another example of working with
table metadata and discuss the major functionalities that
TableMetadataTools.jl offers. The key difference is that this time
we will use JSON source.

The post was written under Julia 1.8.5, DataFrames.jl 1.4.4, JSON3.jl 1.12.0,
TableMetadataTools.jl 0.1.0, and Parquet2.jl 0.2.5.

The data

This time I chose the Electric Vehicle Population Data data to
analyze. This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in
Hybrid Electric Vehicles (PHEVs) that are currently registered through
Washington State Department of Licensing (DOL).

We will want with the JSON file rows.json that can be downloaded from
that website, so before proceeding please download it to your working folder.
The analysis I present is done for the file that I downloaded on Jan 12, 2023.

In the post we will:

  1. Read the JSON file and load both data and metadata contained in it into a
    data frame.
  2. Perform some simple analysis of the data.
  3. Save the data as Parquet file (since it supports metadata).

(note – some of the output in this post will be wide, so in some displays
it might not look very nice; however, it looks OK if you try it in the terminal
or Jupyter Notebook)

Data ingest

First load the required packages:

julia> using DataFrames

julia> using JSON3

julia> using Parquet2

julia> using Statistics

julia> using TableMetadataTools

Next we read the JSON file to a dictionary-like structure:

julia> data_json = open(JSON3.read, "rows.json")
JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}} with 2 entries:  :meta => {…
  :data => JSON3.Array[Any["row-briu~kmbv-8jap", "00000000-0000-0000-6480-…

It is instructive to investigate the structure of this object, as it is complex:

julia> function traverse(obj; level=0)
           if level > 4 || !(obj isa AbstractDict)
               print(" ...")
               return
           end
           for key in keys(obj)
               print("\n", " " ^ level, key, ":")
               traverse(obj[key], level=level+2)
           end
       end
traverse (generic function with 2 methods)

julia> traverse(data_json)

meta:
  view:
    id: ...
    name: ...
    assetType: ...
    attribution: ...
    averageRating: ...
    category: ...
    createdAt: ...
    description: ...
    displayType: ...
    downloadCount: ...
    hideFromCatalog: ...
    hideFromDataJson: ...
    newBackend: ...
    numberOfComments: ...
    oid: ...
    provenance: ...
    publicationAppendEnabled: ...
    publicationDate: ...
    publicationGroup: ...
    publicationStage: ...
    rowsUpdatedAt: ...
    rowsUpdatedBy: ...
    tableId: ...
    totalTimesRated: ...
    viewCount: ...
    viewLastModified: ...
    viewType: ...
    approvals: ...
    clientContext: ...
    columns: ...
    grants: ...
    metadata: ...
    owner: ...
    query: ...
    rights: ...
    tableAuthor: ...
    tags: ...
    flags: ...
data: ...

As you can see we have descended only two levels in the hierarchy and already
there is a lot of information.

Actual data is stored in :data field. The rest is metadata,
of which interesting for us will be :name, :description, and :columns.

Before move forward, let us peek at data_json.meta.view.columns vector:

julia> data_json.meta.view.columns[[1; end]]
2-element Vector{JSON3.Object}:
 {
               "id": -1,
             "name": "sid",
     "dataTypeName": "meta_data",
        "fieldName": ":sid",
         "position": 0,
   "renderTypeName": "meta_data",
           "format": {},
            "flags": [
                       "hidden"
                     ]
}
 {
                    "id": 561974359,
                  "name": "WAOFM - GIS - Legislative District Boundary",
          "dataTypeName": "number",
           "description": "",
             "fieldName": ":@computed_region_8ddd_yn5v",
              "position": 21,
        "renderTypeName": "number",
         "tableColumnId": 87270518,
   "computationStrategy": {
                             "source_columns": [
                                                 "geocoded_column"
                                               ],
                                       "type": "georegion_match_on_point",
                                 "parameters": {
                                                       "region": "_8ddd-yn5v",
                                                  "primary_key": "_feature_id"
                                               }
                          },
                "format": {}
}

(I have displayed only first and last element of the vector contained in this
metadata entry as the rest are similar). This vector stores column-level
metadata. The interesting fields are :name, :fieldName, and :description
(the last is not always present).

Finally, let us investigate data_json.data object. Again, we pick only
the first and last element:

julia> data_json.data[[1; end]]
2-element Vector{JSON3.Array}:
 Any["row-briu~kmbv-8jap", "00000000-0000-0000-6480-436D376BD73A", 0,
 1670541195, nothing, 1670541238, nothing, "{ }", "JTMEB3FV6N",
 "Monroe"  …  "42", "0", nothing, "198968248",
 "POINT (-81.80023 24.5545)", nothing, "12087972100", "51", nothing, nothing]
 Any["row-acyi~zx2n-aq3x", "00000000-0000-0000-BA64-2573B5CAA8F7", 0,
 1670541195, nothing, 1670541258, nothing, "{ }", "YV4BR0CL8N",
 "King"  …  "18", "0", "47", "194673692",
 "POINT (-122.09124 47.33778)", "PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA)",
 "53033032005", "3009", "8", "46"]

We see that the data is stored in rows.

Loading data to a data frame

Having this knowledge let us load the data into a data frame.

First we extract the column-level metadata into a data frame for convenience:

julia> data_colmeta = DataFrame(name=getproperty.(cols, :fieldName),
                                label=getproperty.(cols, :name),
                                note=get.(cols, :description, nothing))
28×3 DataFrame
 Row │ name                         label                              note
     │ String                       String                             Union…
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ :sid                         sid
   2 │ :id                          id
   3 │ :position                    position
   4 │ :created_at                  created_at
   5 │ :created_meta                created_meta
   6 │ :updated_at                  updated_at
   7 │ :updated_meta                updated_meta
   8 │ :meta                        meta
   9 │ vin_1_10                     VIN (1-10)                         The 1st 10 characters of each ve…
  10 │ county                       County                             The county in which the register…
  11 │ city                         City                               The city in which the registered…
  12 │ state                        State                              The state in which the registere…
  13 │ zip_code                     Postal Code                        The 5 digit zip code in which th…
  14 │ model_year                   Model Year                         The model year of the vehicle, d…
  15 │ make                         Make                               The manufacturer of the vehicle,…
  16 │ model                        Model                              The model of the vehicle, determ…
  17 │ ev_type                      Electric Vehicle Type              This distinguishes the vehicle a…
  18 │ cafv_type                    Clean Alternative Fuel Vehicle (…  This categorizes vehicle as Clea…
  19 │ electric_range               Electric Range                     Describes how far a vehicle can …
  20 │ base_msrp                    Base MSRP                          This is the lowest Manufacturer'…
  21 │ legislative_district         Legislative District               The specific section of Washingt…
  22 │ dol_vehicle_id               DOL Vehicle ID                     Unique number assigned to each v…
  23 │ geocoded_column              Vehicle Location                   The center of the ZIP Code for t…
  24 │ electric_utility             Electric Utility                   This is the electric power retai…
  25 │ _2020_census_tract           2020 Census Tract                  The census tract identifier is a…
  26 │ :@computed_region_x4ys_rtnd  Counties
  27 │ :@computed_region_fny7_vc3j  Congressional Districts
  28 │ :@computed_region_8ddd_yn5v  WAOFM - GIS - Legislative Distri…

We now, in a structured way, see that the metadata contains: column name,
column descriptive label, and, optionally, note about column contents.

Assume that from all the available columns we want to keep only few most
interesting. The DATA_COLSKEEP vector keeps column numbers we want to store
in a data frame.

julia> const DATA_COLSKEEP = [9, 12, 14, 15, 17, 18, 19];

julia> keepat!(data_colmeta, DATA_COLSKEEP)
7×3 DataFrame
 Row │ name            label                              note
     │ String          String                             Union…
─────┼──────────────────────────────────────────────────────────────────────────────────────
   1 │ vin_1_10        VIN (1-10)                         The 1st 10 characters of each ve…
   2 │ state           State                              The state in which the registere…
   3 │ model_year      Model Year                         The model year of the vehicle, d…
   4 │ make            Make                               The manufacturer of the vehicle,…
   5 │ ev_type         Electric Vehicle Type              This distinguishes the vehicle a…
   6 │ cafv_type       Clean Alternative Fuel Vehicle (…  This categorizes vehicle as Clea…
   7 │ electric_range  Electric Range                     Describes how far a vehicle can …

Now we are ready to create a data frame with our data. First create its schema
and populate the metadata:

julia> data_df = DataFrame(data_colmeta.name .=> Ref(String[]))
0×7 DataFrame
 Row │ vin_1_10  state   model_year  make    ev_type  cafv_type  electric_range
     │ String    String  String      String  String   String     String
─────┴──────────────────────────────────────────────────────────────────────────

julia> caption!(data_df, data_json.meta.view.name);

julia> note!(data_df, data_json.meta.view.description);

julia> for row in eachrow(data_colmeta)
           label!(data_df, row.name, row.label)
           note!(data_df, row.name, row.note)
       end

Let us check if indeed metadata got properly stored. We can either extract
dictionaries:

julia> metadata(data_df)
Dict{String, String} with 2 entries:
  "note"    => "This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL)."
  "caption" => "Electric Vehicle Population Data"

julia> colmetadata(data_df)
Dict{Symbol, Dict{String, String}} with 7 entries:
  :ev_type        => Dict("label"=>"Electric Vehicle Type", "note"=>"This distinguishes the vehicle as all electric or a plug-in hybrid.")
  :make           => Dict("label"=>"Make", "note"=>"The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN).")
  :cafv_type      => Dict("label"=>"Clean Alternative Fuel Vehicle (CAFV) Eligibility", "note"=>"This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-o…
  :state          => Dict("label"=>"State", "note"=>"The state in which the registered owner resides.")
  :vin_1_10       => Dict("label"=>"VIN (1-10)", "note"=>"The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN).")
  :electric_range => Dict("label"=>"Electric Range", "note"=>"Describes how far a vehicle can travel purely on its electric charge.")
  :model_year     => Dict("label"=>"Model Year", "note"=>"The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).")

Or get a TOML representation of metadata (the first entry of TOML output means
that we did not store metadata style information):

julia> println(meta2toml(data_df, style=false))
style = false

[colmetadata.cafv_type]
label = "Clean Alternative Fuel Vehicle (CAFV) Eligibility"
note = "This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only range requirement in House Bill 2042 as passed in the 2019 legislative session."

[colmetadata.electric_range]
label = "Electric Range"
note = "Describes how far a vehicle can travel purely on its electric charge."

[colmetadata.ev_type]
label = "Electric Vehicle Type"
note = "This distinguishes the vehicle as all electric or a plug-in hybrid."

[colmetadata.make]
label = "Make"
note = "The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN)."

[colmetadata.model_year]
label = "Model Year"
note = "The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN)."

[colmetadata.state]
label = "State"
note = "The state in which the registered owner resides."

[colmetadata.vin_1_10]
label = "VIN (1-10)"
note = "The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN)."

[metadata]
caption = "Electric Vehicle Population Data"
note = "This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL)."

All looks good so far.

Additionally you can easily extract all column labels using the labels function:

julia> labels(data_df)
7-element Vector{String}:
 "VIN (1-10)"
 "State"
 "Model Year"
 "Make"
 "Electric Vehicle Type"
 "Clean Alternative Fuel Vehicle (CAFV) Eligibility"
 "Electric Range"

Now we store the data in our data frame:

julia> foreach(row -> push!(data_df, row[DATA_COLSKEEP]), data_json.data)

julia> show(data_df, truncate=10)
112634×7 DataFrame
    Row │ vin_1_10    state   model_year  make         ev_type      cafv_type    electric_range
        │ String      String  String      String       String       String       String
────────┼───────────────────────────────────────────────────────────────────────────────────────
      1 │ JTMEB3FV6N  FL      2022        TOYOTA       Plug-in Hy…  Clean Alte…  42
      2 │ 1G1RD6E45D  NV      2013        CHEVROLET    Plug-in Hy…  Clean Alte…  38
      3 │ JN1AZ0CP8B  WA      2011        NISSAN       Battery El…  Clean Alte…  73
      4 │ 1G1FW6S08H  WA      2017        CHEVROLET    Battery El…  Clean Alte…  238
      5 │ 3FA6P0SU1K  WA      2019        FORD         Plug-in Hy…  Not eligib…  26
      6 │ 5YJ3E1EB5J  WA      2018        TESLA        Battery El…  Clean Alte…  215
   ⋮    │     ⋮         ⋮         ⋮            ⋮            ⋮            ⋮             ⋮
 112630 │ 7SAYGDEF2N  WA      2022        TESLA        Battery El…  Eligibilit…  0
 112631 │ 1N4BZ1CP7K  WA      2019        NISSAN       Battery El…  Clean Alte…  150
 112632 │ 1FMCU0KZ4N  WA      2022        FORD         Plug-in Hy…  Clean Alte…  38
 112633 │ KNDCD3LD4J  WA      2018        KIA          Plug-in Hy…  Not eligib…  26
 112634 │ YV4BR0CL8N  WA      2022        VOLVO        Plug-in Hy…  Not eligib…  18
                                                                             112623 rows omitted

julia> describe(data_df, :min, :max, :nmissing)
7×4 DataFrame
 Row │ variable        min                                max                                nmissing
     │ Symbol          String                             String                             Int64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ vin_1_10        1C4JJXP60M                         YV4H60DZ9N                                0
   2 │ state           AK                                 WY                                        0
   3 │ model_year      1997                               2023                                      0
   4 │ make            AUDI                               VOLVO                                     0
   5 │ ev_type         Battery Electric Vehicle (BEV)     Plug-in Hybrid Electric Vehicle …         0
   6 │ cafv_type       Clean Alternative Fuel Vehicle E…  Not eligible due to low battery …         0
   7 │ electric_range  0                                  97                                        0

We are almost done. It looks like :model_year and :electric_range are
integer, so we perform an appropriate conversion:

julia> transform!(data_df,
                  [:model_year, :electric_range] .=> ByRow(x -> parse(Int, x)),
                  renamecols=false);

Here one important comment has to be made. Since we keep the old column names
for the transformed columns their "label" and "note" metadata is kept.
The reason is that this metadata originally had :note style. Let us check
that indeed metadata is kept:

julia> colmetadata(data_df, :model_year)
Dict{String, String} with 2 entries:
  "label" => "Model Year"
  "note"  => "The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN)."

julia> colmetadata(data_df, :electric_range)
Dict{String, String} with 2 entries:
  "label" => "Electric Range"
  "note"  => "Describes how far a vehicle can travel purely on its electric charge."

It is an important rule when working with :note-style metadata to keep old
column name only if we do not change the meaning of the column (like in our
case, where we just parsed it).

Doing some analysis of the data

Let us check how many observations we have for combination of :ev_type
and :cafv_type columns:

julia> data_gdf = groupby(data_df, [:ev_type, :cafv_type], sort=true)

julia> data_agg1 = combine(data_gdf, nrow, :electric_range .=> extrema)
5×4 DataFrame
 Row │ ev_type                            cafv_type                          nrow   electric_range_extrema
     │ String                             String                             Int64  Tuple{Int64, Int64}
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Battery Electric Vehicle (BEV)     Clean Alternative Fuel Vehicle E…  46799  (39, 337)
   2 │ Battery Electric Vehicle (BEV)     Eligibility unknown as battery r…  39236  (0, 0)
   3 │ Battery Electric Vehicle (BEV)     Not eligible due to low battery …      9  (29, 29)
   4 │ Plug-in Hybrid Electric Vehicle …  Clean Alternative Fuel Vehicle E…  11840  (30, 153)
   5 │ Plug-in Hybrid Electric Vehicle …  Not eligible due to low battery …  14750  (6, 29)

The results seem interesting, but we most likely cannot immediately understand
them. Let us check the column metadata:

julia> colmetadata(data_agg1)
Dict{Symbol, Dict{String, String}} with 2 entries:
  :ev_type   => Dict("label"=>"Electric Vehicle Type", "note"=>"This distinguishes the vehicle as all electric or a plug-in hybrid.")
  :cafv_type => Dict("label"=>"Clean Alternative Fuel Vehicle (CAFV) Eligibility", "note"=>"This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only r…

We can see that only :ev_type and :cafv_type columns have metadata. This is
intended. These columns were not transformed so their metadata is kept because
it had :note style. On the other hand the :electric_range column was
transformed, by taking its extrema so its metadata was not propagated.

Now we can more easily understand the results. The :ev_type column tells us
if the vehicle is electric only or hybrid. On the other hand :cafv_type
column tells us if the vehicle meets the requirements of Clean Alternative
Fuel Vehicles.

Interestingly, for some BEV cars it is not known if they meet CAFV condition,
and this is indicated by 0 in their :electric_range. For other cases we see
that 30 is a minimal range that allows to classify vehicle as meeting CAFV.
Surprisingly six BEV do not meet it, as their battery range is too low.

Let us check how does the frequency of electric range equal to 0 behaves:

julia> data_agg2 = sort(unstack(data_df,
                                :model_year,
                                :ev_type,
                                :electric_range,
                                combine = x -> mean(x .> 0)))
20×3 DataFrame
 Row │ model_year  Plug-in Hybrid Electric Vehicle (PHEV)  Battery Electric Vehicle (BEV)
     │ Int64       Union{Missing, Float64}                 Union{Missing, Float64}
─────┼────────────────────────────────────────────────────────────────────────────────────
   1 │       1997                               missing                         1.0
   2 │       1998                               missing                         1.0
   3 │       1999                               missing                         1.0
   4 │       2000                               missing                         1.0
   5 │       2002                               missing                         1.0
   6 │       2008                               missing                         1.0
   7 │       2010                               missing                         1.0
   8 │       2011                                     1.0                       1.0
   9 │       2012                                     1.0                       1.0
  10 │       2013                                     1.0                       1.0
  11 │       2014                                     1.0                       1.0
  12 │       2015                                     1.0                       1.0
  13 │       2016                                     1.0                       1.0
  14 │       2017                                     1.0                       1.0
  15 │       2018                                     1.0                       1.0
  16 │       2019                                     1.0                       0.999763
  17 │       2020                                     1.0                       0.994388
  18 │       2021                                     1.0                       0.0327439
  19 │       2022                                     1.0                       0.0
  20 │       2023                                     1.0                       0.0

We see that indeed the newest BEV do not have the range data (because it was not
collected yet). Also we learn that hybrids were introduced only in year 2011.

As above we can check that column metadata is only kept for :model_year column
since we did not transform it:

julia> colmetadata(data_agg2)
Dict{Symbol, Dict{String, String}} with 1 entry:
  :model_year => Dict("label"=>"Model Year", "note"=>"The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).")

Saving the data frame while keeping metadata

The meta2toml function was designed to allow for easy saving of metadata
to an independent file. However, I find it more convenient to save metadata
along with data. Here Parquet2.jl comes handy.

Let us work with data_agg2 data frame:

julia> Parquet2.writefile("data_agg2.parquet", data_agg2)
✏ Parquet2.FileWriter{IOStream}(data_agg2.parquet)

julia> data_pq = Parquet2.Dataset("data_agg2.parquet")
≔ Parquet2.Dataset (928 bytes)
        1. "model_year": Int64
        2. "Plug-in Hybrid Electric Vehicle (PHEV)": Union{Missing, Float64}
        3. "Battery Electric Vehicle (BEV)": Union{Missing, Float64}


julia> data_pq_df = DataFrame(data_pq);

We can check that both data_pq and data_pq_df have metadata information:

julia> println(meta2toml(data_pq))
style = true

[colmetadata."Battery Electric Vehicle (BEV)"]

[colmetadata."Plug-in Hybrid Electric Vehicle (PHEV)"]

[colmetadata.model_year]
label = ["Model Year", "default"]
note = ["The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).", "default"]

[metadata]
caption = ["Electric Vehicle Population Data", "default"]
note = ["This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).", "default"]

julia> println(meta2toml(data_pq_df))
style = true

[colmetadata.model_year]
label = ["Model Year", "default"]
note = ["The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).", "default"]

[metadata]
caption = ["Electric Vehicle Population Data", "default"]
note = ["This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).", "default"]

We note that the metadata has :default style, because Parquet does not
support metadata styles. Therefore, if we wanted to turn all metadata to :note
style, so that it would be propagated we can write e.g.:

julia> setallmetadatastyle!(data_pq_df);

julia> print(meta2toml(data_pq_df))
style = true

[colmetadata.model_year]
label = ["Model Year", "note"]
note = ["The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).", "note"]

[metadata]
caption = ["Electric Vehicle Population Data", "note"]
note = ["This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).", "note"]

What features does TableMetadataTools.jl provide?

Here is the list of functionalities that you can find in
TableMetadataTools.jl that are added on top of standard metadata
management functions defined in DataAPI.jl:

  • the label, label!, labels, and findlabels functions for
    convenient work with column label metadata;
  • the caption and caption! functions for convenient work with
    table caption metadata;
  • note and note! functions for convenient work with note metadata both
    on table and column level;
  • unit (re-exported from Unitful.jl), unit!, and units functions for
    convenient work with column unit metadata;
  • setmetadatastyle!, setcolmetadatastyle!, setallmetadatastyle! for group
    setting style for keys matching a passed pattern; usually needed when working
    with metadata that initially has :default style set and one wants it to have
    :note style (common when reading metadata from storage formats that do not
    support metadata style information);
  • meta2toml and toml2meta! for storing and loading metadata in TOML format;
  • dict2metadata!, dict2colmetadata! for setting table and column level
    metadata stored in a dictionary (e.g. earlier retrieved from some storage
    format or by using metadata or colmetadata functions);
  • the @track macro for tracking operations applied to table and the tracklog
    function for convenient printing of tracking metadata.

Conclusions

I hope you will find this post useful as a tutorial to both working with
metadata and ingesting data from a complex JSON file into a data frame.

My general feeling is that, while metadata is commonly used by Stata or SAS
users, it had less adoption in other ecosystems. The reason is likely that
there was no decent support for storing and working with tabular metadata.
Now, I hope the combination of DataFrames.jl + Parquet2.jl helps to solve
this issue.

Admittedly in many cases you do not need table metadata. However, I am convinced
that if you work with tables that have hundreds or thousands of columns or that
you want to share your tables between teams having proper metadata attached
to data is essential.

Asynchronous Workflow Using Julia Tutorial

By: Jacob Zelko

Re-posted from: https://jacobzelko.com/01082023043553-julia-async-workflow/index.html

Asynchronous Workflow Using Julia Tutorial

Date: January 7 2023

Summary: A thorough tutorial on how to use Julia for asynchronous workflows to do many things, at once!

Keywords: #blog #archive #julia #programming #literate #workflow #asynchronous #weave #browser #sync #node

Bibliography

Not Available

Table of Contents

    1. Motivation
    2. Set-Up
    3. Simple Workflow Process
      1. Blocking Workflow
      2. Asynchronous Workflow
    4. Asynchronous Workflow for Literate Programming
    5. Conclusion
  1. References
  2. References
  3. Discussion:

Motivation

I sometimes find myself wanting to execute a Julia file I have just edited. This is different than what the excellent Revise.jl package (created by Tim Holy) does in that I want to execute the full file and not just update function definitions. Moreover, I may just want Julia to do something else entirely upon updating a file or directory. This can be very helpful in analysis workflows as well as literate programming. Creating an extended post on my process for this was inspired by a Julia Discourse post I provided an answer to.

Set-Up

Here is how I recommend setting up your Julia environment to follow this tutorial (I assume you have Julia installed correctly):

  1. Create an empty directory (folder) somewhere on your computer. (NOTE: For sake of this tutorial, I will refer to it as the "test" folder).

  2. Within your favorite text editor (like Notepad, Vim, VSCode, etc.), open this directory.

  3. Open and create a file called "tmp.jmd" (NOTE: Keep the file and your editor open as we will be coming back to this file quite a lot).

  4. Open your Julia REPL within the directory and activate a temporary environment within package mode:

pkg> activate --temp
  1. Add into this temporary environment the following packages:

pkg> add FileWatching, Weave
  1. Install node onto your machine.

I suggest using the great tool, nvm, which allows easy management of node versions.

  1. Install the node package, browser-sync, which allows for syncing of webpages based on updates.

Here is the command you can use in your command line:

npm i -g browser-sync

NOTE: Steps 6 and 7 are not necessary if you are not interested in the section on literate programming

Simple Workflow Process

There are two ways to start this workflow. One blocking (i.e. you cannot use the Julia REPL while working) and another spawning an asynchronous background process to enable you to work with the same REPL. Here is gif showing these workflows in action:

Blocking Workflow

This creates a synchronous task in your Julia REPL that can then run whatever command you want while you monitor a specific file (or folder). In this case, I am making the task print a statement that says “Change Detected!” to my REPL whenever there is a change in the file called “tmp.jmd”.

using FileWatchingwhile true 
    watch_file("./tmp.jmd")
    println("Change Detected!")
end

This workflow is not optimal as this loop continues and locks your REPL from doing anything else.

Asynchronous Workflow

This creates an asynchronous task in the background of your Julia REPL that can then run whatever command you want while you monitor a specific file (or folder). In this case, I am making the task print a statement that says “Change Detected!” to my REPL whenever there is a change in the file called “tmp.jmd”.

using FileWatching@async while true 
    watch_file("./tmp.jmd")
    println("Change Detected!")
end

From here, you could then still interact with this REPL while this task is running in the background. Furthermore, you can change the line that says println("Change Detected!") to whatever command (or commands) you want to trigger.

Asynchronous Workflow for Literate Programming

This workflow allows one to work with a Julia Markdown document with Weave.jl to preview your work in real time. There a few steps to start the workflow but here is how it looks when in operation:

First, execute the following snippet in your Julia REPL to start the asynchronous process to watch for changes and tell Julia to weave the file we will modify:

using FileWatching
using Weave@async while true 
    watch_file("./tmp.jmd")
    weave("./tmp.jmd")
end

Next, run this in a separate terminal:

browser-sync start --server --start --file "./*.html"

or this command within your Julia REPL (NOTE: This will make your REPL session very cluttered but you can still use your REPL):

@async run(`browser-sync start --server --start --file "./*.html"`)

At this point, you should see browser-sync spit out a lot of information that looks something like this:

[Browsersync] Access URLs:
 --------------------------------------
       Local: http://localhost:3000
    External: http://192.168.1.186:3000
 --------------------------------------
          UI: http://localhost:3001
 UI External: http://localhost:3001
 --------------------------------------
[Browsersync] Serving files from: ./

You'll want to grab that URL that looks like http://localhost:3000 in the Local spot and then open that in your web browser. With the URL that you get (it could be on a different port), navigate to http://localhost:3000/tmp.html.

Finally, when you have this all in place, go ahead and add this code block into the tmp.jmd file that was created:

# Hello World This is a Julia Markdown file. ```julia 
2 + 2
```

You should see a few things happen. You'll see a few messages from the Julia REPL stating that Weave.jl is weaving your document to your desired output. Then, you'll see browser-sync say something about updating or syncing. Finally, you will see your browser update to the latest version of your weaved document.

Feel free to play around with this more and see the dynamism of the workflow! I tend to do this when I am iteratively developing reports within Julia and want to tinker within the REPL at the same time I am creating documents. Having to wait for rendering Weave.jl documents and the like was a pain and this took the pain away.

Conclusion

I hope you appreciated this post! If you have any suggestions, comments, or additional workflows this could be used for, please comment below. May your Julia sessions now be even more supercharged!

References

References

Discussion:

‘Introduction to Julia for Data Science’ course at MIT

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/01/06/mit.html

Introduction

From January 17 to 20, 2023 my friends and I are going to give a short
course on introduction to Julia for data scientists at MIT.
The course is open so everyone is invited to join us. You can find the
detailed schedule and location information here.

We plan to cover various areas of data science tools and techniques including
predictive models, optimization, and simulation. One of the modules in the
course covers mining complex networks. Interestingly, some of the techniques
used there are similar to the methods I discussed last week in my
Flea circus for the New Year celebration post.

Therefore today I thought to present an example related to the material we are
planning to teach during the course that is covers analysis of Markov
models.

The post was written under Julia 1.8.4. As last week, I want to do the whole
analysis just using the functionality available in Base Julia.

The problem

The question I want to answer today is the following.

Assume we have an undirected graph and have an agent traveling on it. We assume
that the graph is connected, i.e. that every node is reachable from every other
node. When agent is in some node it moves to one of the nodes that is connected
by an edge with the source node. If a node is an end of multiple edges then one
of them is picked uniformly at random.

Now we are ready to formulate the question:

In the long run how often each node in the analyzed graph is going to be
visited?

We are going to try to solve this problem numerically.

The data

In order to concentrate on some concrete graph let us analyze the GitHub social
network graph, whose description can be found here.

First download the musae_git_edges.csv to your working directory.

Start with inspecting its contents:

julia> raw = collect(eachline("musae_git_edges.csv"))
289004-element Vector{String}:
 "id_1,id_2"
 "0,23977"
 "1,34526"
 "1,2370"
 "1,14683"
 "1,29982"
 "1,21142"
 ⋮
 "37519,37678"
 "19093,2347"
 "37527,37596"
 "37529,37601"
 "37644,2347"
 "25879,2347"
 "25616,2347"

We see that every line, except the first, contains information about ends of
edges in our graph. Let us first parse it to numbers:

julia> edges = [parse.(Int, line) for line in split.(raw[2:end], ',')]
289003-element Vector{Vector{Int64}}:
 [0, 23977]
 [1, 34526]
 [1, 2370]
 [1, 14683]
 [1, 29982]
 [1, 21142]
 [1, 20363]
 ⋮
 [37519, 37678]
 [19093, 2347]
 [37527, 37596]
 [37529, 37601]
 [37644, 2347]
 [25879, 2347]
 [25616, 2347]

Next we check the minimum and maximum edge number in the data:

julia> extrema(Iterators.flatten(edges))
(0, 37699)

We see that we have 37700 nodes that are numbered using 0-based indexing.

Let us now create adjacency matrix am of our graph.
We will put 1 in cell am[i, j] if there is an edge between node i and j.
Otherwise we will leave am[i, j] to be 0.
In the process of creation of am matrix, we will fix node numbers to start
with 1:

julia> using SparseArrays

julia> from, to = getindex.(edges, 1) .+ 1, getindex.(edges, 2) .+ 1;

julia> am = sparse([from; to], [to; from], fill(1.0, 2 * length(edges)));

Checking if graph is connected

Let us check if the graph is connected using breadth first search
(this function is a bit slow; during the course at MIT you will see how this
process can be implemented faster using Graphs.jl):

julia> function allconnected(am)
           seen = falses(size(am, 1))
           to_visit = [1]
           seen[1] = true
           while !isempty(to_visit)
               i = popfirst!(to_visit)
               new_neighbors = [j for j in findnz(am[i, :])[1] if !seen[j]]
               seen[new_neighbors] .= true
               append!(to_visit, new_neighbors)
           end
           return all(seen)
       end
allconnected (generic function with 1 method)

julia> allconnected(am)
true

Indeed the graph is connected.

Creating a transition matrix

Using the am matrix it is easy to compute the transition matrix for our
process. Let an entry tm[i, j] in the tm matrix be the probability that
agent moves from node i to node j in one step. Clearly this probability
is 1 / deg[i] where deg[i] is degree of node i (that is its number of
neighbors.

Let us compute the tm matrix:

julia> deg = sum(am, dims=2);

julia> invdeg = 1 ./ deg;

julia> tm = am .* invdeg;

Let us check that indeed the probabilities in rows of tm matrix add up to 1:

julia> sum(tm, dims=2)
37700×1 Matrix{Float64}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 0.9999999999999999
 1.0
 1.0
 0.9999999999999998
 ⋮
 0.9999999999999974
 1.0
 1.0000000000000007
 1.0
 1.0000000000000002
 1.0
 1.0
 1.0
 1.0
 1.0

Computing stationary probability of visiting a node

We are almost done computing the long-run probability of visiting each node in
the analyzed graph.

We can start with any distribution of probabilities of visiting nodes. Assume,
for example, that it is uniform:

julia> p = fill(1 / 37700, 1, 37700)
1×37700 Matrix{Float64}:
 2.65252e-5  2.65252e-5  2.65252e-5  …  2.65252e-5  2.65252e-5

From stochastic matrix theory we know that p * tm is a distribution of
location of the agent after one step:

julia> p * tm
1×37700 Matrix{Float64}:
 8.28912e-7  2.8377e-5  4.65354e-7  …  1.54045e-5  7.46794e-7

Now let us repeat this process many times until the vector p stabilizes:

julia> function stationary(tm; eps=sqrt(eps()))
           p = fill(1 / 37700, 1, 37700)
           while true
               newp = p * tm
               if sum(abs(o - n) for (o, n) in zip(p, newp)) < eps
                   return newp
               end
               p = newp
           end
       end
stationary (generic function with 1 method)

julia> p = stationary(tm)
1×37700 Matrix{Float64}:
 1.73009e-6  1.38407e-5  1.73009e-6  …  5.19026e-6  6.92034e-6

We know the stationary distribution of probability of visiting each node.
However, it would be interesting to understand it. As we will see it is
proportional to the degree of the node. Let us check it:

julia> extrema(vec(p) .- deg ./ sum(deg))
(-3.7003669919877247e-10, 9.529364854058532e-9)

Indeed, the deviation of p from the distribution given by the node degree is
low as promised.

We have learned that in the long run how often each node is going to be
visited with probability proportional to its degree. This property can be
verified analytically to hold for any undirected connected graph. I encourage
you to perform the required computations.

Conclusions

I hope you found the example interesting. During the short course at MIT this
and many other examples will be explained in detail and we will discuss which
Julia packages can be used to do data analysis conveniently. Still, I wanted
to highlight, that a very appealing feature of Julia is related to the fact
how far one can go using just its standard functionality, without having to
install any packages.