Category Archives: Julia

Strings vs symbols in DataFrames.jl column indexing

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/08/05/symbol.html

Introduction

In DataFrames.jl you can use both symbols and strings for column indexing. Which
to choose is one of the topics that new users ask about most frequently. In this
post I will explain why both options are supported and what is a difference
between them. Note that this is an entry level post, so I will omit many details
of the discussed topic and focus on most important aspects only.

The post was written under Julia 1.7.2, DataFrames.jl 1.3.4,
DataFramesMeta.jl 0.12.0, BenchmarkTools.jl 1.3.1.

What are strings and symbols?

In Julia a string allows users to store sequences of characters. The simplest
way to create a string is to write some text between double quotation marks:

julia> "an example string"
"an example string"

Symbols are objects used in Julia to create identifiers. You can think of them
as labels. Symbols are normally created by prefixing some label with : like
this:

julia> :label
:label

In this way you can create symbols that are valid variable names.
So, for example, you cannot create a symbol that has a space using ::

julia> :my label
ERROR: syntax: extra token "label" after end of expression

Instead, in such cases, you need to call Symbol passing it a string as
an argument:

julia> Symbol("my label")
Symbol("my label")

How are string and symbols different?

To understand the difference between symbols and strings it is easiest to
think of them as follows:

  • symbols are labels;
  • strings are sequences of characters.

So symbols are indivisible – they are always considered to as a whole,
while strings consist of multiple characters. The most important consequences of
this distinction are the following:

  • symbols are faster than strings when you compare them for equality using ==;
  • you can manipulate strings (e.g. uppercase, chop, perform substring matching etc.)
    while none of such operations are supported for symbols.

Let us have a look at these two characteristics by example. First we check
comparison speed. We create 1000-element vectors with unique values and compare
all pairs of their entries, so we make 1 million comparisons and expect 1000
matches.

julia> using BenchmarkTools

julia> string_vec = string.("s", 1:1000)
1000-element Vector{String}:
 "s1"
 "s2"
 "s3"
 "s4"
 ⋮
 "s997"
 "s998"
 "s999"
 "s1000"

julia> symbol_vec = Symbol.("s", 1:1000)
1000-element Vector{Symbol}:
 :s1
 :s2
 :s3
 :s4
 ⋮
 :s997
 :s998
 :s999
 :s1000

julia> test_cmp(v) = count(x == y for x in v, y in v)
test_cmp (generic function with 1 method)

julia> @btime test_cmp($string_vec)
  3.038 ms (0 allocations: 0 bytes)
1000

julia> @btime test_cmp($symbol_vec)
  635.400 μs (0 allocations: 0 bytes)
1000

Indeed symbol comparison is faster.

Now let us look at manipulation:

julia> str = "example"
"example"

julia> uppercase(str)
"EXAMPLE"

julia> chop(str)
"exampl"

julia> match(r"ex", str)
RegexMatch("ex")

julia> sym = :example
:example

julia> uppercase(sym)
ERROR: MethodError: no method matching uppercase(::Symbol)

julia> chop(sym)
ERROR: MethodError: no method matching chop(::Symbol)

julia> match(r"ex", sym)
ERROR: MethodError: no method matching match(::Regex, ::Symbol)

So in summary we could conclude that:

  • one can use symbol if the value stored in it is not manipulated
    (i.e. is treated as a label); they are faster in comparisons than strings
    and a bit easier to type (only : prefix is needed) provided that they do
    not contain characters like spaces (in which case they are not convenient
    to type);
  • strings support manipulation as opposed to symbols; the cost is that
    comparing them is slower than comparing symbols.

Let us now discuss how these considerations translate to the DataFrames.jl realm.

Strings vs symbols in DataFrames.jl

Column names in a DataFrame are labels. For this reason both symbols and
strings are allowed to be used when referencing them without introducing
an ambiguity. Here is an example. We start with strings:

julia> using DataFrames

julia> df = DataFrame("col1" => 1, "col 2" => 2)
1×2 DataFrame
 Row │ col1   col 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df."col1"
1-element Vector{Int64}:
 1

julia> df."col 2"
1-element Vector{Int64}:
 2

julia> df[:, "col1"]
1-element Vector{Int64}:
 1

julia> df[:, "col 2"]
1-element Vector{Int64}:
 2

Now we try the same with symbols:

julia> df = DataFrame(:col1 => 1, Symbol("col 2") => 2)
1×2 DataFrame
 Row │ col1   col 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df.col1
1-element Vector{Int64}:
 1

julia> getproperty(df, Symbol("col 2"))
1-element Vector{Int64}:
 2

julia> df[:, :col1]
1-element Vector{Int64}:
 1

julia> df[:, Symbol("col 2")]
1-element Vector{Int64}:
 2

We now see the first difference, that we have already discussed. If column
names are all valid variable names symbols are more convenient, however,
if they are not (e.g. contain spaces) then using strings is more convenient.
As an extreme case, note that the convenience syntax for getproperty using
. accessor does not work for symbols containing spaces and we need to do
an explicit getproperty call.

The second important aspect is that all functions that manipulate column
names in DataFrames.jl work with strings. This is natural, as symbol
manipulation is not supported by Julia. Here is a combo showing this in action:

julia> select(df, Cols(startswith("c")) .=> identity .=> uppercase)
1×2 DataFrame
 Row │ COL1   COL 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

The Cols(startswith("c")) .=> identity .=> uppercase operation specification
syntax means that we want to pick all columns whose name starts with "c"
(note that the startswith function expects string as an input), keep them
unchanged (the identiy function) and uppercase their names in the output
(note that uppercase expects string as an input).

Finally, you might ask about comparison of speed of column lookup using strings
vs symbols. Here is a simple test:

julia> @btime $df.col1
  7.500 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

julia> @btime $df."col1"
  38.446 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

As you can see there is a noticeable performance difference. However, please
note that both these operations are very fast. Therefore, in practice,
column lookup is almost never a performance bottleneck in operations on
data frames (usually what you do with the column picked from a data frame
is more expensive by several orders of magnitude). So a practical recommendation
is that performance should not be a reason of choosing symbols over strings
most of the time.

If you really need speed then column lookup using an integer index is fastest:

julia> @btime $df[!, 1]
  4.100 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

However, this way of picking columns is not recommended and you should use it
only if you are sure what column is stored under a given number in a data frame.

Additional practical considerations of using strings and symbols in DataFrames.jl

The first tip is that you can get a list of column names of a data frame as
strings and as symbols in DataFrames.jl using the names and propertynames
functions respectively:

julia> names(df)
2-element Vector{String}:
 "col1"
 "col 2"

julia> propertynames(df)
2-element Vector{Symbol}:
 :col1
 Symbol("col 2")

The second important consideration is that in DataFramesMeta.jl only symbols are
considered to be column identifiers in operations by default.
Therefore you can write:

julia> using DataFramesMeta

julia> @rselect(df, :out = :col1 + 1)
1×1 DataFrame
 Row │ out
     │ Int64
─────┼───────
   1 │     2

If you want to use strings instead you have to escape them with $:

julia> @rselect(df, $"out" = $"col1" + 1)
1×1 DataFrame
 Row │ out
     │ Int64
─────┼───────
   1 │     2

Conclusions

The post today was long, but the conclusion is simple. In DataFrames.jl
you can use both symbols and strings to get access to a column of a data frame.
The major consideration you should use when picking one or the other is your
convenience.

Optimising FPL with Julia and JuMP

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2022/08/05/FPL-Optimising.html

One of my talks for JuliaCon 2022 explored the use of JuMP to optimise a Fantasy Premier League (FPL) team. You can watch my presentation here: Optimising Fantasy Football with JuMP and this blog post is an accompaniment and extension to that talk. I’ve used FPL Review free expected points model and their tools to generate the team images, go check them out.


Enjoy these types of posts? Then you should sign up for my newsletter. It’s a short monthly recap of anything and everything I’ve found interesting recently plus
any posts I’ve written. So sign up and stay informed!






Now last season was my first time playing this game. I started with an analytical approach but didn’t write the team optimising routines until later in the season, by which time it was too late to make too much of a difference. I finished at 353k, so not too bad for a first attempt, but quite a way off from that 100k “good player” milestone. I won’t be starting a YouTube channel for FPL anytime soon.

Still, a new season awaits, and with more knowledge, a hand-crafted optimiser, and some expected points, let’s see if I can do any better.

A Quick Overview of FPL

FPL is a fantasy football game where you need to choose a team of 15 players that consists of:

  • 2 goalkeepers
  • 5 defenders
  • 5 midfielders
  • 3 forwards

Then from these 15 players, you chose a team of 11 each week that must conform to:

  • 1 goalkeeper
  • Between 3 and 5 defenders
  • Between 2 and 5 midfielders
  • Between 1 and 3 forwards

You have a budget of £100 million and you can have at most 3 players from a given team. So no more than 3 Liverpool players etc.

You then score points based on how many goals a player scores, how many assists, and other ways. Each week you can transfer one player out of your squad of 15 for a new player.

That’s the long and short of it, you want to score the most points each week and be forwarding looking to ensure you are set for getting the most points.

A Quick Overview of JuMP

JuMP is an optimisation library for Julia. You write out your problem in the JuMP language, supply an optimiser and let it work its magic. For a detailed explanation of how you can solve the FPL problem in JuMP I recommend you watch my JuliaCon talk here:

But in short, we want to maximise the number of points based on the above constraints while sticking to the overall budget. The code is easy to interpret and there is just the odd bit of code massage to make it do what we want.

All my optimising functions are in the below file which is will be hosted on Github shortly so you can keep up to date with my tweaks.

include("team_optim_functions.jl")

FPL Review Expected Points

To start, we need some indication of each player’s ability. This is an expected points model and will take into account the player’s position, form, and overall ability to score FPL points. Rather than build my expected models I’m going to be using FPL Reviews numbers. They are a very popular site for this type of data and the amount of time I would have to invest to come up with a better model would be not worth the effort. Plus, I feel that the amount of variance in FPL points means that it’s a tough job anyway, it’s better to crowdsource the effort and use other results.

That being said, once you’ve set your team, there might be some edge in interpreting the statistics. But that’s a problem for another day.

FPL Review is nice enough to make their free model as a downloadable CSV so you can head there, download the file and pull it into Julia.

df = CSV.read("fplreview_1658563959", DataFrame)

To verify the numbers they have produced we can look and the total number of points each team is expected to score over the 5 game weeks they provide.

sort(@combine(groupby(df, :Team), 
       :TotalPoints_1 = sum(cols(Symbol("1_Pts"))),
       :TotalPoints_2 = sum(cols(Symbol("2_Pts"))),
       :TotalPoints_3 = sum(cols(Symbol("3_Pts"))),
       :TotalPoints_4 = sum(cols(Symbol("4_Pts"))),
       :TotalPoints_5 = sum(cols(Symbol("5_Pts"))) 
        ), :TotalPoints_5, rev=true)

20 rows × 3 columns

Team TotalPoints_1_2 TotalPointsAll
String15 Float64 Float64
1 Man City 117.63 296.58
2 Liverpool 115.52 284.36
3 Chelsea 94.37 243.1
4 Arsenal 90.38 241.0
5 Spurs 90.85 237.81
6 Man Utd 92.77 215.76
7 Brighton 79.35 205.93
8 Wolves 87.02 203.23
9 Aston Villa 88.65 202.32
10 Brentford 74.75 199.51
11 Leicester 77.99 197.34
12 West Ham 76.19 197.33
13 Leeds 80.61 194.82
14 Newcastle 89.81 190.98
15 Everton 68.46 189.73
16 Crystal Palace 64.95 180.66
17 Southampton 72.23 176.97
18 Fulham 63.02 172.57
19 Bournemouth 62.31 161.9
20 Nott'm Forest 69.65 161.05

So looks pretty sensible, Man City and Liverpool up the top, the newly promoted teams at the bottom. So looks like the FPL Review knows what they are doing.

With that done, let’s move on to optimising. I have to take the dataframe and prepare the inputs for my optimising functions.

expPoints1 = df[!, "1_Pts"]
expPoints2 = df[!, "2_Pts"]
expPoints3 = df[!, "3_Pts"]
expPoints4 = df[!, "4_Pts"]
expPoints5 = df[!, "5_Pts"]

cost = df.BV*10
position = df.Pos
team = df.Team

#currentSquad = rawData.Squad

posInt = recode(position, "M" => 3, "G" => 1, "F" => 4, "D" => 2)
df[!, "PosInt"] = posInt
df[!, "TotalExpPoints"] = expPoints1 + expPoints2 + expPoints3 + expPoints4 + expPoints5
teamDict = Dict(zip(sort(unique(team)), 1:20))
teamInt = get.([teamDict], team, NaN);

I have to multiply the buy values (BV) by 10 to get the values in the same units as my optimising code.

The Set and Forget Team

In this scenario, we add up all the expected points for the five game weeks and run the optimiser to select the highest scoring team over the 5 weeks. No transfers and we set the bench-weighting to 0.5.

# Best set and forget
modelF, resF = squad_selector(expPoints1 + expPoints2 + expPoints3 + expPoints4 + expPoints5, 
    cost, posInt, teamInt, 0.5, false)

Set and forget

It’s a pretty strong-looking team. Big at the back with all the premium defenders which is a slight danger as one conceded goal by either Liverpool or Man City could spell disaster for your rank. Plus no Salah is a bold move.

To add some human input, we can look at the other £5 million defenders to assess who to swap Walker with.

first(sort(@subset(df[!, [:Name, :Team, :Pos, :BV, :TotalExpPoints]], :BV .<= 5.0, :Pos .== "D", :Team .!= "Arsenal"), 
     :TotalExpPoints, rev=true), 5)
Name Team Pos BV TotalExpPoints
String31 String15 String1 Float64 Float64
1 Walker Man City D 5.0 16.53
2 Digne Aston Villa D 5.0 15.25
3 Doherty Spurs D 5.0 15.18
4 Romero Spurs D 5.0 15.03
5 Dunk Brighton D 4.5 14.74

So Doherty or Digne seems like a decent shout. This just goes to show though that you can’t blindly follow the optimiser and you can add some alpha by tweaking as you see fit.

Update After Two Game Weeks

What about if we now allow transfers? We will optimise for the first two game weeks and then see how many transfers are needed afterward to maximise the number of points.

model, res1 = squad_selector(expPoints1 + expPoints2, cost, posInt, teamInt, 0.5)
currentSquad = zeros(nrow(df))
currentSquad[res1[1]["Squad"] .> 0.5] .= 1

res = Array{Dict{String, Any}}(undef, length(0:5))

expPoints = zeros(length(0:5))

for (i, t) in enumerate(0:5)
    model, res3 = transfer_test(expPoints3 + expPoints4 + expPoints5, cost, posInt, teamInt, 0.5, currentSquad, t, true)
    res[i] = res3[1]
    expPoints[i] = res3[1]["ExpPoints"]
end

Checking the expected points of the teams and adjusting for any transfers after the first two free ones gives us:

expPoints .- [0,0,0,1,2,3]*4
6-element Vector{Float64}:
 162.385
 164.295
 167.987
 165.767
 164.084
 161.726

So making two transfers improve our score by 5 points, so seems worth it. If we go beyond two transfers, then we will pay a 4 point penalty, so it seems worth

Update after 2 GWs

So Botman and Watkins are switched out for Gabriel and Toney. Again, not a bad-looking team, and making these transfers improves the expected points by 5.

Shortcomings

The FPL community can be split into two camps, those that think data help and those that think watching the games and the players help. So what are the major issues with these teams?

Firstly, Spurs have a glaring omission from any of the results. Given their strong finish to the season and high expectations coming into the season this is potentially a problem.

Things can change very quickly. After the first week, we will have some information on how different players are looking and by that time these teams could be very wrong with little flexibility to change them to adjust to the new information. I am reminded of last year where Luke Shaw was a hot pick in lots of initial teams and look how that turned out.

How off-meta these teams are. It’s hard to judge what the current template team is going to be at these early stages in the pre-season, but if you aren’t accounting for who other people will be owning you can find yourself being left behind all for the sake of being contrarian. For example, this team has put lots of money into goalkeepers when you could potentially spend that elsewhere.

Some of the players in the teams listed might not get that many minutes. Especially for the cheaper players, I could be selecting fringe players rather than the reliable starters for the lower teams. Again, similar to the last point, there is are ‘enablers’ that the wider community believes to be the most reliable at the lower price points.

And finally variance. FPL is a game of variance. Haaland is projected to score 7 points in his first match, which is the equivalent to playing the full 90 minutes and a goal/assist. He could quite easily only score 1 point after not starting and coming on for the last 10 minutes and you are then panicking about the future game weeks. Relying on these optimised teams can sometimes mean you forget about the variance and how easy it is for a player to not get close to the number of points they are predicted.

Conclusion and What Next

Overall using the optimiser helps reduce the manual process of working out if there is a better player at each price point. Instead, you can use it to inspire some teams and build on them from there adjusting accordingly. There are still some tweaks that I can build into the optimiser, making sure it doesn’t overload the defence with players from the same team and see if I can simulate week-by-week what the optimal transfer if there is one, should be.

I also want to try and make this a bit more interactive so I’m less reliant on the notebooks and have something more production-ready that other people can play with.

Also given we get a free wildcard over Christmas I can do a mid-season review and essentially start again! So check back here in a few months time.

My PhD in a Nutshell Part I: Motivation

By: julia on MAXIMILIAN KOEHLER

Re-posted from: https://www.maximiliankoehler.de/posts/phd-1/

In this post I’ll introduce one open research question in computational mechanics as well as one research direction that tries to resolve the problem. All figures are produced by a small Julia code that resides inside a Pluto.jl notebook. The link to the notebook file can be found here or at the very bottom of this post.
My PhD Topic is a research project of the Chair of Mechanics – Continuum Mechanics, Ruhr-University Bochum in collaboration with Institute of Mathematics, University of Augsburg under the hood of the SPP 2256.