By: Ole Kröger
Re-posted from: https://opensourc.es/blog/2022-05-28-tspsolver.jl-2-opt/index.html
Using 2-opt to find a better upper bound
By: Ole Kröger
Re-posted from: https://opensourc.es/blog/2022-05-28-tspsolver.jl-2-opt/index.html
Using 2-opt to find a better upper bound
Re-posted from: https://bkamins.github.io/julialang/2022/05/27/strings.html
Recently chapters 5 and 6 of my Julia for Data Analysis book were
published in MEAP. Finally we are getting to some more fun stuff like parametric
types or discussion of options for working with strings in Julia.
In the comments from the readers who decided to opt-in for early access to
the book I was asked how does performance of DataFrames.jl compare to Pandas.
Therefore today I decided to run a small benchmark.
The codes were run under:
I want to:
In the tests I want to compare the performance of different options how strings
can be stored.
In Julia tests I check the following storage modes of strings: Vector{String},
Vector{String15} (inline string), Vector{Symbol}, PooledVector{String},
and CategoricalVector{String}. Here is the benchmark:
julia> using Random
julia> using InlineStrings
julia> using BenchmarkTools
julia> using PooledArrays
julia> using CategoricalArrays
julia> Random.seed!(1234);
julia> df = transform!(DataFrame(str=[randstring() for _ in 1:10^6]),
:str .=>
[inlinestrings, ByRow(Symbol),
PooledArray, CategoricalArray] .=>
[:istr, :sym, :pstr, :cstr])
1000000×5 DataFrame
Row │ str istr sym pstr cstr
│ String String15 Symbol String Cat…
─────────┼──────────────────────────────────────────────────
1 │ KYDtLOxn KYDtLOxn KYDtLOxn KYDtLOxn KYDtLOxn
2 │ UkZj0CRg UkZj0CRg UkZj0CRg UkZj0CRg UkZj0CRg
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
999999 │ CZL6fcG5 CZL6fcG5 CZL6fcG5 CZL6fcG5 CZL6fcG5
1000000 │ 1MqvPpVb 1MqvPpVb 1MqvPpVb 1MqvPpVb 1MqvPpVb
999996 rows omitted
julia> @belapsed combine(groupby($df, :str), nrow)
0.0486384
julia> @belapsed combine(groupby($df, :istr), nrow)
0.0461695
julia> @belapsed combine(groupby($df, :sym), nrow)
0.0422065
julia> @belapsed combine(groupby($df, :pstr), nrow)
0.0096023
julia> @belapsed combine(groupby($df, :cstr), nrow)
0.0304362
As you can see PooledVector is by far fastest. Next comes CategoricalVector,
but it is significantly slower. The performance of String, String15,
and Symbol is comparable; String is slowest and Symbol is fastest.
In Python I compared Series containing str and Categorical. Here is the
code:
import pandas as pd
import random
import string
random.seed(1234)
s = [''.join(random.choice(string.ascii_letters) for _ in range(8)) for _ in range(10**6)]
cs = pd.Categorical(s)
df = pd.DataFrame({'str': s, 'cstr': cs})
%time df.groupby(['str']).size()
CPU times: total: 1.56 s
Wall time: 1.56 s
Out[8]:
str
AAAGINxY 1
AAAIavpP 1
AAASsiaU 1
AAAfnfqH 1
AAAxQZTv 1
..
zzzNaWBc 1
zzzTCZdT 1
zzzmmTcP 1
zzzoySQG 1
zzzwzAsA 1
Length: 1000000, dtype: int64
%time df.groupby(['cstr']).size()
CPU times: total: 109 ms
Wall time: 105 ms
Out[9]:
cstr
AAAGINxY 1
AAAIavpP 1
AAASsiaU 1
AAAfnfqH 1
AAAxQZTv 1
..
zzzNaWBc 1
zzzTCZdT 1
zzzmmTcP 1
zzzoySQG 1
zzzwzAsA 1
Length: 1000000, dtype: int64
As you can see using Categorical is much faster. However, both options
are visibly slower than any of the Julia variants we have considered.
First, I would like to thank all reviewers and readers of my book in MEAP. They
really help to improve it and I appreciate the feedback I receive a lot.
Now let me turn to a comment on benchmark results. First, I want to remark that
I usually avoid running such comparisons as it is really hard to do them
comprehensively and objectively. Therefore, I recommend you to treat my
conclusions as a general guidance inferred from the tests I have performed:
I hope you will find the presented benchmarks useful in your daily data
wrangling tasks.
Re-posted from: https://bkamins.github.io/julialang/2022/05/20/whyjulia.html
It has been two years since I have started writing this blog. Therefore I
thought of composing some more high level post today and share with you
my thoughts on why I use Julia.
This post does not aim to give a comprehensive review of strengths and
weaknesses of Julia in general. Instead, I want to collect some notes of my
story of working with this language (and I think it is important that I am an
economist by training, as people with, e.g., computer science background might
have different thoughts or expectations).
My starting point with Julia was around time when I was finishing my
habilitation degree (this is roughly an equivalent of getting a tenure in US).
At that time I was implementing a lot of agent-based simulations. Studying such
models usually requires quite a lot of computations. The reason is that not only
a single simulation run takes time but also you usually need to run these models
many times under different parameterizations.
If you would like to learn more how such models are designed and used you can
have a look, for example, at this paper where you can find
a description of a model and experiments run using it and a link to a GitHub
repository with Julia source codes.
So what requirements did I have for a programming language? Here they are:
Julia fitted these requirements perfectly.
In short: I started using Julia because it solved an important problem I had
in my work.
Over the years I have started using Julia in numerous projects. Since most of
problems I needed to solve involved some numerical computing I rarely had an
issue that Julia missed important features. Sometimes indeed Julia lacks some
functionality, but, fortunately, then it is really easy to integrate it with
C, Python, or R.
Another nice aspect of Julia is that it allowed me over the years to easily
collaborate with other non-programmers. You might be surprised by this
statement, but there are the following aspects of Julia that make it possible:
After using Julia for a while I decided to start contributing to packages. This
is another area where I enjoy Julia a lot. In my early days the selling point
was that I found it extremely easy to learn how to create a package, register
and release it, set up CI on GitHub, find advice on good coding style or writing
tests. The package to which I ended up to contribute most is DataFrames.jl. I
have found the community of people involved with this package extremely
welcoming, engaged, and knowledgeable. I have learned a lot from them and at the
same time I made a lot of friends.
This friendly social aspect of the Julia community is extremely important in the
long run. Please do not take me wrong – it does not mean that any PR or issue is
just accepted. I would say it is the opposite. Many times the discussions take
months of going there and back with the design. The point is that if you feel
people reacting to your thoughts are welcoming it is much easier to go through
all the obstacles you encounter along the way. In short – in Julia community
even though I often felt like this I have never felt like this.
Do I think Julia is one language to rule them all? Certainly not.
Different languages have different target audiences, where they shine.
However, if you are looking for a language that:
you may consider giving Julia a try like I have done.
Will you encounter problems or missing features along the way? For sure you will
(as most likely you would with any programming language). If you want to see a
record of PRs and issues I opened for Julia and how they got solved it is easy
to do it here.