Tag Archives: julialang

JuliaCon 2022 wrapped up on Saturday July 30th with the annual virtual hackathon. As one of the conference organizers, the live conference…

Continue reading on Towards Data Science »

How to safely use the vec and reshape functions in Julia?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/08/12/vec.html

Introduction

Julia users often want to squeeze-out maximum performance from their programs.
In the search for efficiency, they soon discover the vec and reshape
functions that allow for changing of the shape of the input array without
copying data. In this post I want do discuss how these functions work
and share with you the rules I use when deciding if I want to use them.

The post was written under Julia 1.7.2.

The contract

When you first learn some function you must look up its contract in its
docstring. Let us check vec and reshape (I abbreviated the docstrings to
focus on the key parts):

help?> vec
  vec(a::AbstractArray) -> AbstractVector

  Reshape the array a as a one-dimensional column vector.
  Return a if it is already an AbstractVector.
  The resulting array shares the same underlying data as a,
  so it will only be mutable if a is mutable,
  in which case modifying one will also modify the other.

help?> reshape
search: reshape promote_shape

  reshape(A, dims...) -> AbstractArray

  Return an array with the same data as A,
  but with different dimension sizes or number of dimensions.
  The two arrays share the same underlying data, so that the result is mutable
  if and only if A is mutable,
  and setting elements of one alters the values of the other.

In short, both functions allow you to change the shape of some array without
copying of the data. vec always returns a vector, while reshape is more
flexible and allows you to produce an array of any dimension.

Let me show you some use cases of these functions. First, assume I want to
produce a cartesian product of two collections:

julia> collect(Iterators.product('a':'b', 1:3))
2×3 Matrix{Tuple{Char, Int64}}:
 ('a', 1)  ('a', 2)  ('a', 3)
 ('b', 1)  ('b', 2)  ('b', 3)

By default the collect function produced me a matrix. If for some reason
I needed a vector instead I could write:

julia> vec(collect(Iterators.product('a':'b', 1:3)))
6-element Vector{Tuple{Char, Int64}}:
 ('a', 1)
 ('b', 1)
 ('a', 2)
 ('b', 2)
 ('a', 3)
 ('b', 3)

The important benefit of this operation is that vec is non-copying so adding
this step is efficient. Let me give you another example, this time using
broadcasting:

julia> string.(['a' 'b'], 1:3)
3×2 Matrix{String}:
 "a1"  "b1"
 "a2"  "b2"
 "a3"  "b3"

julia> vec(string.(['a' 'b'], 1:3))
6-element Vector{String}:
 "a1"
 "a2"
 "a3"
 "b1"
 "b2"
 "b3"

Now let us have a look at reshape:

julia> reshape(1:6, 2, 3)
2×3 reshape(::UnitRange{Int64}, 2, 3) with eltype Int64:
 1  3  5
 2  4  6

Why reshape would be useful? Consider for example a simple function changing
pairs of consecutive elements of a vector into a tuple. One of the ways
(for sure not the only way) to implement this would be:

julia> totuples(x::AbstractVector) = Tuple.(eachcol(reshape(x, 2, :)))
totuples (generic function with 1 method)

julia> totuples(1:6)
3-element Vector{Tuple{Int64, Int64}}:
 (1, 2)
 (3, 4)
 (5, 6)

The dangers

While the vec and reshape functions can be useful there are some risks
of using them. Let me discuss some common pitfalls.

The first is that when you reshape a collection you may leave a permanent mark
in the source that it was used in reshape (even though reshape has no !
as its suffix). This can lead to hard to catch bugs. Let us check the following
code:

julia> x = [1, 2, 3, 4]
4-element Vector{Int64}:
 1
 2
 3
 4

julia> totuples(x)
2-element Vector{Tuple{Int64, Int64}}:
 (1, 2)
 (3, 4)

julia> push!(x, 5)
ERROR: cannot resize array with shared data

As you can see, although the use of reshape was done in the totuples
function and the reshaped matrix we created there with reshape(x, 2, :)
is already out of scope the fact that we used reshape on x permanently
disallows its resizing.

The second risk is that vec and reshape may, or may not, create a new
object, as they might just return a source object. Let us check the following
code that extends the original totuples function to accept any
AbstractArray. In the code I write y = vec(x), but the same behavior
would be present with y = reshape(x, :).

julia> function totuples2(x::AbstractArray)
           y = vec(x)
           isodd(length(x)) && push!(y, last(y))
           return totuples(y)
       end
totuples2 (generic function with 1 method)

julia> x = [1;;]
1×1 Matrix{Int64}:
 1

julia> totuples2(x)
1-element Vector{Tuple{Int64, Int64}}:
 (1, 1)

julia> x
1×1 Matrix{Int64}:
 1

julia> x = [1]
1-element Vector{Int64}:
 1

julia> totuples2(x)
1-element Vector{Tuple{Int64, Int64}}:
 (1, 1)

julia> x
2-element Vector{Int64}:
 1
 1

As you can see our totuples2 function left [1;;] unchanged, since it is
a matrix, but [1] was updated. The reason is that vec(x) in this case just
returned its argument.

Finally, as an another application of the same rule, note that that the vec
function (and similarly reshape when reshaping to a vector), may or may not
produce a vector that can be resized:

julia> totuples2([1 2; 3 4])
2-element Vector{Tuple{Int64, Int64}}:
 (1, 3)
 (2, 4)

julia> totuples2(reshape(1:4, 2, 2))
2-element Vector{Tuple{Int64, Int64}}:
 (1, 2)
 (3, 4)

julia> totuples2([1;;])
1-element Vector{Tuple{Int64, Int64}}:
 (1, 1)

julia> totuples2(reshape(1:1, 1, 1))
ERROR: MethodError: no method matching resize!(::UnitRange{Int64}, ::Int64)

As you can see, this is tricky, as the function you use, like totuples2 in our
case, might throw an error only in some cases, but work in other cases. In the
totuples2 case the reason is that we use push! only if the length of the
collection is odd.

The point of all these examples is that code using vec or reshape can lead
to hard-to-diagnose errors. The reason is that you might notice the problems
caused by using them much later in the code than when you used them.

Conclusions

The vec and reshape functions are nice utilities and I use them quite often.
However, to safely use them I always follow the following two rules:

when writing a function never use reshape on an array that is an argument
of the function; the reason is, that in some cases reshape will silently
leave the “no resize” mark on the source vector; if I use reshape I make
sure that its source is always some short lived object from a local scope;
do not resize the vector produced by vec/reshape as such resizing may,
or may not affect the source and keeping a mental record if this is the case
is hard (and most likely readers of such code will not be able to easily
know it); as a softer rule I generally avoid any mutation of the output of
vec/reshape as it is not guaranteed that it will be mutable.

In short: the best uses for vec and reshape are situations when their source
is a short lived object and you do not want to mutate their output in any way.

Strings vs symbols in DataFrames.jl column indexing

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/08/05/symbol.html

Introduction

In DataFrames.jl you can use both symbols and strings for column indexing. Which
to choose is one of the topics that new users ask about most frequently. In this
post I will explain why both options are supported and what is a difference
between them. Note that this is an entry level post, so I will omit many details
of the discussed topic and focus on most important aspects only.

The post was written under Julia 1.7.2, DataFrames.jl 1.3.4,
DataFramesMeta.jl 0.12.0, BenchmarkTools.jl 1.3.1.

What are strings and symbols?

In Julia a string allows users to store sequences of characters. The simplest
way to create a string is to write some text between double quotation marks:

julia> "an example string"
"an example string"

Symbols are objects used in Julia to create identifiers. You can think of them
as labels. Symbols are normally created by prefixing some label with : like
this:

julia> :label
:label

In this way you can create symbols that are valid variable names.
So, for example, you cannot create a symbol that has a space using ::

julia> :my label
ERROR: syntax: extra token "label" after end of expression

Instead, in such cases, you need to call Symbol passing it a string as
an argument:

julia> Symbol("my label")
Symbol("my label")

How are string and symbols different?

To understand the difference between symbols and strings it is easiest to
think of them as follows:

symbols are labels;
strings are sequences of characters.

So symbols are indivisible – they are always considered to as a whole,
while strings consist of multiple characters. The most important consequences of
this distinction are the following:

symbols are faster than strings when you compare them for equality using ==;
you can manipulate strings (e.g. uppercase, chop, perform substring matching etc.)
while none of such operations are supported for symbols.

Let us have a look at these two characteristics by example. First we check
comparison speed. We create 1000-element vectors with unique values and compare
all pairs of their entries, so we make 1 million comparisons and expect 1000
matches.

julia> using BenchmarkTools

julia> string_vec = string.("s", 1:1000)
1000-element Vector{String}:
 "s1"
 "s2"
 "s3"
 "s4"
 ⋮
 "s997"
 "s998"
 "s999"
 "s1000"

julia> symbol_vec = Symbol.("s", 1:1000)
1000-element Vector{Symbol}:
 :s1
 :s2
 :s3
 :s4
 ⋮
 :s997
 :s998
 :s999
 :s1000

julia> test_cmp(v) = count(x == y for x in v, y in v)
test_cmp (generic function with 1 method)

julia> @btime test_cmp($string_vec)
  3.038 ms (0 allocations: 0 bytes)
1000

julia> @btime test_cmp($symbol_vec)
  635.400 μs (0 allocations: 0 bytes)
1000

Indeed symbol comparison is faster.

Now let us look at manipulation:

julia> str = "example"
"example"

julia> uppercase(str)
"EXAMPLE"

julia> chop(str)
"exampl"

julia> match(r"ex", str)
RegexMatch("ex")

julia> sym = :example
:example

julia> uppercase(sym)
ERROR: MethodError: no method matching uppercase(::Symbol)

julia> chop(sym)
ERROR: MethodError: no method matching chop(::Symbol)

julia> match(r"ex", sym)
ERROR: MethodError: no method matching match(::Regex, ::Symbol)

So in summary we could conclude that:

one can use symbol if the value stored in it is not manipulated
(i.e. is treated as a label); they are faster in comparisons than strings
and a bit easier to type (only : prefix is needed) provided that they do
not contain characters like spaces (in which case they are not convenient
to type);
strings support manipulation as opposed to symbols; the cost is that
comparing them is slower than comparing symbols.

Let us now discuss how these considerations translate to the DataFrames.jl realm.

Strings vs symbols in DataFrames.jl

Column names in a DataFrame are labels. For this reason both symbols and
strings are allowed to be used when referencing them without introducing
an ambiguity. Here is an example. We start with strings:

julia> using DataFrames

julia> df = DataFrame("col1" => 1, "col 2" => 2)
1×2 DataFrame
 Row │ col1   col 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df."col1"
1-element Vector{Int64}:
 1

julia> df."col 2"
1-element Vector{Int64}:
 2

julia> df[:, "col1"]
1-element Vector{Int64}:
 1

julia> df[:, "col 2"]
1-element Vector{Int64}:
 2

Now we try the same with symbols:

julia> df = DataFrame(:col1 => 1, Symbol("col 2") => 2)
1×2 DataFrame
 Row │ col1   col 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df.col1
1-element Vector{Int64}:
 1

julia> getproperty(df, Symbol("col 2"))
1-element Vector{Int64}:
 2

julia> df[:, :col1]
1-element Vector{Int64}:
 1

julia> df[:, Symbol("col 2")]
1-element Vector{Int64}:
 2

We now see the first difference, that we have already discussed. If column
names are all valid variable names symbols are more convenient, however,
if they are not (e.g. contain spaces) then using strings is more convenient.
As an extreme case, note that the convenience syntax for getproperty using
. accessor does not work for symbols containing spaces and we need to do
an explicit getproperty call.

The second important aspect is that all functions that manipulate column
names in DataFrames.jl work with strings. This is natural, as symbol
manipulation is not supported by Julia. Here is a combo showing this in action:

julia> select(df, Cols(startswith("c")) .=> identity .=> uppercase)
1×2 DataFrame
 Row │ COL1   COL 2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

The Cols(startswith("c")) .=> identity .=> uppercase operation specification
syntax means that we want to pick all columns whose name starts with "c"
(note that the startswith function expects string as an input), keep them
unchanged (the identiy function) and uppercase their names in the output
(note that uppercase expects string as an input).

Finally, you might ask about comparison of speed of column lookup using strings
vs symbols. Here is a simple test:

julia> @btime $df.col1
  7.500 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

julia> @btime $df."col1"
  38.446 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

As you can see there is a noticeable performance difference. However, please
note that both these operations are very fast. Therefore, in practice,
column lookup is almost never a performance bottleneck in operations on
data frames (usually what you do with the column picked from a data frame
is more expensive by several orders of magnitude). So a practical recommendation
is that performance should not be a reason of choosing symbols over strings
most of the time.

If you really need speed then column lookup using an integer index is fastest:

julia> @btime $df[!, 1]
  4.100 ns (0 allocations: 0 bytes)
1-element Vector{Int64}:
 1

However, this way of picking columns is not recommended and you should use it
only if you are sure what column is stored under a given number in a data frame.

Additional practical considerations of using strings and symbols in DataFrames.jl

The first tip is that you can get a list of column names of a data frame as
strings and as symbols in DataFrames.jl using the names and propertynames
functions respectively:

julia> names(df)
2-element Vector{String}:
 "col1"
 "col 2"

julia> propertynames(df)
2-element Vector{Symbol}:
 :col1
 Symbol("col 2")

The second important consideration is that in DataFramesMeta.jl only symbols are
considered to be column identifiers in operations by default.
Therefore you can write:

julia> using DataFramesMeta

julia> @rselect(df, :out = :col1 + 1)
1×1 DataFrame
 Row │ out
     │ Int64
─────┼───────
   1 │     2

If you want to use strings instead you have to escape them with $:

julia> @rselect(df, $"out" = $"col1" + 1)
1×1 DataFrame
 Row │ out
     │ Int64
─────┼───────
   1 │     2

Conclusions

The post today was long, but the conclusion is simple. In DataFrames.jl
you can use both symbols and strings to get access to a column of a data frame.
The major consideration you should use when picking one or the other is your
convenience.

juliabloggers.com

A Julia Language Blog Aggregator

Tag Archives: julialang

5 important talks you might have missed at JuliaCon 2022

How to safely use the vec and reshape functions in Julia?

Introduction

The contract

The dangers

Conclusions

Strings vs symbols in DataFrames.jl column indexing

Introduction

What are strings and symbols?

How are string and symbols different?

Strings vs symbols in DataFrames.jl

Additional practical considerations of using strings and symbols in DataFrames.jl

Conclusions