Author Archives: Blog by Bogumił Kamiński

The String, or There and Back Again

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/08/13/strings.html

Introduction

The String type in the Julia language supports the full range of Unicode
characters, which is great in practice. Other features that are important
when working with the String type (especially when you need performance)
are the following:

  • String is immutable, but it is not interned (notably Symbols are
    interned and thus can be compared using === fast);
  • taking a single character from a String produces a Char, which is a 32-bit
    value;
  • Julia Base normally assumes that String is UTF-8 encoded which is handy
    because most likely your source is UTF-8 encoded and in general UTF-8 has
    a reasonably compact memory footprint (but Strings that contain invalid
    encodings are allowed to be constructed, and it is possible to
    transcode strings).

While I think that the Julia manual Section on Strings does a very
good job explaining how they work I often find that people are confused by the
consequences of UTF-8 encoding of String and this post is intended to cover
this ground a bit more in depth.

All what I write here was tested under Julia 1.5.

Two types of indices for String

Since String is UTF-8 encoded one character is represented by 1, 2, 3, or 4
bytes in it, see e.g. Wikipedia for the details of the encoding.
In Julia you can check that indeed one code unit of String is one byte by
writing (I am storing the String in the str variable as we will soon use it
again):

julia> str = "? Hello! ?"
"? Hello! ?"

julia> codeunit(str)
UInt8

Now this string contains 10 characters, which you can check using the
length function:

julia> length(str)
10

but it is actually stored on more bytes, which the ncodeunits function tells us:

julia> ncodeunits(str)
16

The reason is that the first and last character in this string are not ASCII
(note that in UTF-8 all ASCII characters are stored on one byte), which we can
check in the following way:

julia> foreach(c -> println(repr(c), ":\t", ncodeunits(c)), str)
'?':    4
' ':    1
'H':    1
'e':    1
'l':    1
'l':    1
'o':    1
'!':    1
' ':    1
'?':    4

Given these observations the natural questions are:

  • how to get the i-th code unit in the String (so called byte index);
  • how to get the i-th character in the String (so called character index);
  • is is easy to go ‘There and Back Again’ between byte and character indices;
  • which functions expect byte indices, which expect character indices and what
    is the cost of using these functions.

Below I try to answer these questions.

Getting code units

Getting the i-th code unit is simple (but in practice rarely needed, except
if you are working with strings on low level), you just use the codeunit
function, e.g.:

julia> codeunit(str, 1)
0xf0

julia> codeunit(str, 2)
0x9f

julia> codeunit(str, 3)
0x98

julia> codeunit(str, 4)
0x84

julia> codeunit(str, 5)
0x20

or you can use the codeunits function to get them as a vector:

julia> codeunits(str)
16-element Base.CodeUnits{UInt8,String}:
 0xf0
 0x9f
 0x98
 0x84
 0x20
 0x48
 0x65
 0x6c
 0x6c
 0x6f
 0x21
 0x20
 0xf0
 0x9f
 0x91
 0x8b

Getting characters

Now, this is more tricky:

julia> str[1]
'?': Unicode U+1F604 (category So: Symbol, other)

julia> str[5]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

julia> str[2]
ERROR: StringIndexError("? Hello! ?", 2)

and you see that the getindex function in the str[i] syntax does not give
you the i-th character in the string but rather a character that starts in the
i-th byte index in the string (and errors if at this given byte index the
character does not start).

So how should one get the i-th character in the string? Use the nextind
function in the following way:

julia> str[nextind(str, 0, 1)]
'?': Unicode U+1F604 (category So: Symbol, other)

julia> str[nextind(str, 0, 2)]
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

where you pass 0 as the first argument and the desired character index as a
second argument.

You might ask why this is so awkward? The reason is that computing the location
of i-th character in the string is expensive, so normally one should use byte
indexing. Notably, most functions working on strings take byte indices and only
one function — length — returns number of characters, all other functions
return byte indices or number of bytes (more on this below in the glossary
section).

For example (we store loc as we will use it later also):

julia> loc = findfirst(==('H'), str)
6

julia> str[loc]
'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)

and this is fast.

Be warned though that when you use byte indexing you should not do arithmetics
on them (unless your string is ASCII only, which you can check using the
isascii function). For instance if you want to go back two characters from H
do not write:

julia> str[loc - 2]
ERROR: StringIndexError("? Hello! ?", 4)

but rather write:

julia> str[prevind(str, loc, 2)]
'?': Unicode U+1F604 (category So: Symbol, other)

and let prevind do the calculation of an appropriate byte index. Trying to do
arithmetics on byte indices is the most common error when working with Strings
in Julia.

Finally it is easy to get all byte indices that point to the start of the
character in the string with the eachindex function:

julia> foreach(i -> println("$i:\t$(repr(str[i]))"),  eachindex(str))
1:  '?'
5:  ' '
6:  'H'
7:  'e'
8:  'l'
9:  'l'
10: 'o'
11: '!'
12: ' '
13: '?'

Going there and back again between byte and character indices

If you have a byte index and want to find character index of a character that
covers this code unit then write:

julia> length(str, 1, loc)
3

Note that byte index does not have to be a valid start of a character. In this
case the index of the character that contains this index is returned:

julia> length(str, 1, 1)
1

julia> length(str, 1, 2)
1

julia> length(str, 1, 3)
1

julia> length(str, 1, 4)
1

julia> length(str, 1, 5)
2

(byte index 5 corresponds to the second character in the string)

If you have a character index and want to learn the byte index of this character
in the String we already know we should use the nextind function. For example:

julia> nextind(str, 0, 3)
6

gets you the byte index of 'H' character in the string.

The glossary

Below I present the list of functions available in Julia Base that work with
strings with a comment if they work with byte or character indices and
information about their time complexity for String type. I omit the
description what the functions do to keep the table brief and I list only the
functions that either take or return an index or byte/character number.

Funtion arguments return value complexity
length byte index characters O(n)
ncodeunits   bytes O(1)
sizeof   bytes O(1)
codeunit byte index   O(1)
isvalid byte index   O(1)
getindex byte index   O(1)
SubString byte index   O(1)
view, @view byte index   O(1)
unsafe_string byte index   O(1)
match byte index   O(n)
findfirst   byte index O(n)
findlast   byte index O(n)
findnext   byte index O(n)
findprev   byte index O(n)
firstindex   byte index O(1)
lastindex   byte index O(1)
thisind byte index byte index O(1)
prevind i: byte index, n: characters byte index O(n)
nextind i: byte index, n: characters byte index O(n)
chop character index   O(n)
first character index   O(n)
last character index   O(n)
lpad characters   O(n)
rpad characters   O(n)
textwidth   screen characters O(n)

(note that nextind and prevind are O(n) for n, but for i they are
O(1) as UTF-8 is self-synchronizing)

Conclusions

I must admit that working with the String type can be sometimes tricky, but I
hope that the summary I have presented in this post will help you easier
navigate through the options.

Finally String is not the only string type available in the Julia language.
Actually most functions just work with any AbstractString and there are
alternative string types developed in the community, you can check them out
for example here.

JuliaCon2020: Julia is production ready!

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/08/07/production-ready.html

Introduction

Last time I have posted about the take-aways for DataFrames.jl from
JuliaCon 2020. This time I wanted to share my general conclusions
from attending different talks at this extremely successful event.

I have spent 20 years now deploying data science related projects in corporate
environments (back then it was not called data science, but we were already
training neural networks to make predictions) and have many colleagues who are
deeply into enterprise software development. Quoting a discussion with
Tomasz Olczak I had some time ago, who is a genuine one-man army in
delivering complex enterprise projects:

Julia is fast, and has a very nice syntax, but its ecosystem is not mature
enough for use in serious production projects.

For many years I would agree with it, but after JuliaCon 2020
I believe we can confidently announce that

Julia is production ready!

Let me now give a list of key (in my opinion) presentations given during
JuliaCon 2020 that make me draw this conclusion.

I will not comment here on functionalities related to number crunching, as it
is clear that Julia shines here, but rather I want to focus on the things that
make Julia a great tool for deployment in production (still I skip many
interesting talks in this area — check out the detailed agenda to
learn more).

Building microservices and applications

In this talk Jacob Quinn gives an end to end tutorial how
to build and deploy in an enterprise setting a microservice using Julia.
He gives ready recipes how to solve typical tasks that need to be handled in
such contexts: logging, context management, middleware setup, authentication,
caching, connection pooling, dockerization, and many other, that
are bread and butter of enterprise projects.

As an addition be sure to check out:

  • the shippable apps talk, where Kristoffer Carlsson guides you
    through creating executables which can be run on machines that do not have Julia
    installed.
  • the Julia for scripting presentation, during which
    Fredrik Ekre discusses the best practices for using Julia in contexts
    where you need to execute short code snippets many times.
  • the Genie.jl talk, in which Adrian Salceanu shows that it is currently
    a mature, stable, performant, and feature-rich Julia web development framework.

Dependency management

The two talks Pkg.update() and What’s new in Pkg show that
currently Julia has best in class functionalities for enterprise grade dependency
management for your projects. The list of provided functionalities is so long
that it is hard to list them all here.

Let me just mention one particular tool in this ecosystem, that is presented in
BinaryBuilder.jl talk that explains how to take software written in compiled
languages such as C, C++, Fortran, Go or Rust, and build precompiled artifacts
that can be used from Julia packages with ease (which means that no compilation
has to take place on client side when you install packages having such dependencies).

Integration with external libraries

A natural topic related to dependency management is how to integrate Julia with
external tools. This area of functionality is really mature. Here is a list of
talks that cover this topic:

Here it is worth to add Julia has had a great integration with Python for many
years now, see JuliaPy.

A good end to end example of doing some real work in Julia that requires integration
is Creating a multichannel wireless speaker setup with Julia
talk that show how to easily stitch things together (and in particular featuring
ZMQ.jl, Opus.jl, PortAudio.jl, and DSP.jl).

Another interesting talk showcasing integration capabilities is
JSServe: Websites & Dashboards in Julia
that shows a high performance framework to easily combine interactive plots,
markdown, widgets and plain HTML/Javascript in Jupyter / Atom / Nextjournal and
on websites.

Developer tooling

The two great talks Juno 1.0 and Using VS Code that current IDE
support for Julia in VS Code is first class. You have all tools that normally you
would expect to get: code analysis (static and dynamic), debugger, workspaces,
integration with Jupyter Notebooks, and remote capabilities.

Managing ML workflows

I do not want to cover many different ML algorithms that are available in Julia
natively, as there are just too many of them (and if something is missing you
can easily integrate it — see the integration capabilities section above).

However, on top of particular models you need frameworks that let you manage
ML workflows. In this area there are two interesting talks, one about
MLJ: a machine learning toolbox for Julia and the other showing
AutoMLPipeline: A ToolBox for Building ML Pipelines. From my experience
such tools are crucial when you want to move with your ML models from data scientist’s
sandbox to a real production usage.

Conclusion

Obviously, I have omitted many interesting things that were shown during
JuliaCon 2020. However, I hope that the aspects I have covered here,
that is:

  • enterprise grade patterns to create microservices and applications in Julia,
  • robust dependency management tools,
  • very flexible and powerful capabilities to integrate Julia with existing code
    bases that were not written in Julia,
  • excellent developer tooling in VSCode,
  • mature packages that help you to create production-grade code for ML solutions
    deployment,

show that already now Julia can (and should) be considered as a serious option
for your next project in enterprise environment.

What I believe is crucially important is that not only we have required tools
ready but also we have great practical showcases how they can be used to build
robust production code with.

JuliaCon2020: conclusions for DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2020/08/02/post_juliacon_1.html

Introduction

JuliaCon 2020 was a great event. It has opened my eyes to many
fantastic things that happen in the ecosystem and for sure I will write at least
one more post with the summary of my take-aways.

In this post I want to summarize my conclusions from the discussions around
DataFrames.jl and related ecosystem. In particular
Julia & Data: An Evolving Ecosystem BOF was a great gathering to discuss
the future directions. Thank you all who participated.

The survey

Before the BOF I have made a quick survey to check with the community
where the development effort of the DataFrames.jl should focus on. While many
topics are were found important the top issue is performance, with a particular
emphasis on adding treading support and improving the performance of joins
(which are not sub-par in comparison to aggregation).

Therefore this will be the area that I plan to focus most of the development
effort in the short term (of course all contributors are encouraged to open
issues/PRs in all potential areas of improvement and they will be handled).

In particular, regarding the performance, I have opened an issue
related to joins. Everyone is welcome to comment there with thoughts how things
could be improved. I believe that the current major reason of bad performance
we have is that we have only one join algorithm that treats left and right
joined data frame differently which in some cases leads to severe performance
bottlenecks.

To give an example of the problem consider the following timings in
DataFrames.jl 0.21.4:

julia> using DataFrames, BenchmarkTools

julia> df1 = DataFrame(id=1:10^6, x1=1:10^6);

julia> df2 = DataFrame(id=1:10^3, x2=1:10^3);

julia> @benchmark innerjoin($df1, $df2, on=:id)
BenchmarkTools.Trial:
  memory estimate:  76.41 MiB
  allocs estimate:  1999686
  --------------
  minimum time:     215.176 ms (0.00% GC)
  median time:      229.573 ms (0.00% GC)
  mean time:        228.554 ms (2.15% GC)
  maximum time:     241.558 ms (6.11% GC)
  --------------
  samples:          22
  evals/sample:     1

julia> @benchmark innerjoin($df2, $df1, on=:id)
BenchmarkTools.Trial:
  memory estimate:  61.54 MiB
  allocs estimate:  1692
  --------------
  minimum time:     115.506 ms (0.00% GC)
  median time:      122.250 ms (0.00% GC)
  mean time:        123.309 ms (0.29% GC)
  maximum time:     133.132 ms (0.63% GC)
  --------------
  samples:          41
  evals/sample:     1

julia> df2 = DataFrame(id=1:10, x2=1:10);

julia> @benchmark innerjoin($df1, $df2, on=:id)
BenchmarkTools.Trial:
  memory estimate:  76.30 MiB
  allocs estimate:  1999673
  --------------
  minimum time:     55.207 ms (0.00% GC)
  median time:      69.426 ms (0.00% GC)
  mean time:        68.312 ms (7.17% GC)
  maximum time:     81.201 ms (17.13% GC)
  --------------
  samples:          74
  evals/sample:     1

julia> @benchmark innerjoin($df2, $df1, on=:id)
BenchmarkTools.Trial:
  memory estimate:  61.42 MiB
  allocs estimate:  201
  --------------
  minimum time:     117.681 ms (0.00% GC)
  median time:      121.471 ms (0.00% GC)
  mean time:        122.413 ms (0.26% GC)
  maximum time:     131.358 ms (0.64% GC)
  --------------
  samples:          41
  evals/sample:     1

As you can see the order of arguments matters and influences the performance
in a non-trivial way. Also a challenge for managing deprecation process when
we change the implementation is that the row order of the result of joins
depends on the order in which we passed data frames for joining (and it is
possible that faster algorithms will produce different row orderings of the
resulting joined table).

The ecosystem

For the things that happen around DataFrames.jl I would like to highlight two
out of many interesting efforts:

  • It can be expected that soon Apache Arrow will have a full support in Julia.
    This is a super important thing I think and when we have it it will be much
    easier to use Julia in enterprise applications.
  • There is a significant amount of work done to make DataFramesMeta.jl even more
    user friendly than it is now. I am really looking forward to it, as then
    in DataFrames.jl we will be able to concentrate on the internals and making
    things fast, and the bells and whistles that make daily work with data
    frames smooth will be provided in DataFramesMeta.jl.

The last point relates to the tension around how much DataFrames.jl should
follow a Unix convention do one thing, and do it right vs the approach where
we would like to see it as a Swiss Army knife for all tabular data
manipulation tasks. There are pros and cons of both approaches and soon I will
write a separate post explaining my current thinking about this issue.

What is next?

In the conclusion I would like to write what to expect in DataFrames.jl
development in the coming months. Please consider it as my personal view as the
community might disagree:

  • In 1-2 months we shall have a 0.22 release that will introduce new breaking
    changes.
  • The 1.0 release will probably happen in the early 2021 with a major target
    that it would incorporate performance improvement fixes.

Now what is the rationale behind this:

  • In 0.21 there were found several corner cases of functionality that we should
    change (like making sure transform does not reorder existing columns and
    properly handles data frames with zero rows, see this PR for details).
    So we need a minor release relatively soon.
  • When introducing performance fixes we might need to change how rows of the
    the requested operations are ordered (e.g. in joins). This means that making
    performance improvements might introduce changes that will be breaking.
    And we should not expect to fix all performance issues (e.g. providing a
    decent threading support) sooner than in the end of 2020 and then such things
    require detailed tests, as usually the algorithms that are fast are complex.

Having said that I am committed to the contract we have stated when releasing
0.21 version that we do not want to be breaking after this release.
Therefore, as users you can expect that this promise is taken very seriously and
if we break something there is a strong reason for it. In particular I very
strongly want to avoid API breakage (we rather can extend it, but not break
things that already worked). However, things that might be broken, as you see
from this post, is what is the column or row order of the result of some
operations (so in a sense — from a data base perspective these things mostly
would not be considered as breaking, but as DataFrames.jl is seen as a
matrix-like structure by some operations in user’s code row and column order
matters).