Category Archives: Julia

Adjusting to Julia: Generating the Fibonacci sequence

By: Jan Vanhove

Re-posted from: https://janhove.github.io/posts/2022-12-20-julia-fibonacci/

I’m currently learning a bit of Julia and I thought I’d share with you a couple of my attempts at writing Julia code. I’ll spare you the sales pitch, and I’ll skip straight to the goal of this blog post: writing three different Julia functions that can generate the Fibonacci sequence.

The Fibonacci sequence

The famous Fibonacci sequence is an infinite sequence of natural numbers, the first of which are 1, 1, 2, 3, 5, 8, 13, …. The sequence is defined as follows:

Let’s write some Julia functions that can generate this sequence.

Julia

You can download Julia from julialang.org. I’m currently using the Pluto.jl package that allows you to write Julia code in a reactive notebook. Check out the Pluto.jl page for more information.

First alternative: A purely recursive function

The Fibonacci sequence is defined recursively: To obtain the nth Fibonacci number, you first need to compute the n-1th and n-2th Fibonacci number and then add them. We can write a Julia function that exactly reflects the definition of the Fibonacci sequence like so:

function fibonacci(n)
  if n <= 2
    return 1
  end
  return fibonacci(n - 1) + fibonacci(n - 2)
end
fibonacci (generic function with 1 method)

This function tacitly assumes that n is a non-zero natural number. If n is equal to or lower than 2, i.e., if n is 1 or 2, it immediately returns 1, as per the definition of the sequence. If this condition isn’t met, the output is computed recursively. The function can be run as follows:

fibonacci(10)
55

Checks out! But from a computational point of view, the fibonacci() function is quite wasteful. In order to obtain fibonacci(10), we need to compute fibonacci(9) and fibonacci(8). But in order to compute fibonacci(9), we also need to compute fibonacci(8). For both fibonacci(9) and fibonacci(8), we need to compute fibonacci(7), etc. In fact, we need to compute the value of fibonacci(8) two times, that of fibonacci(7) three times, that of fibonacci(6) five times, and that of fibonacci(5) seven times. So we’d be doing lots of computations over and over again. For this reason, the fibonacci() function is hopelessly inefficient: While you can compute fibonacci(10) in a fraction of a second, it may take minutes to compute, say, fibonacci(60). Luckily, we can speed up our function considerably.

Second alternative: Recursion with memoisation

Memoisation is a programming technique where any intermediate result that you computed is stored in an array. Before computing any further intermediate results, you then first look up in the array if you haven’t in fact already computed it, saving you a lot of unnecessary computations. The following Julia function is a bit more involved that the previous one, but it’s much more efficient.

function fib_memo(n)
  known = zeros(Int64, n)
  function memoize(k)
    if known[k] != 0
      # do nothing
    elseif k == 1 || k == 2
      known[k] = 1
    else
      known[k] = memoize(k-1) + memoize(k-2)
    end
    return known[k]
  end
  return memoize(n)
end
fib_memo (generic function with 1 method)

The overall function that we’ll actually call is fib_memo(). It creates an array called known with n zeroes. Then it defines an inner function memoize(). This latter function obtains an integer k that in practice will range from 0 to n and does the following. First, it checks if the kth value in the array known is still 0. If it got changed, the function just returns the kth value in known. Otherwise, if k is equal to either 1 or 2, it sets the first or second value of known to 1. If k is greater than 2, the kth value of known is computed recursively. In all cases, the memoize() function returns the k value of the known array. The outer fib_memo() function then just returns the result of memoize(n).

Perhaps by now, your computer has finished running fibonacci(60) and you can try out the alternative implementation:

fib_memo(60)
1548008755920

Notice how much faster this new function is! Even the 200th Fibonacci number can be computed in a fraction of a second:

fib_memo(200)
-1123705814761610347

Unfortunately, we’ve ran into a different problem now: integer overflow. The result of the computations has become so large that it exceeded the range of 64-bit integers. To fix this problem, we can work with BigIntegers instead:

function fib_memo(n)
  known = zeros(BigInt, n)
  function memoize(k)
    if known[k] != 0
      # do nothing
    elseif k == 1 || k == 2
      known[k] = 1
    else
      known[k] = memoize(k-1) + memoize(k-2)
    end
    return known[k]
  end
  return memoize(n)
end
fib_memo (generic function with 1 method)
fib_memo(200)
280571172992510140037611932413038677189525

Nice!

Third alternative: Using Binet’s formula

The third alternative is more of a mathematical solution rather than a programming solution. According to Binet’s formula, the nth Fibonacci number can be computed as where , the Golden Ratio, and , its conjugate. In Julia:

function fib_binet(n)
  φ = (1 + sqrt(5))/2
  ψ = (1 - sqrt(5))/2
  fib_n = 1/sqrt(5) * (φ^n - ψ^n)
  return BigInt(round(fib_n))
end
fib_binet (generic function with 1 method)

Note that you can use mathematical symbols like and in Julia. This function runs very fast, too:

fib_binet(60)
1548008755920
fib_binet(200)
280571172992512015699912586503521287798784

Notice, however, that the result for the 200th Fibonacci number differs by 27 orders of magnitude from the one obtained using fib_memo():

fib_binet(200) - fib_memo(200)
1875662300654090482610609259

By using Binet’s formula, we’ve left the fairly neat world of integer arithmetic and entered the realm of floating point arithmetic that is rife with approximation errors. While we’re at it, we might as well compute and plot the size of these approximation errors. In the snippet below, I first use list comprehension in order to compute the first 200 Fibonacci numbers using both fib_memo() and fib_binet(). Note that I added a dot (.) to both function names. This is Julia notation for running vectorised computations. Further note that I end all lines with a semi-colon so that the results don’t get printed to the prompt. Then, I compute the absolute values of the differences between the numbers obtained by both computation methods. Note again the use of a dot in both abs.() and .- that is required to have both of these functions work on vectors. Finally, I convert these absolute differences to differences relative to the correct answers;

fib_integer = fib_memo.(1:200);
fib_math    = fib_binet.(1:200);
abs_diff = abs.(fib_math .- fib_integer);
rel_diff = abs_diff ./ fib_integer;

To wrap off this blog post, let’s now plot these absolute and relative differences using the Plots.jl package. While Figure 1 shows that the absolute error becomes huge, Figure 2 shows that these discrepancies only amount to a negligble fraction of the correct answers.

using Plots
plot(1:200, abs_diff, seriestype=:scatter,
     xlabel = "n",
     ylabel = "absolute difference",
     label = "")

Figure 1. Absolute difference between the Fibonacci numbers obtained using fib_binet() and those obtained using fib_memo().

plot(1:200, rel_diff, seriestype=:scatter, 
     xlabel = "n",
     ylabel = "relative difference",
     label = "")

Figure 2. Relative difference between the Fibonacci numbers obtained using fib_binet() and those obtained using fib_memo().

Handling summary statistics of empty collections

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2022/12/17/empty.html

Introduction

When designing solutions for data analysis one is often faced with a tough
choice between doing the correct thing and the convenient thing.

One particular case of such a situation is computing summary statistics of
empty collections. The reason is that often such statistics are not properly
defined for empty data (so Julia throws an error), but data scientist instead
would want to get some flag value instead. Today I want to discuss typical
cases of such situations and possible solutions.

The post was written under Julia 1.8.2 and DataFrames.jl 1.4.4.

Sum and product

In this case the situation is least problematic. Typically you will get what
you expect (i.e. respectively zero or one of the domain of values you
aggregate):

julia> sum(Int[])
0

julia> prod(Float64[])
1.0

julia> sum(skipmissing(Union{Int, Missing}[missing, missing]))
0

julia> prod(skipmissing(Union{Float64, Missing}[missing, missing]))
1.0

The only case that is problematic is when you want to work with an empty
container of a too-wide element type:

julia> sum([])
ERROR: MethodError: no method matching zero(::Type{Any})

A standard solution, since both sum and prod are reductions is to provide an
initialization value manually in this case:

julia> sum([], init=0)
0

julia> prod([], init=1.0)
1.0

Minimum and maximum

When computing minimum and maximum by default you get an error with empty
collections:

julia> minimum(Int[])
ERROR: MethodError: reducing over an empty collection is not allowed;
consider supplying `init` to the reducer

As you can see, we get and error are prompted to pass the init value for the
reduction, so the situation is less convenient.

Indeed, if such init value can be reasonably passed this is a good solution:

julia> minimum(Float64[], init=Inf)
Inf

julia> maximum(Int[], init=typemin(Int))
-9223372036854775808

or e.g. if we know we are working with values that must be in some range, we can
provide this range. A common case is with probabilities:

julia> minimum([], init=1.0)
1.0

julia> maximum([], init=0.0)
0.0

Sometimes, however, we might want to have a special signal value. In this case
you have two options. One is to check if the collection is empty, the other is
to catch exception:

julia> x = Int[]
Int64[]

julia> isempty(x) ? missing : minimum(x)
missing

julia> try
           minimum(x)
       catch e
           isa(e, MethodError) ? missing : rethrow(e)
       end
missing

You could wrap both solutions with a function for convenience, if you use them
often in your code. Their downside is that they add a bit of computational
overhead. The isempty check in some cases is not a O(1) operation. The most
common case is skipmissing. The trycatch approach introduces the cost of
handling of the exception.

Extrema

In case of the extrema function the situation is analogous to minimum and
maximum. The only difference is that you pass two values to init if you
want to use this method. Here is an example assuming we are processing data
that are probabilities:

julia> extrema(Float64[], init=(1.0, 0.0))
(1.0, 0.0)

Note that in this case minimum is greater than the maximum, so we can
immediately see that the passed collection was empty.

Mean, variance, and standard deviation

When computing mean, var, or std, we get NaN when working with an empty
collection:

julia> using Statistics

julia> mean(Int[])
NaN

julia> var(Float64[])
NaN

julia> std(Float64[])
NaN

This is expected, as we are performing division by zero in their computation.
Also, similarly to sum and prod, when the collection has a too-wide element
type we get an error:

julia> mean([])
ERROR: MethodError: no method matching zero(::Type{Any})

If we do not like this default behavior and want to handle an empty collection
in a special case checking if the container is empty is a standard solution:

julia> x = Float64[]
Float64[]

julia> isempty(x) ? missing : var(x)
missing

Median and quantile

Computing quantiles, and median in particular, is the least convenient case
as for them we always get an error and cannot use init value (as they are not
reductions):

julia> median(Int[])
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

julia> quantile(Float64[], 0.1)
ERROR: ArgumentError: empty data vector

Here, currently, the only solution is to either check if the collection is empty
or catch the exception:

julia> x = Float64[]
Float64[]

julia> isempty(x) ? missing : median(x)
missing

julia> try
           quantile(x, 0.1)
       catch e
           isa(e, ArgumentError) ? missing : rethrow(e)
       end
missing

Conclusions

My post today was meant to be a quick reference for Julia users who sometimes
hit these issues when working with their data. My experience is that the most
common scenario of this kind is connected with missing data. Here is a typical
problematic case:

julia> using DataFrames

julia> using Random

julia> Random.seed!(1234);

julia> df = DataFrame(id=rand(1:10^6, 10^6),
                      value=rand([1:10; missing], 10^6))
1000000×2 DataFrame
     Row │ id      value
         │ Int64   Int64?
─────────┼─────────────────
       1 │ 325977        4
       2 │ 549052        9
       3 │ 218587        9
       4 │ 894246        8
       5 │ 353112        1
       6 │ 394256       10
       7 │ 953125  missing
       8 │ 795547        5
       9 │ 494250        1
    ⋮    │   ⋮        ⋮
  999993 │ 967428        9
  999994 │ 557085        1
  999995 │ 353965        5
  999996 │ 590548       10
  999997 │ 657727        2
  999998 │ 928733        3
  999999 │ 884126  missing
 1000000 │ 587503        2
        999983 rows omitted

I on purpose generated the data in a way that has quite a few missing values:

julia> combine(groupby(df, :value), proprow)
11×2 DataFrame
 Row │ value    proprow
     │ Int64?   Float64
─────┼───────────────────
   1 │       1  0.091109
   2 │       2  0.091387
   3 │       3  0.091394
   4 │       4  0.090954
   5 │       5  0.090504
   6 │       6  0.091412
   7 │       7  0.090809
   8 │       8  0.090844
   9 │       9  0.090254
  10 │      10  0.090795
  11 │ missing  0.090538

Now notice that some groups will only have missing values (in the output below
group with :id equal to 12 in row 7 is such a case):

julia> combine(groupby(df, :id),
               :value => (x -> mean(ismissing, x)) => :propmissing)
632166×2 DataFrame
    Row │ id       propmissing
        │ Int64    Float64
────────┼──────────────────────
      1 │       2     0.0
      2 │       3     0.25
      3 │       4     0.0
      4 │       6     0.0
      5 │       8     0.0
      6 │       9     0.333333
      7 │      12     1.0
      8 │      15     0.5
      9 │      16     0.0
   ⋮    │    ⋮          ⋮
 632159 │  999990     0.0
 632160 │  999991     0.0
 632161 │  999992     0.0
 632162 │  999993     0.0
 632163 │  999994     0.0
 632164 │  999996     0.0
 632165 │  999997     0.0
 632166 │ 1000000     0.0
            632149 rows omitted

If we now try to compute e.g. median value per group while skipping missing
we fail:

julia> combine(groupby(df, :id), :value => median∘skipmissing)
ERROR: ArgumentError: median of an empty array is undefined, Int64[]

The solution, as we discussed in this post is to handle the case of an empty
collection in a special way. I typically prefer the isempty check.

So first define a helper function:

julia> withempty(f, default) = x -> isempty(x) ? default : f(x)
withempty (generic function with 1 method)

and now we can write:

julia> combine(groupby(df, :id),
               :value => withempty(median, missing)∘skipmissing)
632166×2 DataFrame
    Row │ id       value_function_skipmissing
        │ Int64    Union{Missing, Float64}
────────┼─────────────────────────────────────
      1 │       2                         3.5
      2 │       3                         5.0
      3 │       4                         8.0
      4 │       6                         4.0
      5 │       8                         1.0
      6 │       9                         9.0
      7 │      12                   missing
      8 │      15                         3.0
      9 │      16                         7.0
   ⋮    │    ⋮                 ⋮
 632159 │  999990                         4.0
 632160 │  999991                         3.0
 632161 │  999992                         8.0
 632162 │  999993                         2.0
 632163 │  999994                         7.0
 632164 │  999996                         4.0
 632165 │  999997                         7.0
 632166 │ 1000000                         7.0
                           632149 rows omitted

As you can see we get missing in row 7 for group with :id value equal to
12 as expected.

If you have some thoughts about pros and cons of the approaches I discussed
today please check out this issue and comment there. Thank you!

? WIP ? From Julia to BQN

By: PathToPerformance

Re-posted from: https://miguelraz.github.io/blog/fromjuliatobqn/index.html

Here's the rough sketch of this blogpost:

  1. I will give a brief intro to BQN and talk about its pros and cos

  2. I will show a few Julia vs BQN code problems side by syde

  3. I will argue that there's areas for Julians to draw inspiration from BQN

  4. I will give a few resources at the end for you to dive deeper.

As usual, if you want to support my writings and open source work, please consider sponsoring me on GitHub. I'm reaaaaally close to 50 monthly sponsors, and it makes a huuuuge difference in how much time/worries/resources I have for working on stuff like this.

Alright, on with the blogpost.


Why is BQN is cool

  1. It has fast multidimensional arrays

  2. They love unicode

  3. It has a REPL!

  4. It's super at code golfing ?

  5. It's self hosted

  6. They use JuliaMono! ?

  7. They're building a JIT!

Name: funny bacon puns. Also "APL" .+ 1 == "BQM", but people noticed too late before the "bacon" puns began. BQN vs APL:

Getting started

  • BQN keyoard

  • Tutorial

  • Note: If you are going to try to use your terminal with the CQBN REPL (the fastest implementation), note that you will want to do rlwrap -r BQN to fire it up.

  • The nvim-bqn interface is the best local option I've found.

  • Remember to change your terminal font to JuliaMono, otherwiseyour code will be even more unreadable (at first!).

Range:

  • 15 reshape range 10 # cycles!

  • transpose 3_3

Scripting

online REPL or download BQN repo and open with browser BQN/docs/try.html from their github repo.

Everything in green is a function Everything in yellow is a 1 modifier Everything in purple/pink is a 2 modifier

Defining Hi function

REPL Duel

  • Problems: palindromes, count different words, Remove HTML Tags

Tutorial

Julia vs BQN problems:

Here's a few "classic" problems in both Julia and BQN

  1. Find the Hamming/edit distance between 2 strings:

julia> dist(s1, s2) = count(((x,y),) -> x != y, zip(s1, s2))
dist (generic function with 1 method)julia> dist("ACCAGGG", "ACTATGG")
2julia> dist(s1, s2) = sum(map(!=, s1, s2)) # kudos to Michael Abbot for the broadcasting tip
dist (generic function with 1 method)julia> dist(s1, s2) = mapreduce(!=, +, s1, s2) # kudos to J. Ling for this one

And in BQN:

s1 ← "XXXXGGG"
s2 ← "ACTAGGG"
Sol ← +´≠
s1 Sol s2 # 4

This is a neat 3 char solution that Asher will no doubt be very proud of.

  1. Increasing Array

You should take 3 minutes to go read the problem statement.

I like that after seeing the problem (you should go and click the link), I didn't think about a C++ but a BQN solution. Here's my attempt:

a ← 3‿2‿5‿1‿7
+´a-˜⌈`a
Sol ← {+´?-˜⌈`?}
Sol ← +´∘(⌈`-⊢) # Asher's solution
Sol a

which in Julia I would write like

julia> x = [3 2 5 1 7];
julia> sol(x) = accumulate(max, x) - x |> sum;
julia> sol(x);

Which took be a bit because scanl is called accumulate in Julia. Not too shabby. (Extra kudos if you can get a non-allocating version working)

  1. Maximum parenthesis depth

  1. Remove HTML Tags

Problem spec:

Given the string "<div>Hello <b>CppNorth!</b></div>", remove the HTML tags (underne

The following snippets are thanks to dzaima on the BQN Discord:

)t:10000000 {?/˜¬(≠`∨⊢)(?='>')∨?='<'} "<div>Hello <b>CppNorth!</b></div>"
179.01ns

On a make o3n-singeli build (built for SIMD speedups):

)t:10000000 ¬∘(≠`∨⊢)∘(=⟜'>'∨=⟜'<')⊸/ "<div>Hello <b>CppNorth!</b></div>"
165.19ns

On a 100x longer input it rips at about 0.24ns/character (without a block):

≠a←∾100⥊<"<div>Hello <b>CppNorth!</b></div>"
3300
   )t:1000000 ¬∘(≠`∨⊢)∘(=⟜'>'∨=⟜'<')⊸/ a
782.1ns
  1. "LURD" robot

What Julians can learn from BQN

  1. Broadcasting semantics, Each ([¨](https://mlochbaum.github.io/BQN/doc/map.html)), and Taking Arrays Seriously™.

  2. Data parallelism techniques.

  3. Bit vector optimizations.

  4. Flattening data recursive structures for performance.

  5. Array-ify all the things.

  6. Algorithmic thinking

Notes and words of caution

The syntax and symbols of BQN is a big "love it or hate it" part of the deal. I won't try to convince you to like it, but I have found it much easier to take a silly, mnemonic based approach to what each symbol does: * ≡"abc" will give you the "depth" of something, because it looks like a little ladder that you descend * is taking the "highest" value (and is thus the max), is taking the "lowest" * will be dragging all the stuff to the right of the tick towards the +, so it's a reduction * +` If you use the tick the other way, you will be dragging the +towards the stuff on the right, so it's a scan, from left to right. These are just the examples that come to mind, but I've found (completely subjectively) for BQN's symbology to be a bit friendlier/more consistent than APL's. Be mindful that the character to denote lists is not the same as that of arrays. The docs say that newbies usually start out with these for easy manipulation examples and gradually move on to explicit array notation with the fancy brackets:

3 1⊸+⊸× 5
20
    3‿1⊸+⊸× 5
⟨ 40 30 ⟩

As stated in the page, general array notation is a thorny problem in APL, and it took Julia about 10 years to finally nail down the tools and syntax to land it in Base..

  • Reading BQN/APL is likely where the learning difficulty curve hits hardest when starting out – this docs page was very useful to grok that ˜ is a 1-modifier (as all symbols that "float higher up") and (like all symbols with unbroken circles) are 2-modifiers. Concretely, having a context free grammar removes ambiguity

  • When I'm struggling to find out how to write my solutions to problems like the Increasing Array, this is my workflow:

    1. Start with thinking "I should propagate the max function" like ⌈`a. I'll press Shift+Enter on the online BQN REPL and build up the solution

    2. "I should now try to subtract it from the original array" and write a-⌈`a

    3. "Ah, right – I need to add a flip thingy" and evaluate a-˜⌈`a

    4. "Sweet, just have to reduce with a sum now" +´a-˜⌈`a

    5. "Ok, to make it tacit I had to use those ⊣ ⊢ thingies." (I go and review the modifiers diagrams docs)

    6. (After much plugging away at the REPL) … "Dammit, I forgot I can use the Explain button!"

    7. (Fiddle around some more) "OK, I think I got it" and write

Sol ← +´∘(⌈`-⊢)
  • The next big step up in BQN skills is identifying function trains, which took me a bit of spelunking about in the manual before finding it. For example, going from the first line to the second in this snippet ??

"whatsin" {(?∊?)/?} "intersect"
"whatsin" (∊/⊣) "intersect"

proficiently will really up your game in code-golfing powers, should you be interested in that. This APL Wiki page and the Trainspotting links and videos at the end are also useful resources.

Useful idioms

At some point, any seasoned array programmer develops a good collection of known code snippets. Here's a few to save you some headaches:

  • Reading lines from a file can be done via:

lines ← •FLines "day01-a-test.txt"
nums ← •BQN¨ lines

The •BQN¨ isn't optimal, but it's good enough to get going with AdventOfCode problems.

  • Benchmarking! (I am a Julia REPL stan after all.) If you're in the REPL, you can use )t:X

)t:1000 3+3
14.666 ns

This will run your code X times and tell you how long it took on average. Elsewhere, •_timed can be bound to the left argument How does one get a finer performance report though? Dzaima kindly posted:

perf report of that, with some comments – most of the time is in replicate (which is implemented with pdep & pext, i.e. SWAR; 8 bytes per iteration), followed by the ≠-scan, which is more SWAR, but it processes 64 items per iteration so it's fast

To get that, you do

sudo perf record rlwrap ./BQN

type in your script and then

sudo perf report
  • TODO Generating random arrays:

If we need to repeat a string 100 times, we can use:

100⥊"<div>Hello <b>CppNorth!</b></div>"

That will give you the above string repeated 100 times.

Interesting resources

For those that truly want to stare into the abyss and have it stare right back at them, there's some ~university level courses that are written in APL/BQN/J.

What's next?

…Well, I think I want to learn a bit from the people that took parallelism and performance seriously in ML aka, "What if Haskell wasn't slow and they wanted to dunk on MATLAB"?

Don't forget…

miguelito:

If you want to see more blogposts, sponsor me on GitHub