Tag Archives: data

Can Python with Julia be faster than low-level code?

By: Abel Soares Siqueira

Re-posted from: https://blog.esciencecenter.nl/can-python-with-julia-be-faster-than-low-level-code-cd71a72fbcf4?source=rss----ab3660314556--julia

Part 3 of the series on achieving high performance with high-level code

By Abel Soares Siqueira and Faruk Diblen

Here comes a new challenger: It is Julia. Photo by Joran Quinten on Unsplash (https://unsplash.com/photos/MR9xsNWVKvo), modified by us.

Introduction

In our last post, we were able to improve Python code using a few lines of Julia code. We were able to achieve a very interesting result without optimizing prematurely or using low-level code. However, what if we want more? In this blog post, we will investigate that.

It is quite common that a developer prototypes with a high-level language, but when the need for speed arises, they eventually move to a low-level language. This is called the “two-language problem”, and Julia was created with the objective of solving this issue (read more on their blog post from 2012). Unfortunately, achieving the desired speedup is not always easy. It depends highly on the problem, and on how much previous work was done trying to tackle it. Today we find out how much more we can speed up our Julia code, and how much effort it took.

Previously

  • Patrick Bos presented the problem of reading irregular data, or non-tabular data, in this blog post.
  • He also presented his original solution to the problem using just Python with pandas, which we are calling Pure Python in our benchmarks.
  • Finally, he presented a faster strategy which consisits of calling C++ from Python, which we denote C++.
  • In the previous blog post of this series, we created two strategies with Python calling Julia code. Our first strategy, Basic Julia, wasn’t that great, but our second strategy, Prealloc Julia, was sufficiently faster than Pure Python, but not as fast as C++.

Remember that we have set up a GitHub repository with our whole code, and also, that we have a Docker image for reproducibility.

For the C fans

Our first approach to speeding things up is to simulate what C++ is doing. We believe that the C++ version is faster because it can read the data directly as the desired data type. In Julia, we had to read the data as String and then convert it to Int. We don’t know how to do that with Julia. But we know how to do that with C.

Using Julia’s built-in ccall function, we can directly call the C functions to open and close a file, namely fopen and fclose, and call fscanf to read and parse the file at the same time. Our updated Julia code which uses these C functions is below.

Let’s see if that helped increase the speed of our code. We include in our benchmark the previous strategies as well. This new strategy will be called Julia + C parsing.

Run time of Pure Python, C++, Basic Julia, Prealloc Julia, and Julia + C parsing strategies. (a) Time per element in the log-log scale. (b) Time per element, relative to the time of the C++ version in the log-log scale.

Our code is much more C-like now, so understanding it requires more knowledge about how C works. However, the code is way faster than our previous implementation. For files with more than 1 million elements, the Julia + C parsing strategy has a 10.38 speedup over the Pure Python strategy, on average. This is almost double the speedup we got with Prealloc Julia, which is an amazing result. For comparison, on average, C++ has a 16.37 speedup.

No C for me, thanks

Our C approach was very fast, and we would like to replicate it with pure Julia. Unfortunately, we could not find anything in Julia to perform the same type of reading as fscanf. However, after some investigation, we found an alternative.

Using the read function of Julia, we can parse the file as a stream of bytes. This way we can manually walk through the file and parse the integers. This is the code:

We denote this strategy Optimized Julia. This version of the code manually keeps track of the sequence of bytes related to integers, so it is much less readable. However, this version achieves an impressive speedup, surpassing the C++ version:

Run time of Pure Python, C++, Basic Julia, Prealloc Julia, Julia + C parsing, and Optimized Julia strategies. (a) Time per element in the log-log scale. (b) Time per element, relative to the time of the C++ version in the log-log scale.

It was not easy to get to this point, and the code itself is convoluted, but we managed to achieve a large speedup in relation to Python using only Julia, another high-level language. The average speedup for files with over 1 million elements is 40.25, which is over 2 times faster than what we got with the C++ strategy. We remark again that the Pure Python and C++ strategies have not been optimized, and that readers can let us know in the comments if they found a better strategy.

So yes, we can achieve a speedup equivalent to a low-level language using Julia.

Conclusions: We won, but at what cost?

One thing to keep in mind is that to achieve high speedups, we had to put more effort into getting to that point. This effort comes in diverse ways:

  • To write and use the C++ strategy, we had to know sufficient C++, as well as understand the libraries used. If you don’t have enough C++ knowledge, the effort is higher, since what needs to be done is quite different from what Python developers are used to. If you already know C++, then the effort is that of searching the right keywords and using the right libraries.
  • To write and use any of the Julia strategies, you need to put some effort into having the correct environment. Using Julia from Python is still an experimental feature, so your experience may vary.
  • To write the Basic Julia and Prealloc Julia strategies, not much previous knowledge is required. So, we can classify this as a small effort.
  • To write the Julia + C and Optimized Julia strategies, we need more specialized knowledge. This is again a high-effort task if you do not already know the language.

Here’s our conclusion. To achieve a high speedup, we need specialized knowledge which requires a big effort. However, we can conclude as well that, if you are not familiar with either C++ or Julia, then acquiring some knowledge in Julia allows you to get a smaller improvement. That is, a small effort with Julia already gets you some speedup. You can prototype quickly in Julia and get a reasonable result and keep improving that version to get C-like speedups over time.

Speedup gain relative to the effort of moving the code to a different language.

We hope you have enjoyed the series and that it helps you with your code in any way. Let us know what you think and what you missed. Follow us for more research software content.

Many thanks to our proofreaders and reviewers, Elena Ranguelova, Jason Maassen, Jurrian Spaaks, Patrick Bos, Rob van Nieuwpoort, and Stefan Verhoeven.


Can Python with Julia be faster than low-level code? was originally published in Netherlands eScience Center on Medium, where people are continuing the conversation by highlighting and responding to this story.

Speed up your Python code using Julia

By: Abel Soares Siqueira

Re-posted from: https://blog.esciencecenter.nl/speed-up-your-python-code-using-julia-f97a6c155630?source=rss----ab3660314556--julia

Part two of the series on achieving high performance with high-level code

By Abel Soares Siqueira and Faruk Diblen

Python holds the steering wheel, but we can make it faster with other languages. Photo by Spencer Davis on Unsplash (https://unsplash.com/photos/QUfxuCqdpH0), modified by us.

In part 1 of this series, we set up an environment so that we can run Julia code in Python. You can also check our Docker image with the complete environment if you want to follow along. We also have a GitHub repository with the complete code if you want to see the result.

Background

On the blog post, 50 times faster data loading for Pandas: no problem, our colleague and Senior Research Software Engineer, Patrick Bos, discoursed about improving the speed of reading non-tabular data into a DataFrame in Python. Since the data is not tabular, one must read, split, and stack the data. All of that can be done with pandas in a few lines of code. However, since the data files are large, performance issues with Python and Pandas now become visible and prohibitive. So, instead of doing all those operations with pandas, Patrick shows a nice way of doing it with C++ and Python bindings. Well done, Patrick!

In this blog post, we will look into improving the Python code in a similar fashion. However, instead of moving to C++, a low-level language considerably harder to learn than Python, we will move the heavy lifting to Julia and compare the results.

A very short summary of Patrick’s blog post

Before anything, we recommend checking Patrick’s blog post to read more into the problem, the data, and the approach of using Python with C++. The short version is that we have a file where each row is an integer, followed by the character #, followed by an unknown number of comma-separated values, which we call elements. Each row can have a different number of elements, and that’s why we say the data is non-tabular, or irregular. An example file is below:

From now on, we refer to the initial approach of solving the problem with Python and pandas as the Pure Python strategy, and we will call the strategy of solving the problem with Python and C++ as the C++ strategy.

We will compare the strategies using a dataset we generated. The dataset has 180 files, generated randomly, varying the number of rows, the maximum number of elements per row, and the distribution of the number of elements per row.

Adding some Julia spice to Python

The version below is the first approach to solve our problem using Julia. There are shorter alternatives, but this one is sufficiently descriptive. We start with a very basic approach so it is easier to digest.

You can test this function on Julia directly to see that it works independently of Python. After doing that, we want to call it from Python. As you should know by now, that is fairly easy to do, especially if you use the Docker image we have created for Post 1.

The next code snippet includes the file that we created above into Julia’s Main namespace and defines two functions in Python. The first, load_external , is used to read the arrays that were parsed by either C++ or Julia. The second Python function, read_arrays_julia_basic , is just a wrapper around the Julia function definition in the included file.

Now we will benchmark this strategy, which we will call the Basic Julia strategy, against the Pure Python and C++ strategies. We are using Python 3.10.1 and Julia 1.6.5. We run each strategy three times and take the average time. Our hardware is a Notebook Dell Precision 5530, with 16 GB of RAM and an i7–8850H CPU, and we are using a docker image based on Ubuntu Linux 21.10 to run the tests (from inside another Linux machine). You can reproduce the results by pulling the abelsiqueira/faster-python-with-julia-blogpost Docker image, downloading the dataset, and running the following command in your terminal:

$ docker run --rm --volume "$PWD/dataset:/app/dataset" --volume "$PWD/out:/app/out" abelsiqueira/faster-python-with-julia-post2

See the figure below for the results.

Run time of Pure Python, C++, and Basic Julia strategies. (a) Time per element in the log-log scale. (b) Time per element, relative to the time of the C++ strategy in the log-log scale.

A few interesting things happen in the image. First, both Pure Python and Basic Julia have a lot of variability with respect to the number of elements. We believe this happens because the code’s performance is dependent on the number of rows, as well as the structure distribution of elements per row. The code allocates a new array for each row, so even if the number of elements is small, if the number of rows is large, then the execution will be slow. Remember that our dataset has a lot of variability on the number of rows, maximum elements per row, and distribution of elements per row. This means that some files are close in the number of elements but may be vastly different. Second, Basic Julia and Pure Python have different efficiency profiles. Our Julia code must move all stored elements into a new array for each new row that it reads, meaning it allocates a new array for every row.

The code for Basic Julia is simple and does what is expected, but it does not pre-allocate the memory that will be used, so that really hurts its performance. In low-level languages, that would be one of the first things we would have to worry about. Indeed, if we look into the C++ code, we can see that it starts by figuring out the size of the output vector and allocating them. We need to improve our Julia code at least a little bit.

Basic improvements for the Julia Code

The first version of our Julia code is inefficient in a few ways, as explained above. With that in mind, our first change is to compute the number of elements a priori and allocate our output vectors. Here is our improved Julia code:

Here, we use a dictionary generator comprehension, which has the closest resemblance to the data. This allows us to count the number of elements and keep the values to be stored later. We also use the package Parsers, which provides a slightly faster parser for integers. Here is the updated figure comparing the three previous strategies and the new Prealloc Julia strategy that we just created:

Run time of the Pure Python, C++, Basic Julia, and Prealloc Julia strategies. (a) Time per element in the log-log scale. (b) Time per element, relative to the time of the C++ strategy in the log-log scale.

Now we have made a nice improvement. The results more consistently depend on the number of elements, like the C++ strategy. We can also see a stabilization of the trend that Prealloc Julia follows. It appears to be the same as C++, which is expected since the performance should be linearly dependent on the number of elements. For files with more than 1 million elements, the Prealloc Julia strategy has a 5.83 speedup over the Pure Python strategy, on average, while C++ has a 16.37 speedup, on average.

Next steps

We have achieved an amazing result today. Using only high-level languages, we were able to achieve some speedup in relation to the Pure Python strategy. We remark that we have not optimized the Python or the C++ strategies, simply using what was already available from Patrick’s blog post. Let us know in the comments you have optimized versions of these codes to share with the community.

In the next post, we will optimize our Julia code even further. It is said that Julia’s speed sometimes rivals low-level code. Can we achieve that for our code? Let us know what you think and stay tuned for more!

Many thanks to our proofreaders and reviewers, Elena Ranguelova, Jason Maassen, Jurrian Spaaks, Patrick Bos, Rob van Nieuwpoort, and Stefan Verhoeven.


Speed up your Python code using Julia was originally published in Netherlands eScience Center on Medium, where people are continuing the conversation by highlighting and responding to this story.

Alien facehugger wasps, a pandemic, webcrawlers and julia

By: Ömür Özkir

Re-posted from: https://medium.com/oembot/alien-facehugger-wasps-a-pandemic-webcrawlers-and-julia-c1f136925f8?source=rss----386c2bd736a1--julialang

collect and analyze covid 19 numbers for Hamburg, Germany

TL;DR

  1. Build a webcrawler in julia.
  2. Use the data for a simple plot.

Motivation

The web is full of interesting bits and pieces of information. Maybe it’s the current weather, stock prices, or the wikipedia article about the wasp that goes all alien parasite facehugger on other insects, which you vaguely remember from one of those late night documentaries (already sorry you are reading this?).

If you are lucky, that data is available via an API, making it usually pretty easy (not always tho, if API developers come up with byzantine authentications, required headers or other elegant/horrible designs, the fun is over) to get to the data.

A lot of the shiny nuggets are not available via a nice API, tho. Which means, we have to crawl webpages.

Pick something that you are really interested in / want to use for a project of yours, that’ll make it less of a chore and far more interesting!

Local = Relevant

For me, currently, that’s the Covid-19 pandemic. And more specifically, how it is developing close to me. In my case, that means the city of Hamburg in Germany.

Chances are, these specific numbers/case is not relevant to you. But that’s a good thing, you can use what you learned here and mine the website of your home city maybe (or whatever you are interested in).

Nothing helps your brain absorb new things better than generalizing those new skills and using them to solve related problems!

There is the official website of the city, that has a page for the covid-19 numbers, hamburg.de.

The page with the numbers is in german, but don’t worry, that’s what our webcrawler will hopefully help us with — we can get to the numbers without having to understand all the surrounding text. I will try to help out and translate what is relevant, but that will only be a minor detail when we try to find the right text to extract from.

If you like, you can check out the notebook or even code along in binder.

First, let’s get some of the dependencies out of the way:

Aside from HTTP.jl to request the website, we will also use Gumbo to parse html and Cascadia to extract data from the html document via selectors.

using HTTP
using Gumbo, Cascadia
using Cascadia: matchFirst

We need to fetch and parse the website, which is easily done with Gumbo.

url = "https://www.hamburg.de/corona-zahlen"
response = HTTP.get(url)
html = parsehtml(String(response))
# =>
HTML Document:
<!DOCTYPE >
HTMLElement{:HTML}:<HTML lang="de">
<head></head>
<body class="no-ads">
HTTP/1.1 200 OK
ServerHost: apache/portal5
X-Frame-Options: SAMEORIGIN
Access-Control-Allow-Origin: *
Content-Type: text/html;charset=UTF-8
Content-Language: de-DE
Date: Wed, 05 Aug 2020 19:30:23 GMT
Transfer-Encoding: chunked
Connection: keep-alive, Transfer-Encoding
Set-Cookie: JSESSIONID=AAF197B2F1191AACC08B70C4F8DAB18F.liveWorker2; Path=/; HttpOnly
Set-Cookie: content=13907680; Path=/servlet/segment/de/corona-zahlen/
Set-Cookie: BIGipServerv5-webstatic-80-12-cm7=201658796.20480.0000; path=/; Httponly; Secure
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta charset="utf-8"/>
<meta content="text/html" http-equiv="content-type"/>
<script type="text/javascript">window.JS_LANG='de'; </script>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
...

Alright, we can now start to parse data from the html document, by using query selectors.

You know, they might actually be called css selectors, I don’t know. How precise is the frontend terminology anyways, right?

Oh look, a pack of wild frontenders! Hmm, what are they doing, are they encircling us? They do look kinda angry, don’t they? I guess they just tried to vertically align some div the whole day or something.

Ok… I guess we should leave now.

Seems important

hamburg.de: Visual hierarchy, the topmost information is usually something important

We could start with the first information we see on the page, after all, there must hopefully be a reason that it is at the top of the page.

The three bullet points with the numbers mean confirmed cases, recovered and new cases. Now the trick is to find the best selectors. There are a few plugins for the different browsers that help finding the right selectors quickly.

But it is also pretty easy to do by hand. When right-click/inspecting an element on the page (this requires the developer tools) one can pretty quickly find a decently close selector.

If you want to test it out in the browser first, you can write something like this document.querySelectorAll(".c_chart.one .chart_legend li") in the browser console. Some browsers even highlight the element on the page when you hover over the array elements of the results.

Using the selectors in julia is pretty neat:

eachmatch(sel".c_chart.one .chart_legend li", html.root)
# => 
3-element Array{HTMLNode,1}:
HTMLElement{:li}:<li>
<span style="display:inline-block;width:.7em;height:.7em;margin-right:5px;background-color:#003063"></span>
Bestätigte Fälle 5485
</li>


HTMLElement{:li}:<li>
<span style="display:inline-block;width:.7em;height:.7em;margin-right:5px;background-color:#009933"></span>
Davon geheilt 5000
</li>


HTMLElement{:li}:<li>
<span style="display:inline-block;width:.7em;height:.7em;margin-right:5px;background-color:#005ca9"></span>
Neuinfektionen 25
</li>

Ok, we need to extract the numbers from the text of each html element. Using a simple regex seems like the easiest solution in this case. Check this out, it looks very similar to the previous selector matching:

match(r"\d+", "Neuinfektionen 25")
# =>
RegexMatch("25")

Nice, huh? Ok, but we only need the actual match.

match(r"\d+", "Neuinfektionen 25").match
# =>
"25"

And we need to cast it to a number:

parse(Int, match(r"\d+", "Neuinfektionen 25").match)
# =>
25

We want to do this for each element now, so we extract the text from the second node (the span element is the first, see the elements above).

Then we do all the previously done matching and casting and we got our numbers!

function parsenumbers(el)
text = el[2].text
parse(Int, match(r"\d+", text).match)
end
map(parsenumbers, eachmatch(sel".c_chart.one .chart_legend li", html.root))
# =>
3-element Array{Int64,1}:
5485
5000
25

Learning how to Date in german

We should also extract the date when those numbers were published. The selector for the date on the page is very easy this time: .chart_publication.

In the end we want some numbers, that we can use to instantiate a Date object, something like this Date(year, month, day).

We are starting out with this, however:

date = matchFirst(sel".chart_publication", html.root)[1].text
# =>
"Stand: Mittwoch, 5. August 2020"

Oh dear, it’s in german again. We need "5. August 2020" from this string.

parts = match(r"(\d+)\.\s*(\S+)\s*(\d{4})", date).captures
# =>
3-element Array{Union{Nothing, SubString{String}},1}:
"5"
"August"
"2020"

Better, but it’s still in german!

Ok, last bit of german lesson, promised, how about we collect all the month names in a tuple?

Then we can find it’s index in the tuple. That would be the perfect input for our Date constructor.

const MONTHS = ("januar", "februar", "märz", "april", "mai", "juni", "juli", "august", "september", "oktober", "november", "dezember")
findfirst(m -> m == lowercase(parts[2]), MONTHS) # => 8
using Dates
Date(parse(Int, parts[3]),
findfirst(m -> m == lowercase(parts[2]), MONTHS),
parse(Int, parts[1]))
# => 2020-08-05

More local = more relevant!

There are a few more interesting nuggets of information, I think the hospitalization metrics would be very interesting, especially to investigate the correlation between when cases are confirmed and the delayed hospitalizations.

But one thing that is especially interesting (and I don’t think such locally detailed information is available anywhere else) are the number of cases in the last 14 days, for each borough.

Speaking of local, this is probably the most local we can get.

List of boroughs, number of new infections aggregated for the last 14 days

By now, you probably start to see a pattern:

  1. find good selector
  2. extract content
  3. parse/collect details
rows = eachmatch(sel".table-article tr", html.root)[17:end]
df = Dict()
foreach(rows) do row
name = matchFirst(sel"td:first-child", row)[1].text
num = parse(Int, matchFirst(sel"td:last-child", row)[1].text)
df[name] = num
end
df
# =>
Dict{Any,Any} with 7 entries:
"Bergedorf" => 17
"Harburg" => 28
"Hamburg Nord" => 26
"Wandsbek" => 63
"Altona" => 14
"Eimsbüttel" => 12
"Hamburg Mitte" => 41

great, that’s it?

No! No, now the real fun begins. Do something with the data! You will probably already have some idea what you want to do with the data.

How about ending this with something visual?

Visualizations, even a simple plot, can help a lot with getting a feel for the structure of the data:

using Gadfly
Gadfly.set_default_plot_size(700px, 300px)

There are a lot of great plotting packages for julia, I personally really like Gadfly.jl for its beautiful plots.

plot(x=collect(keys(df)), 
y=collect(values(df)),
Geom.bar,
Guide.xlabel("Boroughs"),
Guide.ylabel("New Infections"),
Guide.title("New infections in the last 14 days"),
Theme(bar_spacing=5mm))
Even such a simple plot already helps understanding the data better, right?

The end! Right?

Ha ha ha ha- nope. Webcrawlers are notoriously brittle, simply because the crawled websites tend to change over time. And with it, the selectors. It’s a good idea to test if everything works, once in a while, depending on how often you use your webcrawler.

Be prepared to maintain your webcrawler more often than other pieces of software.

A few things to check out

Very close to the topic: I created a little package, Hamburg.jl, that has a few datasets about Hamburg, including all the covid-19 related numbers that we scraped a little earlier.

The official julia docs should get you up and running with your local julia dev setup.

One more crawler

Ok, one more thing, before I let you off to mine the web for all its information:

html = parsehtml(String(HTTP.get("https://en.wikipedia.org/wiki/Emerald_cockroach_wasp")))
ptags = eachmatch(sel".mw-parser-output p", html.root)[8:9]
join(map(n -> nodeText(n), ptags))
# =>
"Once the host is incapacitated, the wasp proceeds to chew off half of each of the roach's antennae, after which it carefully feeds from exuding hemolymph.[2][3] The wasp, which is too small to carry the roach, then leads the victim to the wasp's burrow, by pulling one of the roach's antennae in a manner similar to a leash. In the burrow, the wasp will lay one or two white eggs, about 2 mm long, between the roach's legs[3]. It then exits and proceeds to fill in the burrow entrance with any surrounding debris, more to keep other predators and competitors out than to keep the roach in.\nWith its escape reflex disabled, the stung roach simply rests in the burrow as the wasp's egg hatches after about 3 days. The hatched larva lives and feeds for 4–5 days on the roach, then chews its way into its abdomen and proceeds to live as an endoparasitoid[4]. Over a period of 8 days, the final-instar larva will consume the roach's internal organs, finally killing its host, and enters the pupal stage inside a cocoon in the roach's body.[4] Eventually, the fully grown wasp emerges from the roach's body to begin its adult life. Development is faster in the warm season.\n"

…the wasp proceeds to chew off half of each of the roach’s antennae, after which it carefully feeds from exuding…

…what…

…The hatched larva lives and feeds for 4–5 days on the roach, then chews its way into its abdomen…

…the…

…Over a period of 8 days, the final-instar larva will consume the roach’s internal organs, finally killing its host…

…hell mother nature, what the hell…


Alien facehugger wasps, a pandemic, webcrawlers and julia was originally published in oembot on Medium, where people are continuing the conversation by highlighting and responding to this story.