Random Forest Regression from scratch used on Kaggle competition

I normally write some code and some days, weeks, month later when everything is done I write a blog about it. I think the blog is more structured that way but maybe it has less details and my errors in between aren’t that visible anymore (just if someone of you comment afterwards 😉 ) Thanks for that. It really helps improve my code and this blog.
In the past couple of weeks I wanted to understand random forest better and not just how to use them so I’ve built my own package.
For me it’s not satisfying enough to use a machine learning package and tune the parameters I really like to build stuff from scratch.
That’s basically the reason for this blog I see too many blogs about: “How to use random forest?”, “How to achieve an awesome accuracy for MNIST?” but as mentioned in my post about the latter there aren’t many who dive deeper into it and actually use it for real (well and blog about it).

I learn a lot doing this and also when I blog about it and especially when I learn through comments but it also takes quite some time to write these posts. If you’ve read multiple of my posts and learned a good amount please consider a donation via Patreon to keep it going.
Thank you very much!

Of course for everyone else I keep to continue writing and you can’t pay everyone on the internet just because you read a post from them. It’s more about whether you enjoy this blog for a longer time now and found something you didn’t elsewhere.

Back to the track. I’m using random forest for my current Kaggle Challenge about predicting earth quakes. I might publish an extra article about that one when it’s done but for now I want to use my Random Forest package on a different challenge which is more for fun (I mean I do the other for fun as well but Kaggle/LANL are paying money for the winners) whereas this one is purely for training.
It’s about predicting house prices based on some features.

This post is on different aspects

  • Creating a julia package (basics)
  • Explaining random forest
  • Simple features first
  • What can be improved?

I’ve only have my random forest code at the moment so my score in the competition might (and probably will) turn out quite bad. Anyway the goal in general is to learn random forest (by coding it) and somehow apply it.

Creating a julia package

First let’s create our RandomForestRegression package. Starting julia in your favorite projects folder.

(v1.1) pkg> generate RandomForestRegression

That creates a folder RandomForestRegression with:

src/
    - RandomForestRegression.jl
Project.toml

and the RandomForestRegression.jl looks like this:

module RandomForestRegression

greet() = print("Hello World!")

end # module

we later want to add some dependencies to it but for now we are done. Normally you should also add a test folder etc…
Probably you want to have a look at the official documentation: Creating