Tag Archives: finance

Index Funds with Mixed-Integer-Programming

Index Funds with Mixed-Integer-Programming

We will analyze daily price data for stocks in the Dow Jones index and then try to build an accurate index fund using a small numbers of stocks therein.

Similar material was already used in a presentation at PyData Berlin 2017. See the “Tour of popular packages” notebook. Back then, we worked with Julia 0.6 and used the packages DataFrames, Plots and JuMP. Now, we work with Julia 1.0 and use packages from the Queryverse, VegaLite and IndexedTables for data prep and visualization. Also, I added an alternative model.

Loading the Data

In [1]:
using Queryverse
using IndexedTables: ndsparse

The data consists of daily closing prices from 2016 of the 30 stocks that participate in the Dow Jones and are given in a CSV file:

In [2]:
;head -n4 ../files/dowjones2016.csv
Date,Symbol,Price
2016-01-04,AAPL,105.349997999999999
2016-01-04,AXP,67.589995999999999
2016-01-04,BA,140.500000000000000
In [3]:
price = load("../files/dowjones2016.csv") |> ndsparse
price |> @take(3)
Out[3]:
Date Symbol Price
2016-01-04 “AAPL” 105.35
2016-01-04 “AXP” 67.59
2016-01-04 “BA” 140.5
In [4]:
price |> @vlplot(:line, x=:Date, y=:Price, color={"Symbol:o", scale={scheme="viridis"}},
                 width=600, height=450)
Out[4]:

AAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMSymbolFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberDate020406080100120140160180200220240260Price

Computing the Dow Jones Index

The Dow Jones index is computed from the prices of its stocks, as a weighted average, where the weight is itself defined through the price of the stock.

We will compute the average price of each stock over the days. Then we will normalize these values by dividing through the total of the average prices. The normalized weights are then multiplied to the daily prices to get the daily value of the index.

In [5]:
using Statistics: mean
In [6]:
avgprice = price |>
    @groupby(_.Symbol) |>
    @map({Symbol=key(_), AvgPrice=mean(_.Price)}) |> ndsparse
avgprice |> @take(4)
Out[6]:
Symbol AvgPrice
“AAPL” 104.604
“AXP” 63.7933
“BA” 133.112
“CAT” 78.698
In [7]:
avgprice |> @vlplot(:bar, x=:Symbol, y=:AvgPrice)
Out[7]:

AAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMSymbol050100150AvgPrice

In [8]:
totalavg = avgprice |> @map(_.AvgPrice) |> sum
Out[8]:
2617.739246769842
In [9]:
weight = avgprice |> @map({Symbol=_.Symbol, Weight=_.AvgPrice / totalavg}) |> ndsparse
weight |> @take(4)
Out[9]:
Symbol Weight
“AAPL” 0.0399597
“AXP” 0.0243696
“BA” 0.0508498
“CAT” 0.0300634
In [10]:
dowjones = price |> 
    @join(weight, _.Symbol, _.Symbol, {_.Date, _.Symbol, Contrib=_.Price * __.Weight}) |>
    @groupby(_.Date) |>
    @map({Date=key(_), Value=sum(_.Contrib)}) |> ndsparse
dowjones |> @take(4)
Out[10]:
Date Value
2016-01-04 100.573
2016-01-05 100.511
2016-01-06 99.0142
2016-01-07 96.606
In [11]:
@vlplot(width=600, height=450) +
@vlplot(mark={:line, strokeWidth=1, opacity=0.8}, data=price, x=:Date, y=:Price,
        color={"Symbol:n", scale={scheme="blues"}}) +
@vlplot(mark={:line, strokeWidth=3, color=:orange}, data=dowjones, x=:Date, y=:Value)
Out[11]:

AAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMSymbolFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberDate020406080100120140160180200220240260Price, Value

In the previous plot, the Dow Jones index is marked in orange. In the following, it will be our target, which we want to approximate by using a small number of the stocks.

Model for Linear Fit

Let us start with an optimization model similar to an ordinary linear regression, but using the l1-norm, which is readily formulated as a linear program. In compact vector form, this reads

\begin{align*}
\text{minimize} \quad & \lVert w^T P – I \rVert_1 \\
\text{subject to} \quad & w \ge 0
\end{align*}

where $P$ stands for the prices of the individual st, $I$ for our index (the target) and $w$ for the weights we use in our fund. We only allow non-negative weights for use in the portfolio.

We can use a standard linear programming trick and introduce auxiliary variables for the positive and negative parts inside the absolute values and minimize their sum. This is formulated using JuMP:

In [12]:
using JuMP
using SCIP
In [13]:
dates = unique(price.index.columns.Date)
symbols = unique(price.index.columns.Symbol)

length(dates), length(symbols)
Out[13]:
(252, 30)

We could use all days of the year as input for our model, but for our evaluation we only use the days from the first three quarters. That way, we can see how our fit extrapolates through the remaining quarter.

In [14]:
training_days = 189
traindates = dates[1:training_days]
testdates = dates[training_days + 1:end]
@show traindates[end] testdates[1]

# for visualization
datebreak = [(Date=traindates[end],)] |> @take(1)

length(traindates), length(testdates)
traindates[end] = 2016-09-30
testdates[1] = 2016-10-03
Out[14]:
(189, 63)
In [15]:
function solve_index_linear()
    m = Model(solver=SCIPSolver("display/verblevel", 2))
    
    @variable(m, weight[symbols] >= 0)
    @variable(m, posdev[traindates] >= 0)
    @variable(m, negdev[traindates] >= 0)
    
    for d in traindates
        @constraint(m, sum(weight[s]*price[d,s][1] for s in symbols) - dowjones[d][1] == posdev[d] - negdev[d])
    end
    
    @objective(m, :Min, sum(posdev[d] + negdev[d] for d in traindates))
    
    status = solve(m)
    return (
        status = status,
        weight = getvalue(weight),
    )
end
Out[15]:
solve_index_linear (generic function with 1 method)
In [16]:
sol = solve_index_linear()
sol.status
Out[16]:
:Optimal
In [17]:
solweight = [(Symbol=s, SolWeight=sol.weight[s]) for s in symbols]
weight |>
    @join(solweight, _.Symbol, _.Symbol, {_.Symbol, _.Weight, __.SolWeight}) |>
    @vlplot(:point, x=:Weight, y=:SolWeight)
Out[17]:

0.000.010.020.030.040.050.06Weight0.000.010.020.030.040.050.06SolWeight

In the scatter chart above, we compare the weights from the Dow Jones index, with the weights found by the LP model. As we can see (the points lie on the diagonal), we recover the actual weights perfectly, even though we are only using a subset of the data.

Model with Sparsity Constraint

The last model gave us an exact fit, but used all available stocks.

We are changing it now, and allow it to use only a given number of stocks. For this end, we will introduce additional binary variables to select stocks to become active in our index fund.

The weight of inactive variables must be forced to 0, which we do with a so-called big-M constraint. The constant is chosen in such a way, that any active stock could single-handedly approximate the target index.

In [18]:
bigM = 1 / minimum(weight.data.columns[1])
Out[18]:
90.926297738811

We start by using only a single stock and then go up to subsets of 6 stocks out of the 30.

In [19]:
nstocks_range = 1:6
Out[19]:
1:6

The model seems to be relatively difficult to solve for SCIP, probably because of the big-M constraints, so we set a gap limit of 5%, rather than solving to proven optimality.

In [20]:
function solve_index_sparse(nstocks)
    m = Model(solver=SCIPSolver("display/verblevel", 2, "limits/gap", 0.05))
    
    @variable(m, active[symbols], Bin)
    @variable(m, weight[symbols] >= 0)
    @variable(m, posdev[traindates] >= 0)
    @variable(m, negdev[traindates] >= 0)
    
    for d in traindates
        @constraint(m, sum(weight[s]*price[d,s][1] for s in symbols) - dowjones[d][1] == posdev[d] - negdev[d])
    end

    for s in symbols
        @constraint(m, weight[s] <= bigM*active[s])
    end
    
    @constraint(m, sum(active[s] for s in symbols) <= nstocks)
    
    @objective(m, :Min, sum(posdev[d] + negdev[d] for d in traindates))
    
    status = solve(m, suppress_warnings=true)
    return (
        status = status,
        weight = getvalue(weight)
    )
end
Out[20]:
solve_index_sparse (generic function with 1 method)
In [21]:
sols = [solve_index_sparse(n) for n in nstocks_range];

We extract the model solutions and concatenate them into a table for visualization.

In [22]:
solweight = [(NStocks=n, Symbol=s, SolWeight=sols[n].weight[s]) for n in nstocks_range for s in symbols] |>
            @filter(_.SolWeight > 0.0)
solweight |> @take(3)
Out[22]:
NStocks Symbol SolWeight
1 “TRV” 0.91443
2 “BA” 0.394428
2 “MMM” 0.313139
In [23]:
indices = price |>
    @join(solweight, _.Symbol, _.Symbol, {__.NStocks, _.Date, _.Symbol, Contrib=_.Price * __.SolWeight}) |>
    @groupby({_.NStocks, _.Date}) |>
    @map({NStocks=_.NStocks[1], Date=_.Date[1], Value=sum(_.Contrib)}) |> ndsparse
indices |> @take(3)
Out[23]:
NStocks Date Value
1 2016-01-04 100.56
1 2016-01-05 101.017
1 2016-01-06 99.7094
In [24]:
@vlplot(width=600, height=300) +
@vlplot(mark={:line, color=:orange, strokeWidth=3},
        data=dowjones, x=:Date, y={:Value, scale={domain=[80, 125]}}) +
@vlplot(mark={:line, strokeWidth=1, opacity=0.8},
        data=indices, x=:Date, y=:Value, color="NStocks:o") +
@vlplot(:rule, data=datebreak, x=:Date)
Out[24]:

123456NStocksFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberDate80859095100105110115120125130Value

As we can see, our solutions stick to the target fund quite closely during the training period, but then diverge from it quickly. Allowing more stocks gives us a better fit.

Computing Returns

In the previous model, we tried to fit our index fund to the absolute day-to-day closing prices. But the next model will be based on picking representative that are similar, which is defined through correlation of returns.

So let’s start by computing daily returns from the closing prices (for the training period).

In [25]:
prevdates = (Date=dates[2:end], PrevDate=dates[1:end-1])
prevdates |> @take(3)
Out[25]:
Date PrevDate
2016-01-05 2016-01-04
2016-01-06 2016-01-05
2016-01-07 2016-01-06
In [26]:
price[:, "AAPL"] |> @take(3)

returns = prevdates |>
    @filter(_.Date <= traindates[end]) |>
    @join(price, _.Date, _.Date, {_.Date, _.PrevDate, __.Symbol, __.Price}) |>
    @join(price, {Date=_.PrevDate, _.Symbol}, {_.Date, _.Symbol},
                 {__.Symbol, _.Date, Return=_.Price / __.Price - 1.0}) |>
    ndsparse
returns |> @take(3)
Out[26]:
Symbol Date Return
“AAPL” 2016-01-05 -0.0250593
“AAPL” 2016-01-06 -0.0195697
“AAPL” 2016-01-07 -0.0422046
In [27]:
returns |>
    @vlplot(mark={:line, opacity=0.8, strokeWidth=1}, width=600, height=450,
            x=:Date, y=:Return, color={"Symbol:o", scale={scheme="viridis"}})
Out[27]:

AAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMSymbolFebruaryMarchAprilMayJuneJulyAugustSeptemberDate-0.16-0.14-0.12-0.10-0.08-0.06-0.04-0.020.000.020.040.060.080.100.12Return

Analyzing Correlation

We compute pair-wise correlation of the 30 stocks, but first need to compute the mean and standard deviations of the returns.

In [28]:
returnagg = returns |>
    @groupby(_.Symbol) |>
    @map({Symbol=key(_), MeanReturn=mean(_.Return)}) |>
    @join(returns, _.Symbol, _.Symbol,
          {_.Symbol, __.Date, _.MeanReturn, ShiftedSqr=(__.Return - _.MeanReturn)^2}) |>
    @groupby(_.Symbol) |>
    @map({Symbol=key(_), MeanReturn=_.MeanReturn[1],
          StdDevReturn=sqrt(sum(_.ShiftedSqr)/(length(_) - 1))})
returnagg |> @take(3)
Out[28]:
Symbol MeanReturn StdDevReturn
“AAPL” 0.000503948 0.0160686
“AXP” -0.000165808 0.0153556
“BA” -0.000207316 0.0163469

The mean and standard deviation of the stock returns can be used as performance and risk markers. In the following scatter chart, we give an overview of these (bottom right is best).

In [29]:
returnagg |>
    @vlplot(:point, x=:MeanReturn, y=:StdDevReturn, color={"Symbol:o", scale={scheme="viridis"}},
            width=450, height=450)
Out[29]:

AAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMSymbol-0.0008-0.0006-0.0004-0.00020.00000.00020.00040.00060.00080.00100.00120.00140.0016MeanReturn0.0000.0020.0040.0060.0080.0100.0120.0140.0160.018StdDevReturn

In [30]:
shiftedreturns = returns |>
    @join(returnagg, _.Symbol, _.Symbol, {_.Symbol, _.Date, ShiftedReturn=_.Return - __.MeanReturn})
shiftedreturns |> @take(3)
Out[30]:
Symbol Date ShiftedReturn
“AAPL” 2016-01-05 -0.0255633
“AAPL” 2016-01-06 -0.0200736
“AAPL” 2016-01-07 -0.0427085
In [31]:
correlation = shiftedreturns |>
    @join(shiftedreturns, _.Date, _.Date,
          {Left=_.Symbol, Right=__.Symbol, Product=_.ShiftedReturn * __.ShiftedReturn}) |>
    @groupby({_.Left, _.Right}) |>
    @map({Left=_.Left[1], Right=_.Right[1], Covariance=mean(_.Product)}) |>
    @join(returnagg, _.Left, _.Symbol, {_.Left, _.Right, _.Covariance, LeftStdDev=__.StdDevReturn}) |>
    @join(returnagg, _.Right, _.Symbol,
          {_.Left, _.Right, Correlation=_.Covariance / (_.LeftStdDev * __.StdDevReturn)}) |>
    ndsparse
correlation |> @take(3)
Out[31]:
Left Right Correlation
“AAPL” “AAPL” 0.994681
“AAPL” “AXP” 0.121703
“AAPL” “BA” 0.400597
In [32]:
correlation |>
    @vlplot(:rect, x=:Left, y=:Right, color=:Correlation, width=450, height=450)
Out[32]:

0.20.40.60.8CorrelationAAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMLeftAAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMRight

As we can see here, all of the stocks are either unrelated or positively related, but never negatively related, which is not at all obvious to me.

Model for Picking Representatives

The following model is taken from the book Optimization Methods in Finance.

Rather than looking at the day-to-day values, it precomputes a similary for each pair of stocks, and then selects a representative for each stock while limiting the total number of representatives. The actual weight given to the stocks is not part of the model, but computed in a post-processing step.

The authors note that this model can be solved more efficiently using a Lagrangian Relaxation approach, but I found that this is not necessary for our small dataset.

In [33]:
function solve_index_repr(nstocks)
    m = Model(solver=SCIPSolver("display/verblevel", 2))
    
    @variable(m, active[symbols], Bin)       # is stock in index fund?
    @variable(m, repr[symbols,symbols], Bin) # is stock 'r' represented by 's'?
    
    @constraint(m, sum(active[s] for s in symbols) <= nstocks)
    
    for r in symbols
        for s in symbols
            @constraint(m, repr[r,s] <= active[s])
        end
    end
    
    for r in symbols
        @constraint(m, sum(repr[r,s] for s in symbols) == 1)
    end
    
    @objective(m, :Max, sum(correlation[r,s][1] * repr[r,s] for r in symbols for s in symbols))
    
    status = solve(m, suppress_warnings=true)
    
    # post-processing: determine weights for representatives
    reprsol = getvalue(repr)
    accweight = [sum(weight[r][1] * reprsol[r,s] for r in symbols)/avgprice[s][1] for s in symbols]
                    
    return (
        status = status,
        active = getvalue(active),
        weight = accweight * mean(dowjones.data.columns.Value)
    )
end
Out[33]:
solve_index_repr (generic function with 1 method)
In [34]:
sols_repr = [solve_index_repr(n) for n in nstocks_range];
In [35]:
reprweight = [(NStocks=n, Symbol=s, SolWeight=w) for n in nstocks_range
              for (s,w) in zip(symbols,sols_repr[n].weight)] |>
            @filter(_.SolWeight > 0.0)
reprweight |> @take(3)
Out[35]:
NStocks Symbol SolWeight
1 “GE” 3.47077
2 “MMM” 0.409427
2 “V” 0.466415
In [36]:
reprindices = price |>
    @join(reprweight, _.Symbol, _.Symbol, {__.NStocks, _.Date, _.Symbol, Contrib=_.Price * __.SolWeight}) |>
    @groupby({_.NStocks, _.Date}) |>
    @map({NStocks=_.NStocks[1], Date=_.Date[1], Value=sum(_.Contrib)}) |> ndsparse
reprindices |> @take(3)
Out[36]:
NStocks Date Value
1 2016-01-04 106.587
1 2016-01-05 106.691
1 2016-01-06 104.991
In [37]:
@vlplot(width=600, height=300) +
@vlplot(mark={:line, color=:orange, strokeWidth=3},
        data=dowjones, x=:Date, y={:Value, scale={domain=[80, 125]}}) +
@vlplot(mark={:line, strokeWidth=1, opacity=0.8},
        data=reprindices, x=:Date, y=:Value, color="NStocks:o") +
@vlplot(:rule, data=datebreak, x=:Date)
Out[37]:

123456NStocksFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberDate80859095100105110115120125Value

We can see that the solution funds are not following the target index as closely as before, on a day-to-day basis, but the general shape of the curve is very similar. On the other hand, there seems to be hardly any difference in performance between the training and the test period, so it extrapolates quite well!

Comparison of methods

We could compare the approaches more systematically, by computing the losses with different norms (l1, or l2) on the training or test data, but I feel the visualization already gives a good enough impression.

But as a final comparison, let us count how often the different stocks are used for the index funds.

In [38]:
count1 = solweight |>
    @groupby(_.Symbol) |>
    @map({Symbol=key(_), Method="L1Fit", Count=length(_)}) |>
    @orderby_descending(_.Count)
count2 = reprweight|>
    @groupby(_.Symbol) |>
    @map({Symbol=key(_), Method="Repr", Count=length(_)}) |>
    @orderby_descending(_.Count)

counts = [count1...; count2...] |> ndsparse

counts |>
    @vlplot(:bar, column=:Symbol, x={field=:Method, axis={title=""}}, y=:Count, color=:Method)
Out[38]:

Symbol012345CountBAGEGSIBMKOMCDMMMPFEPGTRVUNHUTXVWMTL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprMethod

It looks like the two model have very different taste in stocks and can only agree on 2-3 of them!

Notes on packages used

Since this blog was also an exercise in using Queryverse and VegaLite, I would like to comment on my experience now.
For Query itself, building the queries was relatively straight-forward, after watching the tutorial given by David Anthoff at JuliaCon 2018.

Some of the computations look quite awkward and are probably not efficient either, such as the twofold join of price with itself and another table that links each day with the previous day. Here, an array-based view using [2:end] and [1:end-1] would have been more natural. Also, after a while I found it a little annoying that I had to repeatedly name all of the columns that I wanted to keep, in @join or @map calls.

Finally, the combination of Query and ndsparse of IndexedTables.jl is a little unfortunate, when the latter one is used as a sink for some query. Typically, ndsparse is constructed with a set of columns to be used as an index, and another (single, unnamed) column as a value. Individual values can then be referenced just like an array, that is, table[i,j]. But the constructor of ndsparse used on query results takes the first $n – 1$ columns as index and the last as value, but wrapped in a singledton NamedTuple. So, table[i,j] will actually be a (k=23,). This meant that I had to use price[d,s][1] as coefficient in my model, rather than simply price[d,s]. I could not figure out how to avoid this, without creating essentially another copy of the data.

Marginal Net Income in Germany

We investigate the tax burden and insurance cost for different scenarios of gross income using a simplified model of the situation in Germany. The goal is to compute the marginal net income for additional gross income. From that, we can also derive the net hourly income for part-time employment.

This analysis is done using Julia in a Jupyter notebook.

In [1]:
# assume a typical monthly income
gross_monthly = 37873 / 12
Out[1]:
3156.0833333333335

Social costs

For simplicity, we ignore the possibility of private health insurance and assume that our typical employee participates in the public social system. This includes health insurance, unemployment insurance and payments to the pension system.

These payments are split evenly between the employer and employee. This means that the labor cost of the employer is higher than the gross salary paid.

Source: nettolohn.de

In [2]:
# only includes the employee's share
pension = 0.186
unemployment = 0.03
health = 0.146
longterm_care = 0.028
base_rate = 0.5 * (pension + unemployment + health + longterm_care)
health_extra = 0.011
employer_rate = base_rate
employee_rate = base_rate + health_extra
@show employer_rate, employee_rate;
(employer_rate, employee_rate) = (0.195, 0.20600000000000002)
In [3]:
# effective labor cost for employer
real_monthly = (1.0 + employer_rate) * gross_monthly
Out[3]:
3771.5195833333337

However, these relative rates are only applied up to specific bound for the gross income.
So, let’s define some functions to compute the correct costs.

In the case of self-employment, there is also a virtual minimum income that is used as a reference, which we will ignore for simplicity.

Source: Beitragsbemessungsgrenze

In [4]:
function social_cost(gross_monthly)
    pension       = 0.186 * min(gross_monthly, 6500.0)
    unemployment  = 0.030 * min(gross_monthly, 6500.0)
    health        = 0.146 * min(gross_monthly, 4425.0)
    longterm_care = 0.146 * min(gross_monthly, 4425.0)
    pension + unemployment + health + longterm_care
end

employer_cost(gross_monthly) = 0.5 * social_cost(gross_monthly)
employee_cost(gross_monthly) = 0.5 * social_cost(gross_monthly) + 0.011 * gross_monthly
total_cost(gross_monthly) = employer_cost(gross_monthly) + employee_cost(gross_monthly)
employer_total(gross_monthly) = employer_cost(gross_monthly) + gross_monthly;

Let’s visualize the total social cost relative to the real monthly labor cost.

In [5]:
using Plots
gr()
Out[5]:
Plots.GRBackend()
In [6]:
gross_range = 400.0:20.0:10000.0
real_income = employer_total.(gross_range)
relative_cost = total_cost.(gross_range) ./ real_income
plot(real_income, relative_cost, xlim=(0,11000), ylim=(0.0, 0.5), legend=false,
     xlabel="effective monthly income", ylabel="rel social cost")
Out[6]:

0 2500 5000 7500 10000 0.0 0.1 0.2 0.3 0.4 0.5 effective monthly income rel social cost

So, higher gross (or effective) income leads to a smaller relative social cost.
We can repeat that plot for the employee’s point of view:

In [7]:
plot(gross_range, employee_cost.(gross_range)./gross_range, xlim=(0,10000), ylim=(0.0, 0.3), legend=false,
     xlabel="gross monthly income", ylabel="employee's share of social cost")
Out[7]:

0 2000 4000 6000 8000 10000 0.00 0.05 0.10 0.15 0.20 0.25 gross monthly income employee’s share of social cost

The income tax only applies to the part of the gross income of which the social cost has been subtracted already.
Further, some portion of that remainder is also free from taxes.

In [8]:
taxable_income(gross_monthly) = gross_monthly - employee_cost(gross_monthly)
Out[8]:
taxable_income (generic function with 1 method)

Income tax

The tax rate is a piece-wise defined function.
We assume an unmarried employee with no kids who is not taxable by any church.

The source below actually contains flow charts with details conditions, thresholds and rounding of intermediate results.
I can’t be bothered to understand all of that, so I will try to extract the essential information.

Source: BMF Steuerrechner

In [9]:
function income_tax(yearly)
    if yearly <= 9000
        return 0
    elseif yearly < 13997
        y = (yearly - 9000) / 10000
        rw = y * 997.8 + 1400
        return ceil(rw * y)
    elseif yearly < 54950
        y = (yearly - 13996) / 10000
        rw = y * 220.13 + 2397
        return ceil(rw * y + 948.49)
    elseif yearly < 260533
        return ceil(yearly * 0.42 - 8621.75)
    else
        return ceil(yearly * 0.45 - 16437.7)
    end
end
Out[9]:
income_tax (generic function with 1 method)
In [10]:
yearly_range = 5000:500:100000
plot(yearly_range, income_tax.(yearly_range), legend=false, xlim=(0,100000), ylim=(0,35000),
    xlabel="yearly taxable income", ylabel="income tax")
Out[10]:

0 2×10 4 4×10 4 6×10 4 8×10 4 1×10 5 0 5.0×10 3 1.0×10 4 1.5×10 4 2.0×10 4 2.5×10 4 3.0×10 4 3.5×10 4 yearly taxable income income tax

In [11]:
plot(yearly_range, income_tax.(yearly_range) ./ yearly_range, legend=false, xlim=(0,100000), ylim=(0,0.4),
    xlabel="yearly taxable income", ylabel="income tax rate")
Out[11]:

0 2×10 4 4×10 4 6×10 4 8×10 4 1×10 5 0.0 0.1 0.2 0.3 0.4 yearly taxable income income tax rate

Net income

Now, let’s combine the social costs and taxes to compute the net income.

In [12]:
function net(gross_monthly)
    taxable = taxable_income(gross_monthly)
    taxes = income_tax(ceil(12 * taxable)) / 12
    taxable - taxes
end

net_income = net.(gross_range)

plot(gross_range, net_income, legend=false, xlim=(0,10000), ylim=(0,6000),
     xlabel="monthly gross income", ylabel="monthly net income")
Out[12]:

0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 6000 monthly gross income monthly net income

This looks surprisingly linear. Let’s also plot the relative net income compared to the effective cost.

In [13]:
rel_net = net_income ./ real_income
plot(real_income, rel_net, legend=false, xlim=(0,12000), ylim=(0,0.7),
     xlabel="effective monthly income", ylabel="rel net income")
Out[13]:

0 2.0×10 3 4.0×10 3 6.0×10 3 8.0×10 3 1.0×10 4 1.2×10 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 effective monthly income rel net income

Interestingly, this curve is not monotonically decreasing. In fact, there seems to be a minimum at a gross income of 4425 (real income of about 5300), which is the upper reference value for the health insurances.

Marginal income

The analysis above can be slightly misleading. If we are currently earning a certain income and have an opportunity to raise the income, this might also decrease our relative net income. However, higher rates for social cost and income tax are applied equally to the previous and additional income.

If we want to show how much net income we can retain for each additional EUR earned, we have take some more care.
Here, we approximate the slope of the net income as a function of real income using a symmetric difference quotient.
There are some jumps in that curve since our net income is not smooth.

In [14]:
finite_diffs = (net_income[3:end] - net_income[1:end-2]) ./ (real_income[3:end] - real_income[1:end-2])
plot(real_income[3:end], [finite_diffs rel_net[3:end]], xlim=(0,12000), ylim=(0,0.65),
    xlabel="effective monthly income", label=["marginal income" "relative net income"],
    legend=:bottomright)
Out[14]:

0 2.0×10 3 4.0×10 3 6.0×10 3 8.0×10 3 1.0×10 4 1.2×10 4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 effective monthly income marginal income relative net income

Part-time work and hourly income

We have seen that a lower gross income often corresponds to a higher relative net income.
If we have the option to reduce our working hours, we should be able to increase our income per hour.

Let’s assume a 40 hour work-week and 4 weeks per month.

In [15]:
monthly_hours = 40.0 * 4.0
Out[15]:
160.0
In [16]:
# median income
median_net = net(gross_monthly)
Out[16]:
1929.0545833333333
In [17]:
median_net / monthly_hours
Out[17]:
12.056591145833334
In [18]:
using ColorBrewer
In [19]:
# hourly income for different monthly gross incomes
hours = 10:40
steps = 6
gross = range(1000, stop=8000, length=steps)'
colors = ColorBrewer.palette("YlGnBu", steps + 3)[4:end]
parttime_net = net.((hours / 40)*gross)
hourly_net = parttime_net ./ (hours * 4)
plot(hours, hourly_net, labels=gross, ylim=(0,40),
     color_palette=colors,
     xlabel="hours per week", ylabel="hourly net income",
     title="Hourly net income for part-time work")
Out[19]:

10 20 30 40 0 10 20 30 40 Hourly net income for part-time work hours per week hourly net income 1000.0 2400.0 3800.0 5200.0 6600.0 8000.0

As we can see above, we can increase our hourly income by working fewer hours every week.
This effect is more clear if we look at the hourly net income of part-time work relative the hourly net income for full-time work:

In [20]:
rel_hourly = hourly_net ./ hourly_net[end, :]';
In [21]:
plot(hours, rel_hourly, labels=gross, color_palette=colors,
     xlabel="hours per week", ylabel="relative hourly net income",
     title="Relative hourly net income for part-time work")
Out[21]:

10 20 30 40 1.00 1.05 1.10 1.15 1.20 Relative hourly net income for part-time work hours per week relative hourly net income 1000.0 2400.0 3800.0 5200.0 6600.0 8000.0

For low values of gross income, reducing the weekly work hours will not affect the hourly income any more.
Medium values show the most promise, with the largest inrease in hourly income.
With a high enough income, reducing the work load yields less and less increase in hourly income.