By: Robert Schwarz

Re-posted from: https://leethargo.github.io/posts/index-funds-with-mixed-integer-programming/

## Index Funds with Mixed-Integer-Programming¶

We will analyze daily price data for stocks in the Dow Jones index and then try to build an accurate index fund using a small numbers of stocks therein.

Similar material was already used in a presentation at PyData Berlin 2017. See the “Tour of popular packages” notebook. Back then, we worked with Julia 0.6 and used the packages `DataFrames`

, `Plots`

and `JuMP`

. Now, we work with Julia 1.0 and use packages from the `Queryverse`

, `VegaLite`

and `IndexedTables`

for data prep and visualization. Also, I added an alternative model.

### Loading the Data¶

```
using Queryverse
using IndexedTables: ndsparse
```

The data consists of daily closing prices from 2016 of the 30 stocks that participate in the Dow Jones and are given in a CSV file:

```
;head -n4 ../files/dowjones2016.csv
```

```
price = load("../files/dowjones2016.csv") |> ndsparse
price |> @take(3)
```

```
price |> @vlplot(:line, x=:Date, y=:Price, color={"Symbol:o", scale={scheme="viridis"}},
width=600, height=450)
```

### Computing the Dow Jones Index¶

The Dow Jones index is computed from the prices of its stocks, as a weighted average, where the weight is itself defined through the price of the stock.

We will compute the average price of each stock over the days. Then we will normalize these values by dividing through the total of the average prices. The normalized weights are then multiplied to the daily prices to get the daily value of the index.

```
using Statistics: mean
```

```
avgprice = price |>
@groupby(_.Symbol) |>
@map({Symbol=key(_), AvgPrice=mean(_.Price)}) |> ndsparse
avgprice |> @take(4)
```

```
avgprice |> @vlplot(:bar, x=:Symbol, y=:AvgPrice)
```

```
totalavg = avgprice |> @map(_.AvgPrice) |> sum
```

```
weight = avgprice |> @map({Symbol=_.Symbol, Weight=_.AvgPrice / totalavg}) |> ndsparse
weight |> @take(4)
```

```
dowjones = price |>
@join(weight, _.Symbol, _.Symbol, {_.Date, _.Symbol, Contrib=_.Price * __.Weight}) |>
@groupby(_.Date) |>
@map({Date=key(_), Value=sum(_.Contrib)}) |> ndsparse
dowjones |> @take(4)
```

```
@vlplot(width=600, height=450) +
@vlplot(mark={:line, strokeWidth=1, opacity=0.8}, data=price, x=:Date, y=:Price,
color={"Symbol:n", scale={scheme="blues"}}) +
@vlplot(mark={:line, strokeWidth=3, color=:orange}, data=dowjones, x=:Date, y=:Value)
```

In the previous plot, the Dow Jones index is marked in orange. In the following, it will be our target, which we want to approximate by using a small number of the stocks.

## Model for Linear Fit¶

Let us start with an optimization model similar to an ordinary linear regression, but using the l1-norm, which is readily formulated as a linear program. In compact vector form, this reads

\begin{align*}

\text{minimize} \quad & \lVert w^T P – I \rVert_1 \\

\text{subject to} \quad & w \ge 0

\end{align*}

where $P$ stands for the prices of the individual st, $I$ for our index (the target) and $w$ for the weights we use in our fund. We only allow non-negative weights for use in the portfolio.

We can use a standard linear programming trick and introduce auxiliary variables for the positive and negative parts inside the absolute values and minimize their sum. This is formulated using JuMP:

```
using JuMP
using SCIP
```

```
dates = unique(price.index.columns.Date)
symbols = unique(price.index.columns.Symbol)
length(dates), length(symbols)
```

We could use all days of the year as input for our model, but for our evaluation we only use the days from the first three quarters. That way, we can see how our fit extrapolates through the remaining quarter.

```
training_days = 189
traindates = dates[1:training_days]
testdates = dates[training_days + 1:end]
@show traindates[end] testdates[1]
# for visualization
datebreak = [(Date=traindates[end],)] |> @take(1)
length(traindates), length(testdates)
```

```
function solve_index_linear()
m = Model(solver=SCIPSolver("display/verblevel", 2))
@variable(m, weight[symbols] >= 0)
@variable(m, posdev[traindates] >= 0)
@variable(m, negdev[traindates] >= 0)
for d in traindates
@constraint(m, sum(weight[s]*price[d,s][1] for s in symbols) - dowjones[d][1] == posdev[d] - negdev[d])
end
@objective(m, :Min, sum(posdev[d] + negdev[d] for d in traindates))
status = solve(m)
return (
status = status,
weight = getvalue(weight),
)
end
```

```
sol = solve_index_linear()
sol.status
```

```
solweight = [(Symbol=s, SolWeight=sol.weight[s]) for s in symbols]
weight |>
@join(solweight, _.Symbol, _.Symbol, {_.Symbol, _.Weight, __.SolWeight}) |>
@vlplot(:point, x=:Weight, y=:SolWeight)
```

In the scatter chart above, we compare the weights from the Dow Jones index, with the weights found by the LP model. As we can see (the points lie on the diagonal), we recover the actual weights perfectly, even though we are only using a subset of the data.

## Model with Sparsity Constraint¶

The last model gave us an exact fit, but used all available stocks.

We are changing it now, and allow it to use only a given number of stocks. For this end, we will introduce additional binary variables to select stocks to become active in our index fund.

The weight of inactive variables must be forced to 0, which we do with a so-called *big-M* constraint. The constant is chosen in such a way, that any active stock could single-handedly approximate the target index.

```
bigM = 1 / minimum(weight.data.columns[1])
```

We start by using only a single stock and then go up to subsets of 6 stocks out of the 30.

```
nstocks_range = 1:6
```

The model seems to be relatively difficult to solve for SCIP, probably because of the *big-M* constraints, so we set a gap limit of 5%, rather than solving to proven optimality.

```
function solve_index_sparse(nstocks)
m = Model(solver=SCIPSolver("display/verblevel", 2, "limits/gap", 0.05))
@variable(m, active[symbols], Bin)
@variable(m, weight[symbols] >= 0)
@variable(m, posdev[traindates] >= 0)
@variable(m, negdev[traindates] >= 0)
for d in traindates
@constraint(m, sum(weight[s]*price[d,s][1] for s in symbols) - dowjones[d][1] == posdev[d] - negdev[d])
end
for s in symbols
@constraint(m, weight[s] <= bigM*active[s])
end
@constraint(m, sum(active[s] for s in symbols) <= nstocks)
@objective(m, :Min, sum(posdev[d] + negdev[d] for d in traindates))
status = solve(m, suppress_warnings=true)
return (
status = status,
weight = getvalue(weight)
)
end
```

```
sols = [solve_index_sparse(n) for n in nstocks_range];
```

We extract the model solutions and concatenate them into a table for visualization.

```
solweight = [(NStocks=n, Symbol=s, SolWeight=sols[n].weight[s]) for n in nstocks_range for s in symbols] |>
@filter(_.SolWeight > 0.0)
solweight |> @take(3)
```

```
indices = price |>
@join(solweight, _.Symbol, _.Symbol, {__.NStocks, _.Date, _.Symbol, Contrib=_.Price * __.SolWeight}) |>
@groupby({_.NStocks, _.Date}) |>
@map({NStocks=_.NStocks[1], Date=_.Date[1], Value=sum(_.Contrib)}) |> ndsparse
indices |> @take(3)
```

```
@vlplot(width=600, height=300) +
@vlplot(mark={:line, color=:orange, strokeWidth=3},
data=dowjones, x=:Date, y={:Value, scale={domain=[80, 125]}}) +
@vlplot(mark={:line, strokeWidth=1, opacity=0.8},
data=indices, x=:Date, y=:Value, color="NStocks:o") +
@vlplot(:rule, data=datebreak, x=:Date)
```

As we can see, our solutions stick to the target fund quite closely during the training period, but then diverge from it quickly. Allowing more stocks gives us a better fit.

## Computing Returns¶

In the previous model, we tried to fit our index fund to the absolute day-to-day closing prices. But the next model will be based on picking representative that are similar, which is defined through correlation of returns.

So let’s start by computing daily returns from the closing prices (for the training period).

```
prevdates = (Date=dates[2:end], PrevDate=dates[1:end-1])
prevdates |> @take(3)
```

```
price[:, "AAPL"] |> @take(3)
returns = prevdates |>
@filter(_.Date <= traindates[end]) |>
@join(price, _.Date, _.Date, {_.Date, _.PrevDate, __.Symbol, __.Price}) |>
@join(price, {Date=_.PrevDate, _.Symbol}, {_.Date, _.Symbol},
{__.Symbol, _.Date, Return=_.Price / __.Price - 1.0}) |>
ndsparse
returns |> @take(3)
```

```
returns |>
@vlplot(mark={:line, opacity=0.8, strokeWidth=1}, width=600, height=450,
x=:Date, y=:Return, color={"Symbol:o", scale={scheme="viridis"}})
```