Tag Archives: modeling

Julia and MATLAB can coexist. Let us show you how.

By: Great Lakes Consulting

Re-posted from: https://blog.glcs.io/juliacon-2025-preview

This post was written by Steven Whitaker.

Have you ever wished you could start using the Julia programming languageto develop custom models?Does the idea of replacingoutdated MATLAB code and modelsseem overwhelming?

Or maybe you don’t plan to replace all MATLAB code,but wouldn’t it be excitingto integrate Julia codeinto existing workflows?

Also, technicalities aside,how do you convince your colleaguesto make the leapinto the Julia ecosystem?

I’m excited to sharean announcement!At this year’s JuliaCon,I will be speaking abouta small but significant stepyou can take to start adding Juliato your MATLAB codebase.

Great news!You can transition to Julia smoothlywithout completely abandoning MATLAB.There’s a straightforward methodto embrace the best of both worlds,so you won’t needto rewrite your legacy models from scratch.

I’ll give my full talk in July,but if you don’t want to wait,keep readingfor a sneak peek!

Background

The GLCS.io teamhas been developing Julia-based solutions since 2015.Over the past 4 years,we’ve had the pleasure of redesigning and enhancing Julia modelsfor our clients in the finance, science, and engineering sectors.Its incredible speed and versatility have transformedhow we tackle complex computations together.However,we also fully acknowledge the reality:MATLAB continues to hold a significant placein countless companies and research labs worldwide.

For decades,MATLAB has been the benchmarkfor data analysis, modeling, and simulationacross scientific and engineering fields.There are likely hundreds of thousands of MATLAB licenses in use,with millions of userssupporting an unimaginable number of models and codebases.

Even for a single company,fully transitioning to Juliaoften feels insurmountable.The vast amount of existing MATLAB codepresents a significant challenge for any team considering adopting Julia.

Yet, unlocking Julia’s power is vital for companiesaiming to excel in today’s competitive landscape.The question isn’t if companiesshould adopt Julia—it’s how to do it.

Companies should blend Juliawith their MATLAB environments,ensuring minimal disruption and optimal resource use.This strategic integrationdelivers meaningful gainsin accuracy, performance, and scalabilityto transform operations and drive success.

JuliaCon Preview

At JuliaCon,I’m excited to share how youcan seamlessly integrate Juliainto existing MATLAB workflows—a processthat has delivered up to 100x performance improvementswhile enhancing code quality and functionality.Through a real-world model,I’ll highlight design patterns,benchmark comparisons,and valuable business case insightsto demonstrate the transformative potential of integrating Julia.

(Spoiler alert:the performance improvement is more than 100xfor the example I will show at JuliaCon.)

What We Offer

Unlock high-performance modeling!Our dedicated team is hereto integrate Julia into your MATLAB workflows.Experience a strategic, step-by-step process tailoredfor seamless Julia-MATLAB integration,focused on efficiency and delivering measurable results:

  1. Tailored Assessment:Pinpoint challenges and opportunities for Julia to address.
  2. MATLAB Benchmarking:Establish a performance baseline to measure progress and impact.
  3. Julia Model Development:Convert MATLAB models to Juliaor assist your team in doing so.
  4. Julia Integration:Combine Julia’s capabilities with your existing MATLAB workflows for optimal results.
  5. Roadmap Alignment:Validate performance improvements,create a strong business case for leadership,and agree on future support and innovation.

Check out our website for more details.

Summary

By attending my JuliaCon talk,you will learnhow to seamlessly integrate Juliainto your existing MATLAB codebase.And by leveraging our support at GLCS,you can adopt Juliawithout disruption—unlocking faster computations,improved models,and better scalabilitywhile retaining the strengthsof your MATLAB codebase.

Are you or someone you knowexcited about harnessing the power of Julia and MATLAB together?Let’s connect! Schedule a consultation todayto discover incredible performance gains of 100x or more.

Additional Links

MATLAB is a registered trademarkof The MathWorks, Inc.

Cover image:The JuliaCon 2025 logowas obtained from https://juliacon.org/2025/.

]]>

ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/modelingtoolkit-modelica-and-modia-the-composable-modeling-future-in-julia/

Let me take a bit of time here to write out a complete canonical answer to ModelingToolkit and how it relates to Modia and Modelica. This question comes up a lot: why does ModelingToolkit exist instead of building on tooling for Modelica compilers? I’ll start out by saying I am a huge fan of Martin and Hilding’s work, I respect them a ton, and they have made major advances in this space. But I think ModelingToolkit tops what they have developed in a not-so-subtle way. And it all comes down to the founding principle, the foundational philosophy, of what a modeling language needs to do.

Composable Abstractions for Model Transformations

There is a major philosophical difference which is seen in both the development and usage of the tools. Everything in the SciML organization is built around a principle of confederated modular development: let other packages influence the capabilities of your own. This is highlighted in a paper about the package structure of DifferentialEquations.jl. The underlying principle is that not everyone wants or needs to be a developer of the package, but still may want to contribute. For example, it’s not uncommon that a researcher in ODE solvers wants to build a package that adds one solver to the SciML ecosystem. Doing this in their own package for their own academic credit, but with the free bonus that now it exists in the multiple dispatch world. In the design of DifferentialEquations.jl, solve(prob,IRKGL16()) now exists because of their package, and so we add it to the documentation. Some of this work is not even inside the organization, but we still support it. The philosophy is to include every researcher as a budding artist in the space of computational research, including all of the possible methods, and building an infrastructure that promotes a free research atmosphere in the methods. Top level defaults and documentation may lead people to the most stable aspects of the ecosystem, but with a flip of a switch you can be testing out the latest research.

When approaching modeling languages like Modelica, I noticed this idea was completely foreign to modeling languages. Modelica is created by a committee, but the implementations that people use are closed like Dymola, or monolithic like OpenModelica. This is not a coincidence but instead a fact of the design of the language. In the Modelica language, there is no reference to what transformations are being done to your models in order to make them “simulatable”. People know about Pantelides algorithm, and “singularity elimination”, but this is outside the language. It’s something that the compiler maybe gives you a few options for, but not something the user or the code actively interacts with. Every compiler is different, advances in one compiler do not help your model when you use another compiler, and the whole world is siloed. By this design, it is impossible for an external user to write compiler passes in Modelica which effects this model lowering process. You can tweak knobs, or write a new compiler. Or fork OpenModelica and hack on the whole compiler to just do the change you wanted.

I do not think that the symbolic transformations that are performed by Modelica are the complete set that everyone will need for all models for all time. I think in many cases you might want to write your own. For example, on SDEs there’s a Lamperti transformation which can exist which transforms general SDEs to SDEs with additive noise. It doesn’t always apply, but when it does it can greatly enhance solver speed and stability. This is niche enough that it’ll never be in a commercial Modelica compiler (in fact, they don’t even have SDEs), but it’s something that some user might want to be able to add to the process.

ModelingToolkit: Opening the Development Process

So the starting goal of ModelingToolkit is to give an open and modular transformation system on which a whole modeling ecosystem can thrive. My previous blog post exemplified how unfamiliar use cases for code transformations can solve many difficult mathematical problems, and my goal is to give this power to the whole development community. structuralsimplify is something built into ModelingToolkit to do “the standard transformations” on the standard systems, but the world of transformations is so much larger. Log-transforming a few variables? Exponentiating a few to ensure positivity? Lamperti transforms of SDEs? Transforming to the sensitivity equations? And not just transformations, but functionality for inspecting and analyzing models. Are the equations linear? Which parameters are structurally identifiable?

From that you can see that Modia was a major inspiration for ModelingToolkit, but Modia did not go in this direction of decomposing the modeling language: it essentially is a simplified Modelica compiler in Julia. But ModelingToolkit is a deconstruction of what a modeling language is. It pulls it down to its component pieces and then makes it easy to build new modeling languages like Catalyst.jl which internally use ModelingToolkit for all of the difficult transformations. The deconstructed form is a jumping point for building new domain-based languages, along with new transformations which optimize the compiler for specific models. And then in the end, everybody who builds off of it gets improved stability, performance, and parallelism as the core MTK passes improve.

Bringing the Power to the People

Now there’s two major aspects that need to be handle to fully achieve such a vision though. If you want people to be able to reuse code between transformations, what you want is to expose how you are changing code. To achieve this goal, a new Computer Algebra System (CAS), Symbolics.jl, was created for ModelingToolkit.jl. The idea being, if we want everyone writing code transformations, they should all have easy access to a general mathematical toolset for doing such code transformations. We shouldn’t have everyone building a new code for differentiation, simplify, and substitution. And we shouldn’t have everyone relying on undocumented internals of ModelingToolkit.jl either: this should be something that is open, well-tested, documented, and a well-known system so that everyone can easily become a “ModelingToolkit compiler developer”. By building a CAS and making it a Julia standard, we can bridge that developer gap because now everyone knows how to easily manipulate models: they are just Symbolics.jl expressions.

The second major aspect is to achieve a natural embedding into the host language. Modelica is not a language in which people can write compiler passes, which introduces a major gap between the modeler and the developer of extensions to the modeling language. If we want to bridge this gap, we need to ensure the whole modeling language is used from a host which is a complete imperative programming language. And you need to do so in a language that is interactive, high performance, and has a well-developed ecosystem for modeling and simulation. Martin and Hilding had seen this fact as the synthesis for Modia with how Julia uniquely satisfies this need, but I think we need to take it a step further. To really make the embedding natural, you should be able to on the fly automatically convert code to and from the symbolic form. In the previous blog post I showcased how ModelingToolkit.jl could improve people’s code by automatically parallelizing it and performing index reduction even if the code was not written in ModelingToolkit.jl. This grows the developer audience of the transformation language from “anyone who wants to transform models” to “anyone who wants to automate improving models and general code”. This expansion of the audience is thus pulling in developers who are interested in things like automating parallelism and GPU codegen and bringing them into the MTK developer community.

Intern, since all of these advances then apply to the MTK internals and code generation tools such as Symbolics.jl’s build_function, new features are coming all of the time because of how the community is composed. The CTarget build_function was first created to transpile Julia code to C, and thus ModelingToolkit models can generate C outputs for compiling into embedded systems. This is serendipity when seeing one example, but it’s design when you notice that this is how the entire system is growing so fast.

But Can Distributed Development Be As Good As Specialized Code?

Now one of the questions we received early on was, won’t you not be able to match the performance a specialized compiler which was only made to work on Modelica, right? While at face value it may seem like hyperspecialization could be beneficial, the true effect of hyperspecialization is that algorithms are simply less efficient because less work has been put into them. Symbolics.jl has become a phenomenon of its own, with multiple different hundred comment threads digging through many aspects of the pros and cons of its designs, and that’s not even including the 200 person chat channel which has had tens of thousands of messages in the less than 2 months since the CAS was released. Tons of people are advising how to improve every single plus and multiply operation.

So it shouldn’t be a surprise that there are many details that have quickly been added which are still years away from a Modelica implementation. It automatically multithreads tree traversals and rewrite rules. It automatically generates fast parallelized code, and can do so in a way that composes with tearing of nonlinear equations. It lets users define their own high-performance and parallelized functions, register them, and stick them into the right hand side. And that is even excluding the higher level results, like the fact that it is fully differentiable and thus allows training neural networks decomposed within the models, and automatically discover equations from data.

Just at the very basic level we can see that the CAS is transforming the workflows of scientists and engineers in many aspects of the modeling process. By distributing the work of improving symbolic computing, we have already taken examples which were essentially not obtainable and making them instant with Symbolics.jl:

We are building out a full benchmarking system for the symbolic ecosystem to track performance over time and ensure it reaches the top level. It’s integrating pieces from The OSCAR project, getting lots of people tracking performance in their own work, and building a community. Each step is another major improvement and this ecosystem is making these steps fast. It will be hard for a few people working on the internals of a single Modelica compiler to keep up with such an environment, let alone repeating this work to every new Modelica-based project.

But How Do You Connect To Modelica?

This is a rather good question because there are a lot of models already written in Modelica, and it would be a shame for us to not be able to connect with that ecosystem. I will hint that there is coming tooling as part of JuliaSim for connecting to many pre-existing model libraries. In addition, we hope to make use of tooling like Modia.jl and TinyModia.jl will help us make a bridge.

Conclusion: Designing Around the Developer Community Has Many Benefits

The composability and distributed development nature of ModelingToolkit.jl is its catalyst. This is why ModelingToolkit.jl looks like it has rocket shoes on: it is fast and it is moving fast. And it’s because of the thought put into the design. It’s because ModelingToolkit.jl is including the entire research community as its asset instead of just its user. I plan to keep moving forward from here, looking back to learn from the greats, but building it in our own image. We’re taking the idea of a modeling language, distributing it throughout one of the most active developer communities in modeling and simulation, in a language which is made to build fast and parallelized code. And you’re invited.

PS: what about Simulink?

I’m just going to post a self-explanatory recent talk by Jonathan at the NASA Launch Services Program who saw a 15,000x acceleration by moving from Simulink to ModelingToolkit.jl.

Enough said.

The post ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia appeared first on Stochastic Lifestyle.

Index Funds with Mixed-Integer-Programming

Index Funds with Mixed-Integer-Programming

We will analyze daily price data for stocks in the Dow Jones index and then try to build an accurate index fund using a small numbers of stocks therein.

Similar material was already used in a presentation at PyData Berlin 2017. See the “Tour of popular packages” notebook. Back then, we worked with Julia 0.6 and used the packages DataFrames, Plots and JuMP. Now, we work with Julia 1.0 and use packages from the Queryverse, VegaLite and IndexedTables for data prep and visualization. Also, I added an alternative model.

Loading the Data

In [1]:
using Queryverse
using IndexedTables: ndsparse

The data consists of daily closing prices from 2016 of the 30 stocks that participate in the Dow Jones and are given in a CSV file:

In [2]:
;head -n4 ../files/dowjones2016.csv
Date,Symbol,Price
2016-01-04,AAPL,105.349997999999999
2016-01-04,AXP,67.589995999999999
2016-01-04,BA,140.500000000000000
In [3]:
price = load("../files/dowjones2016.csv") |> ndsparse
price |> @take(3)
Out[3]:
Date Symbol Price
2016-01-04 “AAPL” 105.35
2016-01-04 “AXP” 67.59
2016-01-04 “BA” 140.5
In [4]:
price |> @vlplot(:line, x=:Date, y=:Price, color={"Symbol:o", scale={scheme="viridis"}},
                 width=600, height=450)
Out[4]:

AAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMSymbolFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberDate020406080100120140160180200220240260Price

Computing the Dow Jones Index

The Dow Jones index is computed from the prices of its stocks, as a weighted average, where the weight is itself defined through the price of the stock.

We will compute the average price of each stock over the days. Then we will normalize these values by dividing through the total of the average prices. The normalized weights are then multiplied to the daily prices to get the daily value of the index.

In [5]:
using Statistics: mean
In [6]:
avgprice = price |>
    @groupby(_.Symbol) |>
    @map({Symbol=key(_), AvgPrice=mean(_.Price)}) |> ndsparse
avgprice |> @take(4)
Out[6]:
Symbol AvgPrice
“AAPL” 104.604
“AXP” 63.7933
“BA” 133.112
“CAT” 78.698
In [7]:
avgprice |> @vlplot(:bar, x=:Symbol, y=:AvgPrice)
Out[7]:

AAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMSymbol050100150AvgPrice

In [8]:
totalavg = avgprice |> @map(_.AvgPrice) |> sum
Out[8]:
2617.739246769842
In [9]:
weight = avgprice |> @map({Symbol=_.Symbol, Weight=_.AvgPrice / totalavg}) |> ndsparse
weight |> @take(4)
Out[9]:
Symbol Weight
“AAPL” 0.0399597
“AXP” 0.0243696
“BA” 0.0508498
“CAT” 0.0300634
In [10]:
dowjones = price |> 
    @join(weight, _.Symbol, _.Symbol, {_.Date, _.Symbol, Contrib=_.Price * __.Weight}) |>
    @groupby(_.Date) |>
    @map({Date=key(_), Value=sum(_.Contrib)}) |> ndsparse
dowjones |> @take(4)
Out[10]:
Date Value
2016-01-04 100.573
2016-01-05 100.511
2016-01-06 99.0142
2016-01-07 96.606
In [11]:
@vlplot(width=600, height=450) +
@vlplot(mark={:line, strokeWidth=1, opacity=0.8}, data=price, x=:Date, y=:Price,
        color={"Symbol:n", scale={scheme="blues"}}) +
@vlplot(mark={:line, strokeWidth=3, color=:orange}, data=dowjones, x=:Date, y=:Value)
Out[11]:

AAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMSymbolFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberDate020406080100120140160180200220240260Price, Value

In the previous plot, the Dow Jones index is marked in orange. In the following, it will be our target, which we want to approximate by using a small number of the stocks.

Model for Linear Fit

Let us start with an optimization model similar to an ordinary linear regression, but using the l1-norm, which is readily formulated as a linear program. In compact vector form, this reads

minimizewTPI1subject tow0

where $P$ stands for the prices of the individual st, $I$ for our index (the target) and $w$ for the weights we use in our fund. We only allow non-negative weights for use in the portfolio.

We can use a standard linear programming trick and introduce auxiliary variables for the positive and negative parts inside the absolute values and minimize their sum. This is formulated using JuMP:

In [12]:
using JuMP
using SCIP
In [13]:
dates = unique(price.index.columns.Date)
symbols = unique(price.index.columns.Symbol)

length(dates), length(symbols)
Out[13]:
(252, 30)

We could use all days of the year as input for our model, but for our evaluation we only use the days from the first three quarters. That way, we can see how our fit extrapolates through the remaining quarter.

In [14]:
training_days = 189
traindates = dates[1:training_days]
testdates = dates[training_days + 1:end]
@show traindates[end] testdates[1]

# for visualization
datebreak = [(Date=traindates[end],)] |> @take(1)

length(traindates), length(testdates)
traindates[end] = 2016-09-30
testdates[1] = 2016-10-03
Out[14]:
(189, 63)
In [15]:
function solve_index_linear()
    m = Model(solver=SCIPSolver("display/verblevel", 2))
    
    @variable(m, weight[symbols] >= 0)
    @variable(m, posdev[traindates] >= 0)
    @variable(m, negdev[traindates] >= 0)
    
    for d in traindates
        @constraint(m, sum(weight[s]*price[d,s][1] for s in symbols) - dowjones[d][1] == posdev[d] - negdev[d])
    end
    
    @objective(m, :Min, sum(posdev[d] + negdev[d] for d in traindates))
    
    status = solve(m)
    return (
        status = status,
        weight = getvalue(weight),
    )
end
Out[15]:
solve_index_linear (generic function with 1 method)
In [16]:
sol = solve_index_linear()
sol.status
Out[16]:
:Optimal
In [17]:
solweight = [(Symbol=s, SolWeight=sol.weight[s]) for s in symbols]
weight |>
    @join(solweight, _.Symbol, _.Symbol, {_.Symbol, _.Weight, __.SolWeight}) |>
    @vlplot(:point, x=:Weight, y=:SolWeight)
Out[17]:

0.000.010.020.030.040.050.06Weight0.000.010.020.030.040.050.06SolWeight

In the scatter chart above, we compare the weights from the Dow Jones index, with the weights found by the LP model. As we can see (the points lie on the diagonal), we recover the actual weights perfectly, even though we are only using a subset of the data.

Model with Sparsity Constraint

The last model gave us an exact fit, but used all available stocks.

We are changing it now, and allow it to use only a given number of stocks. For this end, we will introduce additional binary variables to select stocks to become active in our index fund.

The weight of inactive variables must be forced to 0, which we do with a so-called big-M constraint. The constant is chosen in such a way, that any active stock could single-handedly approximate the target index.

In [18]:
bigM = 1 / minimum(weight.data.columns[1])
Out[18]:
90.926297738811

We start by using only a single stock and then go up to subsets of 6 stocks out of the 30.

In [19]:
nstocks_range = 1:6
Out[19]:
1:6

The model seems to be relatively difficult to solve for SCIP, probably because of the big-M constraints, so we set a gap limit of 5%, rather than solving to proven optimality.

In [20]:
function solve_index_sparse(nstocks)
    m = Model(solver=SCIPSolver("display/verblevel", 2, "limits/gap", 0.05))
    
    @variable(m, active[symbols], Bin)
    @variable(m, weight[symbols] >= 0)
    @variable(m, posdev[traindates] >= 0)
    @variable(m, negdev[traindates] >= 0)
    
    for d in traindates
        @constraint(m, sum(weight[s]*price[d,s][1] for s in symbols) - dowjones[d][1] == posdev[d] - negdev[d])
    end

    for s in symbols
        @constraint(m, weight[s] <= bigM*active[s])
    end
    
    @constraint(m, sum(active[s] for s in symbols) <= nstocks)
    
    @objective(m, :Min, sum(posdev[d] + negdev[d] for d in traindates))
    
    status = solve(m, suppress_warnings=true)
    return (
        status = status,
        weight = getvalue(weight)
    )
end
Out[20]:
solve_index_sparse (generic function with 1 method)
In [21]:
sols = [solve_index_sparse(n) for n in nstocks_range];

We extract the model solutions and concatenate them into a table for visualization.

In [22]:
solweight = [(NStocks=n, Symbol=s, SolWeight=sols[n].weight[s]) for n in nstocks_range for s in symbols] |>
            @filter(_.SolWeight > 0.0)
solweight |> @take(3)
Out[22]:
NStocks Symbol SolWeight
1 “TRV” 0.91443
2 “BA” 0.394428
2 “MMM” 0.313139
In [23]:
indices = price |>
    @join(solweight, _.Symbol, _.Symbol, {__.NStocks, _.Date, _.Symbol, Contrib=_.Price * __.SolWeight}) |>
    @groupby({_.NStocks, _.Date}) |>
    @map({NStocks=_.NStocks[1], Date=_.Date[1], Value=sum(_.Contrib)}) |> ndsparse
indices |> @take(3)
Out[23]:
NStocks Date Value
1 2016-01-04 100.56
1 2016-01-05 101.017
1 2016-01-06 99.7094
In [24]:
@vlplot(width=600, height=300) +
@vlplot(mark={:line, color=:orange, strokeWidth=3},
        data=dowjones, x=:Date, y={:Value, scale={domain=[80, 125]}}) +
@vlplot(mark={:line, strokeWidth=1, opacity=0.8},
        data=indices, x=:Date, y=:Value, color="NStocks:o") +
@vlplot(:rule, data=datebreak, x=:Date)
Out[24]:

123456NStocksFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberDate80859095100105110115120125130Value

As we can see, our solutions stick to the target fund quite closely during the training period, but then diverge from it quickly. Allowing more stocks gives us a better fit.

Computing Returns

In the previous model, we tried to fit our index fund to the absolute day-to-day closing prices. But the next model will be based on picking representative that are similar, which is defined through correlation of returns.

So let’s start by computing daily returns from the closing prices (for the training period).

In [25]:
prevdates = (Date=dates[2:end], PrevDate=dates[1:end-1])
prevdates |> @take(3)
Out[25]:
Date PrevDate
2016-01-05 2016-01-04
2016-01-06 2016-01-05
2016-01-07 2016-01-06
In [26]:
price[:, "AAPL"] |> @take(3)

returns = prevdates |>
    @filter(_.Date <= traindates[end]) |>
    @join(price, _.Date, _.Date, {_.Date, _.PrevDate, __.Symbol, __.Price}) |>
    @join(price, {Date=_.PrevDate, _.Symbol}, {_.Date, _.Symbol},
                 {__.Symbol, _.Date, Return=_.Price / __.Price - 1.0}) |>
    ndsparse
returns |> @take(3)
Out[26]:
Symbol Date Return
“AAPL” 2016-01-05 -0.0250593
“AAPL” 2016-01-06 -0.0195697
“AAPL” 2016-01-07 -0.0422046
In [27]:
returns |>
    @vlplot(mark={:line, opacity=0.8, strokeWidth=1}, width=600, height=450,
            x=:Date, y=:Return, color={"Symbol:o", scale={scheme="viridis"}})
Out[27]:

AAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMSymbolFebruaryMarchAprilMayJuneJulyAugustSeptemberDate-0.16-0.14-0.12-0.10-0.08-0.06-0.04-0.020.000.020.040.060.080.100.12Return

Analyzing Correlation

We compute pair-wise correlation of the 30 stocks, but first need to compute the mean and standard deviations of the returns.

In [28]:
returnagg = returns |>
    @groupby(_.Symbol) |>
    @map({Symbol=key(_), MeanReturn=mean(_.Return)}) |>
    @join(returns, _.Symbol, _.Symbol,
          {_.Symbol, __.Date, _.MeanReturn, ShiftedSqr=(__.Return - _.MeanReturn)^2}) |>
    @groupby(_.Symbol) |>
    @map({Symbol=key(_), MeanReturn=_.MeanReturn[1],
          StdDevReturn=sqrt(sum(_.ShiftedSqr)/(length(_) - 1))})
returnagg |> @take(3)
Out[28]:
Symbol MeanReturn StdDevReturn
“AAPL” 0.000503948 0.0160686
“AXP” -0.000165808 0.0153556
“BA” -0.000207316 0.0163469

The mean and standard deviation of the stock returns can be used as performance and risk markers. In the following scatter chart, we give an overview of these (bottom right is best).

In [29]:
returnagg |>
    @vlplot(:point, x=:MeanReturn, y=:StdDevReturn, color={"Symbol:o", scale={scheme="viridis"}},
            width=450, height=450)
Out[29]:

AAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMSymbol-0.0008-0.0006-0.0004-0.00020.00000.00020.00040.00060.00080.00100.00120.00140.0016MeanReturn0.0000.0020.0040.0060.0080.0100.0120.0140.0160.018StdDevReturn

In [30]:
shiftedreturns = returns |>
    @join(returnagg, _.Symbol, _.Symbol, {_.Symbol, _.Date, ShiftedReturn=_.Return - __.MeanReturn})
shiftedreturns |> @take(3)
Out[30]:
Symbol Date ShiftedReturn
“AAPL” 2016-01-05 -0.0255633
“AAPL” 2016-01-06 -0.0200736
“AAPL” 2016-01-07 -0.0427085
In [31]:
correlation = shiftedreturns |>
    @join(shiftedreturns, _.Date, _.Date,
          {Left=_.Symbol, Right=__.Symbol, Product=_.ShiftedReturn * __.ShiftedReturn}) |>
    @groupby({_.Left, _.Right}) |>
    @map({Left=_.Left[1], Right=_.Right[1], Covariance=mean(_.Product)}) |>
    @join(returnagg, _.Left, _.Symbol, {_.Left, _.Right, _.Covariance, LeftStdDev=__.StdDevReturn}) |>
    @join(returnagg, _.Right, _.Symbol,
          {_.Left, _.Right, Correlation=_.Covariance / (_.LeftStdDev * __.StdDevReturn)}) |>
    ndsparse
correlation |> @take(3)
Out[31]:
Left Right Correlation
“AAPL” “AAPL” 0.994681
“AAPL” “AXP” 0.121703
“AAPL” “BA” 0.400597
In [32]:
correlation |>
    @vlplot(:rect, x=:Left, y=:Right, color=:Correlation, width=450, height=450)
Out[32]:

0.20.40.60.8CorrelationAAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMLeftAAPLAXPBACATCSCOCVXDDDISGEGSHDIBMINTCJNJJPMKOMCDMMMMRKMSFTNKEPFEPGTRVUNHUTXVVZWMTXOMRight

As we can see here, all of the stocks are either unrelated or positively related, but never negatively related, which is not at all obvious to me.

Model for Picking Representatives

The following model is taken from the book Optimization Methods in Finance.

Rather than looking at the day-to-day values, it precomputes a similary for each pair of stocks, and then selects a representative for each stock while limiting the total number of representatives. The actual weight given to the stocks is not part of the model, but computed in a post-processing step.

The authors note that this model can be solved more efficiently using a Lagrangian Relaxation approach, but I found that this is not necessary for our small dataset.

In [33]:
function solve_index_repr(nstocks)
    m = Model(solver=SCIPSolver("display/verblevel", 2))
    
    @variable(m, active[symbols], Bin)       # is stock in index fund?
    @variable(m, repr[symbols,symbols], Bin) # is stock 'r' represented by 's'?
    
    @constraint(m, sum(active[s] for s in symbols) <= nstocks)
    
    for r in symbols
        for s in symbols
            @constraint(m, repr[r,s] <= active[s])
        end
    end
    
    for r in symbols
        @constraint(m, sum(repr[r,s] for s in symbols) == 1)
    end
    
    @objective(m, :Max, sum(correlation[r,s][1] * repr[r,s] for r in symbols for s in symbols))
    
    status = solve(m, suppress_warnings=true)
    
    # post-processing: determine weights for representatives
    reprsol = getvalue(repr)
    accweight = [sum(weight[r][1] * reprsol[r,s] for r in symbols)/avgprice[s][1] for s in symbols]
                    
    return (
        status = status,
        active = getvalue(active),
        weight = accweight * mean(dowjones.data.columns.Value)
    )
end
Out[33]:
solve_index_repr (generic function with 1 method)
In [34]:
sols_repr = [solve_index_repr(n) for n in nstocks_range];
In [35]:
reprweight = [(NStocks=n, Symbol=s, SolWeight=w) for n in nstocks_range
              for (s,w) in zip(symbols,sols_repr[n].weight)] |>
            @filter(_.SolWeight > 0.0)
reprweight |> @take(3)
Out[35]:
NStocks Symbol SolWeight
1 “GE” 3.47077
2 “MMM” 0.409427
2 “V” 0.466415
In [36]:
reprindices = price |>
    @join(reprweight, _.Symbol, _.Symbol, {__.NStocks, _.Date, _.Symbol, Contrib=_.Price * __.SolWeight}) |>
    @groupby({_.NStocks, _.Date}) |>
    @map({NStocks=_.NStocks[1], Date=_.Date[1], Value=sum(_.Contrib)}) |> ndsparse
reprindices |> @take(3)
Out[36]:
NStocks Date Value
1 2016-01-04 106.587
1 2016-01-05 106.691
1 2016-01-06 104.991
In [37]:
@vlplot(width=600, height=300) +
@vlplot(mark={:line, color=:orange, strokeWidth=3},
        data=dowjones, x=:Date, y={:Value, scale={domain=[80, 125]}}) +
@vlplot(mark={:line, strokeWidth=1, opacity=0.8},
        data=reprindices, x=:Date, y=:Value, color="NStocks:o") +
@vlplot(:rule, data=datebreak, x=:Date)
Out[37]:

123456NStocksFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberDate80859095100105110115120125Value

We can see that the solution funds are not following the target index as closely as before, on a day-to-day basis, but the general shape of the curve is very similar. On the other hand, there seems to be hardly any difference in performance between the training and the test period, so it extrapolates quite well!

Comparison of methods

We could compare the approaches more systematically, by computing the losses with different norms (l1, or l2) on the training or test data, but I feel the visualization already gives a good enough impression.

But as a final comparison, let us count how often the different stocks are used for the index funds.

In [38]:
count1 = solweight |>
    @groupby(_.Symbol) |>
    @map({Symbol=key(_), Method="L1Fit", Count=length(_)}) |>
    @orderby_descending(_.Count)
count2 = reprweight|>
    @groupby(_.Symbol) |>
    @map({Symbol=key(_), Method="Repr", Count=length(_)}) |>
    @orderby_descending(_.Count)

counts = [count1...; count2...] |> ndsparse

counts |>
    @vlplot(:bar, column=:Symbol, x={field=:Method, axis={title=""}}, y=:Count, color=:Method)
Out[38]:

Symbol012345CountBAGEGSIBMKOMCDMMMPFEPGTRVUNHUTXVWMTL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprL1FitReprMethod

It looks like the two model have very different taste in stocks and can only agree on 2-3 of them!

Notes on packages used

Since this blog was also an exercise in using Queryverse and VegaLite, I would like to comment on my experience now.
For Query itself, building the queries was relatively straight-forward, after watching the tutorial given by David Anthoff at JuliaCon 2018.

Some of the computations look quite awkward and are probably not efficient either, such as the twofold join of price with itself and another table that links each day with the previous day. Here, an array-based view using [2:end] and [1:end-1] would have been more natural. Also, after a while I found it a little annoying that I had to repeatedly name all of the columns that I wanted to keep, in @join or @map calls.

Finally, the combination of Query and ndsparse of IndexedTables.jl is a little unfortunate, when the latter one is used as a sink for some query. Typically, ndsparse is constructed with a set of columns to be used as an index, and another (single, unnamed) column as a value. Individual values can then be referenced just like an array, that is, table[i,j]. But the constructor of ndsparse used on query results takes the first $n – 1$ columns as index and the last as value, but wrapped in a singledton NamedTuple. So, table[i,j] will actually be a (k=23,). This meant that I had to use price[d,s][1] as coefficient in my model, rather than simply price[d,s]. I could not figure out how to avoid this, without creating essentially another copy of the data.