By: Christian Groll
Re-posted from: http://grollchristian.wordpress.com/2014/08/13/missing-data/
1 Missing stock price data
When downloading historic stock price data it happens quite frequently that some observations in the middle of the sample are missing. Hence the question: how should we cope with this? There are several ways how we could process the data, each approach coming with its own advantages and disadvantages, and we want to compare some of the most common approaches in this post.
In any case, however, we want to treat missing values as NA and not as Julia’s built-in NaN (a short justification on why NA is more suitable can be found here). Hence, data best should be treated through DataFrames or – if the data comes with time information – through type Timenum from the TimeData package. In the following, we will use these packages in order to show some common approaches to deal with missing stock price data using an artificially made up data set that shall represent logarthmic prices.
The reason why we are looking at logarthmic prices and returns instead of normal prices and net returns is just that logarithmic returns are defined as simple difference between logarithmic prices of successive days:
This way, our calculations simply involve nicer numbers, and all results equally hold for normal prices and returns as well.
We will be using the following exemplary data set of logarthmic prices for the comparison of different approaches:
using TimeData using Dates using Econometrics df = DataFrame() df[:stock1] = @data([100, 120, 140, 170, 200]) df[:stock2] = @data([110, 120, NA, 130, 150]) dats = [Date(2010, 1, 1):Date(2010, 1, 5)] originalPrices = Timenum(df, dats)
| idx | stock1 | stock2 |
| 2010-01-01 | 100 | 110 |
| 2010-01-02 | 120 | 120 |
| 2010-01-03 | 140 | NA |
| 2010-01-04 | 170 | 130 |
| 2010-01-05 | 200 | 150 |
One possible explanation for such a pattern in the data could be that the two stocks are from different countries, and only the country of the second stock has a holiday at January the 3rd.
Quite often in such a situation, people just refrain from any deviations from the basic calculation formula of logarithmic returns and calculate the associated returns as simple differences. This way, however, each missing observation NA will lead to two NAs in the return series:
simpleDiffRets = originalPrices[2:end, :] .- originalPrices[1:(end-1), :]
| idx | stock1 | stock2 |
| 2010-01-02 | 20 | 10 |
| 2010-01-03 | 20 | NA |
| 2010-01-04 | 30 | NA |
| 2010-01-05 | 30 | 20 |
For example, this also is the approach followed by the PerformanceAnalytics package in R:
library(tseries)
library(PerformanceAnalytics)
stockPrices1 <- c(100, 120, 140, 170, 200)
stockPrices2 <- c(110, 120, NA, 130, 150)
## combine in matrix and name columns and rows
stockPrices <- cbind(stockPrices1, stockPrices2)
dates <- seq(as.Date('2010-1-1'),by='days',length=5)
colnames(stockPrices) <- c("A", "B")
rownames(stockPrices) <- as.character(dates)
(stockPrices)
returns = Return.calculate(exp(stockPrices), method="compound")
| nil | nil |
| 20 | 10 |
| 20 | nil |
| 30 | nil |
| 30 | 20 |
When we calculate returns as the difference between successive closing prices and
, a single return simply represents all price movements that happened at day
, including the opening auction that determines the very first price at this day.
Thinking about returns this way, it obviously makes sense to assign a value of NA to each day of the return series where a stock exchange was closed due to holiday, since there simply are no stock price movements on that day. But why would we set the next day’s return to NA as well?
In other words, we should distinguish between two different cases of NA values for our prices:
NAoccurs because the stock exchange was closed this day and hence there never were any price movements- the stock exchange was open that day, and in reality there were some price changes. However, due to a deficiency of our data set, we do not know the price of the respective day.
For the second case, we really would like to have two consecutive values of NA in our return series. Knowing only the prices in and
, there is no way how we could reasonably deduce the value that the price did take on in
. Hence, there are infinitely many possibilities to allocate a certain two-day price increase to two returns.
For the first case, however, it seems unnecessarily rigid to force the return series to have two NA values: allocating all of the two-day price increase to the one day where the stock exchange was open, and a missing value NA to the day that the stock exchange was closed doesn’t seem to be too arbitrary.
This is how returns are calculated by default in the (not yet registered) Econometrics package.
logRet = price2ret(originalPrices, log = true)
| idx | stock1 | stock2 |
| 2010-01-02 | 20 | 10 |
| 2010-01-03 | 20 | NA |
| 2010-01-04 | 30 | 10 |
| 2010-01-05 | 30 | 20 |
And, the other way round, aggregating the return series again will also keep NAs for the respective days, but otherwise perform the desired aggregation. Without specified initial prices, aggregated prices will all start with value 0 for logarithmic prices, and hence express something like normalized prices that allow a nice comparison of different stock price evolutions.
normedPrices = ret2price(logRet, log = true)
| idx | stock1 | stock2 |
| 2010-01-01 | 0 | 0 |
| 2010-01-02 | 20 | 10 |
| 2010-01-03 | 40 | NA |
| 2010-01-04 | 70 | 20 |
| 2010-01-05 | 100 | 40 |
To regain the complete price series (together with a definitely correct starting date), one simply needs to additionally specify the original starting prices.
truePrices = ret2price(logRet, originalPrices, log = true)
| idx | stock1 | stock2 |
| 2010-01-01 | 100 | 110 |
| 2010-01-02 | 120 | 120 |
| 2010-01-03 | 140 | NA |
| 2010-01-04 | 170 | 130 |
| 2010-01-05 | 200 | 150 |
In some cases, however, missing values NA may not be allowed. This could be a likelihood function that requires real values only, or some plotting function. For these cases NAs easily can be removed through imputation. For log price plots, a meaningful way would be:
impute!(truePrices, "last")
| idx | stock1 | stock2 |
| 2010-01-01 | 100 | 110 |
| 2010-01-02 | 120 | 120 |
| 2010-01-03 | 140 | 120 |
| 2010-01-04 | 170 | 130 |
| 2010-01-05 | 200 | 150 |
However, for log returns, the associated transformation then would artificially incorporate values of 0:
impute!(logRet, "zero")
| idx | stock1 | stock2 |
| 2010-01-02 | 20 | 10 |
| 2010-01-03 | 20 | 0 |
| 2010-01-04 | 30 | 10 |
| 2010-01-05 | 30 | 20 |
As an alternative to replacing NA values, one could also simply remove the respective dates from the data set. Again, there are two options how this could be done.
First, one could remove any missing observations directly in the price series:
originalPrices2 = deepcopy(originalPrices) noNAprices = convert(TimeData.Timematr, originalPrices2, true)
| idx | stock1 | stock2 |
| 2010-01-01 | 100 | 110 |
| 2010-01-02 | 120 | 120 |
| 2010-01-04 | 170 | 130 |
| 2010-01-05 | 200 | 150 |
For the return series, however, we then have a – probably large – multi-day price jump that seems to be a single-day return. In our example, we suddenly observe a return of 50 for the first stock.
logRet = price2ret(noNAprices)
| idx | stock1 | stock2 |
| 2010-01-02 | 20 | 10 |
| 2010-01-04 | 50 | 10 |
| 2010-01-05 | 30 | 20 |
A second way to eliminate NAs would be to remove them from the return series.
logRet = price2ret(originalPrices) noNAlogRet = convert(TimeData.Timematr, logRet, true)
| idx | stock1 | stock2 |
| 2010-01-02 | 20 | 10 |
| 2010-01-04 | 30 | 10 |
| 2010-01-05 | 30 | 20 |
However, deriving the associated price series for this processed return series will then lead to deviating end prices:
noNAprices = ret2price(noNAlogRet, originalPrices)
| idx | stock1 | stock2 |
| 2010-01-01 | 100 | 110 |
| 2010-01-02 | 120 | 120 |
| 2010-01-04 | 150 | 130 |
| 2010-01-05 | 180 | 150 |
as opposed to the real end prices
originalPrices
| idx | stock1 | stock2 |
| 2010-01-01 | 100 | 110 |
| 2010-01-02 | 120 | 120 |
| 2010-01-03 | 140 | NA |
| 2010-01-04 | 170 | 130 |
| 2010-01-05 | 200 | 150 |
The first stock now suddenly ends with a price of only 180 instead of 200.
2 Summary
The first step when facing a missing price observation is to think whether it might make sense to treat only one return as missing, assigning the complete price movement to the other return. This is perfectly reasonable for days where the stock market really was closed. In all other cases, however, it might make more sense to calculate logarithmic returns as simple differences, leading to two NAs in the return series.
Once there are NA values present, we can chose between three options.
2.1 Keeping NAs
Keeping NA values might be cumbersome in some situations, since some functions might only be working for data without NA values.
2.2 Replacing NAs
In cases where NAs may not be present, there sometimes might exist ways of replacing them that perfectly make sense. However, manually replacing observations in some way means messing with the original data, and one should be cautious to not incorporate any artificial patterns this way.
2.3 Removing NAs
Obviously, when dates with NA values for only some variables are eliminated completely, we simultaneously lose data for those variables where observations originally were present. Furthermore, eliminating cases with NAs for returns will lead to price evolutions that are different to the original prices.
2.4 Overview
Possible prices:
| idx | simple diffs | single NA | replace w 0 | rm NA price | rm NA return |
| 2010-01-01 | 100, 110 | 100, 110 | 100, 100 | 100, 110 | 100, 110 |
| 2010-01-02 | 120, 120 | 120, 120 | 120, 120 | 120, 120 | 120, 120 |
| 2010-01-03 | 140, NA | 140, NA | 140, 120 | ||
| 2010-01-04 | 170, 130 | 170, 130 | 170, 130 | 170, 130 | 150, 130 |
| 2010-01-05 | 200, 150 | 200, 150 | 200, 150 | 200, 150 | 180, 150 |
Possible returns:
| idx | simple diffs | single NA | replace w 0 | rm NA price | rm NA return |
| 2010-01-02 | 20, 10 | 20, 10 | 20, 10 | 20, 10 | 20, 10 |
| 2010-01-03 | 20, NA | 20, NA | 20, 0 | ||
| 2010-01-04 | 30, NA | 30, 10 | 30, 10 | 50, 10 | 30, 10 |
| 2010-01-05 | 30, 20 | 30, 20 | 30, 20 | 30, 20 | 30, 20 |
Filed under: financial econometrics, Julia Tagged: data, missing data, returns
![]()