One of the significant challenges faced with creating your own trading models is validation of their correctness. Usually you find a model, implement it in your favourite language, then run some numbers through it to verify it. The verification is often the tricky part, as you simply cannot afford to make mistakes in your calculations.
Ideally you have a reference source where the model has been implemented previously so that you can verify its correctness. This may be a text book or paper, a different programming language, or if you're lucky, the same language as the one you're working with. None of these methods are foolproof, as there might be a typo or bug in your reference source, but at one can argue having more then one implementation increases the likelihood of correctness.
A common scenario is that you'll have a version of the model in Excel, which appears to be where most finance models start off life.
For instance, say we have a spreadsheet containing volatility predictions using 30 day and 60 day lookback periods.
We use calculate the standard deviation of log returns based on the population variance, using the following formula:
Where is our lookback period, is our close price on day , and the term we subtract is simply the mean log return, .
Our spreadsheet contains these calculations for AAPL over 30 day and 60 day periods, using data taken from Yahoo Finance.
We then implement the same formula in Python (the lag function is the same one as I used here):
import numpy as np from numpy import log, sqrt import utils ANNUALISER = sqrt(252.0) def annualise(data): return data * ANNUALISER def population_std_dev(close_prices, lookback): N = float(lookback) prices = log(close_prices / utils.lag(close_prices)) results = np.zeros(np.size(prices)) results[:] = np.NAN for i in range(lookback, len(prices)): bounds = range(i-(lookback-1), i+1) results[i] = sqrt( ((prices[bounds] - prices[bounds].sum() / N)**2).sum() / (N - 1)) return annualise(results)
We can combine tools used in those posts to create an automated test that:
- Reads in the adjusted close data for AAPL from our spreadsheet
- Calculates the rolling standard deviation using our Python code
- Verifies that the results produced by our code match those produced by Excel
The code for this test is as follows:
import unittest import numpy.testing as ntest import pandas as pd import model_validation.volatility as volatility FILENAME = 'data/AAPL_std_dev.xlsx' class TestVolatility(unittest.TestCase): def setUp(self): self.xls_file = pd.ExcelFile(FILENAME) self.data = self.xls_file.parse('AAPL Std Dev', header=0, index_col=0, parse_cols='A:J') # We want to reverse our series so the first index is the oldest # entry not newest self.data.sort_index(inplace=True) def test_population_std_dev(self): vol30 = volatility.population_std_dev(self.data['Adj Close'], 30) ntest.assert_array_almost_equal(self.data['30 Day Vol'], vol30) vol60 = volatility.population_std_dev(self.data['Adj Close'], 60) ntest.assert_array_almost_equal(self.data['60 Day Vol'], vol60)
We use Pandas Excel file parser to load our spreadsheet data. Behind the scenes this uses xlrd which I mentioned previously. You can specify all the specifics of where the data is that you're interested in in the sheet - the tab, which columns, the row offset your data starts, etc.
self.xls_file = pd.ExcelFile(FILENAME) self.data = self.xls_file.parse('AAPL Std Dev', header=0, index_col=0, parse_cols='A:J')
Once we have loaded the data, as it contains the most recent data first, we want to reverse it, such that the first entry is our oldest entry, and the last is our most recent.
We then call our standard deviation calculation, and verify that the result produced by our code is the same as in Excel!
vol30 = volatility.population_std_dev(self.data['Adj Close'], 30) ntest.assert_array_almost_equal(self.data['30 Day Vol'], vol30) vol60 = volatility.population_std_dev(self.data['Adj Close'], 60) ntest.assert_array_almost_equal(self.data['60 Day Vol'], vol60)
We use Numpy's assert_array_almost_equal function to verify the correctness of our data. By default this ensures our data sets are identical to 6 decimal places. Sometimes you may need to lower this precision to 4 or 5 places due to slight differences in Excel versus Python's arithmetic implementations, but if your code is correct, you shouldn't find yourself very far off.
You now have a model implemented in two places that are producing the same results! Although it's not an absolute guarantee of the correctness of your models, it's a pretty cool way of increasing your confidence in them. Packages also exist for interfacing with MATLAB and R from Python, so there's no reason why you couldn't apply similar logic to validate against them also (although you do require a local install of MATLAB/R to use them).
As usual the code for this post is available in its entirety here. I've also added verification against Pandas rolling_mean implementation too for completeness (which is way faster then my implementation as it uses Cython to speed things up).
def pandas_std_dev(close_prices, lookback): prices = log(close_prices / utils.lag(close_prices)) return annualise(pd.rolling_std(prices, window=lookback))