In my last post I demonstrated how you can generate synthetic futures prices. In this post I am going to build on this and show how you can apply principal component analysis (PCA) to determine how much of the variability in returns each of the different futures are responsible for.

Creating our data set was actually the harder part of the work. There are a number of PCA implementations available in Python we can use for the maths.

By the end of the last post we had generated the following constant maturity VIX *prices* dataset.

30.0 60.0 90.0 120.0 150.0 180.0 2009-01-02 39.143667 39.429000 37.445000 36.195917 36.038000 35.935778 2009-01-05 40.053333 39.538667 37.558000 36.401667 36.018000 35.667389 2009-01-06 39.540000 38.969167 37.339556 36.621000 36.058933 36.017000 2009-01-07 42.976000 41.777000 39.765556 38.496167 37.849000 37.387722 2009-01-08 42.828333 41.770000 39.412000 38.150833 37.114400 37.217778 ... 2014-12-23 15.951667 16.752500 17.300000 17.745417 18.079000 18.382778 2014-12-24 16.271667 16.955000 17.405000 17.851667 18.230500 18.510417 2014-12-26 15.918333 16.577500 17.096667 17.608333 17.985000 18.290556 2014-12-29 16.226667 16.785000 17.186111 17.668333 18.041000 18.344444 2014-12-30 16.585000 17.083333 17.435000 17.865833 18.193000 18.499444

We continue by translating each set of constant maturity prices into log returns.

returns = pd.DataFrame(index=prices.index) for key in prices: returns[key] = calculate_log_returns(prices[key])

Where we use the log returns calculation we have used previously.

def lag(data, empty_term=0., axis=0): lagged = np.roll(data, 1, axis=axis) lagged[0] = empty_term return lagged def calculate_log_returns(prices): lagged_pnl = lag(prices) returns = np.log(prices / lagged_pnl) # All values prior to our position opening in pnl will have a # value of inf. This is due to division by 0.0 returns[np.isinf(returns)] = 0. # Additionally, any values of 0 / 0 will produce NaN returns[np.isnan(returns)] = 0. return returns

There are two stages to performing the principal component analysis. First we need to determine the correlation matrix of our different returns series. We use Pandas' corr function for this.

corr = returns.corr()

The correlation matrix contains the correlations between each of our groups return periods. You could alternatively use the covariance in PCA provided your different data series are alike (which in this instance they are).

The correlation between different series is always normalised to be in the range -1 to 1. Whereas the covariance is similar to the variance, except that it is a measure of how much two random variables differ from one another, unlike the variance which is a measure of how much a random variable differs from its mean. Pandas provides a covariance function, so you can experiment with that too.

30.0 60.0 90.0 120.0 150.0 180.0 30.0 1.000000 0.964496 0.942159 0.920386 0.892322 0.878156 60.0 0.964496 1.000000 0.963110 0.915421 0.895599 0.896812 90.0 0.942159 0.963110 1.000000 0.926136 0.909014 0.933444 120.0 0.920386 0.915421 0.926136 1.000000 0.927316 0.913841 150.0 0.892322 0.895599 0.909014 0.927316 1.000000 0.933129 180.0 0.878156 0.896812 0.933444 0.913841 0.933129 1.000000

Secondly to perform our PCA analysis, we use NumPy's linalg.eig() function to obtain our eigenvalues and eigenvectors. Eigenvalues and eigenvectors are values and vectors that characterise the values in our matrix, which are used to decompose our data into their principal components.

eig_val, eig_vec = np.linalg.eig(corr)

Armed with our eigenvalues and eigenvectors, we can determine the proportion of the variance for the th factor () with the following calculation:

Where is the th eigenvalue, and are the components of our eigenvector.

prop_of_variation = eig_val / eig_val.sum() print('Proportion of variation explained: {}'.format(prop_of_variation))

This produces our result:

Proportion of variation explained: [ 0.93401632 0.02888954 0.00440268 0.00637289 0.01540229 0.01091628]

This result demonstrates that 93.4% of the system variance is caused by the 30 day synthetic future, 2.89% by the 60 day future, and the remainder different combinations of the other futures.

Of course this is pretty intuitive - one would expect the nearest expiring future to be the most volatile, however, by using PCA we obtain a statistical breakdown of the contribution of each of our individual components to the result set.

Its potential application is far wider, you could apply a similar technique to identify key variance contributors of any groups of price series. Ernie Chan in his book, demonstrates how one can use PCA to identify the stocks from an indices with the best and worst returns and base a (poorly-performing) strategy on this.

For further background on PCA, I suggest you start with this great visual explanation of PCA, and the accompanying post on eigenvalues and eigenvectors.

The code in it's entirety for both posts is available below and here.

import datetime as dt import glob import os import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn FUTURES_MONTHS = ['F', 'G', 'H', 'J', 'K', 'M', 'N', 'Q', 'U', 'V', 'X', 'Z'] FUTURES_DATA_DIR = '/path/to/data' def main(): futures_prices = load_futures_prices() n_futures = [30., 60., 90., 120., 150., 180.] prices = {} start = '01/01/2009' end = '30/12/2014' dates = pd.date_range(start, end, normalize=True) prices = pd.DataFrame(index=dates) for n in n_futures: result = load_vix_future(n, futures_prices, dates) # You may want to add some additional validation here looking for NaNs... prices[str(n)] = result prices = prices.dropna() prices.plot() plt.title('Constant maturity synthetic VIX futures prices') plt.show() returns = pd.DataFrame(index=prices.index) for key in prices: returns[key] = calculate_log_returns(prices[key]) corr = returns.corr() print(corr) eig_val, eig_vec = np.linalg.eig(corr) prop_of_variation = eig_val / eig_val.sum() print(eig_val) print(eig_vec) print('Proportion of variation explained: {}'.format(prop_of_variation)) def load_futures_prices(): return load_vix_futures_prices( FUTURES_DATA_DIR, price='Settle', start_year=2005) def load_vix_future(n, futures_prices, dates): ratios = np.empty(len(dates)) for i in range(0, len(ratios)): ratios[i] = get_vix_future_ndays(n, futures_prices, dates[i].date()) return ratios def get_vix_future_ndays(n, futures_prices, curr_date): """ Weighted future expiring n days from now. """ # We use n-1 as we roll to the next month if our target date falls on an # expiry date, we want to that months contract, not the next target_date = curr_date + dt.timedelta(days=n-1) prev_expiry_date = get_next_expiry_date(curr_date) next_expiry_date = get_next_expiry_date( prev_expiry_date + dt.timedelta(days=1)) while next_expiry_date < target_date: prev_expiry_date = next_expiry_date next_expiry_date = get_next_expiry_date( prev_expiry_date + dt.timedelta(days=1)) prices = get_futures_prices( futures_prices, curr_date, prev_expiry_date, 2) near_date_maturity = (prev_expiry_date - curr_date).days if prices[0] is np.NaN or prices[1] is np.NaN: return np.NaN else: return (near_date_maturity / n) * prices[0] + \ ((n - near_date_maturity) / n) * prices[1] def get_futures_prices(futures_prices, curr_date, expiry_date, month_count): def get_price_value(prices, curr_date): if curr_date in prices: return prices[curr_date] else: return np.NaN year = expiry_date.year month = expiry_date.month prices = [] prices.append( get_price_value(futures_prices[year][month - 1], curr_date)) remaining = month_count - 1 next_month = month next_year = year while remaining > 0: if next_month == 12: next_month = 1 next_year += 1 else: next_month += 1 prices.append( get_price_value( futures_prices[next_year][next_month - 1], curr_date)) remaining -= 1 return prices def get_next_expiry_date(curr_date): expiry_date = get_expiry_date_for_month(curr_date) # It must be less then the expiry date, as on expiry date we only have # a settlement price if curr_date < expiry_date: return expiry_date else: return get_expiry_date_for_month(curr_date + dt.timedelta(days=30)) def get_expiry_date_for_month(curr_date): """ http://cfe.cboe.com/products/spec_vix.aspx TERMINATION OF TRADING: Trading hours for expiring VIX futures contracts end at 7:00 a.m. Chicago time on the final settlement date. FINAL SETTLEMENT DATE: The Wednesday that is thirty days prior to the third Friday of the calendar month immediately following the month in which the contract expires ("Final Settlement Date"). If the third Friday of the month subsequent to expiration of the applicable VIX futures contract is a CBOE holiday, the Final Settlement Date for the contract shall be thirty days prior to the CBOE business day immediately preceding that Friday. """ # Date of third friday of the following month if curr_date.month == 12: third_friday_next_month = dt.date(curr_date.year + 1, 1, 15) else: third_friday_next_month = dt.date(curr_date.year, curr_date.month + 1, 15) one_day = dt.timedelta(days=1) thirty_days = dt.timedelta(days=30) while third_friday_next_month.weekday() != 4: # Using += results in a timedelta object third_friday_next_month = third_friday_next_month + one_day # TODO: Incorporate check that it's a trading day, if so move the 3rd # Friday back by one day before subtracting return third_friday_next_month - thirty_days def load_vix_futures_prices(source_dir, price='Close', start_year=2005, end_year=2099): """ Dictionary of dataframe price data in format CFE_[M][YY]_VX.csv where M is [] start_year and end_year parameters refer to futures we are interested in, not dates we have price data for. :return data[YYYY][M] = dataframe Where YYYY is expiry year, M is expiry month in range [0, 11] """ data = {} files = glob.glob(os.path.join(source_dir, 'CFE_*')) for f in files: filename = os.path.basename(f) month = FUTURES_MONTHS.index(filename[4]) year = int('20' + filename[5] + filename[6]) if year < start_year or year > end_year: continue try: df = load_symbol_data(f, index=0, header_row=0) except IndexError: df = load_symbol_data(f, index=0, header_row=1) if year not in data: data[year] = 12 * [None] data[year][month] = df[price] return data def load_symbol_data(filename, index=0, header_row=0, order_ascending=True): df = pd.read_csv(filename, index_col=index, header=header_row, parse_dates=True) if order_ascending: # Ensure data is in ascending order by date if df.index[0] > df.index[1]: df.sort_index(inplace=True) return df def np_print_full(*args, **kwargs): from pprint import pprint opt = np.get_printoptions() np.set_printoptions(threshold='nan') pprint(*args, **kwargs) np.set_printoptions(**opt) def lag(data, empty_term=0., axis=0): lagged = np.roll(data, 1, axis=axis) lagged[0] = empty_term return lagged def calculate_log_returns(prices): lagged_pnl = lag(prices) returns = np.log(prices / lagged_pnl) # All values prior to our position opening in pnl will have a # value of inf. This is due to division by 0.0 returns[np.isinf(returns)] = 0. # Additionally, any values of 0 / 0 will produce NaN returns[np.isnan(returns)] = 0. return returns if __name__ == '__main__': main()

Categories: Development Python Volatility

## Leave a Reply