# Variance Factors on VIX Futures II - Principal Component Analysis

In my last post I demonstrated how you can generate synthetic futures prices. In this post I am going to build on this and show how you can apply principal component analysis (PCA) to determine how much of the variability in returns each of the different futures are responsible for.

Creating our data set was actually the harder part of the work. There are a number of PCA implementations available in Python we can use for the maths.

By the end of the last post we had generated the following constant maturity VIX prices dataset.

                 30.0       60.0       90.0      120.0      150.0      180.0
2009-01-02  39.143667  39.429000  37.445000  36.195917  36.038000  35.935778
2009-01-05  40.053333  39.538667  37.558000  36.401667  36.018000  35.667389
2009-01-06  39.540000  38.969167  37.339556  36.621000  36.058933  36.017000
2009-01-07  42.976000  41.777000  39.765556  38.496167  37.849000  37.387722
2009-01-08  42.828333  41.770000  39.412000  38.150833  37.114400  37.217778
...
2014-12-23  15.951667  16.752500  17.300000  17.745417  18.079000  18.382778
2014-12-24  16.271667  16.955000  17.405000  17.851667  18.230500  18.510417
2014-12-26  15.918333  16.577500  17.096667  17.608333  17.985000  18.290556
2014-12-29  16.226667  16.785000  17.186111  17.668333  18.041000  18.344444
2014-12-30  16.585000  17.083333  17.435000  17.865833  18.193000  18.499444


We continue by translating each set of constant maturity prices into log returns.

    returns = pd.DataFrame(index=prices.index)

for key in prices:
returns[key] = calculate_log_returns(prices[key])


Where we use the log returns calculation we have used previously.

def lag(data, empty_term=0., axis=0):
lagged = np.roll(data, 1, axis=axis)
lagged[0] = empty_term
return lagged

def calculate_log_returns(prices):
lagged_pnl = lag(prices)
returns = np.log(prices / lagged_pnl)

# All values prior to our position opening in pnl will have a
# value of inf. This is due to division by 0.0
returns[np.isinf(returns)] = 0.
# Additionally, any values of 0 / 0 will produce NaN
returns[np.isnan(returns)] = 0.
return returns


There are two stages to performing the principal component analysis. First we need to determine the correlation matrix of our different returns series. We use Pandas' corr function for this.

    corr = returns.corr()


The correlation matrix contains the correlations between each of our groups return periods. You could alternatively use the covariance in PCA provided your different data series are alike (which in this instance they are).

The correlation between different series is always normalised to be in the range -1 to 1. Whereas the covariance is similar to the variance, except that it is a measure of how much two random variables differ from one another, unlike the variance which is a measure of how much a random variable differs from its mean. Pandas provides a covariance function, so you can experiment with that too.

           30.0      60.0      90.0     120.0     150.0     180.0
30.0   1.000000  0.964496  0.942159  0.920386  0.892322  0.878156
60.0   0.964496  1.000000  0.963110  0.915421  0.895599  0.896812
90.0   0.942159  0.963110  1.000000  0.926136  0.909014  0.933444
120.0  0.920386  0.915421  0.926136  1.000000  0.927316  0.913841
150.0  0.892322  0.895599  0.909014  0.927316  1.000000  0.933129
180.0  0.878156  0.896812  0.933444  0.913841  0.933129  1.000000


Secondly to perform our PCA analysis, we use NumPy's linalg.eig() function to obtain our eigenvalues and eigenvectors. Eigenvalues and eigenvectors are values and vectors that characterise the values in our matrix, which are used to decompose our data into their principal components.

    eig_val, eig_vec = np.linalg.eig(corr)


Armed with our eigenvalues and eigenvectors, we can determine the proportion of the variance for the $i$th factor ($P_i$) with the following calculation:

$P_i = \dfrac{\lambda_i}{\lambda_1 + ... + \lambda_n}$

Where $\lambda_i$ is the $i$th eigenvalue, and $\lambda_1 + ... + \lambda_n$ are the $n$ components of our eigenvector.

    prop_of_variation = eig_val / eig_val.sum()
print('Proportion of variation explained: {}'.format(prop_of_variation))


This produces our result:

Proportion of variation explained: [ 0.93401632  0.02888954  0.00440268  0.00637289  0.01540229  0.01091628]


This result demonstrates that 93.4% of the system variance is caused by the 30 day synthetic future, 2.89% by the 60 day future, and the remainder different combinations of the other futures.

Of course this is pretty intuitive - one would expect the nearest expiring future to be the most volatile, however, by using PCA we obtain a statistical breakdown of the contribution of each of our individual components to the result set.

Its potential application is far wider, you could apply a similar technique to identify key variance contributors of any groups of price series. Ernie Chan in his book, demonstrates how one can use PCA to identify the stocks from an indices with the best and worst returns and base a (poorly-performing) strategy on this.

For further background on PCA, I suggest you start with this great visual explanation of PCA, and the accompanying post on eigenvalues and eigenvectors.

The code in it's entirety for both posts is available below and here.

import datetime as dt
import glob
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn

FUTURES_MONTHS = ['F', 'G', 'H', 'J', 'K', 'M', 'N', 'Q', 'U', 'V', 'X', 'Z']
FUTURES_DATA_DIR = '/path/to/data'

def main():

n_futures = [30., 60., 90., 120., 150., 180.]
prices = {}

start = '01/01/2009'
end = '30/12/2014'
dates = pd.date_range(start, end, normalize=True)

prices = pd.DataFrame(index=dates)

for n in n_futures:
result = load_vix_future(n, futures_prices, dates)
# You may want to add some additional validation here looking for NaNs...
prices[str(n)] = result

prices = prices.dropna()

prices.plot()
plt.title('Constant maturity synthetic VIX futures prices')
plt.show()

returns = pd.DataFrame(index=prices.index)

for key in prices:
returns[key] = calculate_log_returns(prices[key])

corr = returns.corr()
print(corr)
eig_val, eig_vec = np.linalg.eig(corr)
prop_of_variation = eig_val / eig_val.sum()
print(eig_val)
print(eig_vec)
print('Proportion of variation explained: {}'.format(prop_of_variation))

FUTURES_DATA_DIR, price='Settle', start_year=2005)

def load_vix_future(n, futures_prices, dates):
ratios = np.empty(len(dates))

for i in range(0, len(ratios)):
ratios[i] = get_vix_future_ndays(n, futures_prices, dates[i].date())

return ratios

def get_vix_future_ndays(n, futures_prices, curr_date):
"""
Weighted future expiring n days from now.
"""
# We use n-1 as we roll to the next month if our target date falls on an
# expiry date, we want to that months contract, not the next
target_date = curr_date + dt.timedelta(days=n-1)

prev_expiry_date = get_next_expiry_date(curr_date)
next_expiry_date = get_next_expiry_date(
prev_expiry_date + dt.timedelta(days=1))

while next_expiry_date < target_date:
prev_expiry_date = next_expiry_date
next_expiry_date = get_next_expiry_date(
prev_expiry_date + dt.timedelta(days=1))

prices = get_futures_prices(
futures_prices, curr_date, prev_expiry_date, 2)
near_date_maturity = (prev_expiry_date - curr_date).days

if prices[0] is np.NaN or prices[1] is np.NaN:
return np.NaN
else:
return (near_date_maturity / n) * prices[0] + \
((n - near_date_maturity) / n) * prices[1]

def get_futures_prices(futures_prices, curr_date, expiry_date, month_count):

def get_price_value(prices, curr_date):
if curr_date in prices:
return prices[curr_date]
else:
return np.NaN

year = expiry_date.year
month = expiry_date.month

prices = []

prices.append(
get_price_value(futures_prices[year][month - 1], curr_date))

remaining = month_count - 1
next_month = month
next_year = year
while remaining > 0:
if next_month == 12:
next_month = 1
next_year += 1
else:
next_month += 1

prices.append(
get_price_value(
futures_prices[next_year][next_month - 1], curr_date))

remaining -= 1

return prices

def get_next_expiry_date(curr_date):
expiry_date = get_expiry_date_for_month(curr_date)

# It must be less then the expiry date, as on expiry date we only have
# a settlement price
if curr_date < expiry_date:
return expiry_date
else:
return get_expiry_date_for_month(curr_date + dt.timedelta(days=30))

def get_expiry_date_for_month(curr_date):
"""
http://cfe.cboe.com/products/spec_vix.aspx

Trading hours for expiring VIX futures contracts end at 7:00 a.m. Chicago
time on the final settlement date.

FINAL SETTLEMENT DATE:

The Wednesday that is thirty days prior to the third Friday of the
calendar month immediately following the month in which the contract
expires ("Final Settlement Date"). If the third Friday of the month
subsequent to expiration of the applicable VIX futures contract is a
CBOE holiday, the Final Settlement Date for the contract shall be thirty
days prior to the CBOE business day immediately preceding that Friday.
"""
# Date of third friday of the following month
if curr_date.month == 12:
third_friday_next_month = dt.date(curr_date.year + 1, 1, 15)
else:
third_friday_next_month = dt.date(curr_date.year,
curr_date.month + 1, 15)

one_day = dt.timedelta(days=1)
thirty_days = dt.timedelta(days=30)
while third_friday_next_month.weekday() != 4:
# Using += results in a timedelta object
third_friday_next_month = third_friday_next_month + one_day

# TODO: Incorporate check that it's a trading day, if so move the 3rd
# Friday back by one day before subtracting
return third_friday_next_month - thirty_days

start_year=2005, end_year=2099):
"""
Dictionary of dataframe price data in format
CFE_[M][YY]_VX.csv where M is []

start_year and end_year parameters refer to futures we are interested in,
not dates we have price data for.

:return data[YYYY][M] = dataframe
Where YYYY is expiry year, M is expiry month in range [0, 11]
"""

data = {}

files = glob.glob(os.path.join(source_dir, 'CFE_*'))
for f in files:
filename = os.path.basename(f)
month = FUTURES_MONTHS.index(filename[4])
year = int('20' + filename[5] + filename[6])

if year < start_year or year > end_year:
continue

try:
except IndexError:

if year not in data:
data[year] = 12 * [None]
data[year][month] = df[price]

return data

parse_dates=True)

if order_ascending:
# Ensure data is in ascending order by date
if df.index[0] > df.index[1]:
df.sort_index(inplace=True)
return df

def np_print_full(*args, **kwargs):
from pprint import pprint
opt = np.get_printoptions()
np.set_printoptions(threshold='nan')
pprint(*args, **kwargs)
np.set_printoptions(**opt)

def lag(data, empty_term=0., axis=0):
lagged = np.roll(data, 1, axis=axis)
lagged[0] = empty_term
return lagged

def calculate_log_returns(prices):
lagged_pnl = lag(prices)
returns = np.log(prices / lagged_pnl)

# All values prior to our position opening in pnl will have a
# value of inf. This is due to division by 0.0
returns[np.isinf(returns)] = 0.
# Additionally, any values of 0 / 0 will produce NaN
returns[np.isnan(returns)] = 0.
return returns

if __name__ == '__main__':
main()



Categories: Development Python Volatility