Many books and papers explain empirical asset pricing such as CAPM and the Fama-French 3-factor model, but without working through the data first hand and replicating the results myself, it is difficult to build intuition about the subject.
In the below we work through the CRSP monthly dataset to get a real feel of how the data looks like, and also try to replicate some of the standard results in [1].
import pandas as pd
import numpy as np
from numpy.linalg import lstsq
import matplotlib.pyplot as plt
from scipy.stats import describe, linregress
import functools
from IPython.display import display, clear_output
from numba import double, void, f4
from numbapro import cuda
import itertools
import time
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
plt.style.use(('bmh'))
dataLoc = '/Users/unclechu/Dropbox/Datasets/CRSP/CRSP_monthly.csv'
dataLoc2 = '/Users/unclechu/Dropbox/Datasets/CRSP/CRSP_Riskfree_1_Month.csv'
def crsp_riskfree():
riskfree = pd.read_csv ( dataLoc2, parse_dates=[0], infer_datetime_format=True )
riskfree.rename( columns = {'qdate':'date', 'ave_1':'RF1M' }, inplace=True )
riskfree['RF1M'] = pd.to_numeric( riskfree.RF1M, errors='coerce')
riskfree['RF1M'] = riskfree['RF1M'].apply( lambda x: (np.exp(30./365.*(x/100.)) - 1 ))
riskfree.set_index( 'date', inplace=True )
return riskfree
riskfree = crsp_riskfree() # load data for riskfree rate
riskfree.head()
def excess_return_grouper( sub_df ):
idx = sub_df.index.labels[0][0]
dt = sub_df.index.levels[0][idx]
if dt.month == 1:
label = '%s-%s' %( dt.year - 1, 12 )
else:
label = '%s-%.2d' %( dt.year, dt.month-1)
if not pd.Timestamp(label) > riskfree.index.max():
rf_rate = riskfree.loc[label]['RF1M'].values[0]
sub_df['XRET'] = sub_df.RET - rf_rate
else:
sub_df['XRET'] = 0.0
return sub_df
def crsp_stock_data():
df =pd.read_csv ( dataLoc, parse_dates=[1], infer_datetime_format=True )
df = df.query('SHRCD==10 or SHRCD==11').copy()
df['RET'] = pd.to_numeric( df.RET, errors='coerce')
df['RETX'] = pd.to_numeric( df.RETX, errors='coerce')
df.dropna(subset=['TICKER', 'RET', 'RETX'], inplace=True)
# Compute market capitalization. Note : The number of shares outstanding, in thousands, is given by the SHROUT
# field. The month end price of the stock is taken from the ALTPRC field. We define the market cap as the absolute
# value of the product of the SHROUT and the ALTPRC fields, divded by 1000. This gives market cap in millions. The
# absolutely value is taken because when there is no trading activity in a stock, CRSP reports the price as the
# negative of the average of the most recent bid and ask prices.
df['MktCap'] = np.abs( df.SHROUT * df.ALTPRC ) / 1000.0
df.set_index(['date', 'TICKER'], inplace=True)
df.sortlevel(0, inplace=True)
df = df.groupby(level='date').apply( excess_return_grouper )
return df
df = crsp_stock_data()
df.head()
We calculate the cumulative market capitalization of the CRSP universe for each month $t$.
#calculate cumulative market cap
mktcap_ts = df.groupby(df.index.get_level_values(0))['MktCap'].sum() / 1000. # in billions
fig = plt.figure( figsize = ( 12, 6 ))
mktcap_ts.plot()
plt.title('Total market capitalization of CRSP universe. (Monthly data)')
plt.show()
We calculate returns and excess returns each month $t$ for the cross-section of stocks in the CRSP universe. From this we compute the cross-sectional means, variance, and other descriptive statistics for each time point t. Thus we end up with a time series of such values each month and we then summarise these time-series values.
cx_excess_ret_info = df.groupby(level='date').apply( lambda x : x['XRET'].describe() )
cx_ret_info = df.groupby(level='date').apply( lambda x : x['RET'].describe())
for col in cx_excess_ret_info.columns:
if col != 'count':
cx_excess_ret_info[col] = cx_excess_ret_info[col] * 100
cx_ret_info[col] = cx_ret_info[col] * 100
cx_excess_ret_info.head()
cx_ret_info.head()
ret_summary = pd.DataFrame([cx_excess_ret_info.apply(np.mean),
cx_ret_info.apply(np.mean)], index=['xret', 'ret'])
ret_summary
From the above, we can see that the average cross-sectional return has a time-series mean of 1.2, and a value of 0.83 for excess returns. The min can be as low as more than -60% and the max as high as 250%. The illustrate the high risk of investing in stocks.
The market factor is one of the import risk factors that features itself in the CAPM as well as the Fama-French 3-factor model. We use the market factor to compute the $\beta$ in the subsequent sections. Now we load and give a preview of the data.
def FF_mkt_returns( daily=True ):
if daily:
dataLoc = '/Users/unclechu/Dropbox/Datasets/CRSP/FF-Factors-Daily.csv'
else:
dataLoc = '/Users/unclechu/Dropbox/Datasets/CRSP/FF-Factors-Monthly.csv'
df = pd.read_csv(dataLoc, parse_dates=[0], infer_datetime_format=True)
df.set_index( 'dateff', inplace=True)
df.sort_index( inplace=True )
return df
mkt_returns = FF_mkt_returns(False)
mkt_returns.head()
cum_xret = mkt_returns.mktrf + 1.
cum_xret = np.cumprod( cum_xret ) * 100.
fig = plt.figure( figsize = ( 12, 6))
cum_xret.plot()
plt.title('Cumulative excess return of the market portfolio')
plt.show()
The CAPM postulates that the excess return of asset $i$, $R^i$, can be explained by it's sensitivity to market risk $\beta_i$, via the following formulation,
$$ \begin{aligned} &\mathbb{E}_t\left[R^i -R^f \right] = \beta_i\, \mathbb{E}_t\left[R^m -R^f \right] + (\alpha_i)\\ \Rightarrow &\mathbb{E}_t\left[R^{ei}\right] = \beta_i\, \mathbb{E}_t \left[R^{em}\right] + (\alpha_i) \end{aligned} $$What CAPM theory says is that if a stock has high average returns, it should also have a high $\beta$ against average return of the market portfolio. Additionally, according to the CAPM, we should not see any $\alpha_i$. In reality, $\alpha_i$ is observed in practice and is statistically significant in some instances.
We derive $\beta_i$ from time-series regressions,
$$ R^{ei}_t = \alpha_i + \beta_i R ^{em}_t + \epsilon^i_t $$The point to stress about the above is that it's not about "predicting returns"! The whole point of the exercise is rather to see if high returns are matched by high $\beta_i$. From properties of least square regression we have the following,
We calculate the cross-sectional values of beta across the entire stock universe to gain some intuition on the range of values of $\beta$ and it's behavior over time. To attack this huge computational task, we use the GPU to perform the OLS regressions in parallel.
@cuda.jit( void( double[:,:], double[:,:], double[:,:] ))
def ols( x, y, ret ):
''' batch compute large number of least sqaure regression y = a + bx '''
i = cuda.grid(1)
N = y.shape[1]
xbar = 0.0
ybar = 0.0
for k in range(N):
xbar += x[i][k]
ybar += y[i][k]
xbar = xbar / N
ybar = ybar / N
nom = 0.0
denom = 0.0
for k in range(N):
nom += ( x[i][k] - xbar ) * ( y[i][k] - ybar)
denom += ( x[i][k] - xbar )**2
beta = nom / denom
ret[i][1] = beta
ret[i][0] = ybar - beta * xbar
def chunk_iter( iterable, n ):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
def batch_ols( xlist, ylist, tickers, dt ):
stream = cuda.stream()
griddim = 1, 1
xs = np.stack( xlist )
ys = np.stack( ylist )
cnt = len( ylist )
ret = np.zeros( ( cnt, 2 ) )
blockdim = cnt, 1, 1
dxs = cuda.to_device( xs, stream )
dys = cuda.to_device( ys, stream )
dret = cuda.to_device( ret, stream )
ols[griddim, blockdim, stream](dxs, dys, dret)
dret.to_host(stream)
stream.synchronize()
alpha, beta = zip(*ret)
df = pd.DataFrame.from_dict( dict(ticker=tickers, alpha=alpha, beta=beta ) )
df['date'] = dt
return df
class Timer():
def __init__( self, desc ):
self.desc = desc
def __enter__( self ):
self.start = time.time()
return self
def __exit__(self, type, value, traceback):
self.elapsed = time.time() - self.start
print self.desc, 'took', self.elapsed, 'seconds'
def do_beta():
CHUNK_SIZE = 1024 #1536
monthly_data_points = 24 # number of points we use to compute beta.
stock_data = crsp_stock_data()
riskfree = crsp_riskfree()
mkt_returns = FF_mkt_returns(daily=False)
stock_data['XRET'] = 0.0
# compute excess returns for each month across the whole dataset
for dt in riskfree.index[1:]:
if dt.month == 1:
prev_month = '%s-%.2d' % (dt.year-1,12)
else:
prev_month = '%s-%.2d' % (dt.year,dt.month-1)
riskfree_rate = riskfree.loc[prev_month]['RF1M'].values[0] # only 1 row in the series, pick out the atom value
stock_data.loc[(dt, slice(None)), 'XRET'] = stock_data.loc[(dt, slice(None)), 'RET'] - riskfree_rate
result_dfs = []
for dt in riskfree.index[monthly_data_points:]:
clear_output(wait=True)
display( 'processing %s' % dt.date() )
start_date = dt + pd.Timedelta(days=-365*2)
end_date = dt
period_mkt_returns = mkt_returns.ix[start_date:end_date]
stock_data_window = stock_data.loc[ start_date : end_date ]
tickers = stock_data_window.loc[dt].index.unique()
count = 0
for chunk in chunk_iter( tickers, CHUNK_SIZE ):
# prepare the data for execution on the GPU
symbols, xlist, ylist = [], [], []
for ticker in chunk:
if not isinstance(ticker, str):
continue
ticker_df = stock_data_window.loc(axis=0)[:,ticker]
if all( np.isfinite( ticker_df.RET ) ):
# take data point for the last 24 months
y = ticker_df[-24:]['XRET']
x = period_mkt_returns[-24:]['mktrf']
if not ( len(x) == 24 and len(y) == 24 ):
continue
ylist.append(y.values)
xlist.append(x.values)
symbols.append(ticker)
# send everything to the GPU
if len( xlist ) > 0 and len( ylist ) > 0:
cur_df = batch_ols( xlist, ylist, symbols, dt.date() )
result_dfs.append(cur_df)
return result_dfs
# compute all the beta values if this is first run. This takes ~2hrs
# otherwise load existing computed date from disk.
with Timer('Computing beta') as t:
beta_data = do_beta()
beta_df = pd.concat(beta_data, ignore_index=True)
beta_df.set_index(['date', 'ticker'], inplace=True)
beta_df.sort_index(0, inplace=True)
# read betas from what we've already computed and saved previously.
beta_df = pd.read_csv('/Users/unclechu/Dropbox/Datasets/CRSP/CRSP_beta_monthly_24pts.csv',
parse_dates=[3], infer_datetime_format=True)
beta_df.set_index(['date', 'ticker'], inplace=True)
beta_df.sort_index(0, inplace=True)
We show a plot of time-series of CAPM beta for selected stocks from the CRSP universe. Note the fluctuating behavior of $\beta$ through time for each of the stocks.
fig = plt.figure( figsize = ( 12, 6))
symbols = ['AAPL', 'MSFT', 'XOM', 'DIS']
for sym in symbols:
beta = beta_df.loc[ (slice(None), sym), ][['beta']]
plt.plot(beta.index.get_level_values(0), beta.beta, label = sym)
plt.title('Calculated values of CAPM $\\beta$ for %s' % ['AAPL', 'MSFT', 'XOM', 'DIS'])
plt.legend()
plt.show()
We replicate below the same results as in [1] for the summary statistics of the cross-sectional measurement of beta for months t from 1962 to 2014. The symbol $\beta^{2Y}$ refers to the beta value computed using 2 years of monthly returns data for each month $t$.
def beta_group_func( sub_df ):
return sub_df.beta.describe()
beta_info = beta_df.groupby(beta_df.index.get_level_values(0)).apply(beta_group_func)
beta_summary = pd.DataFrame([beta_info.apply(np.mean)], index=['$\\beta^{2Y}$'])
beta_summary
We next show a 3D histogram of the distribution of the cross-sectional values of beta over time. This gives us greater intuition of the distributional characteristics of $\beta$ through time.
import matplotlib.colors as colors
import matplotlib.cm as cmx
bins = np.linspace(-10, 10, 50)
dateHistPairs = []
for dt, sub_df in beta_df.groupby( beta_df.index.get_level_values(0)):
betas = sub_df['beta']
hist, _ = np.histogram( betas, bins=bins )
dateHistPairs.append( ( dt, hist ))
N=50
data = dateHistPairs[-N:]
fig = plt.figure(figsize=(16,12))
ax = fig.add_subplot(111, projection='3d')
cm = plt.get_cmap('Paired')
vv = range(len(data))
cNorm = colors.Normalize(vmin=0, vmax=N)
scalarMap = cmx.ScalarMappable(norm=cNorm, cmap=cm)
colorVals = [scalarMap.to_rgba(i % N) for i in range(len(data))]
for i, (dt, hist) in enumerate( data ):
xpos = bins[:-1]
ypos = np.ones_like(xpos) * i * 0.25
zpos = np.zeros_like(xpos)
dx = ( bins[1] - bins[0] - 0.1) * np.ones_like(xpos)
dy = 0.25 * np.ones_like(xpos)
dz = hist
ax.bar3d(xpos, ypos, zpos, dx, dy, dz, zsort='max', color=colorVals[i ])
ys = np.array(range(len(data))) * 0.25
ylabels = beta_df.index.get_level_values(0).unique()[-len(data):]
ylabels = [str(x.date()) for x in ylabels]
plt.yticks(ys[::4], ylabels[::4], rotation=-15)
for label in ax.yaxis.get_ticklabels():
label.set_horizontalalignment('left')
label.set_verticalalignment('bottom')
plt.ylim([-0.5, len(data) * 0.25 + 0.5])
plt.title('Plot of monthly histogram of the distribution of $\\beta$ through time.')
plt.show()
As we can see from above, the distribution of the values of $\beta$ through time it's quite stable and mainly falls in the range of $-0.1$ to $5.0$.
We now investigate the excess returns on decile portfolios sorted on $\beta$. In each month $t$, we sort the CRSP universe according their computed values of beta $\beta^{2Y}$.
decile_labels = [ 'D%.2d' % x for x in range(1, 11 )]
def decile_portfolios( beta_df_month_t ):
''' sort the stocks into 10 decile portfolios based on the beta values '''
beta_df_month_t['quantile'] = pd.qcut(beta_df_month_t.beta, 10, labels=decile_labels)
return beta_df_month_t
quantiled_beta = beta_df.groupby(beta_df.index.get_level_values(0)).apply( decile_portfolios )
quantiled_beta.head()
With the above, we now have a quantile column indicating for each date which decile portfolio a ticker belongs to. We can now proceed to compute, for each month $t$, the returns on equal-weighted (or value-weighted) decile portfolios.
The betas calculated for month t are used to construct the 10 decile portfolios. The portfolios are then held from month t to month t+1. So the holding period return for month t+1 are the ones we use to calculate returns on the decile portfolios. In the below, we roll the dates for the beta dataframe one month forward, then perform a join with the securities dataframe to get all the data lined up so we can calculate the decile portfolio returns for each month t.
# roll the dates one month forward for the beta dataframe
quantiled_beta.reset_index(inplace=True)
df.reset_index(inplace=True)
beta_datelist = quantiled_beta.date.unique()
maxdate = beta_datelist.max()
mapping = dict( zip( beta_datelist, np.roll(beta_datelist, -1)))
stockdates = df.date.unique()
max_plus_one = stockdates[stockdates > maxdate][0]
mapping[maxdate] = np.datetime64( max_plus_one )
# this because it's a nightmare working with datetimes in pandas
mapping = dict( ( pd.Timestamp(k).date(), pd.Timestamp(v).date() ) for k, v in mapping.items() )
quantiled_beta['date'] = pd.to_datetime( pd.Series( [ mapping[dt.date()] for dt in quantiled_beta['date'] ] ) )
def decile_return_group_func( sub_df ):
return sub_df.groupby('quantile').apply( lambda q : q['XRET'].mean() * 100. )
m = pd.merge( df, quantiled_beta, left_on=['date','TICKER'], right_on=['date', 'ticker'])
decile_returns = m.groupby('date').apply( decile_return_group_func )
decile_returns['D10-D01'] = decile_returns.D10 - decile_returns.D01
decile_returns.describe()
From the above results, we can see that clearly the claim that higher returns correspond to higher betas is clearly not true. In fact, the return on the decile-10 portfolio minus the decile-1 portfolio is actually negative!.
def decile_beta_group_func( sub_df ):
return sub_df.groupby('quantile').apply( lambda q : q['beta'].mean() )
decile_betas = m.groupby('date').apply( decile_beta_group_func )
decile_betas
fig = plt.figure( figsize = ( 12, 6))
x = decile_betas.apply( np.mean )
y = decile_returns.apply( np.mean )[:-1]
A = np.vstack([x, np.ones(len(x))]).T
b, c = np.linalg.lstsq(A, y)[0]
plt.plot(x, b*x + c, 'black', label='Fitted line')
plt.scatter( x, y, marker='^', s=150, c='pink', edgecolors='black', linewidths=2.0)
plt.title('Graph of $\\beta$ against average excess returns for the 10 decile portfolios (equal-weighted)')
plt.legend()
plt.show()
The graph above presents the results more clearly. As we can see, the betas do not line up ( as they should be according to the CAPM ).
We next show similar analysis as the above for value-weight decile portfolios. The market-capitalization for each stock is used as the weights.
def decile_return_group_func( sub_df ):
return sub_df.groupby('quantile').apply( lambda q : np.average(q['XRET'], weights=q['MktCap']) * 100. )
m = pd.merge( df, quantiled_beta, left_on=['date','TICKER'], right_on=['date', 'ticker'])
decile_returns = m.groupby('date').apply( decile_return_group_func )
decile_returns['D10-D01'] = decile_returns.D10 - decile_returns.D01
decile_returns.describe()
Contrary to the results for the equal-weight decile portfolios, the results for the value-weighted portfolios do show a slight positive return from being long the 10th decile portfolio and short the 1st decile portfolio.
fig = plt.figure( figsize = ( 12, 6))
x = decile_betas.apply( np.mean )
y = decile_returns.apply( np.mean )[:-1]
A = np.vstack([x, np.ones(len(x))]).T
b, c = np.linalg.lstsq(A, y)[0]
plt.plot(x, b*x + c, 'black', label='Fitted line')
plt.scatter( x, y, marker='^', s=150, c='pink', edgecolors='black', linewidths=2.0)
plt.title('Graph of $\\beta$ against average excess returns for the 10 decile portfolios (value-weighted)')
plt.legend()
plt.show()
As we can see, the value-weight portfolios behave much better, but there are still anomalies (outliers too far above the line ).
[1] Turan G. Bali, Robert F. Engle, Scott Murray, Empirical Asset Pricing: The Cross Section of Stock Returns. Wiley Series in Probability and Statistics, April 4, 2016.