Theory
Theory
In this section, I will show how you can code a prediction model for the return of a company’s stock price based on the returns of a number of peer companies. I loosely follow the same approach as in Baker and Gelbach (2020). In that paper, peer firms are identified based on the SIC code in the Compustat data which indicates what the industry is of the firm’s primary product.
We are using the tidymodel
package for the application and I will follow the introduction to tidymodels. The advantage of the tidymodels approach is that you can follow a very similar workflow for other machine learning methods than the one I will be showing here. The code itself is not that important. The goal is more to give you a starting point.
To understand the machine learning approach that we are using, I need to start of with some theory. The fundamental idea is that we want to avoid overfitting in the data that we have, so that when we use the model to predict on new data we still have good predictions. This means that when we estimate our model we do not want to have a model that fits the current data as good as possible. We want to regularize the parameters in the model so that we do not get a perfect fit in the current sample and a better out-of-sample predictions. For instance, if we use 200 trading days and have 200 peer firms, we can perfectly predict within the sample of 200 days 1 but there are no guarantees that we will get good predictions from that model out-of-sample (i.e. after the earnings announcement).
For the linear model to predict stock returns based on peers, we will use a the elastic net regularizer to bias the estimates within the sample data to make it more likely that the linear model will give good predictions out-of-sample. One way to think about the linear model with peers as predictors is that we are creating a bespoke market index for each firm as a weighted average of its peers.
A regular linear model estimates the \(\beta\)s by minimising the following equation where we minimise the sum of the squared difference between the outcome (\(y_i\)) and prediction (\(X_i \beta\)) for each observation i.
\[ \sum^N_{i=1} (y_i - X_i \beta)^2 \]
That is, we want to find estimates that give the best possible fit in the data. The regulariser puts a penalty on bigger absolute values for the \(\beta\)s to limit overfitting to the in-sample data. The estimates will now be chosen to minimise the following equation.
\[ \sum^N_{i=1} (y_i - X_i \beta)^2 + \lambda \left ( \alpha \sum^p_j \beta^2_j + (1 - \alpha) \sum^p_j |\beta_j| \right) \]
The size of the penalty is given by the parameters \(\lambda > 0\) and consists of two parts: the sum of the squared \(\beta\)s and the sum of the absolute values of the \(\beta\)s. The first term is the ridge reguliser and the second one is the LASSO regulariser. They both have been shown to have useful properties as regulisers and thus they are often used together with the weight \(1 \geq \alpha \geq 0\). With \(\alpha = 1\), we only use ridge regression and with \(\alpha = 0\), we only use the LASSO.
The final step is that we need to choose the right values for \(\lambda\) and \(\alpha\). A common approach is to use cross validation where we split the data that we have in roughly equal sized partitions or folds. For instance, you have 10 folds with each 10% of the data. With cross validation, we will use the 90% of the remaining data, use a number of different values for \(\lambda\) and \(\alpha\) to estimate \(\beta\)s and predict the 10% fold that we did not use for estimation. In other words, we use the data that we have to do a prediction task with the advantage that we can evaluate which \(\lambda\) and \(\alpha\) gives us the best predictions.
The key insight is that if we care about a prediction task, we can use some of the data and pretend it is data that we have never seen when we estimate our prediction model. We can then test which model is actually good at predicting on data that it has never seen.
We do need a measure to evaluate the quality of the predictions. A common choice is the Root Mean Squared Error (RMSE) which is defined as the square root of the squared difference between the actual value of the outcome and the predictions for the outcome.
\[ \sqrt{ \frac{\sum^N_{i=1} (y_i - \hat{y}_i)^2}{N} } \]
We will use the RMSE to evaluate the predictions out-of-sample. The RMSE is similar to the first equation where we choose the \(\beta\)s to minimise the squared difference between the outcome and the fitted data in the in-sample data.
References
Footnotes
It’s a system of linear equations with 200 unknown parameters (the \(\beta\)s) and 200 observations (the trading days).↩︎