Abnormal returns

Author

Stijn Masschelein

Setup

The lubridate package is the tidyverse package that helps with time related data. Dates are a specific class of variables and the package helps with managing date variables. There are a bunch of other packages out there that can help with managing date variables. I just picked one.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(here)

here() starts at /Users/stijnmasschelein/Library/CloudStorage/Dropbox/Teaching/lecturenotes/method_package

i_am("freaky_friday/abnormal_returns.qmd")

here() starts at /Users/stijnmasschelein/Library/CloudStorage/Dropbox/Teaching/lecturenotes/method_package

In order to predict the expected market return reaction to earnings, we will want to take into account the general market reaction. Dellavigna and Pollet (2009) only adjust for the overall market return. There are other adjustments possible for specific risk factors. The original factor model is the Fama-French 3 Factors model and the data to run those models is available via the Kenneth French data library.

I have downloaded the data as a .csv file. The code reads the data in skipping 5 lines and reading the variables as numbers (double precision). We then need to transform the date variable to a date type with a function from the lubridate package. Finally, we need to scale the returns by 100 because they are expressed in percentages. For replicating the Dellavigna and Pollet (2009) paper, we only need the market return minus the risk free rate (mkt_rf) and the risk free rate (rf). If I read the paper correctly, we need to use the raw market return which is calculated as mkt.

earn_ann <- readRDS(here("data", "freaky_friday", "earn_ann.RDS"))
analyst <- readRDS(here("data", "freaky_friday", "analyst.RDS"))
all_stocks <- readRDS(here("data", "freaky_friday", "all_stocks.RDS"))
famafrench <- read_csv(file = here("data", "F-F_Research_Data_Factors_daily.csv"),
                       col_names = c("date", "mkt_rf", "smb", "hml", "rf"),
                       skip = 5, col_type = "ddddd") %>%
  mutate(date = ymd(date)) %>%
  mutate_if(is.numeric, ~ . / 100) %>%
  mutate(mkt = mkt_rf + rf) %>% 
  print()

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

# A tibble: 25,400 × 6
   date        mkt_rf     smb     hml      rf      mkt
   <date>       <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
 1 1926-07-01  0.001  -0.0025 -0.0027 0.00009  0.00109
 2 1926-07-02  0.0045 -0.0033 -0.0006 0.00009  0.00459
 3 1926-07-06  0.0017  0.003  -0.0039 0.00009  0.00179
 4 1926-07-07  0.0009 -0.0058  0.0002 0.00009  0.00099
 5 1926-07-08  0.0021 -0.0038  0.0019 0.00009  0.00219
 6 1926-07-09 -0.0071  0.0043  0.0057 0.00009 -0.00701
 7 1926-07-10  0.0062 -0.0053 -0.001  0.00009  0.00629
 8 1926-07-12  0.0004 -0.0003  0.0064 0.00009  0.00049
 9 1926-07-13  0.0048 -0.0028 -0.002  0.00009  0.00489
10 1926-07-14  0.0004  0.0007 -0.0043 0.00009  0.00049
# ℹ 25,390 more rows

shrout is shares outstanding in 1000 shares
market_value is in million USD

Abnormal returns

Remember that we are interested in predicting the counterfactual of what the returns would have been if there were no earnings announcement. We use the following prediction model:

$R_{u, k} = α_{t, k} + β_{t, k} R_{u, m}$

That means that we need to estimate $α$ and $β$ for each announcement based on the market return and firm return from 300 days before the earnings announcement to 46 days before the earnings announcements (Dellavigna and Pollet 2009). That is a lot of computations!

I played around with a lot of different implementations and finally settled on a completely tidyverse style implementation as the fastest on my machine. I did not include the other implementations in this document but I will show how you can test the timing of your code with the microbenchmark package.

The first thing we need to do is to write the code as a function that we can call as many times as we need. ¹ I call this function create_coefs. The input of the function is n, the number of earnings announcements we want to run the code for. The output is a tibble with the coefficients $α$ and $β$ for each permno and anndat combination.

The actual function takes the earnings announcement data and keeps the distinct or unique permno and anndat combinations. Next, the start and end data for the return data is calculated based on the earnings announcement date. Next, we merge the return data and the Fama-French data and delete the missing or infinite observations for the return.

Note

At this point, you can see the flexibility of my approach to keep the data sets separate until we need to join them. The earnings announcement data is one row per earnings announcement date per firm. The stock data has one observation per firm per day. The Fama-French data has one observation per day. Because these data have a different level of analysis, I have kept them separate until the point that we need to analyse the data. If I have to update one of the datasets, I do not necessarily have to update all the other ones. I would strongly encourage you to do the same thing where you keep different parts of the data cleaning separate for as long as possible. This will make your code flexible and easy to adapt.

The next two steps are the most advanced ones. First, remark that we have multiple rows for each earnings announcement. We use the summarise function to put all the returns in a vector y and we make a matrix X with all 1’s in the first column and the mkt return in the second column. Remark that we are wrapping y and X in a list so that we effectively make y and X list columns in our tibble. The tidyverse lets us put almost anything in a dataset as long as we wrap it in a list. Finally, we summarise the y and X not just for the whole dataset but for each combination of permno and anndat. Thus, we end up with a dataset with a row for each earnings announcement with the returns we want to predict (y) from before the earnings announcement and the market returns from the same time period (X).

In the second step, we will run the regression for each row to estimate $α$ and $β$ . We use a lightweight version of lm called lm.fit to speed up the estimation. lm.fit calculates less statistics like the standard errors that we need to calculate the p-values. Because ultimately, we need to this for 150,000 rows lm.fit will save considerable time compared to lm. lm.fit works slightly different in that no regression equation is specified. You need to give the predictors as the first argument and the outcome variable as the second argument. This is the reason why we formed the y and X variable in the previous step. We just need the coefficients form the fit object and we can use the coef function to extract them from the regression.

Finally, we want to apply this lm.fit function to every row (i.e. for every earnings announcement). We use the map-family to apply a function to every element of a column. The pmap function specifically let us apply a function to multiple columns if we wrap them in a list. So, pmap takes the list of columns X and y and applies (~) the function lm.fit to the first (..1) and second (..2) element of that list. At the end, we extract the coefficients.

The microbenchmark function allows us to test the time it takes to run this function for 100, 1000, and 10000 earnings announcements. The computations are done 10 times for each call to get an average/median estimate ². The surprising thing to me was that the function scales very well. The average time per announcement goes down with more announcements.

create_coefs <- function(n = 6){
  earn_ann %>% head(n = n) %>%
    distinct(permno, anndat) %>%
    mutate(start = anndat - 300, end = anndat - 46) %>% 
    left_join(select(all_stocks, permno, date, ret),
               by = join_by(permno == permno, start <= date, end >= date)) %>%
    left_join(select(famafrench, mkt, rf, date),
              by = join_by(date == date)) %>%
    filter(!is.na(ret), !is.infinite(ret)) %>%
    summarise(y = list(cbind(ret)), X = list(cbind(alpha = 1, beta = mkt)),
               .by = c(permno, anndat)) %>%
    mutate(coefs = pmap(list(X, y), ~ lm.fit(..1, ..2) %>% coef()),
            .by = c(permno, anndat)) %>%
    select(-y, -X)
}
create_coefs()

# A tibble: 6 × 3
  permno anndat     coefs    
   <dbl> <date>     <list>   
1  10560 2004-10-21 <dbl [2]>
2  10560 2005-01-26 <dbl [2]>
3  10560 2005-04-27 <dbl [2]>
4  10560 2005-07-27 <dbl [2]>
5  10560 2005-10-26 <dbl [2]>
6  10560 2006-02-01 <dbl [2]>

microbenchmark::microbenchmark(
                  create_coefs(100),
                  create_coefs(1000),
                  create_coefs(10000),
                  times = 10)

Unit: seconds
                expr      min       lq     mean   median       uq      max
   create_coefs(100) 1.504900 1.575041 1.644763 1.644181 1.726109 1.779729
  create_coefs(1000) 1.835891 1.857457 1.905892 1.890821 1.901891 2.161929
 create_coefs(10000) 3.443087 3.594777 3.663977 3.657020 3.752581 3.829242
 neval
    10
    10
    10

Next, we calculate the coefficients for all announcements and save them in the results object. Next, we need to calculate the abnormal returns based on the following formula in Dellavigna and Pollet (2009).

$R_{t, k}^{(h, H)} = [Π_{j = h}^{H} (1 + R_{j, k})] - 1 - {\hat{β}}_{t, k} [Π_{j = h}^{H} (1 + R_{j, m}) - 1]$

In the results, the coef column is a list column where each row has two elements: alpha and beta. The unnest_wider function splits the list columns in two columns alpha and beta. Next we create the date 75 days after the announcement because that is the longest time frame the paper is interested (Dellavigna and Pollet 2009). Just like before, we add the relevant stock price data and Fama-French data. Our dataset now has a row for each day 0 to 75 days after the announcement date. Dellavigna and Pollet (2009) are interested in the short term abnormal return on the day of the announcement and the day after (0-1 days) and the long term abnormal return (2-75 days). I create the time_frame variable to denote whether a day should be counted in the short or long cumulative abnormal return. As an intermediate step, I calculate the products of the (market) returns + 1 (see the equation) by announcement and time frame (and beta because we need it for further calculations).

Finally, I use the pivot_wider function to transform the dataset from the long format to a wider format. Specifically, each announcement has two rows one for the short term car and one for the long term car. The transformation will create a new dataset where every row has a different announcement with a car_long and a car_short. We do that by taking the values_from the car variable and the name for the column from the time_frame variable.

N <- nrow(earn_ann)
results <- create_coefs(N)
# results <- create_coefs(100)

abnormal <- results %>%
  unnest_wider(coefs) %>% 
  mutate(date75 = anndat + 75) %>%
  left_join(select(all_stocks, permno, date, ret),
            by = join_by(permno == permno,
                         date75 >= date, anndat <= date)) %>%
  left_join(select(famafrench, mkt, rf, date),
            by = join_by(date == date)) %>% 
  mutate(time_frame = if_else(date - anndat <= 1, "short", "long")) %>% 
  summarise(raw = prod(1 + ret), mkt = prod(1 + mkt),
            .by = c(permno, anndat, time_frame, beta)) %>% 
  mutate(car = raw - 1 - beta * (mkt - 1)) %>%
  select(-beta, -raw, -mkt) %>%
  filter(!is.na(car)) %>%
  pivot_wider(values_from = car, names_from = time_frame,
              names_prefix = "car_")
glimpse(abnormal)

Rows: 127,724
Columns: 4
$ permno    <dbl> 10560, 10560, 10560, 10560, 10560, 10560, 10560, 10656, 1065…
$ anndat    <date> 2004-10-21, 2005-01-26, 2005-04-27, 2005-07-27, 2005-10-26,…
$ car_short <dbl> 0.097290562, 0.036462040, -0.063605028, -0.004177245, -0.014…
$ car_long  <dbl> 0.01028628, -0.20959780, 0.05661647, -0.18368201, 0.20741783…

Note

The pivot_wider and pivot_longer functions are really, really useful when you are working with poorly structured data. As a rule of thumb when you clean and summarise the data, you will want the data to be in a long format where each row is its own distinct unit of analysis (Wickham 2014). For instance, in this code we had a row for each return day after the announcement for each announcement. The big advantage is that we can easily, with code, change the groups of data that we want to summarise, mutate, or filter. If we want to change what the short time frame means, we can just change that one line of code (e.g. to data - anndat <= 5). The wider format is more useful to present the data. That is why I use it more often at the end of an analysis.

Putting it all together

Finally, we want to put all the data together. You can see that we only merge all the data at the end. This is good practice. I would advise to not even try to build up a complete dataset by starting with for instance the I/B/E/S data and than to gradually add the variables that you need. This is a very error prone process and it is hard to make changes without redoing the whole data cleaning process.

First, we need more stock price data and the market value from the CRSP data. We collect this in the clean_prices dataset. Remark that I remove the data where the price is negative because that means that there were no trades that day. Dellavigna and Pollet (2009) require the price and market value 5 days before the earnings announcement. My guess is that they are trying to avoid any effects of information leaking out into the market before the earnings announcement. I then merge the earnings announcement data with the analyst estimates data, the abnormal return data and the price and market value data. Finally, I calculate the earnings surprise as the difference between the actual and the median predicted earnings per share divided by the price per share to get the earnings surprise per dollar of market value.

clean_prices <- all_stocks %>%
  filter(prc > 0) %>%
  select(permno, date, prc, shrout) %>%
  mutate(market_value = prc * shrout) %>%
  select(-shrout) %>%
  print()

# A tibble: 14,300,084 × 4
   permno date         prc market_value
    <int> <date>     <dbl>        <dbl>
 1  10002 1994-03-10 13          38987 
 2  10010 1994-03-10  7.62       71278.
 3  10011 1994-03-10  7.12       37784.
 4  10012 1994-03-10  2.38       34630.
 5  10019 1994-03-10  8.75       45832.
 6  10025 1994-03-10 18.1       132838.
 7  10026 1994-03-10 18.2       188723.
 8  10032 1994-03-10 16.2       104910 
 9  10035 1994-03-10 18.2       237779.
10  10042 1994-03-10  3.56       90195.
# ℹ 14,300,074 more rows

surprise <- earn_ann %>%
  left_join(analyst,
            by = join_by(ticker, anndat, actual, pdf)) %>%
  left_join(abnormal,
            by = join_by(permno, anndat)) %>%
  mutate(date_minus5 = anndat - 5) %>%
  left_join(clean_prices,
            by = join_by(permno, closest(date_minus5 >= date))) %>%
  mutate(surprise = (actual - median) / prc) 
glimpse(surprise)

Rows: 154,235
Columns: 20
$ ticker       <chr> "A2", "A2", "A2", "A2", "A2", "A2", "A2", "AA0A", "AA0A",…
$ actual       <dbl> -0.0400, -0.1100, -0.1100, -0.0500, -0.0700, -0.1000, -0.…
$ pdf          <chr> "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D", "D…
$ anndats_act  <date> 2004-10-21, 2005-01-26, 2005-04-27, 2005-07-27, 2005-10-…
$ gvkey        <chr> "001081", "001081", "001081", "001081", "001081", "001081…
$ permno       <dbl> 10560, 10560, 10560, 10560, 10560, 10560, 10560, 10656, 1…
$ cusip        <chr> "00392410", "00392410", "00392410", "00392410", "00392410…
$ rdq          <date> 2004-10-21, 2005-01-26, 2005-04-27, 2005-07-27, 2005-10-…
$ anndat       <date> 2004-10-21, 2005-01-26, 2005-04-27, 2005-07-27, 2005-10-…
$ N            <int> 1, 2, 3, 2, 4, 5, 4, 1, NA, 1, 1, NA, NA, NA, 1, 1, 1, NA…
$ median       <dbl> -0.09000, -0.08965, -0.05740, -0.03700, -0.10610, -0.0783…
$ mean         <dbl> -0.090000, -0.089650, -0.068000, -0.037000, -0.100550, -0…
$ mean_days    <dbl> 21.000000, 8.000000, 6.666667, 13.500000, 7.000000, 9.200…
$ car_short    <dbl> 0.097290562, 0.036462040, -0.063605028, -0.004177245, -0.…
$ car_long     <dbl> 0.01028628, -0.20959780, 0.05661647, -0.18368201, 0.20741…
$ date_minus5  <date> 2004-10-16, 2005-01-21, 2005-04-22, 2005-07-22, 2005-10-…
$ date         <date> 2004-10-15, 2005-01-21, 2005-04-22, 2005-07-22, 2005-10-…
$ prc          <dbl> 5.630, 5.890, 4.530, 5.000, 3.260, 4.020, 4.280, 16.750, …
$ market_value <dbl> 2478185.2, 2592630.8, 1993992.8, 2200875.0, 1434970.5, 17…
$ surprise     <dbl> 0.0088809947, -0.0034550085, -0.0116114790, -0.0026000000…

Data Cleaning

Finally, I apply the data cleaning rules from Dellavigna and Pollet (2009). I also include the a print statement so that we can track how many observations are removed after each data cleaning step. The following rules are applied:

Remove the observations where surprise is missing.
Remove the observations where the median or actual estimate is larger than the price.
Remove penny stocks which I arbitrarily and after some googling decide to be stocks with a price lower than $2 per stock.
Remove the observations with earnings announcement on Saturday and Sunday.
I remove the observations with a car in the top and bottom .05% of observations.

winsorise <- 5/10000
main <- surprise %>%
  filter(!is.na(surprise)) %>% print(n = 0) %>%
  filter(abs(median) < prc, abs(actual) < prc) %>% print(n = 0) %>%
  filter(prc > 2) %>% print(n = 0) %>%
  mutate(weekday = wday(anndat, label = TRUE)) %>%
  filter(! weekday %in% c("Sat", "Sun")) %>% print(n = 0) %>%
  filter(percent_rank(car_long) >= winsorise,
         percent_rank(car_long) <= 1 - winsorise,
         percent_rank(car_short) >= winsorise,
         percent_rank(car_short) <= 1 - winsorise) %>%
  print()

# A tibble: 137,038 × 20
# ℹ 137,038 more rows
# ℹ 20 variables: ticker <chr>, actual <dbl>, pdf <chr>, anndats_act <date>,
#   gvkey <chr>, permno <dbl>, cusip <chr>, rdq <date>, anndat <date>, N <int>,
#   median <dbl>, mean <dbl>, mean_days <dbl>, car_short <dbl>, car_long <dbl>,
#   date_minus5 <date>, date <date>, prc <dbl>, market_value <dbl>,
#   surprise <dbl>
# A tibble: 135,566 × 20
# ℹ 135,566 more rows
# ℹ 20 variables: ticker <chr>, actual <dbl>, pdf <chr>, anndats_act <date>,
#   gvkey <chr>, permno <dbl>, cusip <chr>, rdq <date>, anndat <date>, N <int>,
#   median <dbl>, mean <dbl>, mean_days <dbl>, car_short <dbl>, car_long <dbl>,
#   date_minus5 <date>, date <date>, prc <dbl>, market_value <dbl>,
#   surprise <dbl>
# A tibble: 133,246 × 20
# ℹ 133,246 more rows
# ℹ 20 variables: ticker <chr>, actual <dbl>, pdf <chr>, anndats_act <date>,
#   gvkey <chr>, permno <dbl>, cusip <chr>, rdq <date>, anndat <date>, N <int>,
#   median <dbl>, mean <dbl>, mean_days <dbl>, car_short <dbl>, car_long <dbl>,
#   date_minus5 <date>, date <date>, prc <dbl>, market_value <dbl>,
#   surprise <dbl>
# A tibble: 133,154 × 21
# ℹ 133,154 more rows
# ℹ 21 variables: ticker <chr>, actual <dbl>, pdf <chr>, anndats_act <date>,
#   gvkey <chr>, permno <dbl>, cusip <chr>, rdq <date>, anndat <date>, N <int>,
#   median <dbl>, mean <dbl>, mean_days <dbl>, car_short <dbl>, car_long <dbl>,
#   date_minus5 <date>, date <date>, prc <dbl>, market_value <dbl>,
#   surprise <dbl>, weekday <ord>
# A tibble: 130,358 × 21
   ticker actual pdf   anndats_act gvkey  permno cusip    rdq        anndat    
   <chr>   <dbl> <chr> <date>      <chr>   <dbl> <chr>    <date>     <date>    
 1 A2     -0.04  D     2004-10-21  001081  10560 00392410 2004-10-21 2004-10-21
 2 A2     -0.11  D     2005-01-26  001081  10560 00392410 2005-01-26 2005-01-26
 3 A2     -0.11  D     2005-04-27  001081  10560 00392410 2005-04-27 2005-04-27
 4 A2     -0.05  D     2005-07-27  001081  10560 00392410 2005-07-27 2005-07-27
 5 A2     -0.07  D     2005-10-26  001081  10560 00392410 2005-10-26 2005-10-26
 6 A2     -0.1   D     2006-02-01  001081  10560 00392410 2006-02-01 2006-02-01
 7 A2     -0.08  D     2006-04-26  001081  10560 00392410 2006-04-26 2006-04-26
 8 AA0A    0.107 D     2003-05-08  001094  10656 00444610 2003-05-08 2003-05-08
 9 AA0A    0.129 D     2003-11-06  001094  10656 00444610 2003-11-06 2003-11-06
10 AA0A    0.127 D     2004-02-11  001094  10656 00444610 2004-02-11 2004-02-11
# ℹ 130,348 more rows
# ℹ 12 more variables: N <int>, median <dbl>, mean <dbl>, mean_days <dbl>,
#   car_short <dbl>, car_long <dbl>, date_minus5 <date>, date <date>,
#   prc <dbl>, market_value <dbl>, surprise <dbl>, weekday <ord>

saveRDS(main, here("data", "freaky_friday", "main.RDS"))

Overview of the datasets

File	Description
`earn_ann.RDS`	The information on the earnings announcements
`analyst.RDS`	The analyst estimates data
`main.RDS`	The main dataset to replicate the paper

References

Dellavigna, Stefano, and Joshua M. Pollet. 2009. “Investor Inattention and Friday Earnings Announcements.” The Journal of Finance 64 (2): 709–49. https://doi.org/10.1111/j.1540-6261.2009.01447.x.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.

Footnotes

R is fundamentally a programming language that is build around functions. The tidyverse partially works around that by making tibbles the primary object. Nevertheless, when you need to do some more advanced programming, creating functions is quite natural in R.↩︎
For comparison on my 2020 Mac Mini, it takes about 1.5 seconds for 100 announcements and 4.0 seconds for 10000 announcements. On my 2014 Macbook pro, that is respectively 4.3 seconds and 11.2 seconds.↩︎