Baker et al. simulation

Author

Stijn Masschelein

Introduction

This is a non necessarily coherent collection of code that I used to develop my course material to teach the Baker, Larcker, and Wang (2022) paper on staggered difference-in-difference, specifically simulation 6. I used the opportunity to familiarise myself with some of the fastverse packages and I sprinkled some timing tests in the code here and there.¹. If you are going to work with large datasets and just summarising the data takes a lot of computing time, switching over from the tidyverse to (some of) the fastverse packages might be a worthwhile investment of your time. My other interest in writing this all up is that it will allow me to test other estimators and other data generating processes.

R Setup

The here package helps with managing the different files associated with my teaching material. The tidyverse is what I normally use for data management and what I teach because I like the pedagogy but it’s not as fast compared to data.table and collapse. cowplot is my preference for decent default in plots. RPostgress is necessary to get the data from the WRDS server. fixest is the gold standard for regression estimators with fixed effects as of the time of writing. modelsummary is a package to create tables from regression objects. gof_omit and stars will override some of the defaults in modelsummary.run_wrds` is just a boolean to control whether I want to rerun the WRDS query.²

library(here)

here() starts at /Users/stijnmasschelein/Dropbox/github/site

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(cowplot)


Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp

theme_set(theme_cowplot())
library(collapse)

collapse 2.1.5, see ?`collapse-package` or ?`collapse-documentation`

Attaching package: 'collapse'

The following object is masked from 'package:tidyr':

    replace_na

The following object is masked from 'package:stats':

    D

library(data.table)


Attaching package: 'data.table'

The following object is masked from 'package:collapse':

    fdroplevels

The following objects are masked from 'package:lubridate':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following objects are masked from 'package:dplyr':

    between, first, last

The following object is masked from 'package:purrr':

    transpose

library(RPostgres)
library(fixest)


Attaching package: 'fixest'

The following object is masked from 'package:collapse':

    fdim

library(modelsummary)
gof_omit = "IC|Log|Adj|Ps"
stars = c("*" = 0.1, "**" = .05, "***" = .01)
file <- here("data", "roas.RDS")

WRDS

The data comes from WRDS which has pretty good documentation on how to set up R and Rstudio to work. You do need an WRDS account though and they are not cheap. The code in this section does not run by default. ³.

wrds <- dbConnect(Postgres(),
                  host='wrds-pgdata.wharton.upenn.edu',
                  port=9737,
                  dbname='wrds',
                  user='stimas',
                  sslmode='require')

I used Andrew Baker’s Github repository for the paper to replicate the underlying data but implement them in my own way. I don’t fully replicate the sample but it’s good enough for the example. I have slightly more observations and one day I will figure out where I went wrong.

sql1 <- "SELECT fyear, gvkey, oibdp, at, sich
       FROM COMP.FUNDA
       WHERE fyear > 1978 AND fyear < 2016 AND indfmt = 'INDL' AND datafmt = 'STD'
       AND popsrc = 'D' AND consol = 'C' AND fic = 'USA'"
res <- dbSendQuery(wrds, sql1)
dt <- as_tibble(dbFetch(res)) %>%
  rename_all(tolower)
dbClearResult(res)
roas <- dt %>%
  rename(year = fyear, firm = gvkey) %>%
  filter(!is.na(year), !is.na(at), !is.na(oibdp), at > 0) %>%
  group_by(firm) %>% arrange(year) %>%
  mutate(roa = oibdp / lag(at)) %>%
  ungroup() %>%
  select(year, firm, roa, sich) %>%
  filter(!is.na(roa), year > 1979)
glimpse(roas)

Rows: 236,718
Columns: 4
$ year <int> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980,…
$ firm <chr> "001004", "001005", "001006", "001007", "001010", "001012", "0010…
$ roa  <dbl> 0.14992272, 0.20516701, -0.27685325, -0.03031506, 0.16784783, 0.0…
$ sich <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

sql2 <- "SELECT gvkey, incorp, sic FROM crsp.comphead"
res <- dbSendQuery(wrds, sql2)
dt2 <- as_tibble(dbFetch(res)) %>%
  rename_all(tolower)
dbClearResult(res)
roas <- rename(dt2, firm = gvkey) %>%
  filter(!is.na(incorp) & incorp %in% state.abb) %>%
  right_join(roas) %>%
  mutate(sich = coalesce(sich, sic)) %>%
  filter(!sich %in% c(6000:6999)) %>%
  glimpse()

Joining with `by = join_by(firm)`

Rows: 198,531
Columns: 6
$ firm   <chr> "001003", "001003", "001003", "001003", "001003", "001003", "00…
$ incorp <chr> "DE", "DE", "DE", "DE", "DE", "DE", "DE", "DE", "DE", "DE", "DE…
$ sic    <dbl> 5712, 5712, 5712, 5712, 5712, 5712, 5712, 5080, 5080, 5080, 508…
$ year   <int> 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1980, 1981, 1982, 198…
$ roa    <dbl> 0.379616477, 0.096728808, 0.125834243, 0.175982845, 0.007610037…
$ sich   <dbl> 5712, 5712, 5712, 5712, 5712, 5712, 5712, 5080, 5080, 5080, 508…

roas <- roas %>%
  group_by(firm) %>%
  add_tally() %>%
  filter(n >= 5) %>%
  ungroup()
glimpse(roas)

Rows: 180,397
Columns: 7
$ firm   <chr> "001003", "001003", "001003", "001003", "001003", "001003", "00…
$ incorp <chr> "DE", "DE", "DE", "DE", "DE", "DE", "DE", "DE", "DE", "DE", "DE…
$ sic    <dbl> 5712, 5712, 5712, 5712, 5712, 5712, 5712, 5080, 5080, 5080, 508…
$ year   <int> 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1980, 1981, 1982, 198…
$ roa    <dbl> 0.379616477, 0.096728808, 0.125834243, 0.175982845, 0.007610037…
$ sich   <dbl> 5712, 5712, 5712, 5712, 5712, 5712, 5712, 5080, 5080, 5080, 508…
$ n      <int> 7, 7, 7, 7, 7, 7, 7, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36…

 # winsorize ROA at 99, and censor at -1
wins <- function(x) {
  # winsorize and return
  case_when(
    is.na(x) ~ NA_real_,
    x < -1 ~ -1,
    x > quantile(x, 0.99, na.rm = TRUE) ~ quantile(x, 0.99, na.rm = TRUE),
    TRUE ~ x
  )
}

# winsorize ROA by year
roas <- roas %>%
  group_by(year) %>%
  mutate(roa = wins(roa)) %>%
  arrange(firm, year) %>%
  ungroup()

Saving the data

saveRDS(roas, file)

Create Data

Let’s read the data back in.

roas <- readRDS(file) 
mtime <- file.info((file))$mtime

The data was last modified on 2025-12-19 16:30:40.63027

Decomposition

If you are interested in creating you are own Baker file this is less relevant. It’s just my experimentation with the collapse package to mimic a naive calculation of year and time fixed effects with dplyr. Long story short, decomp_dplyr is almost 10 times slower than decomp_fast. With decent sized data and simulation exercises I am definitely more motivated to use the fastverse packages now. decomp_fixest is the correct decomposition in year and firm fixed effects. It’s not directly comparable to the other decompositions but given that it is actually the correct thing, it holds up pretty well in the timing.

decomp_dplyr <- function(){
roas %>%
  group_by(year) %>%
  mutate(fey = mean(roa)) %>%
  group_by(firm) %>%
  mutate(fef = mean(roa)) %>%
  ungroup() %>%
  mutate(residual = roa - fey - fef)
}
decomp_fast <- function(){
  roas %>%
    fgroup_by(year) %>%
    fmutate(fey = fmean(roa)) %>%
    fgroup_by(firm) %>%
    fmutate(fef = fmean(roa)) %>%
    fungroup() %>%
    fmutate(residual = roa - fef - fey)
}
decomp_fixest <- function(){
  mod <- feols(roa ~ 1 | firm + year, data = roas)
  firm_fes <- fixef(mod)$firm
  year_fes <- fixef(mod)$year
  return(list(ff = firm_fes, fy = year_fes, resid = resid(mod)))
}
microbenchmark::microbenchmark(dplyr = decomp_dplyr(),
                               collapse = decomp_fast(),
                               fixest = decomp_fixest(),
                  times = 20)

Warning in microbenchmark::microbenchmark(dplyr = decomp_dplyr(), collapse =
decomp_fast(), : less accurate nanosecond times to avoid potential integer
overflows

Unit: milliseconds
     expr       min        lq      mean    median        uq       max neval
    dplyr 33.014553 34.856130 39.561814 37.013652 38.133669 99.135540    20
 collapse  1.641148  1.830896  2.272443  1.895389  2.139277  6.198052    20
   fixest 36.879295 39.413751 45.631204 41.339624 45.942407 98.203979    20

decomp <- decomp_fixest()
glimpse(decomp)

List of 3
 $ ff   : Named num [1:12655] 0.0801 0.1526 -0.1387 0.2701 0.1107 ...
  ..- attr(*, "names")= chr [1:12655] "1003" "1004" "1007" "1009" ...
 $ fy   : Named num [1:36] 0.033907 0.028517 -0.002745 0 -0.000281 ...
  ..- attr(*, "names")= chr [1:36] "1980" "1981" "1982" "1983" ...
 $ resid: num [1:180397] 0.2995 0.0169 0.0626 0.1225 -0.0511 ...

Sampling

Now we can create new samples with the above derived fixed effects for the ROAs. I use the data.table package for most of the data manipulation. The first bit of code prepares everything that does not change across conditions in the observations data. We keep the identifier for the firm and year. The new firms data keeps the firm fixed effects the years data has the year firm fixed effects from the decomposition above. The residuals are stored as a vector and will represent the empirical distribution of the unexplaind part of the ROAs. The remaining variables reflect the treatment effects which are applied to three groups of states which are randomly assigned to one of the three groups ⁴

n <- fnrow(roas)
sigma_roa <- sd(roas$roa)
observations <- setDT(fselect(roas, year, firm))
observations[, firm := as.numeric(firm)]
firms <- data.table(firm = as.numeric(names(decomp$ff)), fef = decomp$ff)
years <- data.table(year = as.numeric(names(decomp$fy)), fey = decomp$fy)
residuals <- decomp$res
treatment_groups <- c(1989, 1998, 2007)
treatment_effects <- c(.05, .03, .01)
state_treatments <- sample(1:3, 50, TRUE)

Instead of explaining the average treatment effect per row I am just plotting them below. Behold the data generating process for simulation 6.

plot_dt <- data.table(year = rep(1980:2015, 3),
                      group = rep(1:3, each = 36))
plot_dt[, `:=`(roa = pmax(year - treatment_groups[group] + 1, 0)
               * treatment_effects[group] * sigma_roa,
               label = ifelse(year == 2012, paste("group", group, sep = " "),
                              NA))]
effect_plot <- ggplot(plot_dt, aes(y = roa, x = year, group = group,
                                   label = label)) +
  geom_line() +
  geom_label()
save_plot(here("courses", "fina4590", "slides", "figures", "baker_effects.svg"), effect_plot)

Warning: Removed 105 rows containing missing values or values outside the scale range
(`geom_label()`).

print(effect_plot)

Warning: Removed 105 rows containing missing values or values outside the scale range
(`geom_label()`).

The actual data simulation is in the new_sample function which again mostly uses data.table. The function goes through the following steps. I think I am adding more variation than the Baker et al. paper. So this is by no means a one-for-one replication but it allows me to build on extensions if I so want.

Each firm in the firm gets allocated a random firm fixed effect based on the empirical distribution of the firm fixed effects and gets allocated to one of 50 states.
Based on the state, I assign the firm to a treatment group and treatment effect.
Each year gets a randomly sampled year fixed effect based on the empirical distribution.
The two fixed effects are merged to the firm-year observations.
The simulated ROA is calculated based on the fixed effects, the treatment effect, and random residuals.

During the development of the function I kept track of the timing of the function. Overall, I am quite impressed how fast this was.

new_sample <- function(){
  sample_firms <- firms[, `:=`(
    fef = sample(fef, .N, TRUE),
    state = sample(1:50, .N, TRUE))]
  sample_firms[, state_treatment := state_treatments[state]]
  sample_firms[, `:=`(
    treatment_group = treatment_groups[state_treatment],
    treatment_effect = treatment_effects[state_treatment])]
  sample_years <- years[, fey := sample(fey, .N, TRUE)]
  new <- observations[
    sample_firms, on = .(firm = firm)][
    sample_years, on = .(year = year)]
  new[, treatment := treatment_effect *
          sigma_roa * pmax(year - treatment_group + 1, 0)]
  new[, roa := treatment + fef + fey + sample(residuals, .N, TRUE)]
  new[, rel_year := year - treatment_group]
  return(as.data.table(new))
}
microbenchmark::microbenchmark(sample = new_sample(), times = 100)

Unit: milliseconds
   expr      min       lq     mean   median       uq     max neval
 sample 14.45795 17.47049 23.27825 19.10744 21.25053 90.0704   100

Analyse

For now I focus on three estimators. The traditional two-way fixed effect estimator, the Sun and Abraham (2021) estimator, and the Callaway and Sant’Anna (2021) estimator. The Sun and Abraham (2021) estimator is easy to implement as a fixed effect regression and conceptually easy to explain because the hard work is done by only using the appropriate data which allows for the appropriate comparison. The Callaway and Sant’Anna (2021) is more robust has and is easier to extend to include selection effects, automatic selection of control groups, and additional control variables.

Two-way Fixed Effects

new <- new_sample()
twoway <- feols(roa ~ ifelse(year >= treatment_group, 1, 0)| year + firm,
                cluster = "state", data = new)
etable(twoway)

                                               twoway
Dependent Var.:                                   roa
                                                     
ifelse(year>=treatment_group,1,0) -0.0151*** (0.0026)
Fixed-Effects:                    -------------------
year                                              Yes
firm                                              Yes
_________________________________ ___________________
S.E.: Clustered                             by: state
Observations                                  180,397
R2                                            0.78096
Within R2                                     0.00039
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coefficients(twoway)

ifelse(year >= treatment_group, 1, 0) 
                          -0.01511372

Sun and Abraham

Explanation of Sample Selection

The hard work is done by making sure that we make no invalid comparisons. I try to explain this with the following figure. I assume we are only interested in the treatment effect 5 years after the implementation of the treatment. For group 3 who gets the treatment last, there is not appropriate control group (no_control_group) so we do not use this group after the treatment. For the other groups, we delete the data more than 5 years after treatment (no_interest). We are left with controls which are all pre treatment (control) and the 5 years of treatment in group 1 and 2 (treatment).

plot_dt[, `:=`(
  roa = roa - 0.002 * group,
  sample_groups = case_when(
    year >= 2007 - 1 ~ "no_control_group",
    year - treatment_groups[group] > 5 - 1 ~ "no_interest",
    year < treatment_groups[group] - 1 ~ "control",
    year >= treatment_groups[group] - 1 ~ "treatment"
  ))]
sun_plot <- ggplot(plot_dt, aes(y = roa, x = year, group = group,
                                label = label, colour = sample_groups)) +
  geom_line(size = 2) +
  viridis::scale_colour_viridis(discrete = T)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

save_plot(here("courses", "fina4590", "slides", "figures", "baker_sun_sample.svg"), sun_plot)
print(sun_plot)

The code

The actual code to run the Sun and Abraham (2021) estimator is available in fixest. We first need to construct the cleaned data set following the figure above. Next, we run the regression with fixest and use the sunab function, which automatically creates indicators for each relative year - treatment group combination. It excludes rel_year = -1 which is the year before the treatment and the benchmark to which all the estimated coefficients can be compared.

The default output for the Sun and Abraham estimator is to report the aggregated⁵ relative time coefficients. The overall average treatment effect can be extracted with the summary function.

max_rel_year <- 5
sa_new <- new_sample() %>%
  filter(year < treatment_group | rel_year <= max_rel_year) %>%
  filter(year < 2007)
saveRDS(sa_new, here("data", "sa_new.RDS"))
sun_ab <- feols(roa ~ 1 + sunab(treatment_group, year)
                | firm + year, cluster = "state", data = sa_new)

NOTE: 236/0 fixed-effect singletons were removed (236 observations).

modelsummary(sun_ab, gof_omit = gof_omit, stars = stars)

	(1)
* p < 0.1, p < 0.05, * p < 0.01
year = -18	0.006
	(0.009)
year = -17	0.006
	(0.011)
year = -16	0.019**
	(0.009)
year = -15	0.011
	(0.009)
year = -14	0.008
	(0.010)
year = -13	0.004
	(0.009)
year = -12	0.007
	(0.007)
year = -11	-0.006
	(0.006)
year = -10	-0.002
	(0.008)
year = -9	0.001
	(0.005)
year = -8	0.004
	(0.006)
year = -7	0.004
	(0.006)
year = -6	0.007
	(0.006)
year = -5	-0.000
	(0.006)
year = -4	0.005
	(0.006)
year = -3	0.006
	(0.005)
year = -2	0.009
	(0.005)
year = 0	0.014**
	(0.006)
year = 1	0.023***
	(0.005)
year = 2	0.048***
	(0.006)
year = 3	0.061***
	(0.006)
year = 4	0.065***
	(0.006)
year = 5	0.073***
	(0.006)
Num.Obs.	120447
R2	0.782
R2 Within	0.005
RMSE	0.17
Std.Errors	by: state
FE: firm	X
FE: year	X

ATT <- coeftable(summary(sun_ab, agg = "ATT"))[1,1]
print(ATT)

[1] 0.04732367

Explanation of the code

You can extract the individual coefficients for each relative year - treatment group estimation.

sun_ab$coefficients

year::-18:cohort::1998 year::-17:cohort::1998 year::-16:cohort::1998 
          0.0059808786           0.0057626226           0.0190557484 
year::-15:cohort::1998 year::-14:cohort::1998 year::-13:cohort::1998 
          0.0105113288           0.0078918374           0.0035510189 
year::-12:cohort::1998 year::-11:cohort::1998 year::-10:cohort::1998 
          0.0067639985          -0.0056028134          -0.0017450078 
 year::-9:cohort::1989  year::-9:cohort::1998  year::-8:cohort::1989 
          0.0020703200          -0.0006915803           0.0044965588 
 year::-8:cohort::1998  year::-7:cohort::1989  year::-7:cohort::1998 
          0.0044439650           0.0031671331           0.0038964527 
 year::-6:cohort::1989  year::-6:cohort::1998  year::-5:cohort::1989 
          0.0135616074           0.0011282605           0.0006183560 
 year::-5:cohort::1998  year::-4:cohort::1989  year::-4:cohort::1998 
         -0.0015751835           0.0068714905           0.0038704694 
 year::-3:cohort::1989  year::-3:cohort::1998  year::-2:cohort::1989 
          0.0082776795           0.0039914535           0.0027030135 
 year::-2:cohort::1998   year::0:cohort::1989   year::0:cohort::1998 
          0.0145928695           0.0063709074           0.0225983012 
  year::1:cohort::1989   year::1:cohort::1998   year::2:cohort::1989 
          0.0269185212           0.0187843327           0.0580739447 
  year::2:cohort::1998   year::3:cohort::1989   year::3:cohort::1998 
          0.0385158113           0.0637800545           0.0584719701 
  year::4:cohort::1989   year::4:cohort::1998   year::5:cohort::1989 
          0.0736433775           0.0554880908           0.0922013695 
  year::5:cohort::1998 
          0.0507663973

summary(sun_ab, agg = "cohort") %>%
  msummary(gof_omit = gof_omit, stars = stars)

	(1)
* p < 0.1, p < 0.05, * p < 0.01
cohort = 1989	0.054***
	(0.006)
cohort = 1998	0.040***
	(0.006)
Num.Obs.	120447
R2	0.782
R2 Within	0.005
RMSE	0.17
Std.Errors	by: state
FE: firm	X
FE: year	X

Callaway and Sant’Anna

The code

For the Callaway and Sant’Anna code to work we first need to set the treatment group variable to 0 for the group that is never treated (in the subsample). And the code needs to exclude all firms that don’t have any pre-treatment observations.

library(did)
cs_new <- filter(new, year < 2007) %>%
  mutate(pretreatment = sum(year < treatment_group), .by = firm) %>%
  filter(pretreatment > 0) %>%
  mutate(treatment_group = if_else(treatment_group == 2007, 0, treatment_group))

cs <- att_gt(yname = "roa",
             gname = "treatment_group",
             idname = "firm",
             tname = "year",
             xformla = ~ 1,
             est_method  = "reg",
             data = cs_new)

Warning in did_standarization(data, args): 8987 units are missing in some
periods. Converting to balanced panel by dropping them.

ATT_cs <- aggte(cs, type = "simple")
ATT_cs$overall.att

[1] 0.09987021

True ATT

The true ATT will be different if we use the full sample or restrict the treated observations to 5 years after the treatment for group 1 and 2 or any other exlusion. The following code calculates the expected true ATT for the full sample or by cohort. The true_att calculation exploits the linear increase in the treatment effect.

true_full_att <- mean((2016 - treatment_groups)/2 *
                      sigma_roa * treatment_effects)
true_att <- (max_rel_year + 2)/2 * sigma_roa * treatment_effects
max_rel_years <- max(treatment_groups - 1) - treatment_groups
max_rel_years

[1] 17  8 -1

true_att_cs <- (max_rel_years + 2)/2 * sigma_roa * treatment_effects
print(true_full_att)

[1] 0.1009089

print(true_att)

[1] 0.05351231 0.03210739 0.01070246

print(true_att_cs)

[1] 0.145247707 0.045867697 0.001528923

collap(new[rel_year %in% 0:5], treatment ~ rel_year + treatment_group) %>%
  collap(treatment ~ treatment_group)

   treatment_group  treatment
             <num>      <num>
1:            1989 0.05351231
2:            1998 0.03210739
3:            2007 0.01070246

collap(new[year < 2007 & rel_year >= 0], treatment ~ rel_year + treatment_group) %>%
  collap(treatment ~ treatment_group)

   treatment_group treatment
             <num>     <num>
1:            1989 0.1452477
2:            1998 0.0458677

Simulation

Everything is now in place to run a simulation that simulates new data and runs the two-way fixed effect OLS estimator, the Sun and Abraham (2021) estimator and the Callaway and Sant’Anna (2021) estimator. I extract the estimate of the ATT from both and save them as the results.

The rest of the code runs the simulation nsim times and plots the estimate with the expected true ATT.

nsim <- 500
simulate <- function(sim = 1){
  new <- new_sample()
  twoway <- feols(roa ~ ifelse(year >= treatment_group, 1, 0)
                  | year + firm, data = new)
  twoway_coef = coefficients(twoway)[1]
  sa_new <- new[(year < treatment_group | rel_year <= 5) & year < 2007]
  cs_new <- filter(new, year < 2007) %>%
    mutate(pretreatment = sum(year < treatment_group), .by = firm) %>%
    filter(pretreatment > 0) %>%
    mutate(treatment_group = if_else(treatment_group == 2007, 0, treatment_group))
  sun_ab  <- feols(roa ~ 1 + sunab(treatment_group, year)
                   | firm + year, sa_new)
  sun_ab_att <- coeftable(summary(sun_ab, agg = "ATT"))[1,1]
  cs <- att_gt(
    yname = "roa",
    gname = "treatment_group",
    idname = "firm",
    tname = "year",
    xformla = ~ 1,
    est_method  = "reg",
    data = cs_new)
  cs_att <- aggte(cs, type = "simple")$overall.att
  results <- c(twoway_coef, sun_ab_att, cs_att)
  names(results) <- c("twoway", "sun_abraham", "callaway_santanna")
  return(results)
}
est <- parallel::mclapply(1:2, FUN = simulate, mc.cores = 4L) %>%
  bind_rows
estimates <- parallel::mclapply(1:nsim, FUN = simulate, mc.cores = 4L) %>%
  bind_rows()
summary(estimates)

     twoway           sun_abraham      callaway_santanna
 Min.   :-0.019456   Min.   :0.03004   Min.   :0.0764   
 1st Qu.:-0.014987   1st Qu.:0.04075   1st Qu.:0.1090   
 Median :-0.013805   Median :0.04365   Median :0.1166   
 Mean   :-0.013747   Mean   :0.04359   Mean   :0.1164   
 3rd Qu.:-0.012435   3rd Qu.:0.04654   3rd Qu.:0.1238   
 Max.   :-0.006342   Max.   :0.05650   Max.   :0.1517

sim_twoway <- ggplot(estimates, aes(x = twoway)) +
  geom_histogram(bins = 20) +
  ylab("twoway fixed effects") +
  geom_vline(xintercept = true_full_att)
sim_sun_abraham <- ggplot(estimates, aes(x = sun_abraham)) +
  geom_histogram(bins = 20) +
  ylab("Sun and Abraham") +
  geom_vline(xintercept = mean(true_att[1:2]))
sim_callaway_santanna <- ggplot(estimates, aes(x = callaway_santanna)) +
  geom_histogram(bins = 20) +
  ylab("Callaway and Sant'Anna") +
  geom_vline(xintercept = mean(true_att_cs[1:2]))
save_plot(here("courses", "fina4590", "slides", "figures", "simulation_twoway.svg"), sim_twoway)
save_plot(here("courses", "fina4590", "slides", "figures", "simulation_sun_abraham.svg"), sim_sun_abraham)
save_plot(here("courses", "fina4590", "slides", "figures", "simulation_callaway_santanna.svg"), sim_callaway_santanna)
plot(sim_twoway)

plot(sim_sun_abraham)

plot(sim_callaway_santanna)

References

Baker, Andrew C., David F. Larcker, and Charles C. Y. Wang. 2022. “How Much Should We Trust Staggered Difference-in-Differences Estimates?” Journal of Financial Economics 144 (2): 370–95. https://doi.org/10.1016/j.jfineco.2022.01.004.

Callaway, Brantly, and Pedro H. C. Sant’Anna. 2021. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics, Themed Issue: Treatment Effect 1, 225 (2): 200–230. https://doi.org/10.1016/j.jeconom.2020.12.001.

Sun, Liyang, and Sarah Abraham. 2021. “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.” Journal of Econometrics, Themed Issue: Treatment Effect 1, 225 (2): 175–99. https://doi.org/10.1016/j.jeconom.2020.09.006.

Footnotes

You may or may not be interested in this but there you go. It’s free.↩︎
I do not.↩︎
Which is not necessarily a given. See this terrifying looking paper.↩︎
This is slightly different than in the Baker et al. paper. I allow that by random chance some of the groups are larger than the other while the Baker et al. simulation keeps the groups (almost) equal in size. I was too lazy to change it when I noticed the difference but I am also interested in what happens to the average treatment effect if some group effects are estimated more precisely so my simulation is at least consistent with that goal↩︎
over the groups↩︎