library(here)
i_am("auxilary/introduction_to_rstudio_for_accfin.qmd")
here() starts at /Users/stijnmasschelein/Library/CloudStorage/Dropbox/Teaching/lecturenotes/method_package
library(tidyverse)
library(dbplyr)
This setup code loads a package, here
, that helps to navigate the folder structure in which I will create files. The tidyverse
package is the main package, we will use to manipulate datasets. There are other ways to program in R
. I think that to start of the tidyverse
way of looking at data is quite intuitive and it is very well supported. The intuition and how quickly you can do meaningful things will hopefully be clear by the end of this document.
library(here)
i_am("auxilary/introduction_to_rstudio_for_accfin.qmd")
here() starts at /Users/stijnmasschelein/Library/CloudStorage/Dropbox/Teaching/lecturenotes/method_package
library(tidyverse)
library(dbplyr)
We establish a connection with the WRDS databases with my account. WRDS is the data provider the university subscribes to and it has a lot of good tutorials on how to connect to their databases from R. Important, you will first have to ask for an account from the person in charge of the databases in the department. I will give a more extensive example later in the unit to really go over some of the details. This is meant as an illustration of what is possible with these type of files and code.
The code below establishes a connection with the WRDS servers under my username. The WRDS website has instructions on how to do set up the connection and how you can safely store your WRDS password so you do not have to type it. The settings I used here are all given by the WRDS instructions.
library(RPostgres)
<- dbConnect(Postgres(),
wrds host='wrds-pgdata.wharton.upenn.edu',
port=9737,
dbname='wrds',
user='stimas',
sslmode='require')
Next, I download the data that we need from the wrds
connection that I established before. I use the tbl
function from the dbplyr
package. This basically allows me to use the tidyverse
verbs to to interact with almost any database. I give the downloaded data a name wrds_data
so that I can use it further down the line. The only thing to remember with the tbl
function is that you need to run the collect
function at the end of the query so that the data is downloaded. Obviously, you need to know where the data is on the servers and how the variables are named. Sometimes, you can ask colleagues but other times, you will have to read the documentation on the WRDS website.
<- tbl(wrds, in_schema("execcomp", "anncomp")) %>%
wrds_data filter(year > 2010 & ceoann == "CEO") %>%
select(year, gvkey, cusip, exec_fullname, coname, ceoann,
execid, bonus, salary, stock_awards_fv, stock_unvest_val,
eip_unearn_num, eip_unearn_val, option_awards,
option_awards_blk_value, option_awards_num, tdc1, tdc2,%>%
shrown_tot_pct, becameceo, joined_co, reason) collect()
Here, I clean up the data and have a quick look after which I save it in my data
folder to share it with you.
library(tidyverse)
<- as_tibble(wrds_data) %>%
compensation rename_all(tolower)
glimpse(compensation)
Rows: 24,717
Columns: 22
$ year <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018…
$ gvkey <chr> "001004", "001004", "001004", "001004", "00100…
$ cusip <chr> "00036110", "00036110", "00036110", "00036110"…
$ exec_fullname <chr> "David P. Storch", "David P. Storch", "David P…
$ coname <chr> "AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"…
$ ceoann <chr> "CEO", "CEO", "CEO", "CEO", "CEO", "CEO", "CEO…
$ execid <chr> "09249", "09249", "09249", "09249", "09249", "…
$ bonus <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0…
$ salary <dbl> 867.000, 877.838, 906.449, 906.449, 755.250, 8…
$ stock_awards_fv <dbl> 2664.745, 619.200, 1342.704, 1695.200, 1150.50…
$ stock_unvest_val <dbl> 4227.273, 6018.000, 4244.165, 4103.283, 1334.0…
$ eip_unearn_num <dbl> 32.681, 56.681, 66.929, 83.415, 91.969, 157.96…
$ eip_unearn_val <dbl> 393.806, 1137.021, 1626.375, 2464.079, 2244.96…
$ option_awards <dbl> 578.460, 695.520, 1622.016, 0.000, 1150.500, 1…
$ option_awards_blk_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ option_awards_num <dbl> 49.022, 144.000, 158.400, 0.000, 153.810, 225.…
$ tdc1 <dbl> 5786.400, 4182.832, 5247.779, 5234.648, 4674.4…
$ tdc2 <dbl> 6105.117, 3487.312, 3809.626, 10428.375, 3523.…
$ shrown_tot_pct <dbl> 2.964, 2.893, 3.444, 3.877, 4.597, 5.417, 3.71…
$ becameceo <date> 1996-10-09, 1996-10-09, 1996-10-09, 1996-10-0…
$ joined_co <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ reason <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
saveRDS(compensation, here("data", "us-compensation-new.RDS"))
Now, we do the same for the market value of the company. Again, we can do this completely in R
. I don’t want to download all market values. Only for the firms that are already in our compensation data. What I do is to extract all the identifiers of the firms in our original data.
<- unique(compensation$gvkey)
gvkeys length(gvkeys)
[1] 2579
The remainder of the code follows a fairly similar pattern as before. I extract three variables from the WRDS databases fyear
and gvkey
so that we can match it with the compensation data and the mkvalt
at the end of the fiscal year.
<- tbl(wrds, in_schema("comp", "funda")) %>%
value filter(gvkey %in% gvkeys, fyear > 2010) %>%
select(fyear, gvkey, mkvalt) %>%
collect()
glimpse(value)
Rows: 55,377
Columns: 3
$ fyear <int> 2011, 2011, 2012, 2013, 2013, 2014, 2014, 2015, 2015, 2016, 201…
$ gvkey <chr> "001004", "001004", "001004", "001004", "001004", "001004", "00…
$ mkvalt <dbl> 485.2897, 485.2897, 790.0029, 961.3080, 961.3080, 1046.3954, 10…
<- distinct(value)
value glimpse(value)
Rows: 26,684
Columns: 3
$ fyear <int> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 202…
$ gvkey <chr> "001004", "001004", "001004", "001004", "001004", "001004", "00…
$ mkvalt <dbl> 485.2897, 790.0029, 961.3080, 1046.3954, 842.5112, 1200.3288, 1…
saveRDS(value, here("data", "us-value-new.RDS"))
<- readRDS(here("data", "us-compensation-new.RDS"))
us_comp glimpse(us_comp)
Rows: 23,096
Columns: 22
$ year <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018…
$ gvkey <chr> "001004", "001004", "001004", "001004", "00100…
$ cusip <chr> "00036110", "00036110", "00036110", "00036110"…
$ exec_fullname <chr> "David P. Storch", "David P. Storch", "David P…
$ coname <chr> "AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"…
$ ceoann <chr> "CEO", "CEO", "CEO", "CEO", "CEO", "CEO", "CEO…
$ execid <chr> "09249", "09249", "09249", "09249", "09249", "…
$ bonus <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.00…
$ salary <dbl> 867.000, 877.838, 906.449, 906.449, 755.250, 8…
$ stock_awards_fv <dbl> 2664.745, 619.200, 1342.704, 1695.200, 1150.50…
$ stock_unvest_val <dbl> 4227.273, 6018.000, 4244.165, 4103.283, 1334.0…
$ eip_unearn_num <dbl> 32.681, 56.681, 66.929, 83.415, 91.969, 157.96…
$ eip_unearn_val <dbl> 393.806, 1137.021, 1626.375, 2464.079, 2244.96…
$ option_awards <dbl> 578.460, 695.520, 1622.016, 0.000, 1150.500, 1…
$ option_awards_blk_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ option_awards_num <dbl> 49.022, 144.000, 158.400, 0.000, 153.810, 225.…
$ tdc1 <dbl> 5786.400, 4182.832, 5247.779, 5234.648, 4674.4…
$ tdc2 <dbl> 6105.117, 3487.312, 3809.626, 10428.375, 3523.…
$ shrown_tot_pct <dbl> 2.964, 2.893, 3.444, 3.877, 4.597, 5.417, 3.71…
$ becameceo <date> 1996-10-09, 1996-10-09, 1996-10-09, 1996-10-0…
$ joined_co <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ reason <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
filter(us_comp, gvkey == "001004")
# A tibble: 11 × 22
year gvkey cusip exec_fullname coname ceoann execid bonus salary
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 2011 001004 00036110 David P. Storch AAR C… CEO 09249 0 867
2 2012 001004 00036110 David P. Storch AAR C… CEO 09249 0 878.
3 2013 001004 00036110 David P. Storch AAR C… CEO 09249 0 906.
4 2014 001004 00036110 David P. Storch AAR C… CEO 09249 0 906.
5 2015 001004 00036110 David P. Storch AAR C… CEO 09249 0 755.
6 2016 001004 00036110 David P. Storch AAR C… CEO 09249 0 835
7 2017 001004 00036110 David P. Storch AAR C… CEO 09249 0 941
8 2018 001004 00036110 John McClain Holmes,… AAR C… CEO 48195 0 750
9 2019 001004 00036110 John McClain Holmes,… AAR C… CEO 48195 0 801.
10 2020 001004 00036110 John McClain Holmes,… AAR C… CEO 48195 0 781.
11 2021 001004 00036110 John McClain Holmes,… AAR C… CEO 48195 0 925
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
# eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
# option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
# tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
# reason <chr>
filter(us_comp, bonus == 0, year == 2012)
# A tibble: 1,678 × 22
year gvkey cusip exec_fullname coname ceoann execid bonus salary
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 2012 001004 00036110 David P. Storch AAR C… CEO 09249 0 878.
2 2012 001045 02376R10 Thomas W. Horton AMERI… CEO 26059 0 618.
3 2012 001075 72348410 Donald E. Brandt, CPA PINNA… CEO 05835 0 1146
4 2012 001076 74319R10 Ronald W. Allen PROG … CEO 00283 0 850
5 2012 001078 00282410 Miles D. White, M.B.… ABBOT… CEO 14300 0 1900
6 2012 001094 00444610 Albert L. Eilender ACETO… CEO 46204 0 626.
7 2012 001161 00790310 Rory P. Read ADVAN… CEO 42390 0 1000.
8 2012 001177 00817Y10 Mark Thomas Bertolini AETNA… CEO 31029 0 977.
9 2012 001209 00915810 John E. McGlade AIR P… CEO 27315 0 1200
10 2012 001230 01165910 Bradley D. Tilden ALASK… CEO 21308 0 420.
# ℹ 1,668 more rows
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
# eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
# option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
# tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
# reason <chr>
gvkey
of Gamestop. Then use filter
with exactly that key.filter(us_comp, str_detect(tolower(coname),
"gamestop"))
# A tibble: 11 × 22
year gvkey cusip exec_fullname coname ceoann execid bonus salary
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 2011 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 2254. 1028.
2 2012 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 1515 1050.
3 2013 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 975 1059.
4 2014 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 0 1201.
5 2015 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 0 1247.
6 2016 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 0 1285.
7 2017 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 0 1293.
8 2018 145049 36467W10 Shane S. Kim GAMESTOP … CEO 56940 25 963.
9 2019 145049 36467W10 George E. Sherman GAMESTOP … CEO 46080 150 846.
10 2020 145049 36467W10 George E. Sherman GAMESTOP … CEO 46080 0 1005.
11 2021 145049 36467W10 Matthew Furlong GAMESTOP … CEO 61979 1595. 115.
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
# eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
# option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
# tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
# reason <chr>
filter(us_comp, gvkey == "145049")
# A tibble: 11 × 22
year gvkey cusip exec_fullname coname ceoann execid bonus salary
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 2011 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 2254. 1028.
2 2012 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 1515 1050.
3 2013 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 975 1059.
4 2014 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 0 1201.
5 2015 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 0 1247.
6 2016 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 0 1285.
7 2017 145049 36467W10 J. Paul Raines GAMESTOP … CEO 35719 0 1293.
8 2018 145049 36467W10 Shane S. Kim GAMESTOP … CEO 56940 25 963.
9 2019 145049 36467W10 George E. Sherman GAMESTOP … CEO 46080 150 846.
10 2020 145049 36467W10 George E. Sherman GAMESTOP … CEO 46080 0 1005.
11 2021 145049 36467W10 Matthew Furlong GAMESTOP … CEO 61979 1595. 115.
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
# eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
# option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
# tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
# reason <chr>
select(us_comp, year, coname, bonus, salary, total = tdc1)
# A tibble: 23,096 × 5
year coname bonus salary total
<dbl> <chr> <dbl> <dbl> <dbl>
1 2011 AAR CORP 0 867 5786.
2 2012 AAR CORP 0 878. 4183.
3 2013 AAR CORP 0 906. 5248.
4 2014 AAR CORP 0 906. 5235.
5 2015 AAR CORP 0 755. 4674.
6 2016 AAR CORP 0 835 6073.
7 2017 AAR CORP 0 941 6284.
8 2018 AAR CORP 0 750 3344.
9 2019 AAR CORP 0 801. 4736.
10 2020 AAR CORP 0 781. 4123.
# ℹ 23,086 more rows
We first build up the dataset with select
and mutate
and the pipe %>%
. When we are satisfied with the result, we can save the dataset as an R object with the name us_comp_small
.
<-
us_comp_small select(us_comp, year, coname, bonus,
total = tdc1) %>%
salary, mutate(salary_percentage = salary/total)
print(us_comp_small)
# A tibble: 23,096 × 6
year coname bonus salary total salary_percentage
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2011 AAR CORP 0 867 5786. 0.150
2 2012 AAR CORP 0 878. 4183. 0.210
3 2013 AAR CORP 0 906. 5248. 0.173
4 2014 AAR CORP 0 906. 5235. 0.173
5 2015 AAR CORP 0 755. 4674. 0.162
6 2016 AAR CORP 0 835 6073. 0.138
7 2017 AAR CORP 0 941 6284. 0.150
8 2018 AAR CORP 0 750 3344. 0.224
9 2019 AAR CORP 0 801. 4736. 0.169
10 2020 AAR CORP 0 781. 4123. 0.189
# ℹ 23,086 more rows
group_by(us_comp, gvkey) %>%
summarise(N = n(), N_CEO = n_distinct(execid),
average = mean(salary),
sd = sd(salary),
med = median(salary),
min = min(salary),
max = max(salary)) %>%
ungroup() %>%
filter(med > 1000)
# A tibble: 599 × 8
gvkey N N_CEO average sd med min max
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001075 12 2 1219. 111. 1222. 1091 1395
2 001078 12 2 1788. 224. 1900 1298. 1973.
3 001161 12 3 961. 151. 1000. 566. 1149.
4 001209 12 2 1239. 131. 1200 905. 1402.
5 001274 11 1 1082. 101. 1030 1000 1250
6 001300 12 2 1714. 152. 1750 1415. 1890
7 001380 12 1 1500 0 1500 1500 1500
8 001440 12 2 1339. 176. 1350. 903. 1522.
9 001447 12 2 1794. 262. 2000 1488. 2038.
10 001449 12 1 1438. 12.1 1441. 1399. 1441.
# ℹ 589 more rows
summarise(us_comp, N = n(), N_CEO = n_distinct(execid),
average = mean(salary),
sd = sd(salary),
med = median(salary),
min = min(salary),
max = max(salary),
.by = gvkey) %>%
filter(med > 1000)
# A tibble: 599 × 8
gvkey N N_CEO average sd med min max
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001075 12 2 1219. 111. 1222. 1091 1395
2 001078 12 2 1788. 224. 1900 1298. 1973.
3 001161 12 3 961. 151. 1000. 566. 1149.
4 001209 12 2 1239. 131. 1200 905. 1402.
5 001274 11 1 1082. 101. 1030 1000 1250
6 001300 12 2 1714. 152. 1750 1415. 1890
7 001380 12 1 1500 0 1500 1500 1500
8 001440 12 2 1339. 176. 1350. 903. 1522.
9 001447 12 2 1794. 262. 2000 1488. 2038.
10 001449 12 1 1438. 12.1 1441. 1399. 1441.
# ℹ 589 more rows