Introduction to Rstudio for Accounting and Finance

Author

Stijn Masschelein

This setup code loads a package, here, that helps to navigate the folder structure in which I will create files. The tidyverse package is the main package, we will use to manipulate datasets. There are other ways to program in R. I think that to start of the tidyverse way of looking at data is quite intuitive and it is very well supported. The intuition and how quickly you can do meaningful things will hopefully be clear by the end of this document.

library(here) 
i_am("auxilary/introduction_to_rstudio_for_accfin.qmd")
here() starts at /Users/stijnmasschelein/Library/CloudStorage/Dropbox/Teaching/lecturenotes/method_package
library(tidyverse)
library(dbplyr)

Get the data

We establish a connection with the WRDS databases with my account. WRDS is the data provider the university subscribes to and it has a lot of good tutorials on how to connect to their databases from R. Important, you will first have to ask for an account from the person in charge of the databases in the department. I will give a more extensive example later in the unit to really go over some of the details. This is meant as an illustration of what is possible with these type of files and code.

The code below establishes a connection with the WRDS servers under my username. The WRDS website has instructions on how to do set up the connection and how you can safely store your WRDS password so you do not have to type it. The settings I used here are all given by the WRDS instructions.

library(RPostgres)
wrds <- dbConnect(Postgres(),
              host='wrds-pgdata.wharton.upenn.edu',
              port=9737,
              dbname='wrds',
              user='stimas',
              sslmode='require')

Next, I download the data that we need from the wrds connection that I established before. I use the tbl function from the dbplyr package. This basically allows me to use the tidyverse verbs to to interact with almost any database. I give the downloaded data a name wrds_data so that I can use it further down the line. The only thing to remember with the tbl function is that you need to run the collect function at the end of the query so that the data is downloaded. Obviously, you need to know where the data is on the servers and how the variables are named. Sometimes, you can ask colleagues but other times, you will have to read the documentation on the WRDS website.

wrds_data <- tbl(wrds, in_schema("execcomp", "anncomp")) %>%
  filter(year > 2010 & ceoann == "CEO") %>%
  select(year, gvkey, cusip, exec_fullname, coname, ceoann,
    execid, bonus, salary, stock_awards_fv, stock_unvest_val,
    eip_unearn_num, eip_unearn_val, option_awards,
    option_awards_blk_value, option_awards_num, tdc1, tdc2,
    shrown_tot_pct, becameceo, joined_co, reason) %>%
  collect()

Here, I clean up the data and have a quick look after which I save it in my data folder to share it with you.

library(tidyverse)
compensation <- as_tibble(wrds_data) %>%
  rename_all(tolower)
glimpse(compensation)
Rows: 24,717
Columns: 22
$ year                    <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018…
$ gvkey                   <chr> "001004", "001004", "001004", "001004", "00100…
$ cusip                   <chr> "00036110", "00036110", "00036110", "00036110"…
$ exec_fullname           <chr> "David P. Storch", "David P. Storch", "David P…
$ coname                  <chr> "AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"…
$ ceoann                  <chr> "CEO", "CEO", "CEO", "CEO", "CEO", "CEO", "CEO…
$ execid                  <chr> "09249", "09249", "09249", "09249", "09249", "…
$ bonus                   <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0…
$ salary                  <dbl> 867.000, 877.838, 906.449, 906.449, 755.250, 8…
$ stock_awards_fv         <dbl> 2664.745, 619.200, 1342.704, 1695.200, 1150.50…
$ stock_unvest_val        <dbl> 4227.273, 6018.000, 4244.165, 4103.283, 1334.0…
$ eip_unearn_num          <dbl> 32.681, 56.681, 66.929, 83.415, 91.969, 157.96…
$ eip_unearn_val          <dbl> 393.806, 1137.021, 1626.375, 2464.079, 2244.96…
$ option_awards           <dbl> 578.460, 695.520, 1622.016, 0.000, 1150.500, 1…
$ option_awards_blk_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ option_awards_num       <dbl> 49.022, 144.000, 158.400, 0.000, 153.810, 225.…
$ tdc1                    <dbl> 5786.400, 4182.832, 5247.779, 5234.648, 4674.4…
$ tdc2                    <dbl> 6105.117, 3487.312, 3809.626, 10428.375, 3523.…
$ shrown_tot_pct          <dbl> 2.964, 2.893, 3.444, 3.877, 4.597, 5.417, 3.71…
$ becameceo               <date> 1996-10-09, 1996-10-09, 1996-10-09, 1996-10-0…
$ joined_co               <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ reason                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
saveRDS(compensation, here("data", "us-compensation-new.RDS"))

Now, we do the same for the market value of the company. Again, we can do this completely in R. I don’t want to download all market values. Only for the firms that are already in our compensation data. What I do is to extract all the identifiers of the firms in our original data.

gvkeys <- unique(compensation$gvkey)
length(gvkeys)
[1] 2579

The remainder of the code follows a fairly similar pattern as before. I extract three variables from the WRDS databases fyear and gvkey so that we can match it with the compensation data and the mkvalt at the end of the fiscal year.

value <-  tbl(wrds, in_schema("comp", "funda")) %>%
  filter(gvkey %in% gvkeys, fyear > 2010) %>%
  select(fyear, gvkey, mkvalt) %>%
  collect()
glimpse(value)
Rows: 55,377
Columns: 3
$ fyear  <int> 2011, 2011, 2012, 2013, 2013, 2014, 2014, 2015, 2015, 2016, 201…
$ gvkey  <chr> "001004", "001004", "001004", "001004", "001004", "001004", "00…
$ mkvalt <dbl> 485.2897, 485.2897, 790.0029, 961.3080, 961.3080, 1046.3954, 10…
value <- distinct(value)
glimpse(value)
Rows: 26,684
Columns: 3
$ fyear  <int> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 202…
$ gvkey  <chr> "001004", "001004", "001004", "001004", "001004", "001004", "00…
$ mkvalt <dbl> 485.2897, 790.0029, 961.3080, 1046.3954, 842.5112, 1200.3288, 1…
saveRDS(value, here("data", "us-value-new.RDS"))

Explore the data

With the tidyverse

us_comp <- readRDS(here("data", "us-compensation-new.RDS"))
glimpse(us_comp)
Rows: 23,096
Columns: 22
$ year                    <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018…
$ gvkey                   <chr> "001004", "001004", "001004", "001004", "00100…
$ cusip                   <chr> "00036110", "00036110", "00036110", "00036110"…
$ exec_fullname           <chr> "David P. Storch", "David P. Storch", "David P…
$ coname                  <chr> "AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"…
$ ceoann                  <chr> "CEO", "CEO", "CEO", "CEO", "CEO", "CEO", "CEO…
$ execid                  <chr> "09249", "09249", "09249", "09249", "09249", "…
$ bonus                   <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.00…
$ salary                  <dbl> 867.000, 877.838, 906.449, 906.449, 755.250, 8…
$ stock_awards_fv         <dbl> 2664.745, 619.200, 1342.704, 1695.200, 1150.50…
$ stock_unvest_val        <dbl> 4227.273, 6018.000, 4244.165, 4103.283, 1334.0…
$ eip_unearn_num          <dbl> 32.681, 56.681, 66.929, 83.415, 91.969, 157.96…
$ eip_unearn_val          <dbl> 393.806, 1137.021, 1626.375, 2464.079, 2244.96…
$ option_awards           <dbl> 578.460, 695.520, 1622.016, 0.000, 1150.500, 1…
$ option_awards_blk_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ option_awards_num       <dbl> 49.022, 144.000, 158.400, 0.000, 153.810, 225.…
$ tdc1                    <dbl> 5786.400, 4182.832, 5247.779, 5234.648, 4674.4…
$ tdc2                    <dbl> 6105.117, 3487.312, 3809.626, 10428.375, 3523.…
$ shrown_tot_pct          <dbl> 2.964, 2.893, 3.444, 3.877, 4.597, 5.417, 3.71…
$ becameceo               <date> 1996-10-09, 1996-10-09, 1996-10-09, 1996-10-0…
$ joined_co               <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ reason                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
filter(us_comp, gvkey == "001004")
# A tibble: 11 × 22
    year gvkey  cusip    exec_fullname         coname ceoann execid bonus salary
   <dbl> <chr>  <chr>    <chr>                 <chr>  <chr>  <chr>  <dbl>  <dbl>
 1  2011 001004 00036110 David P. Storch       AAR C… CEO    09249      0   867 
 2  2012 001004 00036110 David P. Storch       AAR C… CEO    09249      0   878.
 3  2013 001004 00036110 David P. Storch       AAR C… CEO    09249      0   906.
 4  2014 001004 00036110 David P. Storch       AAR C… CEO    09249      0   906.
 5  2015 001004 00036110 David P. Storch       AAR C… CEO    09249      0   755.
 6  2016 001004 00036110 David P. Storch       AAR C… CEO    09249      0   835 
 7  2017 001004 00036110 David P. Storch       AAR C… CEO    09249      0   941 
 8  2018 001004 00036110 John McClain Holmes,… AAR C… CEO    48195      0   750 
 9  2019 001004 00036110 John McClain Holmes,… AAR C… CEO    48195      0   801.
10  2020 001004 00036110 John McClain Holmes,… AAR C… CEO    48195      0   781.
11  2021 001004 00036110 John McClain Holmes,… AAR C… CEO    48195      0   925 
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
#   eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
#   option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
#   tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
#   reason <chr>
filter(us_comp, bonus == 0, year == 2012)
# A tibble: 1,678 × 22
    year gvkey  cusip    exec_fullname         coname ceoann execid bonus salary
   <dbl> <chr>  <chr>    <chr>                 <chr>  <chr>  <chr>  <dbl>  <dbl>
 1  2012 001004 00036110 David P. Storch       AAR C… CEO    09249      0   878.
 2  2012 001045 02376R10 Thomas W. Horton      AMERI… CEO    26059      0   618.
 3  2012 001075 72348410 Donald E. Brandt, CPA PINNA… CEO    05835      0  1146 
 4  2012 001076 74319R10 Ronald W. Allen       PROG … CEO    00283      0   850 
 5  2012 001078 00282410 Miles D. White, M.B.… ABBOT… CEO    14300      0  1900 
 6  2012 001094 00444610 Albert L. Eilender    ACETO… CEO    46204      0   626.
 7  2012 001161 00790310 Rory P. Read          ADVAN… CEO    42390      0  1000.
 8  2012 001177 00817Y10 Mark Thomas Bertolini AETNA… CEO    31029      0   977.
 9  2012 001209 00915810 John E. McGlade       AIR P… CEO    27315      0  1200 
10  2012 001230 01165910 Bradley D. Tilden     ALASK… CEO    21308      0   420.
# ℹ 1,668 more rows
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
#   eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
#   option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
#   tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
#   reason <chr>

A Gamestop to introduce the pipe

  1. Look for any company where “gamestop” is in the name of the company.
  2. Look for the gvkey of Gamestop. Then use filter with exactly that key.
filter(us_comp, str_detect(tolower(coname),
                           "gamestop"))
# A tibble: 11 × 22
    year gvkey  cusip    exec_fullname     coname     ceoann execid bonus salary
   <dbl> <chr>  <chr>    <chr>             <chr>      <chr>  <chr>  <dbl>  <dbl>
 1  2011 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719  2254.  1028.
 2  2012 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719  1515   1050.
 3  2013 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719   975   1059.
 4  2014 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719     0   1201.
 5  2015 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719     0   1247.
 6  2016 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719     0   1285.
 7  2017 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719     0   1293.
 8  2018 145049 36467W10 Shane S. Kim      GAMESTOP … CEO    56940    25    963.
 9  2019 145049 36467W10 George E. Sherman GAMESTOP … CEO    46080   150    846.
10  2020 145049 36467W10 George E. Sherman GAMESTOP … CEO    46080     0   1005.
11  2021 145049 36467W10 Matthew Furlong   GAMESTOP … CEO    61979  1595.   115.
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
#   eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
#   option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
#   tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
#   reason <chr>
filter(us_comp, gvkey == "145049")
# A tibble: 11 × 22
    year gvkey  cusip    exec_fullname     coname     ceoann execid bonus salary
   <dbl> <chr>  <chr>    <chr>             <chr>      <chr>  <chr>  <dbl>  <dbl>
 1  2011 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719  2254.  1028.
 2  2012 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719  1515   1050.
 3  2013 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719   975   1059.
 4  2014 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719     0   1201.
 5  2015 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719     0   1247.
 6  2016 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719     0   1285.
 7  2017 145049 36467W10 J. Paul Raines    GAMESTOP … CEO    35719     0   1293.
 8  2018 145049 36467W10 Shane S. Kim      GAMESTOP … CEO    56940    25    963.
 9  2019 145049 36467W10 George E. Sherman GAMESTOP … CEO    46080   150    846.
10  2020 145049 36467W10 George E. Sherman GAMESTOP … CEO    46080     0   1005.
11  2021 145049 36467W10 Matthew Furlong   GAMESTOP … CEO    61979  1595.   115.
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
#   eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
#   option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
#   tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
#   reason <chr>

Moving on

select(us_comp, year, coname, bonus, salary, total = tdc1)
# A tibble: 23,096 × 5
    year coname   bonus salary total
   <dbl> <chr>    <dbl>  <dbl> <dbl>
 1  2011 AAR CORP     0   867  5786.
 2  2012 AAR CORP     0   878. 4183.
 3  2013 AAR CORP     0   906. 5248.
 4  2014 AAR CORP     0   906. 5235.
 5  2015 AAR CORP     0   755. 4674.
 6  2016 AAR CORP     0   835  6073.
 7  2017 AAR CORP     0   941  6284.
 8  2018 AAR CORP     0   750  3344.
 9  2019 AAR CORP     0   801. 4736.
10  2020 AAR CORP     0   781. 4123.
# ℹ 23,086 more rows

We first build up the dataset with select and mutate and the pipe %>%. When we are satisfied with the result, we can save the dataset as an R object with the name us_comp_small.

us_comp_small <-
  select(us_comp, year, coname, bonus,
       salary, total = tdc1) %>%
    mutate(salary_percentage = salary/total)
print(us_comp_small)
# A tibble: 23,096 × 6
    year coname   bonus salary total salary_percentage
   <dbl> <chr>    <dbl>  <dbl> <dbl>             <dbl>
 1  2011 AAR CORP     0   867  5786.             0.150
 2  2012 AAR CORP     0   878. 4183.             0.210
 3  2013 AAR CORP     0   906. 5248.             0.173
 4  2014 AAR CORP     0   906. 5235.             0.173
 5  2015 AAR CORP     0   755. 4674.             0.162
 6  2016 AAR CORP     0   835  6073.             0.138
 7  2017 AAR CORP     0   941  6284.             0.150
 8  2018 AAR CORP     0   750  3344.             0.224
 9  2019 AAR CORP     0   801. 4736.             0.169
10  2020 AAR CORP     0   781. 4123.             0.189
# ℹ 23,086 more rows

Quick descriptive statistics

group_by(us_comp, gvkey) %>%
    summarise(N = n(), N_CEO = n_distinct(execid),
              average = mean(salary),
              sd = sd(salary),
              med = median(salary),
              min = min(salary),
              max = max(salary)) %>%
    ungroup() %>%
  filter(med > 1000)
# A tibble: 599 × 8
   gvkey      N N_CEO average    sd   med   min   max
   <chr>  <int> <int>   <dbl> <dbl> <dbl> <dbl> <dbl>
 1 001075    12     2   1219. 111.  1222. 1091  1395 
 2 001078    12     2   1788. 224.  1900  1298. 1973.
 3 001161    12     3    961. 151.  1000.  566. 1149.
 4 001209    12     2   1239. 131.  1200   905. 1402.
 5 001274    11     1   1082. 101.  1030  1000  1250 
 6 001300    12     2   1714. 152.  1750  1415. 1890 
 7 001380    12     1   1500    0   1500  1500  1500 
 8 001440    12     2   1339. 176.  1350.  903. 1522.
 9 001447    12     2   1794. 262.  2000  1488. 2038.
10 001449    12     1   1438.  12.1 1441. 1399. 1441.
# ℹ 589 more rows
summarise(us_comp, N = n(), N_CEO = n_distinct(execid),
              average = mean(salary),
              sd = sd(salary),
              med = median(salary),
              min = min(salary),
              max = max(salary),
          .by = gvkey) %>%
  filter(med > 1000)
# A tibble: 599 × 8
   gvkey      N N_CEO average    sd   med   min   max
   <chr>  <int> <int>   <dbl> <dbl> <dbl> <dbl> <dbl>
 1 001075    12     2   1219. 111.  1222. 1091  1395 
 2 001078    12     2   1788. 224.  1900  1298. 1973.
 3 001161    12     3    961. 151.  1000.  566. 1149.
 4 001209    12     2   1239. 131.  1200   905. 1402.
 5 001274    11     1   1082. 101.  1030  1000  1250 
 6 001300    12     2   1714. 152.  1750  1415. 1890 
 7 001380    12     1   1500    0   1500  1500  1500 
 8 001440    12     2   1339. 176.  1350.  903. 1522.
 9 001447    12     2   1794. 262.  2000  1488. 2038.
10 001449    12     1   1438.  12.1 1441. 1399. 1441.
# ℹ 589 more rows