Introduction to Rstudio for Accounting and Finance

Author

Stijn Masschelein

This setup code loads a package, here, that helps to navigate the folder structure in which I will create files. The tidyverse package is the main package, we will use to manipulate datasets. There are other ways to program in R. I think that to start of the tidyverse way of looking at data is quite intuitive and it is very well supported. The intuition and how quickly you can do meaningful things will hopefully be clear by the end of this document.

library(here) 
here() starts at /Users/stijnmasschelein/Dropbox/github/site
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dbplyr)

Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql

Get the data

We establish a connection with the WRDS databases with my account. WRDS is the data provider the university subscribes to and it has a lot of good tutorials on how to connect to their databases from R. Important, you will first have to ask for an account from the person in charge of the databases in the department. I will give a more extensive example later in the unit to really go over some of the details. This is meant as an illustration of what is possible with these type of files and code.

The code below establishes a connection with the WRDS servers under my username. The WRDS website has instructions on how to do set up the connection and how you can safely store your WRDS password so you do not have to type it. The settings I used here are all given by the WRDS instructions.

library(RPostgres)
wrds <- dbConnect(Postgres(),
              host='wrds-pgdata.wharton.upenn.edu',
              port=9737,
              dbname='wrds',
              user='stimas',
              sslmode='require')

Next, I download the data that we need from the wrds connection that I established before. I use the tbl function from the dbplyr package. This basically allows me to use the tidyverse verbs to to interact with almost any database. I give the downloaded data a name wrds_data so that I can use it further down the line. The only thing to remember with the tbl function is that you need to run the collect function at the end of the query so that the data is downloaded. Obviously, you need to know where the data is on the servers and how the variables are named. Sometimes, you can ask colleagues but other times, you will have to read the documentation on the WRDS website.

wrds_data <- tbl(wrds, in_schema("execcomp", "anncomp")) %>%
  filter(year > 2010 & ceoann == "CEO") %>%
  select(year, gvkey, cusip, exec_fullname, coname, ceoann,
    execid, bonus, salary, stock_awards_fv, stock_unvest_val,
    eip_unearn_num, eip_unearn_val, option_awards,
    option_awards_blk_value, option_awards_num, tdc1, tdc2,
    shrown_tot_pct, becameceo, joined_co, reason) %>%
  collect()

Here, I clean up the data and have a quick look after which I save it in my data folder to share it with you.

library(tidyverse)
compensation <- as_tibble(wrds_data) %>%
  rename_all(tolower)
glimpse(compensation)
Rows: 28,607
Columns: 22
$ year                    <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018…
$ gvkey                   <chr> "001004", "001004", "001004", "001004", "00100…
$ cusip                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ exec_fullname           <chr> "David P. Storch", "David P. Storch", "David P…
$ coname                  <chr> "AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"…
$ ceoann                  <chr> "CEO", "CEO", "CEO", "CEO", "CEO", "CEO", "CEO…
$ execid                  <chr> "09249", "09249", "09249", "09249", "09249", "…
$ bonus                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ salary                  <dbl> 867.000, 877.838, 906.449, 906.449, 755.250, 8…
$ stock_awards_fv         <dbl> 2664.745, 619.200, 1342.704, 1695.200, 1150.50…
$ stock_unvest_val        <dbl> 4227.273, 6018.000, 4244.165, 4103.283, 1334.0…
$ eip_unearn_num          <dbl> 32.681, 56.681, 66.929, 83.415, 91.969, 157.96…
$ eip_unearn_val          <dbl> 393.806, 1137.021, 1626.375, 2464.079, 2244.96…
$ option_awards           <dbl> 578.460, 695.520, 1622.016, 0.000, 1150.500, 1…
$ option_awards_blk_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ option_awards_num       <dbl> 49.022, 144.000, 158.400, 0.000, 153.810, 225.…
$ tdc1                    <dbl> 5786.400, 4182.832, 5247.779, 5234.648, 4674.4…
$ tdc2                    <dbl> 6105.117, 3487.312, 3809.626, 10428.375, 3523.…
$ shrown_tot_pct          <dbl> 2.977, 3.034, 3.530, 3.885, 4.543, 5.325, 3.70…
$ becameceo               <date> 1996-10-09, 1996-10-09, 1996-10-09, 1996-10-0…
$ joined_co               <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ reason                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
saveRDS(compensation, here("data", "us-compensation-new.RDS"))

Now, we do the same for the market value of the company. Again, we can do this completely in R. I don’t want to download all market values. Only for the firms that are already in our compensation data. What I do is to extract all the identifiers of the firms in our original data.

gvkeys <- unique(compensation$gvkey)
length(gvkeys)
[1] 2682

The remainder of the code follows a fairly similar pattern as before. I extract three variables from the WRDS databases fyear and gvkey so that we can match it with the compensation data and the mkvalt at the end of the fiscal year.

value <-  tbl(wrds, in_schema("comp", "funda")) %>%
  filter(gvkey %in% gvkeys, fyear > 2010) %>%
  select(fyear, gvkey, mkvalt) %>%
  collect()
glimpse(value)
Rows: 61,225
Columns: 3
$ fyear  <int> 2011, 2011, 2012, 2013, 2013, 2014, 2014, 2015, 2015, 2016, 201…
$ gvkey  <chr> "001004", "001004", "001004", "001004", "001004", "001004", "00…
$ mkvalt <dbl> 485.2897, 485.2897, 790.0029, 961.3080, 961.3080, 1046.3954, 10…
value <- distinct(value)
glimpse(value)
Rows: 29,633
Columns: 3
$ fyear  <int> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 202…
$ gvkey  <chr> "001004", "001004", "001004", "001004", "001004", "001004", "00…
$ mkvalt <dbl> 485.2897, 790.0029, 961.3080, 1046.3954, 842.5112, 1200.3288, 1…
saveRDS(value, here("data", "us-value-new.RDS"))

Explore the data

With the tidyverse

us_comp <- readRDS(here("data", "us-compensation-new.RDS"))
glimpse(us_comp)
Rows: 28,607
Columns: 22
$ year                    <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018…
$ gvkey                   <chr> "001004", "001004", "001004", "001004", "00100…
$ cusip                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ exec_fullname           <chr> "David P. Storch", "David P. Storch", "David P…
$ coname                  <chr> "AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"…
$ ceoann                  <chr> "CEO", "CEO", "CEO", "CEO", "CEO", "CEO", "CEO…
$ execid                  <chr> "09249", "09249", "09249", "09249", "09249", "…
$ bonus                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ salary                  <dbl> 867.000, 877.838, 906.449, 906.449, 755.250, 8…
$ stock_awards_fv         <dbl> 2664.745, 619.200, 1342.704, 1695.200, 1150.50…
$ stock_unvest_val        <dbl> 4227.273, 6018.000, 4244.165, 4103.283, 1334.0…
$ eip_unearn_num          <dbl> 32.681, 56.681, 66.929, 83.415, 91.969, 157.96…
$ eip_unearn_val          <dbl> 393.806, 1137.021, 1626.375, 2464.079, 2244.96…
$ option_awards           <dbl> 578.460, 695.520, 1622.016, 0.000, 1150.500, 1…
$ option_awards_blk_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ option_awards_num       <dbl> 49.022, 144.000, 158.400, 0.000, 153.810, 225.…
$ tdc1                    <dbl> 5786.400, 4182.832, 5247.779, 5234.648, 4674.4…
$ tdc2                    <dbl> 6105.117, 3487.312, 3809.626, 10428.375, 3523.…
$ shrown_tot_pct          <dbl> 2.977, 3.034, 3.530, 3.885, 4.543, 5.325, 3.70…
$ becameceo               <date> 1996-10-09, 1996-10-09, 1996-10-09, 1996-10-0…
$ joined_co               <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ reason                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
filter(us_comp, gvkey == "001004")
# A tibble: 14 × 22
    year gvkey  cusip exec_fullname            coname ceoann execid bonus salary
   <dbl> <chr>  <chr> <chr>                    <chr>  <chr>  <chr>  <dbl>  <dbl>
 1  2011 001004 <NA>  David P. Storch          AAR C… CEO    09249      0   867 
 2  2012 001004 <NA>  David P. Storch          AAR C… CEO    09249      0   878.
 3  2013 001004 <NA>  David P. Storch          AAR C… CEO    09249      0   906.
 4  2014 001004 <NA>  David P. Storch          AAR C… CEO    09249      0   906.
 5  2015 001004 <NA>  David P. Storch          AAR C… CEO    09249      0   755.
 6  2016 001004 <NA>  David P. Storch          AAR C… CEO    09249      0   835 
 7  2017 001004 <NA>  David P. Storch          AAR C… CEO    09249      0   941 
 8  2018 001004 <NA>  John McClain Holmes, III AAR C… CEO    48195      0   750 
 9  2019 001004 <NA>  John McClain Holmes, III AAR C… CEO    48195      0   801.
10  2020 001004 <NA>  John McClain Holmes, III AAR C… CEO    48195      0   781.
11  2021 001004 <NA>  John McClain Holmes, III AAR C… CEO    48195      0   925 
12  2022 001004 <NA>  John McClain Holmes, III AAR C… CEO    48195      0  1000 
13  2023 001004 <NA>  John McClain Holmes, III AAR C… CEO    48195      0  1050 
14  2024 001004 <NA>  John McClain Holmes, III AAR C… CEO    48195      0  1082 
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
#   eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
#   option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
#   tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
#   reason <chr>
filter(us_comp, bonus == 0, year == 2012)
# A tibble: 1,744 × 22
    year gvkey  cusip exec_fullname          coname   ceoann execid bonus salary
   <dbl> <chr>  <chr> <chr>                  <chr>    <chr>  <chr>  <dbl>  <dbl>
 1  2012 001004 <NA>  David P. Storch        AAR CORP CEO    09249      0   878.
 2  2012 001045 <NA>  Thomas W. Horton       AMERICA… CEO    26059      0   618.
 3  2012 001075 <NA>  Donald E. Brandt, CPA  PINNACL… CEO    05835      0  1146 
 4  2012 001076 <NA>  Ronald W. Allen        PROG HO… CEO    00283      0   850 
 5  2012 001078 <NA>  Miles D. White, M.B.A. ABBOTT … CEO    14300      0  1900 
 6  2012 001094 <NA>  Albert L. Eilender     ACETO C… CEO    46204      0   626.
 7  2012 001161 <NA>  Rory P. Read           ADVANCE… CEO    42390      0  1000.
 8  2012 001177 <NA>  Mark Thomas Bertolini  AETNA I… CEO    31029      0   977.
 9  2012 001209 <NA>  John E. McGlade        AIR PRO… CEO    27315      0  1200 
10  2012 001230 <NA>  Bradley D. Tilden      ALASKA … CEO    21308      0   420.
# ℹ 1,734 more rows
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
#   eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
#   option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
#   tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
#   reason <chr>

A Gamestop to introduce the pipe

  1. Look for any company where “gamestop” is in the name of the company.
  2. Look for the gvkey of Gamestop. Then use filter with exactly that key.
filter(us_comp, str_detect(tolower(coname),
                           "gamestop"))
# A tibble: 14 × 22
    year gvkey  cusip exec_fullname     coname        ceoann execid bonus salary
   <dbl> <chr>  <chr> <chr>             <chr>         <chr>  <chr>  <dbl>  <dbl>
 1  2011 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719  2254. 1028. 
 2  2012 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719  1515  1050. 
 3  2013 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719   975  1059. 
 4  2014 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719     0  1201. 
 5  2015 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719     0  1247. 
 6  2016 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719     0  1285. 
 7  2017 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719     0  1293. 
 8  2018 145049 <NA>  Shane S. Kim      GAMESTOP CORP CEO    56940    25   963. 
 9  2019 145049 <NA>  George E. Sherman GAMESTOP CORP CEO    46080   150   846. 
10  2020 145049 <NA>  George E. Sherman GAMESTOP CORP CEO    46080     0  1005. 
11  2021 145049 <NA>  Matthew Furlong   GAMESTOP CORP CEO    61979  1595.  115. 
12  2022 145049 <NA>  Matthew Furlong   GAMESTOP CORP CEO    61979  2280.  200  
13  2023 145049 <NA>  Matthew Furlong   GAMESTOP CORP CEO    61979   825    78.5
14  2024 145049 <NA>  Ryan Cohen        GAMESTOP CORP CEO    65635     0     0  
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
#   eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
#   option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
#   tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
#   reason <chr>
filter(us_comp, gvkey == "145049")
# A tibble: 14 × 22
    year gvkey  cusip exec_fullname     coname        ceoann execid bonus salary
   <dbl> <chr>  <chr> <chr>             <chr>         <chr>  <chr>  <dbl>  <dbl>
 1  2011 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719  2254. 1028. 
 2  2012 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719  1515  1050. 
 3  2013 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719   975  1059. 
 4  2014 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719     0  1201. 
 5  2015 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719     0  1247. 
 6  2016 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719     0  1285. 
 7  2017 145049 <NA>  J. Paul Raines    GAMESTOP CORP CEO    35719     0  1293. 
 8  2018 145049 <NA>  Shane S. Kim      GAMESTOP CORP CEO    56940    25   963. 
 9  2019 145049 <NA>  George E. Sherman GAMESTOP CORP CEO    46080   150   846. 
10  2020 145049 <NA>  George E. Sherman GAMESTOP CORP CEO    46080     0  1005. 
11  2021 145049 <NA>  Matthew Furlong   GAMESTOP CORP CEO    61979  1595.  115. 
12  2022 145049 <NA>  Matthew Furlong   GAMESTOP CORP CEO    61979  2280.  200  
13  2023 145049 <NA>  Matthew Furlong   GAMESTOP CORP CEO    61979   825    78.5
14  2024 145049 <NA>  Ryan Cohen        GAMESTOP CORP CEO    65635     0     0  
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
#   eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
#   option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
#   tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
#   reason <chr>

Moving on

select(us_comp, year, coname, bonus, salary, total = tdc1)
# A tibble: 28,607 × 5
    year coname   bonus salary total
   <dbl> <chr>    <dbl>  <dbl> <dbl>
 1  2011 AAR CORP     0   867  5786.
 2  2012 AAR CORP     0   878. 4183.
 3  2013 AAR CORP     0   906. 5248.
 4  2014 AAR CORP     0   906. 5235.
 5  2015 AAR CORP     0   755. 4674.
 6  2016 AAR CORP     0   835  6073.
 7  2017 AAR CORP     0   941  6284.
 8  2018 AAR CORP     0   750  3344.
 9  2019 AAR CORP     0   801. 4736.
10  2020 AAR CORP     0   781. 4123.
# ℹ 28,597 more rows

We first build up the dataset with select and mutate and the pipe %>%. When we are satisfied with the result, we can save the dataset as an R object with the name us_comp_small.

us_comp_small <-
  select(us_comp, year, coname, bonus,
       salary, total = tdc1) %>%
    mutate(salary_percentage = salary/total)
print(us_comp_small)
# A tibble: 28,607 × 6
    year coname   bonus salary total salary_percentage
   <dbl> <chr>    <dbl>  <dbl> <dbl>             <dbl>
 1  2011 AAR CORP     0   867  5786.             0.150
 2  2012 AAR CORP     0   878. 4183.             0.210
 3  2013 AAR CORP     0   906. 5248.             0.173
 4  2014 AAR CORP     0   906. 5235.             0.173
 5  2015 AAR CORP     0   755. 4674.             0.162
 6  2016 AAR CORP     0   835  6073.             0.138
 7  2017 AAR CORP     0   941  6284.             0.150
 8  2018 AAR CORP     0   750  3344.             0.224
 9  2019 AAR CORP     0   801. 4736.             0.169
10  2020 AAR CORP     0   781. 4123.             0.189
# ℹ 28,597 more rows

Quick descriptive statistics

group_by(us_comp, gvkey) %>%
    summarise(N = n(), N_CEO = n_distinct(execid),
              average = mean(salary),
              sd = sd(salary),
              med = median(salary),
              min = min(salary),
              max = max(salary)) %>%
    ungroup() %>%
  filter(med > 1000)
# A tibble: 669 × 8
   gvkey      N N_CEO average    sd   med   min   max
   <chr>  <int> <int>   <dbl> <dbl> <dbl> <dbl> <dbl>
 1 001075    14     2   1208. 106.  1177. 1091  1395 
 2 001078    14     2   1747. 231.  1900  1298. 1973.
 3 001161    14     3    998. 167.  1000.  566. 1230 
 4 001209    14     2   1255. 127.  1255.  905. 1402.
 5 001274    11     1   1082. 101.  1030  1000  1250 
 6 001300    14     3   1668. 194.  1688. 1225  1890 
 7 001380    14     1   1505.  20.0 1500  1500  1575 
 8 001440    14     3   1285. 216.  1315.  862. 1522.
 9 001447    14     2   1752. 263.  1750  1488. 2038.
10 001449    14     1   1438.  11.2 1441. 1399. 1441.
# ℹ 659 more rows
summarise(us_comp, N = n(), N_CEO = n_distinct(execid),
              average = mean(salary),
              sd = sd(salary),
              med = median(salary),
              min = min(salary),
              max = max(salary),
          .by = gvkey) %>%
  filter(med > 1000)
# A tibble: 669 × 8
   gvkey      N N_CEO average    sd   med   min   max
   <chr>  <int> <int>   <dbl> <dbl> <dbl> <dbl> <dbl>
 1 001075    14     2   1208. 106.  1177. 1091  1395 
 2 001078    14     2   1747. 231.  1900  1298. 1973.
 3 001161    14     3    998. 167.  1000.  566. 1230 
 4 001209    14     2   1255. 127.  1255.  905. 1402.
 5 001274    11     1   1082. 101.  1030  1000  1250 
 6 001300    14     3   1668. 194.  1688. 1225  1890 
 7 001380    14     1   1505.  20.0 1500  1500  1575 
 8 001440    14     3   1285. 216.  1315.  862. 1522.
 9 001447    14     2   1752. 263.  1750  1488. 2038.
10 001449    14     1   1438.  11.2 1441. 1399. 1441.
# ℹ 659 more rows