Chapter 4 Research tools - Data Analysis

4.1 Data analysis tools

Our main tool for data analysis is the R statistical software (R Core Team 2021). R is free software specifically aimed at statistical analysis and graphical representation. It is a fully developed programming language that allows to extend it with new packages with new capabilities. We are going to heavily rely on some of those packages.21 You can download R from this CRAN server

We are going to interact with the R software through Rstudio.22 You can download from the Rstudio website Rstudio is an integrated development environment for R. It allows you to easily write, test, and run R code, and integrate the results with explanations. One advantage of R and Rstudio is that you can install them on as many computers as you want. You do not need a special license. This means that you do not have to be affiliated with the university to use the sofware.

In Rstudio, you write code and ask the R software to execute that code. You can save the different steps in your analysis in an R script which is just a plain text file with the .R-extenstion. The advantage of plain text file is that they are easy to share across different operating systems. The code in your script allows you and anyone else to run the same analysis at a later point in time which is very important to check for mistakes and for teaching.

We will make use of a special type of plain text files: Rmarkdown (.Rmd) files.23 These notes are fully written in Rmarkdown files. RMarkdown is an R extension of the markdown format. The idea of markdown is to write plain text documents which can later be exported to other formats such as html, pdf, or word. The advantage of plain text scripts and markdown. The beauty of RMarkdown is that it lets you combine both R scripts and markdown into one document. This allows you to integrate your analysis and your description of your analysis in one document. When you go back to your analysis after a couple of weeks, you will be happy that you have more than the raw code to look at.

Finally, I am going to introduce you to a specific dialect of the R language which is informally known as the tidyverse. If you want to be a good R programmer this might not be the best entry point into R. However, I assume that you want to quickly pick up some tools to facilitate your research project and for that purpose I believe the tidyverse will be excellent. I will introduce the most important bits and pieces throughout my lectures however I don’t have the time to go into detail. Now and then, you will have to experiment on your own to get your code working. Some excellent resources are the free online books R for datascience and Data Visualization: A practical introduction. You can also buy reasonably priced physical copies if you your are interested in developing your R skills further.

For larger projects, you should use a project in its own separate folder.24 See the Project Chapter of Hadley Wickham’s book. Once you need multiple scripts, you want to make sure that you can reliably point to the results or functions in different scripts. Projects help you manage all the parts of a larger piece of work. You can start a

You start a new project by clicking File > New Project ... You can start a new file by clicking File > New File > R Script.

4.2 The R console

In Rstudio, you can use the R console directly. I mainly use the R console to test the code I have written in my scripts and I would advice you to do the same thing. If you want to understand how R works. You can experiment by typing the following lines in the console.25 It is tempting to copy and paste the code but retyping the code is a surprising powerful way of learning to code. I would advise you to try to type as much of the code as possible when you are trying to figure out how R works.

x <- 1
x
## [1] 1
x <- x + 1
x
## [1] 2

The little code above already shows you two things.

  1. You can assign (and overwrite) a numerical value to an object x
  2. The right-hand side is assigned to the left-hand side. You can’t just switch them around.

We will see further that we can assign almost anything to an object x. When x is a vector or a data set with multiple elements, this will allow us to perform the same function on each element of x. That is the basic advantage of programming. You can tell your computer how to do a thing and than the computer can do it over and over for you.

4.3 R packages

install.packages("tidyverse")
library("tidyverse")

R packages are additions to R that give R extra functionality. One of the selling points of R is that it has a good way of integrating this extra functionality and a lot of people are highly motivated to add this extra functionality. We will heavily rely on one meta-packages. tidyverse is a package that helps with data transformations and working with tabular data in a tidy fashion. One of the packages in the tidyverse is ggplot2 which provides tools to make pretty plots. The code above shows you how to install packages and use them. Normally, you will only have to install a package once. However, when you want to use a package in your script or code, you will have to load the package with the library() function. You only have to do load a package once at the start of your R session or at the start of your script.26 Installing packages is another reason to use the R console.

dplyr is another part of the tidyverse we will use extensively. With 5 verbs27 i.e. functions to do something with a dataset. and one pipe operator to glue them together helps you to explore the data.filter() let’s you filter out a subset of the observations. select() selects a subset of the variables. mutate() changes variables and creates new ones. group_by() and summarise() group subsets of observations and summarise them. %>% is the pipe operator and joins together the verbs to create compound statements where you for instance filter a subset of the observations, create a new variable, create groups, and summarise the groups based on the average for the new variable.

4.4 Finally some useful code!

Let’s run some code. First, we have to tell R to use the packages in the tidyverse.

library(tidyverse)

Next, we have a look at the data I used to draw the plot of the compensation of CEOs of S&P500 companies. First, we have to assign the file with the data to an object, us_comp in this case.28 The data is available on LMS or through the code in the Appendix 2.5 The glimpse function gives an overview of the variables in the data, the type of the variables, and a couple of examples of the values for that variable.

us_comp <- readRDS("data/us-compensation.RDS")
glimpse(us_comp)
## Rows: 17,622
## Columns: 22
## $ year                    <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018…
## $ gvkey                   <chr> "001004", "001004", "001004", "001004", "00100…
## $ cusip                   <chr> "00036110", "00036110", "00036110", "00036110"…
## $ exec_fullname           <chr> "David P. Storch", "David P. Storch", "David P…
## $ coname                  <chr> "AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"…
## $ ceoann                  <chr> "CEO", "CEO", "CEO", "CEO", "CEO", "CEO", "CEO…
## $ execid                  <chr> "09249", "09249", "09249", "09249", "09249", "…
## $ bonus                   <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.00…
## $ salary                  <dbl> 867.000, 877.838, 906.449, 906.449, 755.250, 8…
## $ stock_awards_fv         <dbl> 2664.745, 619.200, 1342.704, 1695.200, 1150.50…
## $ stock_unvest_val        <dbl> 4227.273, 6018.000, 4244.165, 4103.283, 0.000,…
## $ eip_unearn_num          <dbl> 32.681, 56.681, 66.929, 83.415, 0.000, 157.969…
## $ eip_unearn_val          <dbl> 393.806, 1137.021, 1626.375, 2464.079, 2244.96…
## $ option_awards           <dbl> 578.460, 695.520, 1622.016, 0.000, 1150.500, 1…
## $ option_awards_blk_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ option_awards_num       <dbl> 49.022, 144.000, 158.400, 0.000, 0.000, 225.00…
## $ tdc1                    <dbl> 5786.400, 4182.832, 5247.779, 5234.648, 3523.9…
## $ tdc2                    <dbl> 6105.117, 3487.312, 3809.626, 10428.375, 3523.…
## $ shrown_tot_pct          <dbl> 2.964, 2.893, 3.444, 3.877, 4.597, 5.417, 3.71…
## $ becameceo               <date> 1996-10-09, 1996-10-09, 1996-10-09, 1996-10-0…
## $ joined_co               <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ reason                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

We can also look at a subset of the observations, i.e. we filter the observations from the company that has the key 001045.

filter(us_comp, gvkey == "001045")
## # A tibble: 9 × 22
##    year gvkey  cusip    exec_fullname          coname ceoann execid bonus salary
##   <dbl> <chr>  <chr>    <chr>                  <chr>  <chr>  <chr>  <dbl>  <dbl>
## 1  2011 001045 02376R10 Gerard J. Arpey        AMERI… CEO    14591      0   614.
## 2  2012 001045 02376R10 Thomas W. Horton       AMERI… CEO    26059      0   618.
## 3  2013 001045 02376R10 Thomas W. Horton       AMERI… CEO    26059      0   592.
## 4  2014 001045 02376R10 William Douglas Parker AMERI… CEO    46191      0   688.
## 5  2015 001045 02376R10 William Douglas Parker AMERI… CEO    46191      0   232.
## 6  2016 001045 02376R10 William Douglas Parker AMERI… CEO    46191      0     0 
## 7  2017 001045 02376R10 William Douglas Parker AMERI… CEO    46191      0     0 
## 8  2018 001045 02376R10 William Douglas Parker AMERI… CEO    46191      0     0 
## 9  2019 001045 02376R10 William Douglas Parker AMERI… CEO    46191      0     0 
## # … with 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
## #   eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
## #   option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
## #   tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
## #   reason <chr>

You can see that the company, American Airlines, has had three CEOs since 2011. I used the gvkey to select a company and not its name. You will often use an id or a database key especially if you are working with multiple datasets and you want to link observations from one dataset to observations in the other dataset.

We can also filter data based on other variables. For instance, the below filters the observations from 2014 where the CEO had a salary over $1,000,000.

filter(us_comp, salary > 1000, year == 2014)
## # A tibble: 476 × 22
##     year gvkey  cusip    exec_fullname    coname      ceoann execid bonus salary
##    <dbl> <chr>  <chr>    <chr>            <chr>       <chr>  <chr>  <dbl>  <dbl>
##  1  2014 001075 72348410 Donald E. Brand… PINNACLE W… CEO    05835      0  1240 
##  2  2014 001078 00282410 Miles D. White,… ABBOTT LAB… CEO    14300      0  1973.
##  3  2014 001161 00790310 Rory P. Read     ADVANCED M… CEO    42390      0  1000.
##  4  2014 001300 43851610 David M. Cote    HONEYWELL … CEO    20931   5500  1866.
##  5  2014 001380 42809H10 John B. Hess     HESS CORP   CEO    02132      0  1500 
##  6  2014 001440 02553710 Nicholas K. Aki… AMERICAN E… CEO    40778      0  1241.
##  7  2014 001447 02581610 Kenneth I. Chen… AMERICAN E… CEO    02157   4500  2000 
##  8  2014 001449 00105510 Daniel Paul Amos AFLAC INC   CEO    00013      0  1441.
##  9  2014 001468 02637510 Jeffrey M. Weiss AMERICAN G… CEO    11710   3640  1012.
## 10  2014 001487 02687478 Robert Herman B… AMERICAN I… CEO    20972      0  1385.
## # … with 466 more rows, and 13 more variables: stock_awards_fv <dbl>,
## #   stock_unvest_val <dbl>, eip_unearn_num <dbl>, eip_unearn_val <dbl>,
## #   option_awards <dbl>, option_awards_blk_value <dbl>,
## #   option_awards_num <dbl>, tdc1 <dbl>, tdc2 <dbl>, shrown_tot_pct <dbl>,
## #   becameceo <date>, joined_co <date>, reason <chr>

You can also select some variables if you are only interested in those. If a variable has an undescriptive name, you can use select to rename the variable. For instance, tdc1 is the total compensation of a the CEO. A more descriptive name will help you to remember what the variable actually means.

select(us_comp, year, coname, bonus, salary, total = tdc1)
## # A tibble: 17,622 × 5
##     year coname                      bonus salary total
##    <dbl> <chr>                       <dbl>  <dbl> <dbl>
##  1  2011 AAR CORP                        0   867  5786.
##  2  2012 AAR CORP                        0   878. 4183.
##  3  2013 AAR CORP                        0   906. 5248.
##  4  2014 AAR CORP                        0   906. 5235.
##  5  2015 AAR CORP                        0   755. 3524.
##  6  2016 AAR CORP                        0   835  6073.
##  7  2017 AAR CORP                        0   941  6284.
##  8  2018 AAR CORP                        0   750  4234.
##  9  2019 AAR CORP                        0   801. 3436.
## 10  2011 AMERICAN AIRLINES GROUP INC     0   614. 5527.
## # … with 17,612 more rows

You can also create new variables with the mutate function. I created a variable that calculates what percentage of total compensation is the CEO’s salary.

select(us_comp, year, coname, bonus, salary, total = tdc1) %>%
  mutate(salary_percentage = salary / total)  
## # A tibble: 17,622 × 6
##     year coname                      bonus salary total salary_percentage
##    <dbl> <chr>                       <dbl>  <dbl> <dbl>             <dbl>
##  1  2011 AAR CORP                        0   867  5786.             0.150
##  2  2012 AAR CORP                        0   878. 4183.             0.210
##  3  2013 AAR CORP                        0   906. 5248.             0.173
##  4  2014 AAR CORP                        0   906. 5235.             0.173
##  5  2015 AAR CORP                        0   755. 3524.             0.214
##  6  2016 AAR CORP                        0   835  6073.             0.138
##  7  2017 AAR CORP                        0   941  6284.             0.150
##  8  2018 AAR CORP                        0   750  4234.             0.177
##  9  2019 AAR CORP                        0   801. 3436.             0.233
## 10  2011 AMERICAN AIRLINES GROUP INC     0   614. 5527.             0.111
## # … with 17,612 more rows

Remark how the select statement from before is chained together with the mutate statement through the pipe operator (%>%). This operator pipes the results from the first statemetn to the second statement. The second statement implicitly uses the result from the first statement as the data it is going to work with.

You can use pipe statement to make your own descriptive statistics table. I created a table with some statistics for each firm. I first group the observations based on the gvkey and then summarise each group with the same key by defining a number of new variables.

group_by(us_comp, gvkey) %>%
  summarise(N = n(), N_CEO = n_distinct(execid), 
            average = mean(salary), sd = sd(salary),
            med = median(salary), minimum = min(salary), 
            maximum = max(salary)) %>%
  ungroup()
## # A tibble: 2,347 × 8
##    gvkey      N N_CEO average    sd   med minimum maximum
##    <chr>  <int> <int>   <dbl> <dbl> <dbl>   <dbl>   <dbl>
##  1 001004     9     2    849.  68.3  867     750     941 
##  2 001045     9     3    305. 316.   232.      0     688.
##  3 001072     8     2    677. 219.   652.    450     967.
##  4 001075     9     1   1259.  98.8 1277    1091    1395 
##  5 001076     9     3    727. 103.   700     567.    850 
##  6 001078     9     1   1908.  24.4 1900    1900    1973.
##  7 001094     8     3    563.  66.2  581.    437.    626.
##  8 001161     9     3    913. 143.   961.    566.   1026.
##  9 001177     7     1   1049.  86.4 1000     977.   1200 
## 10 001209     9     2   1196. 123.  1200     905.   1350 
## # … with 2,337 more rows

4.5 Looking for help

Programming is hard work. You will make mistakes. You will get error messages. An important programming skill is to efficiently debug your code and find out what is going wrong.

  • The most important technique is trial and error. Change one thing in your code and see what the output is. Do you get an error message? What does the error message tell you? If you are carefull and do not change everything at once, this should at least help you to find out which part of the code does not work. This is one reason why scripts are so important. Because a script contains your entire analyis, you can always go back and (let your computer) redo the analysis from the start.

  • If you are not sure how you should use a certain function in R, you can read its help files in the Rstudio Help window. You can search for more information what a function is doing.

  • R is a very popular language. If you have a problem it is likely that someone else had the same problem before you. If you Google the error message, there is a decent chance you will find a solution to your problem.

  • A specific website where a lot of these questions are asked is
    StackOverflow. You can often directly search for your error on the website and find multiple solutions. You can improve your chances of finding a related answer to your problem by adding the [r], [tidyverse], [dplyr], [ggplot] tags to your error message.

  • You can also ask me directly for help on LMS. I set up a forum for questions and I promise to help you with any R questions you might have for the unit and for your thesis.

4.6 RMarkdown = Markdown + R

One of the big advantages of the R world is that you can easily combine explanations with your analysis in .Rmd files. The assignements can all be completed in Rmarkdown. You can find a lot of good resources on Rmarkdown on the Rstudio website.
In short, markdown is a simple markup language29 This means that you can add little bits of code to your text that indicate how the text should look that lets you include R code.

4.6.1 Markdown

You can have titles in markdown.

# Title
## Subtitle
### Lower level titles.

You can emphasise some words.

*Italic*, **bold**

Add links to pages and include pictures.

[weblink](https://www.google.com)
![pictures](https://rstudio.com/wp-content/uploads/2015/10/r-packages.png)

You can write enumerations and lists.

1. Item 1
2. Item 2 
3. Item 3 

- one
- two 
- three 

You can also write tables. You will rarely have to use tables. Typically, you can directly create the tables from R without the need to type in the results of your analysis.

First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell

4.6.2 R-code in markdown: R chunks

The largest advantage of Rmarkdown files is that we can include pieces of R code.[^ Again, all these notes are written in Rmarkdown. Code chuncks go between three backticks and we tell Rmarkdown that the language we are using is {r}. The example below creates a code chuck where we create a random vector x with then elements where the elements are drawn from a random distribution with mean 4 and standard deviation 3.

```{r}
x <- rnorm(n = 10, mean = 4, sd = 3)
```

References

R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Page built: 2022-02-01 using R version 4.1.2 (2021-11-01)