Our main tool for data analysis is the R statistical software (R Core Team 2021). R is free software specifically aimed at statistical analysis and graphical representation. It is a fully developed programming language that allows to extend it with new packages with new capabilities. We are going to heavily rely on some of those packages.21 You can download R from this CRAN server
We are going to interact with the R software through Rstudio.22 You can download from the Rstudio website Rstudio is an integrated development environment for R. It allows you to easily write, test, and run R code, and integrate the results with explanations. One advantage of R and Rstudio is that you can install them on as many computers as you want. You do not need a special license. This means that you do not have to be affiliated with the university to use the sofware.
In Rstudio, you write code and ask the R software to execute that
code. You can save the different steps in your analysis in an R
script which is just a plain text file with the .R
-extenstion.
The advantage of plain text file is that they are easy to share
across different operating systems. The code in your script
allows you and anyone else to run the same analysis at a later
point in time which is very important to check for mistakes
and for teaching.
We will make use of a special type of plain text files: Rmarkdown
(.Rmd
) files.23 These notes are fully written in Rmarkdown
files. RMarkdown is an R extension of the markdown format. The
idea of markdown is to write plain text documents which can later
be exported to other formats such as html, pdf, or word. The
advantage of plain text scripts and markdown. The beauty of
RMarkdown is that it lets you combine both R scripts and markdown
into one document. This allows you to integrate your analysis and
your description of your analysis in one document. When you go
back to your analysis after a couple of weeks, you will be happy
that you have more than the raw code to look at.
Finally, I am going to introduce you to a specific dialect of the R language which is informally known as the tidyverse. If you want to be a good R programmer this might not be the best entry point into R. However, I assume that you want to quickly pick up some tools to facilitate your research project and for that purpose I believe the tidyverse will be excellent. I will introduce the most important bits and pieces throughout my lectures however I don’t have the time to go into detail. Now and then, you will have to experiment on your own to get your code working. Some excellent resources are the free online books R for datascience and Data Visualization: A practical introduction. You can also buy reasonably priced physical copies if you your are interested in developing your R skills further.
For larger projects, you should use a project in its own separate folder.24 See the Project Chapter of Hadley Wickham’s book. Once you need multiple scripts, you want to make sure that you can reliably point to the results or functions in different scripts. Projects help you manage all the parts of a larger piece of work. You can start a
You start a new project by clicking File > New Project ...
You
can start a new file by clicking File > New File > R Script
.
In Rstudio, you can use the R console directly. I mainly use the R console to test the code I have written in my scripts and I would advice you to do the same thing. If you want to understand how R works. You can experiment by typing the following lines in the console.25 It is tempting to copy and paste the code but retyping the code is a surprising powerful way of learning to code. I would advise you to try to type as much of the code as possible when you are trying to figure out how R works.
<- 1
x x
## [1] 1
<- x + 1
x x
## [1] 2
The little code above already shows you two things.
x
We will see further that we can assign almost anything to an
object x
. When x
is a vector or a data set with multiple
elements, this will allow us to perform the same function on each
element of x
. That is the basic advantage of programming. You
can tell your computer how to do a thing and than the computer
can do it over and over for you.
install.packages("tidyverse")
library("tidyverse")
R packages are additions to R that give R extra functionality.
One of the selling points of R is that it has a good way of
integrating this extra functionality and a lot of people are
highly motivated to add this extra functionality. We will heavily
rely on one meta-packages. tidyverse
is a package that helps
with data transformations and working with tabular data in a tidy
fashion. One of the packages in the tidyverse
is ggplot2
which provides tools to make pretty plots. The code above shows
you how to install packages and use them. Normally, you will only
have to install a package once. However, when you want to use a
package in your script or code, you will have to load the package
with the library()
function. You only have to do load a package
once at the start of your R session or at the start of your
script.26 Installing packages is another reason to use the R console.
dplyr
is another part of the tidyverse
we will use
extensively. With 5 verbs27 i.e. functions to do something with
a dataset. and one pipe operator to glue them together helps
you to explore the data.filter()
let’s you filter out a subset
of the observations. select()
selects a subset of the variables.
mutate()
changes variables and creates new ones. group_by()
and summarise()
group subsets of observations and summarise
them. %>%
is the pipe operator and joins together the verbs
to create compound statements where you for instance filter a
subset of the observations, create a new variable, create groups,
and summarise the groups based on the average for the new
variable.
Let’s run some code. First, we have to tell R to use the packages
in the tidyverse
.
library(tidyverse)
Next, we have a look at the data I used to draw the plot of the
compensation of CEOs of S&P500 companies. First, we have to assign
the file with the data to an object, us_comp
in this case.28 The data is available on LMS or through the code in the
Appendix 2.5
The glimpse function gives an overview of the variables in the
data, the type of the variables, and a couple of examples of the
values for that variable.
<- readRDS("data/us-compensation.RDS")
us_comp glimpse(us_comp)
## Rows: 17,622
## Columns: 22
## $ year <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018…
## $ gvkey <chr> "001004", "001004", "001004", "001004", "00100…
## $ cusip <chr> "00036110", "00036110", "00036110", "00036110"…
## $ exec_fullname <chr> "David P. Storch", "David P. Storch", "David P…
## $ coname <chr> "AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"…
## $ ceoann <chr> "CEO", "CEO", "CEO", "CEO", "CEO", "CEO", "CEO…
## $ execid <chr> "09249", "09249", "09249", "09249", "09249", "…
## $ bonus <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.00…
## $ salary <dbl> 867.000, 877.838, 906.449, 906.449, 755.250, 8…
## $ stock_awards_fv <dbl> 2664.745, 619.200, 1342.704, 1695.200, 1150.50…
## $ stock_unvest_val <dbl> 4227.273, 6018.000, 4244.165, 4103.283, 0.000,…
## $ eip_unearn_num <dbl> 32.681, 56.681, 66.929, 83.415, 0.000, 157.969…
## $ eip_unearn_val <dbl> 393.806, 1137.021, 1626.375, 2464.079, 2244.96…
## $ option_awards <dbl> 578.460, 695.520, 1622.016, 0.000, 1150.500, 1…
## $ option_awards_blk_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ option_awards_num <dbl> 49.022, 144.000, 158.400, 0.000, 0.000, 225.00…
## $ tdc1 <dbl> 5786.400, 4182.832, 5247.779, 5234.648, 3523.9…
## $ tdc2 <dbl> 6105.117, 3487.312, 3809.626, 10428.375, 3523.…
## $ shrown_tot_pct <dbl> 2.964, 2.893, 3.444, 3.877, 4.597, 5.417, 3.71…
## $ becameceo <date> 1996-10-09, 1996-10-09, 1996-10-09, 1996-10-0…
## $ joined_co <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ reason <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
We can also look at a subset of the observations, i.e. we filter
the observations from the company that has the key 001045
.
filter(us_comp, gvkey == "001045")
## # A tibble: 9 × 22
## year gvkey cusip exec_fullname coname ceoann execid bonus salary
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2011 001045 02376R10 Gerard J. Arpey AMERI… CEO 14591 0 614.
## 2 2012 001045 02376R10 Thomas W. Horton AMERI… CEO 26059 0 618.
## 3 2013 001045 02376R10 Thomas W. Horton AMERI… CEO 26059 0 592.
## 4 2014 001045 02376R10 William Douglas Parker AMERI… CEO 46191 0 688.
## 5 2015 001045 02376R10 William Douglas Parker AMERI… CEO 46191 0 232.
## 6 2016 001045 02376R10 William Douglas Parker AMERI… CEO 46191 0 0
## 7 2017 001045 02376R10 William Douglas Parker AMERI… CEO 46191 0 0
## 8 2018 001045 02376R10 William Douglas Parker AMERI… CEO 46191 0 0
## 9 2019 001045 02376R10 William Douglas Parker AMERI… CEO 46191 0 0
## # … with 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
## # eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
## # option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
## # tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
## # reason <chr>
You can see that the company, American Airlines, has had three
CEOs since 2011. I used the gvkey
to select a company and not
its name. You will often use an id or a database key especially
if you are working with multiple datasets and you want to link
observations from one dataset to observations in the other
dataset.
We can also filter data based on other variables. For instance, the below filters the observations from 2014 where the CEO had a salary over $1,000,000.
filter(us_comp, salary > 1000, year == 2014)
## # A tibble: 476 × 22
## year gvkey cusip exec_fullname coname ceoann execid bonus salary
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2014 001075 72348410 Donald E. Brand… PINNACLE W… CEO 05835 0 1240
## 2 2014 001078 00282410 Miles D. White,… ABBOTT LAB… CEO 14300 0 1973.
## 3 2014 001161 00790310 Rory P. Read ADVANCED M… CEO 42390 0 1000.
## 4 2014 001300 43851610 David M. Cote HONEYWELL … CEO 20931 5500 1866.
## 5 2014 001380 42809H10 John B. Hess HESS CORP CEO 02132 0 1500
## 6 2014 001440 02553710 Nicholas K. Aki… AMERICAN E… CEO 40778 0 1241.
## 7 2014 001447 02581610 Kenneth I. Chen… AMERICAN E… CEO 02157 4500 2000
## 8 2014 001449 00105510 Daniel Paul Amos AFLAC INC CEO 00013 0 1441.
## 9 2014 001468 02637510 Jeffrey M. Weiss AMERICAN G… CEO 11710 3640 1012.
## 10 2014 001487 02687478 Robert Herman B… AMERICAN I… CEO 20972 0 1385.
## # … with 466 more rows, and 13 more variables: stock_awards_fv <dbl>,
## # stock_unvest_val <dbl>, eip_unearn_num <dbl>, eip_unearn_val <dbl>,
## # option_awards <dbl>, option_awards_blk_value <dbl>,
## # option_awards_num <dbl>, tdc1 <dbl>, tdc2 <dbl>, shrown_tot_pct <dbl>,
## # becameceo <date>, joined_co <date>, reason <chr>
You can also select some variables if you are only interested in
those. If a variable has an undescriptive name, you can use
select to rename the variable. For instance, tdc1
is the total
compensation of a the CEO. A more descriptive name will help you
to remember what the variable actually means.
select(us_comp, year, coname, bonus, salary, total = tdc1)
## # A tibble: 17,622 × 5
## year coname bonus salary total
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2011 AAR CORP 0 867 5786.
## 2 2012 AAR CORP 0 878. 4183.
## 3 2013 AAR CORP 0 906. 5248.
## 4 2014 AAR CORP 0 906. 5235.
## 5 2015 AAR CORP 0 755. 3524.
## 6 2016 AAR CORP 0 835 6073.
## 7 2017 AAR CORP 0 941 6284.
## 8 2018 AAR CORP 0 750 4234.
## 9 2019 AAR CORP 0 801. 3436.
## 10 2011 AMERICAN AIRLINES GROUP INC 0 614. 5527.
## # … with 17,612 more rows
You can also create new variables with the mutate
function. I
created a variable that calculates what percentage of total
compensation is the CEO’s salary.
select(us_comp, year, coname, bonus, salary, total = tdc1) %>%
mutate(salary_percentage = salary / total)
## # A tibble: 17,622 × 6
## year coname bonus salary total salary_percentage
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2011 AAR CORP 0 867 5786. 0.150
## 2 2012 AAR CORP 0 878. 4183. 0.210
## 3 2013 AAR CORP 0 906. 5248. 0.173
## 4 2014 AAR CORP 0 906. 5235. 0.173
## 5 2015 AAR CORP 0 755. 3524. 0.214
## 6 2016 AAR CORP 0 835 6073. 0.138
## 7 2017 AAR CORP 0 941 6284. 0.150
## 8 2018 AAR CORP 0 750 4234. 0.177
## 9 2019 AAR CORP 0 801. 3436. 0.233
## 10 2011 AMERICAN AIRLINES GROUP INC 0 614. 5527. 0.111
## # … with 17,612 more rows
Remark how the select
statement from before is chained together
with the mutate
statement through the pipe operator (%>%
).
This operator pipes the results from the first statemetn to the
second statement. The second statement implicitly uses the result
from the first statement as the data it is going to work with.
You can use pipe statement to make your own descriptive
statistics table. I created a table with some statistics for each
firm. I first group the observations based on the gvkey
and
then summarise each group with the same key by defining a number
of new variables.
group_by(us_comp, gvkey) %>%
summarise(N = n(), N_CEO = n_distinct(execid),
average = mean(salary), sd = sd(salary),
med = median(salary), minimum = min(salary),
maximum = max(salary)) %>%
ungroup()
## # A tibble: 2,347 × 8
## gvkey N N_CEO average sd med minimum maximum
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 001004 9 2 849. 68.3 867 750 941
## 2 001045 9 3 305. 316. 232. 0 688.
## 3 001072 8 2 677. 219. 652. 450 967.
## 4 001075 9 1 1259. 98.8 1277 1091 1395
## 5 001076 9 3 727. 103. 700 567. 850
## 6 001078 9 1 1908. 24.4 1900 1900 1973.
## 7 001094 8 3 563. 66.2 581. 437. 626.
## 8 001161 9 3 913. 143. 961. 566. 1026.
## 9 001177 7 1 1049. 86.4 1000 977. 1200
## 10 001209 9 2 1196. 123. 1200 905. 1350
## # … with 2,337 more rows
Programming is hard work. You will make mistakes. You will get error messages. An important programming skill is to efficiently debug your code and find out what is going wrong.
The most important technique is trial and error. Change one thing in your code and see what the output is. Do you get an error message? What does the error message tell you? If you are carefull and do not change everything at once, this should at least help you to find out which part of the code does not work. This is one reason why scripts are so important. Because a script contains your entire analyis, you can always go back and (let your computer) redo the analysis from the start.
If you are not sure how you should use a certain function in R, you can read its help files in the Rstudio Help window. You can search for more information what a function is doing.
R is a very popular language. If you have a problem it is likely that someone else had the same problem before you. If you Google the error message, there is a decent chance you will find a solution to your problem.
A specific website where a lot of these questions are asked is
StackOverflow. You can often
directly search for your error on the website and find multiple
solutions. You can improve your chances of finding a related
answer to your problem by adding the [r]
, [tidyverse]
,
[dplyr]
, [ggplot]
tags to your error message.
You can also ask me directly for help on LMS. I set up a forum for questions and I promise to help you with any R questions you might have for the unit and for your thesis.
One of the big advantages of the R world is that you can
easily combine explanations with your analysis in .Rmd
files.
The assignements can all be completed in Rmarkdown. You can find
a lot of good resources on
Rmarkdown on the Rstudio website.
In short, markdown is a simple markup language29 This means that you can
add little bits of code to your text that indicate how the text
should look that lets you include R code.
You can have titles in markdown.
# Title
## Subtitle
### Lower level titles.
You can emphasise some words.
*Italic*, **bold**
Add links to pages and include pictures.
[weblink](https://www.google.com)
![pictures](https://rstudio.com/wp-content/uploads/2015/10/r-packages.png)
You can write enumerations and lists.
1. Item 1
2. Item 2
3. Item 3
- one
- two
- three
You can also write tables. You will rarely have to use tables. Typically, you can directly create the tables from R without the need to type in the results of your analysis.
First Header | Second Header
------------- | -------------
Content Cell | Content Cell
Content Cell | Content Cell
The largest advantage of Rmarkdown files is that we can include
pieces of R code.[^ Again, all these notes are written in
Rmarkdown. Code chuncks go between three backticks and we tell
Rmarkdown that the language we are using is {r}
. The example
below creates a code chuck where we create a random vector x
with then elements where the elements are drawn from a random
distribution with mean 4 and standard deviation 3.
```{r}
x <- rnorm(n = 10, mean = 4, sd = 3)
```
Page built: 2022-02-01 using R version 4.1.2 (2021-11-01)