<- 1
x x
[1] 1
<- x + 1
x x
[1] 2
Our main tool for data analysis is the R statistical software. R is free software specifically aimed at statistical analysis and graphical representation. It is a fully developed programming language that allows to extend it with new packages with new capabilities. We are going to heavily rely on some of those packages.1
We are going to interact with the R software through Rstudio.2 Rstudio is an integrated development environment for R. It allows you to easily write, test, and run R code, and integrate the results with explanations. One advantage of R and Rstudio is that you can install them on as many computers as you want. You do not need a special license. This means that you do not have to be affiliated with the university to use the sofware.
In Rstudio, you write code and ask the R software to execute that code. You can save the different steps in your analysis in an R script which is just a plain text file with the .R
-extenstion. The advantage of plain text file is that they are easy to share across different operating systems. The code in your script allows you and anyone else to run the same analysis at a later point in time which is very important to check for mistakes and for teaching. We will make use of a special type of plain text files: Quarto (.qmd
) files. 3.
Quarto is an extension of the markdown format. The idea of markdown is to write plain text documents which can later be exported to other formats such as html, pdf, or Word. The beauty of RMarkdown is that it lets you combine both R scripts and markdown into one document. This allows you to integrate your analysis and your description of your analysis in one document. When you go back to your analysis after a couple of weeks, you will be happy that you have more than the raw code to look at.
Finally, I am going to introduce you to a specific dialect of the R language which is informally known as the tidyverse. If you want to be a good R programmer this might not be the best entry point into R. However, I assume that you want to quickly pick up some tools to facilitate your research project and for that purpose I believe the tidyverse will be excellent. I will introduce the most important bits and pieces throughout my lectures however I don’t have the time to go into detail. Now and then, you will have to experiment on your own to get your code working. Some excellent resources are the free online books R for datascience and Data Visualization: A practical introduction. You can also buy reasonably priced physical copies if you your are interested in developing your R skills further.
For larger projects, you should use a project in its own separate folder.4. Once you need multiple scripts, you want to make sure that you can reliably point to the results or functions in different scripts. Projects help you manage all the parts of a larger piece of work. You can start a You start a new project by clicking File > New Project ...
You can start a new file by clicking File > New File > R Script
.
In Rstudio, you can use the R console directly. I mainly use the R console to test the code I have written in my scripts and I would advice you to do the same thing. If you want to understand how R works. You can experiment by typing the following lines in the console.5
<- 1
x x
[1] 1
<- x + 1
x x
[1] 2
The little code above already shows you two things.
x
We will see further that we can assign almost anything to an object x
. When x
is a vector or a data set with multiple elements, this will allow us to perform the same function on each element of x
. That is the basic advantage of programming. You can tell your computer how to do a thing and than the computer can do it over and over for you.
install.packages("tidyverse")
library("tidyverse")
R packages are additions to R that give R extra functionality. One of the selling points of R is that it has a good way of integrating this extra functionality and a lot of people are highly motivated to add this extra functionality. We will heavily rely on one meta-packages. tidyverse
is a package that helps with data transformations and working with tabular data in a tidy fashion. One of the packages in the tidyverse
is ggplot2
which provides tools to make pretty plots. The code above shows you how to install packages and use them. Normally, you will only have to install a package once. However, when you want to use a package in your script or code, you will have to load the package with the library()
function. You only have to do load a package once at the start of your R session or at the start of your script. 6
dplyr
is another part of the tidyverse
we will use extensively. With 5 verbs 7 and one pipe operator to glue them together helps you to explore the data.filter()
let’s you filter out a subset of the observations. select()
selects a subset of the variables. mutate()
changes variables and creates new ones. group_by()
and summarise()
group subsets of observations and summarise them. %>%
is the pipe operator and joins together the verbs to create compound statements where you for instance filter a subset of the observations, create a new variable, create groups, and summarise the groups based on the average for the new variable.
Let’s run some code. First, we have to tell R to use the packages in the tidyverse
.
library(tidyverse)
Next, we have a look at the data of the compensation of CEOs of S&P500 companies. First, we have to assign the file with the data to an object, us_comp
in this case. 8 The glimpse
function gives an overview of the variables in the data, the type of the variables, and a couple of examples of the values for that variable.
<- readRDS(here("data", "us-compensation-new.RDS"))
us_comp glimpse(us_comp)
Rows: 24,294
Columns: 22
$ year <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018…
$ gvkey <chr> "001004", "001004", "001004", "001004", "00100…
$ cusip <chr> "00036110", "00036110", "00036110", "00036110"…
$ exec_fullname <chr> "David P. Storch", "David P. Storch", "David P…
$ coname <chr> "AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"…
$ ceoann <chr> "CEO", "CEO", "CEO", "CEO", "CEO", "CEO", "CEO…
$ execid <chr> "09249", "09249", "09249", "09249", "09249", "…
$ bonus <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0…
$ salary <dbl> 867.000, 877.838, 906.449, 906.449, 755.250, 8…
$ stock_awards_fv <dbl> 2664.745, 619.200, 1342.704, 1695.200, 1150.50…
$ stock_unvest_val <dbl> 4227.273, 6018.000, 4244.165, 4103.283, 1334.0…
$ eip_unearn_num <dbl> 32.681, 56.681, 66.929, 83.415, 91.969, 157.96…
$ eip_unearn_val <dbl> 393.806, 1137.021, 1626.375, 2464.079, 2244.96…
$ option_awards <dbl> 578.460, 695.520, 1622.016, 0.000, 1150.500, 1…
$ option_awards_blk_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ option_awards_num <dbl> 49.022, 144.000, 158.400, 0.000, 153.810, 225.…
$ tdc1 <dbl> 5786.400, 4182.832, 5247.779, 5234.648, 4674.4…
$ tdc2 <dbl> 6105.117, 3487.312, 3809.626, 10428.375, 3523.…
$ shrown_tot_pct <dbl> 2.964, 2.893, 3.444, 3.877, 4.597, 5.417, 3.71…
$ becameceo <date> 1996-10-09, 1996-10-09, 1996-10-09, 1996-10-0…
$ joined_co <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ reason <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
We can also look at a subset of the observations, i.e. we filter the observations from the company that has the key 001045
.
filter(us_comp, gvkey == "001045")
# A tibble: 12 × 22
year gvkey cusip exec_fullname coname ceoann execid bonus salary
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 2011 001045 02376R10 Gerard J. Arpey AMERI… CEO 14591 0 614.
2 2012 001045 02376R10 Thomas W. Horton AMERI… CEO 26059 0 618.
3 2013 001045 02376R10 Thomas W. Horton AMERI… CEO 26059 0 592.
4 2014 001045 02376R10 William Douglas Park… AMERI… CEO 46191 0 688.
5 2015 001045 02376R10 William Douglas Park… AMERI… CEO 46191 0 232.
6 2016 001045 02376R10 William Douglas Park… AMERI… CEO 46191 0 0
7 2017 001045 02376R10 William Douglas Park… AMERI… CEO 46191 0 0
8 2018 001045 02376R10 William Douglas Park… AMERI… CEO 46191 0 0
9 2019 001045 02376R10 William Douglas Park… AMERI… CEO 46191 0 0
10 2020 001045 02376R10 William Douglas Park… AMERI… CEO 46191 0 0
11 2021 001045 02376R10 William Douglas Park… AMERI… CEO 46191 0 0
12 2022 001045 02376R10 Robert D. Isom, Jr. AMERI… CEO 46193 0 1162.
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
# eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
# option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
# tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
# reason <chr>
You can see that the company, American Airlines, has had four CEOs since 2011. I used the gvkey
to select a company and not its name. You will often use an id or a database key especially if you are working with multiple datasets and you want to link observations from one dataset to observations in the other dataset.
We can also filter data based on other variables. For instance, the below filters the observations from 2014 where the CEO had a salary over $1,000,000.
filter(us_comp, salary > 1000, year == 2014)
# A tibble: 494 × 22
year gvkey cusip exec_fullname coname ceoann execid bonus salary
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 2014 001075 72348410 Donald E. Brandt, CPA PINNA… CEO 05835 0 1240
2 2014 001078 00282410 Miles D. White, M.B.… ABBOT… CEO 14300 0 1973.
3 2014 001161 00790310 Rory P. Read ADVAN… CEO 42390 0 1000.
4 2014 001300 43851610 David M. Cote HONEY… CEO 20931 5500 1866.
5 2014 001380 42809H10 John B. Hess HESS … CEO 02132 0 1500
6 2014 001440 02553710 Nicholas K. Akins AMERI… CEO 40778 0 1241.
7 2014 001447 02581610 Kenneth I. Chenault AMERI… CEO 02157 4500 2000
8 2014 001449 00105510 Daniel Paul Amos AFLAC… CEO 00013 0 1441.
9 2014 001468 02637510 Jeffrey M. Weiss AMERI… CEO 11710 3640 1012.
10 2014 001487 02687478 Robert Herman Benmos… AMERI… CEO 20972 0 1385.
# ℹ 484 more rows
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
# eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
# option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
# tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
# reason <chr>
You can also select some variables if you are only interested in those. If a variable has an undescriptive name, you can use select to rename the variable. For instance, tdc1
is the total compensation of a the CEO. A more descriptive name will help you to remember what the variable actually means.
select(us_comp, year, coname, bonus, salary, total = tdc1)
# A tibble: 24,294 × 5
year coname bonus salary total
<dbl> <chr> <dbl> <dbl> <dbl>
1 2011 AAR CORP 0 867 5786.
2 2012 AAR CORP 0 878. 4183.
3 2013 AAR CORP 0 906. 5248.
4 2014 AAR CORP 0 906. 5235.
5 2015 AAR CORP 0 755. 4674.
6 2016 AAR CORP 0 835 6073.
7 2017 AAR CORP 0 941 6284.
8 2018 AAR CORP 0 750 3344.
9 2019 AAR CORP 0 801. 4736.
10 2020 AAR CORP 0 781. 4123.
# ℹ 24,284 more rows
You can also create new variables with the mutate
function. I created a variable that calculates what percentage of total compensation is the CEO’s salary.
select(us_comp, year, coname, bonus, salary, total = tdc1) %>%
mutate(salary_percentage = salary / total)
# A tibble: 24,294 × 6
year coname bonus salary total salary_percentage
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2011 AAR CORP 0 867 5786. 0.150
2 2012 AAR CORP 0 878. 4183. 0.210
3 2013 AAR CORP 0 906. 5248. 0.173
4 2014 AAR CORP 0 906. 5235. 0.173
5 2015 AAR CORP 0 755. 4674. 0.162
6 2016 AAR CORP 0 835 6073. 0.138
7 2017 AAR CORP 0 941 6284. 0.150
8 2018 AAR CORP 0 750 3344. 0.224
9 2019 AAR CORP 0 801. 4736. 0.169
10 2020 AAR CORP 0 781. 4123. 0.189
# ℹ 24,284 more rows
Remark how the select
statement from before is chained together with the mutate
statement through the pipe operator (%>%
). This operator pipes the results from the first statemetn to the second statement. The second statement implicitly uses the result from the first statement as the data it is going to work with.
You can use pipe statement to make your own descriptive statistics table. I created a table with some statistics for each firm. I first group the observations based on the gvkey
and then summarise each group with the same key by defining a number of new variables.
group_by(us_comp, gvkey) %>%
summarise(N = n(), N_CEO = n_distinct(execid),
average = mean(salary), sd = sd(salary),
med = median(salary), minimum = min(salary),
maximum = max(salary)) %>%
ungroup()
# A tibble: 2,571 × 8
gvkey N N_CEO average sd med minimum maximum
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001004 12 2 862. 78.9 872. 750 1000
2 001045 12 4 325. 395. 116. 0 1162.
3 001072 8 2 677. 219. 652. 450 967.
4 001075 12 2 1219. 111. 1222. 1091 1395
5 001076 12 4 758. 119. 743. 567. 950
6 001078 12 2 1788. 224. 1900 1298. 1973.
7 001094 8 3 563. 66.2 581. 437. 626.
8 001161 12 3 961. 151. 1000. 566. 1149.
9 001177 7 1 1049. 86.4 1000 977. 1200
10 001209 13 2 1247. 129. 1200 905. 1402.
# ℹ 2,561 more rows
Recent updates allow you to write the same code in a shorter manner.
summarise(us_comp, N = n(), N_CEO = n_distinct(execid),
average = mean(salary), sd = sd(salary),
med = median(salary), minimum = min(salary),
maximum = max(salary),
.by = gvkey)
# A tibble: 2,571 × 8
gvkey N N_CEO average sd med minimum maximum
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001004 12 2 862. 78.9 872. 750 1000
2 001045 12 4 325. 395. 116. 0 1162.
3 001072 8 2 677. 219. 652. 450 967.
4 001075 12 2 1219. 111. 1222. 1091 1395
5 001076 12 4 758. 119. 743. 567. 950
6 001078 12 2 1788. 224. 1900 1298. 1973.
7 001094 8 3 563. 66.2 581. 437. 626.
8 001161 12 3 961. 151. 1000. 566. 1149.
9 001177 7 1 1049. 86.4 1000 977. 1200
10 001209 13 2 1247. 129. 1200 905. 1402.
# ℹ 2,561 more rows
Programming is hard work. You will make mistakes. You will get error messages. An important programming skill is to efficiently debug your code and find out what is going wrong.
The most important technique is trial and error. Change one thing in your code and see what the output is. Do you get an error message? What does the error message tell you? If you are carefull and do not change everything at once, this should at least help you to find out which part of the code does not work. This is one reason why scripts are so important. Because a script contains your entire analyis, you can always go back and (let your computer) redo the analysis from the start.
If you are not sure how you should use a certain function in R, you can read its help files in the Rstudio Help window. You can search for more information what a function is doing.
R is a very popular language. If you have a problem it is likely that someone else had the same problem before you. If you Google the error message, there is a decent chance you will find a solution to your problem.
A specific website where a lot of these questions are asked is StackOverflow. You can often directly search for your error on the website and find multiple solutions. You can improve your chances of finding a related answer to your problem by adding the [r]
, [tidyverse]
, [dplyr]
, [ggplot]
tags to your error message.
One of the big advantages of the R world is that you can easily combine explanations with your analysis in .qmd
files. The assignements can all be completed in Quarto. You can find a lot of good resources on the Quarto website. In short, markdown is a simple markup language 9 that lets you include R code.
You can have titles in markdown.
# Title
## Subtitle
### Lower level titles.
You can emphasise some words.
*Italic*, **bold**
Add links to pages and include pictures.
[weblink](https://www.google.com)
![pictures](https://rstudio.com/wp-content/uploads/2015/10/r-packages.png)
You can write enumerations and lists.
1. Item 1
2. Item 2
3. Item 3
- one
- two
- three
You can also write tables. You will rarely have to use tables. Typically, you can directly create the tables from R without the need to type in the results of your analysis.
First Header | Second Header
------------- | -------------
Content Cell | Content Cell
Content Cell | Content Cell
The largest advantage of Rmarkdown files is that we can include pieces of R code. 10 Code chuncks go between three backticks and we tell Rmarkdown that the language we are using is {r}
. The example below creates a code chuck where we create a random vector x
with then elements where the elements are drawn from a random distribution with mean 4 and standard deviation 3.
```{r}
<- rnorm(n = 10, mean = 4, sd = 3)
x ```
You can download R from this CRAN server↩︎
You can download from the Rstudio website↩︎
These notes and slides are enteriely written in Quarto files↩︎
See the Project Chapter of Hadley Wickham’s book↩︎
It is tempting to copy and paste the code but retyping the code is a surprising powerful way of learning to code. I would advise you to try to type as much of the code as possible when you are trying to figure out how R works.↩︎
Installing packages is another reason to use the R console.↩︎
i.e. functions to do something with a dataset.↩︎
The data is available on LMS or through the code in the detailed introduction file↩︎
This means that you can add little bits of code to your text that indicate how the text should look↩︎
Again, all these notes are written in Rmarkdown.↩︎