Rstudio installation and first code

Author

Stijn Masschelein

Research tools - Data Analysis

Data analysis tools

Our main tool for data analysis is the R statistical software. R is free software specifically aimed at statistical analysis and graphical representation. It is a fully developed programming language that allows to extend it with new packages with new capabilities. We are going to heavily rely on some of those packages.¹

We are going to interact with the R software through Rstudio.² Rstudio is an integrated development environment for R. It allows you to easily write, test, and run R code, and integrate the results with explanations. One advantage of R and Rstudio is that you can install them on as many computers as you want. You do not need a special license. This means that you do not have to be affiliated with the university to use the sofware.

In Rstudio, you write code and ask the R software to execute that code. You can save the different steps in your analysis in an R script which is just a plain text file with the .R-extenstion. The advantage of plain text file is that they are easy to share across different operating systems. The code in your script allows you and anyone else to run the same analysis at a later point in time which is very important to check for mistakes and for teaching. We will make use of a special type of plain text files: Quarto (.qmd) files. ³.

Quarto is an extension of the markdown format. The idea of markdown is to write plain text documents which can later be exported to other formats such as html, pdf, or Word. The beauty of RMarkdown is that it lets you combine both R scripts and markdown into one document. This allows you to integrate your analysis and your description of your analysis in one document. When you go back to your analysis after a couple of weeks, you will be happy that you have more than the raw code to look at.

Finally, I am going to introduce you to a specific dialect of the R language which is informally known as the tidyverse. If you want to be a good R programmer this might not be the best entry point into R. However, I assume that you want to quickly pick up some tools to facilitate your research project and for that purpose I believe the tidyverse will be excellent. I will introduce the most important bits and pieces throughout my lectures however I don’t have the time to go into detail. Now and then, you will have to experiment on your own to get your code working. Some excellent resources are the free online books R for datascience and Data Visualization: A practical introduction. You can also buy reasonably priced physical copies if you your are interested in developing your R skills further.

For larger projects, you should use a project in its own separate folder.⁴. Once you need multiple scripts, you want to make sure that you can reliably point to the results or functions in different scripts. Projects help you manage all the parts of a larger piece of work. You can start a You start a new project by clicking File > New Project ... You can start a new file by clicking File > New File > R Script.

The R console

In Rstudio, you can use the R console directly. I mainly use the R console to test the code I have written in my scripts and I would advice you to do the same thing. If you want to understand how R works. You can experiment by typing the following lines in the console.⁵

x <- 1
x

[1] 1

x <- x + 1
x

[1] 2

The little code above already shows you two things.

You can assign (and overwrite) a numerical value to an object x
The right-hand side is assigned to the left-hand side. You can’t just switch them around.

We will see further that we can assign almost anything to an object x. When x is a vector or a data set with multiple elements, this will allow us to perform the same function on each element of x. That is the basic advantage of programming. You can tell your computer how to do a thing and than the computer can do it over and over for you.

R packages

install.packages("tidyverse")
library("tidyverse")

R packages are additions to R that give R extra functionality. One of the selling points of R is that it has a good way of integrating this extra functionality and a lot of people are highly motivated to add this extra functionality. We will heavily rely on one meta-packages. tidyverse is a package that helps with data transformations and working with tabular data in a tidy fashion. One of the packages in the tidyverse is ggplot2 which provides tools to make pretty plots. The code above shows you how to install packages and use them. Normally, you will only have to install a package once. However, when you want to use a package in your script or code, you will have to load the package with the library() function. You only have to do load a package once at the start of your R session or at the start of your script. ⁶

dplyr is another part of the tidyverse we will use extensively. With 5 verbs ⁷ and one pipe operator to glue them together helps you to explore the data.filter() let’s you filter out a subset of the observations. select() selects a subset of the variables. mutate() changes variables and creates new ones. group_by() and summarise() group subsets of observations and summarise them. %>% is the pipe operator and joins together the verbs to create compound statements where you for instance filter a subset of the observations, create a new variable, create groups, and summarise the groups based on the average for the new variable.

Finally some useful code!

Let’s run some code. First, we have to tell R to use the packages in the tidyverse.

library(tidyverse)

Next, we have a look at the data of the compensation of CEOs of S&P500 companies. First, we have to assign the file with the data to an object, us_comp in this case. ⁸ The glimpse function gives an overview of the variables in the data, the type of the variables, and a couple of examples of the values for that variable.

us_comp <- readRDS(here("data", "us-compensation-new.RDS"))
glimpse(us_comp)

Rows: 24,294
Columns: 22
$ year                    <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018…
$ gvkey                   <chr> "001004", "001004", "001004", "001004", "00100…
$ cusip                   <chr> "00036110", "00036110", "00036110", "00036110"…
$ exec_fullname           <chr> "David P. Storch", "David P. Storch", "David P…
$ coname                  <chr> "AAR CORP", "AAR CORP", "AAR CORP", "AAR CORP"…
$ ceoann                  <chr> "CEO", "CEO", "CEO", "CEO", "CEO", "CEO", "CEO…
$ execid                  <chr> "09249", "09249", "09249", "09249", "09249", "…
$ bonus                   <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0…
$ salary                  <dbl> 867.000, 877.838, 906.449, 906.449, 755.250, 8…
$ stock_awards_fv         <dbl> 2664.745, 619.200, 1342.704, 1695.200, 1150.50…
$ stock_unvest_val        <dbl> 4227.273, 6018.000, 4244.165, 4103.283, 1334.0…
$ eip_unearn_num          <dbl> 32.681, 56.681, 66.929, 83.415, 91.969, 157.96…
$ eip_unearn_val          <dbl> 393.806, 1137.021, 1626.375, 2464.079, 2244.96…
$ option_awards           <dbl> 578.460, 695.520, 1622.016, 0.000, 1150.500, 1…
$ option_awards_blk_value <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ option_awards_num       <dbl> 49.022, 144.000, 158.400, 0.000, 153.810, 225.…
$ tdc1                    <dbl> 5786.400, 4182.832, 5247.779, 5234.648, 4674.4…
$ tdc2                    <dbl> 6105.117, 3487.312, 3809.626, 10428.375, 3523.…
$ shrown_tot_pct          <dbl> 2.964, 2.893, 3.444, 3.877, 4.597, 5.417, 3.71…
$ becameceo               <date> 1996-10-09, 1996-10-09, 1996-10-09, 1996-10-0…
$ joined_co               <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ reason                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

We can also look at a subset of the observations, i.e. we filter the observations from the company that has the key 001045.

filter(us_comp, gvkey == "001045")

# A tibble: 12 × 22
    year gvkey  cusip    exec_fullname         coname ceoann execid bonus salary
   <dbl> <chr>  <chr>    <chr>                 <chr>  <chr>  <chr>  <dbl>  <dbl>
 1  2011 001045 02376R10 Gerard J. Arpey       AMERI… CEO    14591      0   614.
 2  2012 001045 02376R10 Thomas W. Horton      AMERI… CEO    26059      0   618.
 3  2013 001045 02376R10 Thomas W. Horton      AMERI… CEO    26059      0   592.
 4  2014 001045 02376R10 William Douglas Park… AMERI… CEO    46191      0   688.
 5  2015 001045 02376R10 William Douglas Park… AMERI… CEO    46191      0   232.
 6  2016 001045 02376R10 William Douglas Park… AMERI… CEO    46191      0     0 
 7  2017 001045 02376R10 William Douglas Park… AMERI… CEO    46191      0     0 
 8  2018 001045 02376R10 William Douglas Park… AMERI… CEO    46191      0     0 
 9  2019 001045 02376R10 William Douglas Park… AMERI… CEO    46191      0     0 
10  2020 001045 02376R10 William Douglas Park… AMERI… CEO    46191      0     0 
11  2021 001045 02376R10 William Douglas Park… AMERI… CEO    46191      0     0 
12  2022 001045 02376R10 Robert D. Isom, Jr.   AMERI… CEO    46193      0  1162.
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
#   eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
#   option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
#   tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
#   reason <chr>

You can see that the company, American Airlines, has had four CEOs since 2011. I used the gvkey to select a company and not its name. You will often use an id or a database key especially if you are working with multiple datasets and you want to link observations from one dataset to observations in the other dataset.

We can also filter data based on other variables. For instance, the below filters the observations from 2014 where the CEO had a salary over $1,000,000.

filter(us_comp, salary > 1000, year == 2014)

# A tibble: 494 × 22
    year gvkey  cusip    exec_fullname         coname ceoann execid bonus salary
   <dbl> <chr>  <chr>    <chr>                 <chr>  <chr>  <chr>  <dbl>  <dbl>
 1  2014 001075 72348410 Donald E. Brandt, CPA PINNA… CEO    05835      0  1240 
 2  2014 001078 00282410 Miles D. White, M.B.… ABBOT… CEO    14300      0  1973.
 3  2014 001161 00790310 Rory P. Read          ADVAN… CEO    42390      0  1000.
 4  2014 001300 43851610 David M. Cote         HONEY… CEO    20931   5500  1866.
 5  2014 001380 42809H10 John B. Hess          HESS … CEO    02132      0  1500 
 6  2014 001440 02553710 Nicholas K. Akins     AMERI… CEO    40778      0  1241.
 7  2014 001447 02581610 Kenneth I. Chenault   AMERI… CEO    02157   4500  2000 
 8  2014 001449 00105510 Daniel Paul Amos      AFLAC… CEO    00013      0  1441.
 9  2014 001468 02637510 Jeffrey M. Weiss      AMERI… CEO    11710   3640  1012.
10  2014 001487 02687478 Robert Herman Benmos… AMERI… CEO    20972      0  1385.
# ℹ 484 more rows
# ℹ 13 more variables: stock_awards_fv <dbl>, stock_unvest_val <dbl>,
#   eip_unearn_num <dbl>, eip_unearn_val <dbl>, option_awards <dbl>,
#   option_awards_blk_value <dbl>, option_awards_num <dbl>, tdc1 <dbl>,
#   tdc2 <dbl>, shrown_tot_pct <dbl>, becameceo <date>, joined_co <date>,
#   reason <chr>

You can also select some variables if you are only interested in those. If a variable has an undescriptive name, you can use select to rename the variable. For instance, tdc1 is the total compensation of a the CEO. A more descriptive name will help you to remember what the variable actually means.

select(us_comp, year, coname, bonus, salary, total = tdc1)

# A tibble: 24,294 × 5
    year coname   bonus salary total
   <dbl> <chr>    <dbl>  <dbl> <dbl>
 1  2011 AAR CORP     0   867  5786.
 2  2012 AAR CORP     0   878. 4183.
 3  2013 AAR CORP     0   906. 5248.
 4  2014 AAR CORP     0   906. 5235.
 5  2015 AAR CORP     0   755. 4674.
 6  2016 AAR CORP     0   835  6073.
 7  2017 AAR CORP     0   941  6284.
 8  2018 AAR CORP     0   750  3344.
 9  2019 AAR CORP     0   801. 4736.
10  2020 AAR CORP     0   781. 4123.
# ℹ 24,284 more rows

You can also create new variables with the mutate function. I created a variable that calculates what percentage of total compensation is the CEO’s salary.

select(us_comp, year, coname, bonus, salary, total = tdc1) %>%
  mutate(salary_percentage = salary / total)

# A tibble: 24,294 × 6
    year coname   bonus salary total salary_percentage
   <dbl> <chr>    <dbl>  <dbl> <dbl>             <dbl>
 1  2011 AAR CORP     0   867  5786.             0.150
 2  2012 AAR CORP     0   878. 4183.             0.210
 3  2013 AAR CORP     0   906. 5248.             0.173
 4  2014 AAR CORP     0   906. 5235.             0.173
 5  2015 AAR CORP     0   755. 4674.             0.162
 6  2016 AAR CORP     0   835  6073.             0.138
 7  2017 AAR CORP     0   941  6284.             0.150
 8  2018 AAR CORP     0   750  3344.             0.224
 9  2019 AAR CORP     0   801. 4736.             0.169
10  2020 AAR CORP     0   781. 4123.             0.189
# ℹ 24,284 more rows

Remark how the select statement from before is chained together with the mutate statement through the pipe operator (%>%). This operator pipes the results from the first statemetn to the second statement. The second statement implicitly uses the result from the first statement as the data it is going to work with.

You can use pipe statement to make your own descriptive statistics table. I created a table with some statistics for each firm. I first group the observations based on the gvkey and then summarise each group with the same key by defining a number of new variables.

group_by(us_comp, gvkey) %>%
  summarise(N = n(), N_CEO = n_distinct(execid), 
            average = mean(salary), sd = sd(salary),
            med = median(salary), minimum = min(salary), 
            maximum = max(salary)) %>%
  ungroup()

# A tibble: 2,571 × 8
   gvkey      N N_CEO average    sd   med minimum maximum
   <chr>  <int> <int>   <dbl> <dbl> <dbl>   <dbl>   <dbl>
 1 001004    12     2    862.  78.9  872.    750    1000 
 2 001045    12     4    325. 395.   116.      0    1162.
 3 001072     8     2    677. 219.   652.    450     967.
 4 001075    12     2   1219. 111.  1222.   1091    1395 
 5 001076    12     4    758. 119.   743.    567.    950 
 6 001078    12     2   1788. 224.  1900    1298.   1973.
 7 001094     8     3    563.  66.2  581.    437.    626.
 8 001161    12     3    961. 151.  1000.    566.   1149.
 9 001177     7     1   1049.  86.4 1000     977.   1200 
10 001209    13     2   1247. 129.  1200     905.   1402.
# ℹ 2,561 more rows

Recent updates allow you to write the same code in a shorter manner.

summarise(us_comp, N = n(), N_CEO = n_distinct(execid), 
          average = mean(salary), sd = sd(salary),
          med = median(salary), minimum = min(salary), 
          maximum = max(salary),
          .by = gvkey)

# A tibble: 2,571 × 8
   gvkey      N N_CEO average    sd   med minimum maximum
   <chr>  <int> <int>   <dbl> <dbl> <dbl>   <dbl>   <dbl>
 1 001004    12     2    862.  78.9  872.    750    1000 
 2 001045    12     4    325. 395.   116.      0    1162.
 3 001072     8     2    677. 219.   652.    450     967.
 4 001075    12     2   1219. 111.  1222.   1091    1395 
 5 001076    12     4    758. 119.   743.    567.    950 
 6 001078    12     2   1788. 224.  1900    1298.   1973.
 7 001094     8     3    563.  66.2  581.    437.    626.
 8 001161    12     3    961. 151.  1000.    566.   1149.
 9 001177     7     1   1049.  86.4 1000     977.   1200 
10 001209    13     2   1247. 129.  1200     905.   1402.
# ℹ 2,561 more rows

Looking for help

Programming is hard work. You will make mistakes. You will get error messages. An important programming skill is to efficiently debug your code and find out what is going wrong.

The most important technique is trial and error. Change one thing in your code and see what the output is. Do you get an error message? What does the error message tell you? If you are carefull and do not change everything at once, this should at least help you to find out which part of the code does not work. This is one reason why scripts are so important. Because a script contains your entire analyis, you can always go back and (let your computer) redo the analysis from the start.
If you are not sure how you should use a certain function in R, you can read its help files in the Rstudio Help window. You can search for more information what a function is doing.
R is a very popular language. If you have a problem it is likely that someone else had the same problem before you. If you Google the error message, there is a decent chance you will find a solution to your problem.
A specific website where a lot of these questions are asked is StackOverflow. You can often directly search for your error on the website and find multiple solutions. You can improve your chances of finding a related answer to your problem by adding the [r], [tidyverse], [dplyr], [ggplot] tags to your error message.

Quarto = Markdown + R + (a bunch of other programming languages)

One of the big advantages of the R world is that you can easily combine explanations with your analysis in .qmd files. The assignements can all be completed in Quarto. You can find a lot of good resources on the Quarto website. In short, markdown is a simple markup language ⁹ that lets you include R code.

Markdown

You can have titles in markdown.

# Title
## Subtitle
### Lower level titles.

You can emphasise some words.

*Italic*, **bold**

Add links to pages and include pictures.

[weblink](https://www.google.com)
![pictures](https://rstudio.com/wp-content/uploads/2015/10/r-packages.png)

You can write enumerations and lists.

1. Item 1
2. Item 2 
3. Item 3 

- one
- two 
- three

You can also write tables. You will rarely have to use tables. Typically, you can directly create the tables from R without the need to type in the results of your analysis.

First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell

R-code in markdown: R chunks

The largest advantage of Rmarkdown files is that we can include pieces of R code. ¹⁰ Code chuncks go between three backticks and we tell Rmarkdown that the language we are using is {r}. The example below creates a code chuck where we create a random vector x with then elements where the elements are drawn from a random distribution with mean 4 and standard deviation 3.

```{r}
x <- rnorm(n = 10, mean = 4, sd = 3)
```

Footnotes

You can download R from this CRAN server ↩︎
You can download from the Rstudio website ↩︎
These notes and slides are enteriely written in Quarto files↩︎
See the Project Chapter of Hadley Wickham’s book↩︎
It is tempting to copy and paste the code but retyping the code is a surprising powerful way of learning to code. I would advise you to try to type as much of the code as possible when you are trying to figure out how R works.↩︎
Installing packages is another reason to use the R console.↩︎
i.e. functions to do something with a dataset.↩︎
The data is available on LMS or through the code in the detailed introduction file ↩︎
This means that you can add little bits of code to your text that indicate how the text should look↩︎
Again, all these notes are written in Rmarkdown.↩︎