WRDS linking data

Author

Stijn Masschelein

I use four packages on this page and three of them require some more explanation. The here package helps with managing the different files in this larger project. I can refer to different files relative to the root folder all the files are in. The only thing that I need to do is to say where this file is compared to the root folder with the i_am function. You can see the simplified folder structure below for this website. By telling R which file this is, I can easily refer to other files and save the data we are going to download and clean in the data folder. If it is specific to the replication it will go into the data/freaky_friday folder as an .RDS file which is an efficient way of storing R objects.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(here)
here() starts at /Users/stijnmasschelein/Library/CloudStorage/Dropbox/Teaching/lecturenotes/method_package
i_am("freaky_friday/download_linking.qmd")
here() starts at /Users/stijnmasschelein/Library/CloudStorage/Dropbox/Teaching/lecturenotes/method_package
library(RPostgres)
library(dbplyr)

Attaching package: 'dbplyr'

The following objects are masked from 'package:dplyr':

    ident, sql
|-- auxilary
|-- data
    |-- freaky_friday
        |-- ... .RDS
    |-- machine_learning
    |-- wrds
|-- freaky_friday
    |-- abnormal_returns.qmd
    |-- descriptive.qmd
    |-- download_data.qmd
    |-- download_linking.qmd
    |-- download_stocks.qmd
    |-- index.qmd
    |-- linking.qmd
    |-- regressions.qmd
|-- generalised
|-- generated
|-- machine_learning
|-- slides

The second package is the RPostgres package that helps make a connection with the WRDS data sources. The final package is dbplyr which allows us to interact with almost any database with the tidyverse verbs. I’ll demonstrate how below.

Connect to WRDS

This code sets up the connection to WRDS. WRDS has a thorough tutorial on how to set up Rstudio to connect to WRDS.

In this tutorial, I will show you how you can do the whole data analysis from no data at all to the main results table. You do not have to do it this way. It is possible to just download the data files in for instance excel format from WRDS. For individual researchers this is sometimes easier. The disadvantage is that it is hard to check what researchers actually did when you do not have the full analytical pipeline. For future researchers having the full code available gives a way to check the work and improve upon it even if the data itself is not shared by the researchers.

The practical benefit for you is that you can pick and choose which parts of the code you think are useful because I attempt to have everything in this tutorial.

wrds <- dbConnect(Postgres(),
                  host='wrds-pgdata.wharton.upenn.edu',
                  port=9737,
                  dbname='wrds',
                  user='stimas',
                  sslmode='require')

The code sets up a connection to the WRDS database under my account name. Do not fear, you also need my password to make that code work so you cannot hack my WRDS account with this code. The WRDS tutorial explains all of this in more detail.

Linking Databases in WRDS

WRDS provides really good documentation on how to best combine I/B/E/S with other data sources. In doubt, you should check their tutorials. This would also be my general advice for working with any database. It is always a good idea to read through the documentation of the database and try to understand how the data is constructed. There is a similar tutorial on how to merge Compustat with other sources. Most of the decisions I made below follow directly from those tutorials. The long and short of it is that both I/B/E/S and Compustat have a dataset with identifiers for other databases and we are going to download those datasets 1.

I use the tbl function from the dbplyr package. I give the downloaded data a name ibes_query so that I can use it further to clean it and save it. The only thing to remember with the tbl function is that you need to run the collect function at the end of the query so that the data is downloaded. Obviously, you need to know where the data is on the servers and how the variables are named. Sometimes, you can ask colleagues but other times, you will have to read the documentation on the WRDS website.

I/B/E/S

The code instructs the WRDS database to get the variables ticker, cusip (another identifier), cname (company name), and sdates (the start date for this ticker) from the ibes.idsum (IBES ID summary) database of WRDS. In this paper, we only want U.S. firms which we do by filtering by usfirm == 1.

ibes_query <- tbl(wrds, in_schema("ibes", "idsum")) %>%
  filter(usfirm == 1) %>%
  select(ticker, cusip, cname, sdates) %>%
  collect()

With some R code, we clean the data and save it as a file in the data > wrds folder in our main folder.

ibes_id <- as_tibble(ibes_query) %>%
  rename_all(tolower)
saveRDS(ibes_id, here("data", "wrds", "ibes_id.RDS"))

CRSP

From the CRSP data, we get the permno and ncusip identifier where ncusip stands for the same cusip identifier as mentioned above. We also have the company name, start date, and end date.

crsp_query <- tbl(wrds, in_schema("crsp", "stocknames")) %>%
  select(permno, ncusip, comnam, st_date, end_date) %>%
  collect()
crsp_id <- as_tibble(crsp_query) %>%
  rename_all(tolower)
saveRDS(crsp_id, here("data", "wrds", "crsp_id.RDS"))

Compustat with I/B/E/S

From Compustat we use the security file which has all the financial securities (and their identifiers) that are linked to the firms in Compustat. We select all the variables from that dataset. We only select the ones where the ibes ticker is available so that we can match via the ticker in the I/B/E/S files.

compu_security_query <- tbl(wrds, in_schema("comp", "security")) %>%
  filter(!is.null(ibtic)) %>%
  collect()
compu_security <- as_tibble(compu_security_query) %>%
  rename_all(tolower)
saveRDS(compu_security, here("data", "wrds", "compu_security.RDS"))

Compustat with CRSP

Finally, we get the linking file in compustat. According to the documentation, not all the links are reliable and they advice to use the linktype variable and the usedflag variable to filter only the links that are most reliable. I have implemented the rules that follow best practice according to this tutorial (https://wrds-www.wharton.upenn.edu/pages/wrds-research/applications/linking-databases/linking-crsp-and-compustat/)

# compu_query
compu_query <- tbl(wrds, in_schema("crsp", "ccmxpf_linktable")) %>%
  select(gvkey, linktype, usedflag, iid = liid, permno = lpermno, stdt = linkdt, enddt = linkenddt) %>%
  filter(!is.na(permno), linktype %in% c("LU", "LC"), usedflag == 1) %>%
  select(gvkey, permno, stdt, enddt) %>%
  distinct()
saveRDS(crsp_query, here("data", "wrds", "crsp_compu.RDS"))

Summary

It is important to remember that we did not download the actual data that we need, we only downloaded the meta data about all the merges we can make. However, to have this data in place makes it much easier to update the data if we need to do so. Also, notice that I made this a separate file. It is good practice to break down larger projects in smaller tasks and files. If this part does not need updating there is no need to rerun this code!

Footnotes

  1. There are some other overviews available for identifiers that you can use to merge datasets. I am only using the best practice advice according to WRDS.↩︎