Combining databases

Author

Stijn Masschelein

The packages are the same as before.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(here)

here() starts at /Users/stijnmasschelein/Library/CloudStorage/Dropbox/Teaching/lecturenotes/method_package

i_am("freaky_friday/linking.qmd")

here() starts at /Users/stijnmasschelein/Library/CloudStorage/Dropbox/Teaching/lecturenotes/method_package

Read the data

We first read in the linking databases that we have downloaded. Here, you can again see the that I am separating the code in different files and I am using the saved versions of the databases. The advantage is that I do not have to wait on the WRDS download to do the data cleaning in this file.

ibes_id <- readRDS(here("data", "wrds", "ibes_id.RDS"))
crsp_id <- readRDS(here("data", "wrds", "crsp_id.RDS"))
crsp_compu <- readRDS(here("data", "wrds", "crsp_compu.RDS"))
ibes_compu <- readRDS(here("data", "wrds", "compu_security.RDS"))

Merging between IBES and Compustat

We start with the ibes_id dataset where we can use the ticker and the cusip as an identifier.

glimpse(ibes_id)

Rows: 92,131
Columns: 4
$ ticker <chr> "0000", "0000", "0001", "0001", "0001", "0001", "0001", "0004",…
$ cusip  <chr> "87482X10", "87482X10", "26878510", "26878510", "26878510", "26…
$ cname  <chr> "TALMER BANCORP INC", "TALMER BANCORP INC", "EP ENERGY CORP", "…
$ sdates <date> 2014-02-20, 2014-03-20, 2014-02-20, 2014-03-20, 2019-06-20, 20…

First, I simplify the dataset a little bit. For every observation, I calculate the first and the last date for every combination of ticker and cusip. We are going to use these variables to match observations by ticker in one dataset and by cusip in another dataset. So, we really care about the combinations of (ticker, cusip). Then, I only keep the observations where the sdates equals last_data_ibes. This restricts the sample just to the identifiers and the first and last date they appeared in the data.

ibes_clean <- ibes_id %>%
  mutate(first_date_ibes = min(sdates),
         last_date_ibes = max(sdates),
         .by = c(ticker, cusip)) %>%
  filter(sdates == last_date_ibes) %>%
  select(-sdates)
glimpse(ibes_clean)

Rows: 34,728
Columns: 5
$ ticker          <chr> "0000", "0001", "0004", "000R", "000V", "000V", "000V"…
$ cusip           <chr> "87482X10", "26878510", "02504D10", "14163310", "15117…
$ cname           <chr> "TALMER BANCORP INC", "EP ENERGY CORP", "AMERICAN CAPI…
$ first_date_ibes <date> 2014-02-20, 2014-02-20, 2014-02-20, 2014-02-20, 2014-…
$ last_date_ibes  <date> 2014-03-20, 2020-10-15, 2014-06-19, 2014-02-20, 2014-…

Unfortunately, there are some problems with this data. The number of unique tickers is relatively low compared to the number of cusips and compared to the ones in the ibes_compu linking data. We also have to be careful because the cusip in the ibes_clean data is 8 characters long and the one in the ibes_compu is 9 characters long. That is one of the problems with the cusip identifier that you always have to check. See the code below for the details.

ibes_clean %>%
  summarise(ticker = n_distinct(ticker), cusip = n_distinct(cusip))

# A tibble: 1 × 2
  ticker cusip
   <int> <int>
1  22421 34647

ibes_clean %>%
  transmute(lcusip = str_length(cusip)) %>%
  summarise(N = n(), .by = lcusip)

# A tibble: 1 × 2
  lcusip     N
   <int> <int>
1      8 34728

ibes_compu %>%
  summarise(ibtic = n_distinct(ibtic), cusip = n_distinct(cusip))

# A tibble: 1 × 2
  ibtic cusip
  <int> <int>
1 27176 26702

ibes_compu %>%
  transmute(lcusip = str_length(cusip)) %>%
  summarise(N = n(), .by = lcusip)

# A tibble: 1 × 2
  lcusip     N
   <int> <int>
1      9 28717

At this point, I decided that the best approach is to stick with the ibes_compu dataset to match I/B/E/S and Compustat data. I also checked the overlap between the two linking databases and it’s decent. About 19000 of the 22000 unique identifiers in ibes_id are in the ibes_compu dataset.

glimpse(ibes_compu)

Rows: 28,717
Columns: 16
$ tic             <chr> "AMFD.", "ANTQ", "AIR", "ABSI.1", "ACSE", "ADCT.1", "A…
$ gvkey           <chr> "001001", "001003", "001004", "001009", "001011", "001…
$ iid             <chr> "01", "01", "01", "01", "01", "01", "01", "02", "01", …
$ cusip           <chr> "000165100", "000354100", "000361105", "000781104", "0…
$ dlrsni          <chr> "01", "10", NA, "10", "01", "01", "01", "10", "10", "0…
$ dsci            <chr> "COM USD.01", "COM USD.10", "COM USD1", "COM NPV", "CO…
$ epf             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ exchg           <int> 14, 19, 11, 19, 14, 14, 14, 3, 19, 11, 13, 1, 11, 11, …
$ excntry         <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"…
$ ibtic           <chr> "AMFD", "ANTQ", "AIR", "ABSI", "ACSE", "ADCT", "AELN",…
$ isin            <chr> "US0001651001", "US0003541002", "US0003611052", "US000…
$ secstat         <chr> "I", "I", "A", "I", "I", "I", "I", "I", "I", "I", "I",…
$ sedol           <chr> NA, "2791283", "2001119", "2001153", NA, "B08G9Z7", NA…
$ tpci            <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",…
$ dldtei          <date> 1986-07-31, 2017-10-12, NA, 2007-02-15, 1995-10-31, 2…
$ curr_sp500_flag <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

sum(unique(ibes_clean$ticker) %in% unique(ibes_compu$ibtic))

[1] 19111

The reason why I leave this false trail in is that is typically what happens when you are working on a bigger data project. You need to interrogate the data to check how reliable the data is and to be aware of its shortcomings.

Merging between Compustat and CRSP

This goes via crsp_compu and the permno and gvkey variables. There is no perfect match between the two variables but the number of distinct identifiers is close.

glimpse(crsp_compu)

Rows: 33,385
Columns: 4
$ gvkey  <chr> "001000", "001001", "001002", "001003", "001004", "001005", "00…
$ permno <dbl> 25881, 10015, 10023, 10031, 54594, 61903, 10058, 10058, 10066, …
$ stdt   <date> 1970-11-13, 1983-09-20, 1972-12-14, 1983-12-07, 1972-04-24, 19…
$ enddt  <date> 1978-06-30, 1986-07-31, 1973-06-05, 1989-08-16, NA, 1983-01-31…

crsp_compu %>%
  summarise(permno = n_distinct(permno), gvkey = n_distinct(gvkey))

# A tibble: 1 × 2
  permno gvkey
   <int> <int>
1  29362 28809

Unfortunately, we also have the problem that sometimes the permno identifier for CRSP is associated with multiple gvkeys. This is more problematic, I think we will need to make sure we don’t run into trouble with this.

crsp_compu %>%
  summarise(stdt = min(stdt), enddt = max(enddt), .by = c(gvkey, permno)) %>%
  mutate(N = n(), .by = permno) %>%
  filter(N > 1) %>%
  arrange(desc(N), permno, stdt)

# A tibble: 1,511 × 5
   gvkey  permno stdt       enddt          N
   <chr>   <dbl> <date>     <date>     <int>
 1 185620  12350 2010-10-22 2012-10-02     3
 2 163806  12350 2012-10-03 2021-11-04     3
 3 040116  12350 2021-11-05 NA             3
 4 015215  13883 2013-04-30 2018-12-09     3
 5 034528  13883 2018-12-10 2021-06-27     3
 6 039104  13883 2021-06-28 NA             3
 7 007280  19289 1950-01-01 1984-02-29     3
 8 012127  19289 1984-03-01 1987-06-29     3
 9 014064  19289 1987-06-30 2006-08-29     3
10 003923  29787 1977-10-18 1979-12-31     3
# ℹ 1,501 more rows

It looks like sometimes the identifier for the company in compustat (gvkey) switches but the financial instrument is still the same (permno). That probably means that we will have to take into account the dates. That is, we will have to match on the identifiers and the date when matching observations form Compustat and CRSP.

Full Linking Table

For this study, we need the data from all three datasets, so we can make one table that links everything together. The two datasets have gvkey in common because of their link to Compustat. I use an inner_join because we need observations of all three dataset for each observation. I also rename some of the variables to make them a little bit easier to work with for me. I specifically retained the isin identifier. It is not necessary for the replication but it can be useful for potential extensions.

linking_table <- ibes_compu %>%
  select(gvkey, ibtic, isin, cusip) %>%
  inner_join(crsp_compu, by = join_by(gvkey == gvkey),
             multiple = "all") %>%
  rename(ticker = ibtic, start_date = stdt, end_date = enddt) %>%
  distinct() %>%
  print()

Warning in inner_join(., crsp_compu, by = join_by(gvkey == gvkey), multiple = "all"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 17 of `x` matches multiple rows in `y`.
ℹ Row 98 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

# A tibble: 24,375 × 7
   gvkey  ticker isin         cusip     permno start_date end_date  
   <chr>  <chr>  <chr>        <chr>      <dbl> <date>     <date>    
 1 001001 AMFD   US0001651001 000165100  10015 1983-09-20 1986-07-31
 2 001003 ANTQ   US0003541002 000354100  10031 1983-12-07 1989-08-16
 3 001004 AIR    US0003611052 000361105  54594 1972-04-24 NA        
 4 001009 ABSI   US0007811047 000781104  10074 1982-01-18 1996-03-13
 5 001011 ACSE   US0008723092 000872309  10082 1983-03-21 1995-09-28
 6 001013 ADCT   US0008863096 000886309  50906 1979-03-16 2010-12-31
 7 001017 AELN   US0010301051 001030105  10154 1972-12-14 1996-02-29
 8 001020 AFGN   <NA>         001054931  68806 1972-12-14 1988-04-05
 9 001021 AFPC   US0010582056 001058205  10218 1980-10-14 1999-03-19
10 001023 AGS    US0012401008 001240100  66755 1982-06-09 1988-10-05
# ℹ 24,365 more rows

glimpse(linking_table)

Rows: 24,375
Columns: 7
$ gvkey      <chr> "001001", "001003", "001004", "001009", "001011", "001013",…
$ ticker     <chr> "AMFD", "ANTQ", "AIR", "ABSI", "ACSE", "ADCT", "AELN", "AFG…
$ isin       <chr> "US0001651001", "US0003541002", "US0003611052", "US00078110…
$ cusip      <chr> "000165100", "000354100", "000361105", "000781104", "000872…
$ permno     <dbl> 10015, 10031, 54594, 10074, 10082, 50906, 10154, 68806, 102…
$ start_date <date> 1983-09-20, 1983-12-07, 1972-04-24, 1982-01-18, 1983-03-21…
$ end_date   <date> 1986-07-31, 1989-08-16, NA, 1996-03-13, 1995-09-28, 2010-1…

saveRDS(linking_table, here("data", "freaky_friday", "linking_table.RDS"))

This linking table is a full linking table for all U.S. companies that are in I/B/E/S, Compustat, and CRSP. It’s always a good idea to make your own documentation for important datasets. So, I made an overview table.

variable name	data source	description
`gvkey`	Compustat	A company identifier
`ticker`	I/B/E/S	A security identifier
`isin`	Datastream	A company identifier
`permno`	CRSP	A security identifier
`start_date`	Compustat/CRSP	The start date for a (`gvkey`, `permno`) combination
`end_date`	Compustat/CRSP	The start date for a (`gvkey`, `permno`) combination

Going forward, we will only need the linking table. This makes it a good place to stop this file ¹. If for future code, we need the linking_table. We can just read it from our RDS file. If we need to make changes to the table, we make it in this file and rerun the code from here.

Linking I/B/E/S

linking_table %>%
  select(gvkey, ticker) %>%
  distinct() %>%
  summarise(N = n(), .by = ticker) %>%
  filter(N > 1)

# A tibble: 0 × 2
# ℹ 2 variables: ticker <chr>, N <int>

There are no tickers linked with multiple gvkeys. This means that left_join from I/B/E/S is the way to start the joining process. That way, there will be no duplicate matches from Compustat.

Footnotes

Except for some checks on the linking table.↩︎