Scope
For every US-based facility served by ICS, GTL, and Securus, we aim to collect the following:
- The price of calls from a phone number that originates inside of a facility's state for every phone service (e.g. collect calling, prepaid calling, etc) that a facility offers.
- The price of calls from a phone number that originates outside of a facility's state for every phone service that a facility offers.
For in-state calls, we use the governor's number as the origin number, and for out-of-state calls, we use a randomly selected governor's number from a different state. We hold the randomly selected number constant over time. For example, we will always use the Maryland governor's number to check out-of-state rates in Alabama.
Note: Because our data is scraped from private entities, it is updated and ammended on a schedule we do not control. Furthur, our data represents telecom rates as those entities report them publicly, rather than what rates are in practice.
Scraping
The first step in pulling rate data from telecom providers is to run automated web scrapers against the publicly available rate calculators provided by each of the telecom companies we include in our data set
You can view the URLs of these rate calculators in the
CSV, containing our latest data, and the source code of our web scrapers on github.
Automated Cleaning
The raw scraped data that we save is full of incomplete records. For example, we often encounter facilties that have NULL rates reported. We discard these these rates at this point.
Normalizing
In order to build a unified schema from disparate origin sources, we need to transform the names of fields we collect, and, in some cases, do some math to normalize rates. In our database, a rate is calculated by adding the price of the first minute, the price of an additional minute, and the price of any associated taxes. You can find the full transformation logic
here.
Geotagging
The rate calculators provided by telecom companies provide the name of facilities stored, and their states, but not detailed location information. We use the
Google Places API, the programmatic equivalent of a Google Maps search, in order to guess the exact location of every facility reported by the telecom rate calculators given its name + state.
Joining with external data
A core goal of PTDP is to make it possible for researchers to contextualize the rates reported by telecom providers. By, for example, providing information about the population of facilities paying a particular rate. The challenge here is that the facility names reported by telecom providers may vary in subtle and significant ways from the facility names reported by external data providers (like the Department of Homeland Security).
Compare for example, "FSL LA TUNA (EL PASO)", Homeland Security's canonical name for a Federal facility in El Paso with GTL's name for the same facility "Federal Bureau of Prisons TX-La Tuna FSL (El Paso)."
In order to perform the data join, we do a combination of spatial joins (checking whether the Google Places reported latitude and longitude is within .5 miles of the external data source's reported latitude and longitude), and fuzzy string matching (
checking whether the token set ratio is > 75).
By default, we hide any rate data that we are unable to confidently tie to a canonical facility in the Department of Homeland Security's Prison Boundary Dataset. We only reintroduce such rates after manual review. You can view the details of our approach to reconciling facilities
Jupyter notebook.Manual Correction
The above approach to joining vendor reported data to external data is by its nature imprecise. We rely on volunteers to audit our final results, and provide manual corrections to fields to compensate for the imprecision of the automated approach.