Data Sourcing and Wrangling
A social science and spatial perspective
Affordable Housing, San Diego County, 2021
Point layer of affordable housing developments in San Diego County, as updated by Homelessness Hub at UC San Diego. Current to 2021
VIEW DATASET
At the core of any research, there is central challenge of sourcing data of good quality, and preserving data quality and completeness through the process of cleaning and "wrangling" into formats that can be analyzed statistically or spatially. Analyses and findings are only as good as the data that inform them; bad quality or poorly formatted data cannot magically be analyzed or modeled into robust inference or summaries.
Therefore, it is important to understand how to address data challenges such as incompatible formats, missing or incorrect values, and disparate categorizations of data across sources, among others. Understanding these challenges allows us to both better reclaim data from different sources and to produce and publish data that is understandable, usable, and transparent. Homelessness Hub at UC San Diego helps to fulfill both of these roles: we gather and standardize disparate data sources while also publishing and maintaining datasets to these standards so that researchers can have both datasets and dataset standards to work with.
Below we present a case study of data wrangling several different data sources to create an inventory of affordable housing in San Diego County in 2020.
The Data
Below are representations of four separate data sources that eventually comprised the affordable housing inventory. Each entry describes the raw data source's quality, temporality, update cycle, format it was available in, and the tabular structure (if any).
1) County of San Diego HHSA
- List of affordable housing properties that exist countywide in PDF form
- Updated yearly
2) San Diego Housing Commission
- List of affordable housing properties in City of San Diego
- Has categories for affordable housing properties (AHPs) and single room occupancy properties (SROs)
- Available in PDF form
- Updated yearly
3) County of San Diego HHSA / Tablecloth
- Internal spreadsheet compiled from several publicly-available sources, including various city housing commissions
- Only completed once in early 2020
- Includes more AMI fields not shown here
4) Regional Task Force on the Homeless
- Housing Inventory Count (HIC)
- Publicly-available spreadsheet of providers of shelter for the unhoused
- Updated yearly
- Includes many other fields not shown here
5) 211 San Diego
- Public website of affordable housing properties, linked to an interactive map
- Not easily machine-readable / web-scrapable
- Update cycle not clear
Data Objectives
Ideally, we want to have a dataset that is reconciled on many levels:
- Format (address, phone number, etc.)
- Categories (Clientele, Housing Type, municipality etc.) in scope and meaning, if possible
- Granularity of each observation (should an observation be a single property or a project?)
Data Challenges
- Converting
- Formats can be arduous (PDF or HTML to tabular format)
- Reliability dependent on structure of HTML code or PDF
- Python & Regex are powerful tools but have a steeper learning curve
- More complex websites / structures harder to read
- Compiling
- Different categorizations of data from each source
- Must find ways to reconcile / recategorize data without misrepresentation or too much reduction in detail
- A manual process that is prone to human error and is more time intensive
- Verifying
- Must enlist the help of other service providers, develops, and subject matter experts to help verify the validity of data and process of data wrangling
- Keeping Data Current
- Each entity / source has a different update cycle (which may not be detailed clearly)
- Checking for updates is time- and energy-intensive
- Updating is even more time- and energy-intensive, important to establish plans and protocol to do this
Affordable Housing, San Diego County, 2021
Point layer of affordable housing developments in San Diego County, as updated by Homelessness Hub at UC San Diego. Current to 2021
VIEW DATASET