Using data to inform and reform policies isn’t new. However, in order to analyze data, you need to be able to access the data — even better if it’s already processed and formatted. There’s no unified database for parking minimums in the United States, so we were tasked to build one.
What are parking minimums?
Parking minimums are a set of parking requirements for new developments, ranging from residential to commercial buildings. Developers have to, at a minimum, build a certain number of parking spaces, which can be based on characteristics from square footage to number of employees.
Parking minimums are also known as parking requirements, parking ratios, or parking schedules. These tables are scattered across municipalities’ Code of Ordinances or Unified Development Ordinance (UDO).
Our ten-week project’s main goal was to build and populate a database to hold all parking minimum requirements across the United States, using automation to speed up the data parsing.
We mainly web-scraped the requirements from Municode, which is a digital library of Code of Ordinances across the United States. A parking section URL is inputted into our program for web-scraping and data processing. The data entries are then inserted into the database.
The data lives in a PostgreSQL database through Supabase. Each entry states the state, region, and use case, which maps to the raw requirement.
Web-scraping parking requirements
In order to efficiently extract data from the parking tables, we developed a web scraper refined for our purpose and for a website like Municode.
We used Selenium Webdriver to obtain the HTML of a webpage. We can then parse the HTML according to the structure of data representation. In our case, most of the information is presented in tables and bullet points, so we have functions that extract data from both structures.
We encountered one difficulty when using our current table parsing function. When scraping tabular data, the function will acquire all of the tables in a webpage, including the ones we don’t need. So, we tried to write a function that would automatically select the tables needed. We attempted to resolve this problem by using natural language processing (NLP) machine learning approach. Each table will be assessed by their column names, and they’ll be categorized as “useful” or “useless.”
Ideally, the model will reach above 95% in accuracy to be implemented. However, the accuracy of our trained model is only 80%. We still have to double-check the results and manually select some tables from the prediction. Having this model didn’t increase our efficiency when adding data into the database, so we ended up abandoning it.
For the majority of our project, the parking code URLs are found manually, which is tedious. There are almost 3,500 municipalities in the United States. Using a generous estimate of five minutes to find a parking code on Municode, it would take about 290 hours. We’re not expecting to crawl for every single parking code, but if we could find one out of three parking minimums, we’d save 100 hours of manual work.
What if we could write a script that can search keywords and find the parking code for us? We experimented with Python’s scrapy library (which also has scraping functions) and focused on its “crawling” ability to find the pages we need.
However, Municode has fancy elements, like a search bar and drop-down tabs, which can only be loaded by waiting for a browser to interpret the code. We landed on using a joint library called scrapy-playwright, which implements a browser.
Our crawler is able to search for keywords in the search bar, wait for its results, and extract the first link.
This feature isn’t fully implemented, as the spider can only search for one keyword and only scrapes the first link, which isn’t always the right one with the parking code.
What we would have done differently
We wanted to automate as much of the process as possible, from trying to parse every single format of parking minimums to using machine learning to find the right table. It’s difficult to get computers to work 100% correctly, and it’s easy to speed up a human’s workflow. Rather than approaching this project as an end-to-end system with no humans involved, we realized our time is better spent building tools to make the manual work easier.
We were also torn between inserting as many entries as possible and building a viable system to onboard other volunteers. As a result, we developed our personal scraping and parsing pipeline without fully considering how other people can use it.
What comes next
There are a few more core and auxiliary features that can be implemented:
- Build an interface, like a website, for the general public to view the database and contribute to some of the manual steps of the workflow
- Parse the raw requirements (text) into numbers, in order for the database to be useful in data analysis (like aggregations and comparisons)
- Standardize data entries (i.e. “laundromat” and “laundry and cleaning service” can be processed as one category, like “laundry service”)
- Combine our crawler and scraper into one unit
Our codebase can be publicly accessed on GitHub. Contributors are always welcome! If you’re interested in contributing in a non-coding fashion, please open a GitHub issue at https://github.com/ParkingReformNetwork/parking-requirement-database.