Posted by Matt on Tuesday, February 08, 2011.

A closer look at public data cleanup

Our CM1 carbon models depend on lots of public energy and emissions data relating to things like buildings and transportation—mostly from government sources like the EIA, BTS, and EPA. Whenever you use or mash up a large dataset you’ll likely run into issues like inconsistent file layouts, awkward formats, and human error. So file cleanup is a large part of the process we go through when plugging data into our system and making them available in our free data clearinghouse. Here’s a quick look at three of the main things we do; we’ll delve into more specifics in subsequent posts:

  • Error correction. Most datasets are pretty good, but with some containing millions of data points, human error is bound to work its way in. We use search algorithms to identify typos and outliers, and then manually correct or delete flawed entries.

    For example: The EPA Fuel Economy Guide files from 1985 through 2010 contain various typos and inconsistencies. We list corrections to each of these in an errata file that we apply during the import process.

  • Restructuring. Some databases have attribute names that change over time. Others have attributes that contain both numerical data and numerical codes and so require special analysis techniques. In both of these cases we parse the data into a single format that is convenient for calculations.

    For example: The EIA’s Commercial Buildings Energy Consumption Survey includes codes like 99999 for nautral gas use when a building never uses natural gas. The EPA Fuel Economy Guide uses four different sets of attribute names from 1985 through 2010.

  • Data extraction. Some data come from report tables that are only available as a PDF file or in print. We manually extract these data and make them available in formats like CSV, JSON, XML, and SQL that can plug into the modern information ecosystem.

    For example: The APTA Public Transportation Fact Book Appendix A is only available in PDF format.

We take all the public data we clean up for our clients’ use and post it on our data site so others can benefit as well. It’s available for download in multiple formats. And it’s kept up to date using automated import programs that crawl government agency websites and automatically import the latest data updates as soon as they’re available.

What blog is this?

Safety in Numbers is Brighter Planet's blog about climate science, Ruby, Rails, data, transparency, and, well, us.

Who's behind this?

We're Brighter Planet, the world's leading computational sustainability platform.

Who's blogging here?

  1. Patti Prairie CEO