An Open Letter to Data Providers: Clean Formats, Please

Subject: An Open Letter to Data Providers: Clean Formats, Please
From: Drew Skau
Date: 12 Mar 2015

Dear Data Collectors and Providers,

Over the past few weeks, I have worked on a few projects that involve collecting data from various sources. This experience has ranged from immensely time consuming and frustrating, to mildly obnoxious. The open data movement seems to have improved things somewhat over the past few years, but there’s still a lot of room for improvement. This may seem obvious to people who deal with data on a daily basis, but there are so many data providers that still don’t have it right, we should spell it out.

By far, the worst offender I’ve encountered in this situation has been the NC Department of Health and Human Services. While working on A Closer Look at NC’s Amendment One with a colleague at UNCC, a viewer suggested to us that HIV/AIDS data might be relevant to include in the visualization. While HIV and AIDS have long been proven to not be a “gay disease,” the stigma is still there for many people, so the suggestion actually made a lot of sense. Better yet, the NCDHHS has that data, organized by county.

The problem? It is in tables in PDF files.

PDF is a widely accepted document format with free readers and a well-documented set of specifications for others to use for implementation. PDFs are great for documents where graphic style and layout are important.

But neither of these are important for data. The goal of releasing data like this is to allow people to actually use it to learn and analyze. Looking at a nicely formatted table of numbers laid out on a page that you cannot format or rearrange is NOT a good way to do either of these.

By far, the biggest downside to using PDF files for data is that the data can’t be used by a computer. PDF files are meant to be read by humans, with a focus on layout and styling. All that extra graphic information gets in the way for a computer and damages the structure of the data. In some cases, even copy-pasting the tables doesn’t keep the correct data columns, as everything gets merged into a single column.

So, what are some alternatives to PDFs?

Comma Separated Values/Tab Separated Values will always be great for tabular data. It is a very simple format and it compresses well. You would be hard-pressed to find a tool for working with data that doesn’t support CSV/TSV.

Javascript Object Notation is growing in popularity due to so many web-based tools for dealing with data. JSON is great because it can support relationships between data items that are much more complex than what can be easily represented in a table format. It also works instantly with any javascript based web visualization toolkits. JSON does have some downsides because it can be very verbose, which inflates file sizes — but compression easily remedies this for data repositories.

Excel is not the best option, but it is still better than PDF. Excel files have maximum row/column sizes, and they aren’t always compatible even with other versions of Excel. In terms of accessibility, this can be a nightmare. But if you can open a file, converting it to better formats is fairly simple.

Another problem having lots of data sources is the interface for retrieving the data. The US Census data website, for example, offers a wealth of data. Unfortunately, the data is organized into a huge list of categories that are each smaller than they probably should be. This means that to gather a wide range of census data, one must tediously go through the drop down lists to download each category.

The RunKeeper website is another example of a poor interface for gathering data. Over the last year or so, I have accumulated 132 activities, each with GPS data. Kudos to RunKeeper for allowing me to download that data, but the website only allows you to download that data one activity at a time. This is a really frustrating way to go about gathering up this data, and it would be great to have an export area where one could download all data, or all running data, or just cycling data (or whichever other activity you have been tracking with the app). [Edit: Runkeeper does have a download everything button hidden in the settings page of your account.]

Google Latitude is yet another offender. They offer location data for download in 30 day increments in KML file format. Being only able to grab a month at a time is frustrating. But even more so is the lack of file options. KML files are alright because they are intended for that type of data and for computers to read. But there are no other options available to us — and being able to grab a TSV/CSV would be great.

By no means are these the only offenders. There are countless other data providers that just don’t get it yet.

We all benefit from people gathering knowledge and insight from data, and with the quantities that are collected today, computers are a necessary part of that process. Unfortunately, getting that data into computer useable formats is still a struggle. The infrastructure for collecting was the hard part, we just need everyone to do their part in ensuring that data is as accessible as possible.

Sincerely,
Drew Skau

Category: