Posted: September 6, 2010 Filed under: api, finance, open data, opencharities, openlylocal, semantic web | Tags: charities, data, linked data, Open Government, opencharities, opendata, opsi, register
A couple of weeks ago I needed a list of all the charities in the UK and their registration numbers so that I could try to match them up to the local council spending data OpenlyLocal is aggregating and trying to make sense of. A fairly simple request, you’d think, especially in this new world of transparency and open data, and for a dataset that’s uncontentious.
Well, you’d be wrong. There’s nothing at data.gov.uk, nothing at CKAN and nothing on the Charity Commission website, and in fact you can’t even see the whole register on the website, just the first 500 results of any search/category. Here’s what the Charities Commission says on their website (NB: extract below is truncated):
The Commission can provide an electronic copy in discharge of its duty to provide a legible copy of publicly available information if the person requesting the copy is happy to receive it in that form. There is no obligation on the Commission to provide a copy in this form…
The Commission will not provide an electronic copy of any material subject to Crown copyright or to Crown database right unless it is satisfied… that the Requestor intends to re-use the information in an appropriate manner.
Hmmm. Time for Twitter to come to the rescue to check that some other independently minded person hasn’t already solved the problem. Nothing, but I did get pointed to this request for the data to be unlocked, with the very recent response by the Charity Commission, essentially saying, “Nope, we ain’t going to release it”:
For resource reasons we are not able to display the entire Register of Charities. Searches are therefore limited to 500 results… We cannot allow full access to all the data, held on the register, as there are limitations on the use of data extracted from the Register… However, we are happy to consider granting access to our records on receipt of a written request to the Departmental Record Officer
Try turning off both in your browser, and see how far you get, and then you’ll also get an idea of how difficult it is to use if you have accessibility issues – and check out their poor excuse for accessibility statement, i.e. tough luck.
Still, there’s usually a way, even if it does mean some pretty tortuous routes, and like the similarly inaccessible Birmingham City Council website, this is just the sort of challenge that stubborn so-and-so’s like me won’t give up on.
And because this sort of thing is so easy, once you’ve got it in a database (Charity Commission take note), there are a couple of bonuses.
First, it was relatively easy to knock up a quick and very simple Sinatra application, OpenCharities:
If there’s any interest, I’ll add more features to it, but for now, it’s just a the simplest of things, a web application with a unique URL for every charity based on its charity number, and with the basic information for each charity is available as data (XML, JSON and RDF). It’s also searchable, and sortable by most recent income and spending, and for linked data people there are dereferenceable Resource URIs.
This is very much an alpha application: the design is very basic and it’s possible that there are a few charities missing – for two reasons. One: the Charity Commission kept timing out (think I managed to pick up all of those, and they should get picked up when I periodically run the scraper); and two: there appears to be a bug in the Charity Commission website, so that when there’s between 10 and 13 entries, only 10 are shown, but there is no way of seeing the additional ones. As a benchmark, there are currently 150,422 charities in the OpenCharities database.
It’s also worth mentioning that due to inconsistencies with the page structure, the income/spending data for some of the biggest charities is not yet in the system. I’ve worked out a fix, and the entries will be gradually updated, but only as they are re-scraped.
The second bonus is that the entire database is available to download and reuse (under an open, share-alike attribution licence). It’s a compressed CSV file, weighing in at just under 20MB for the compressed version, and should probably only attempted by those familiar with manipulating large datasets (don’t try opening it up in your spreadsheet, for example). I’m also in the process of importing it into Google Fusion Tables (it’s still churning away in the background) and will post a link when it’s done.
Now, back to that spending data.