Introducing OpenCharities: Opening up the Charities Register

A couple of weeks ago I needed a list of all the charities in the UK and their registration numbers so that I could try to match them up to the local council spending data OpenlyLocal is aggregating and trying to make sense of. A fairly simple request, you’d think, especially in this new world of transparency and open data, and for a dataset that’s uncontentious.

Well, you’d be wrong. There’s nothing at data.gov.uk, nothing at CKAN and nothing on the Charity Commission website, and in fact you can’t even see the whole register on the website, just the first 500 results of any search/category. Here’s what the Charities Commission says on their website (NB: extract below is truncated):

The Commission can provide an electronic copy in discharge of its duty to provide a legible copy of publicly available information if the person requesting the copy is happy to receive it in that form. There is no obligation on the Commission to provide a copy in this form…

The Commission will not provide an electronic copy of any material subject to Crown copyright or to Crown database right unless it is satisfied… that the Requestor intends to re-use the information in an appropriate manner.

Hmmm. Time for Twitter to come to the rescue to check that some other independently minded person hasn’t already solved the problem. Nothing, but I did get pointed to this request for the data to be unlocked, with the very recent response by the Charity Commission, essentially saying, “Nope, we ain’t going to release it”:

For resource reasons we are not able to display the entire Register of Charities. Searches are therefore limited to 500 results… We cannot allow full access to all the data, held on the register, as there are limitations on the use of data extracted from the Register… However, we are happy to consider granting access to our records on receipt of a written request to the Departmental Record Officer

OK, so it seems as though they have no intention of making this data available anytime soon (I actually don’t buy that there are Intellectual Property or Data Privacy issues with making basic information about charities available, and if there really are this needs to be changed, pronto), so time for some screen-scraping. Turns out it’s a pretty difficult website to scrape, because it requires both cookies and javascript to work properly.

Try turning off both in your browser, and see how far you get, and then you’ll also get an idea of how difficult it is to use if you have accessibility issues – and check out their poor excuse for accessibility statement, i.e. tough luck.

Still, there’s usually a way, even if it does mean some pretty tortuous routes, and like the similarly inaccessible Birmingham City Council website, this is just the sort of challenge that stubborn so-and-so’s like me won’t give up on.

And the way to get the info seems to be through the geographical search (other routes relied upon Javascript), and although it was still problematic, it was doable. So, now we have an open data register of charities, incorporated into OpenlyLocal, and tied in to the spending data being published by councils.

Charity supplier to Local authority

And because this sort of thing is so easy, once you’ve got it in a database (Charity Commission take note), there are a couple of bonuses.

First, it was relatively easy to knock up a quick and very simple Sinatra application, OpenCharities:

Open Charities :: Opening up the UK Charities Register

If there’s any interest, I’ll add more features to it, but for now, it’s just a the simplest of things, a web application with a unique URL for every charity based on its charity number, and with the  basic information for each charity is available as data (XML, JSON and RDF). It’s also searchable, and sortable by most recent income and spending, and for linked data people there are dereferenceable Resource URIs.

This is very much an alpha application: the design is very basic and it’s possible that there are a few charities missing – for two reasons. One: the Charity Commission kept timing out (think I managed to pick up all of those, and they should get picked up when I periodically run the scraper); and two: there appears to be a bug in the Charity Commission website, so that when there’s between 10 and 13 entries, only 10 are shown, but there is no way of seeing the additional ones. As a benchmark, there are currently 150,422 charities in the OpenCharities database.

It’s also worth mentioning that due to inconsistencies with the page structure, the income/spending data for some of the biggest charities is not yet in the system. I’ve worked out a fix, and the entries will be gradually updated, but only as they are re-scraped.

The second bonus is that the entire database is available to download and reuse (under an open, share-alike attribution licence). It’s a compressed CSV file, weighing in at just under 20MB for the compressed version, and should probably only attempted by those familiar with manipulating large datasets (don’t try opening it up in your spreadsheet, for example). I’m also in the process of importing it into Google Fusion Tables (it’s still churning away in the background) and will post a link when it’s done.

Now, back to that spending data.


A Local Spending Data wish… granted

The very wonderful Stuart Harrison (aka pezholio), webmaster at Lichfield District Council, blogged yesterday with some thoughts about the publication of spending data following a local spending data workshop in Birmingham. Sadly I wasn’t able to attend this, but Stuart gives a very comprehensive account, and like all his posts it’s well worth reading.

In it he made an important observation about those at the workshop who were pushing for linked data from the beginning, and wished there was a solution. First the observation:

There did seem to be a bit of resistance to the linked data approach, mainly because agreeing standards seems to be a long, drawn out process, which is counter to the JFDI approach of publishing local data… I also recognise that there are difficulties in both publishing the data and also working with it… As we learned from the local elections project, often local authorities don’t even have people who are competent in HTML, let alone RDF, SPARQL etc.

He’s not wrong there. As someone who’s been publishing linked data for some time, and who conceived and ran the Open Election Data project Stuart refers to, working with numerous councils to help them publish linked data I’m probably as aware of the issues as anyone (ironically and I think significantly none of the councils involved in the local government e-standards body, and now pushing so hard for the linked data, has actually published any linked data themselves).

That’s not to knock linked data – just to be realistic about the issues and hurdles that need to be overcome (see the report for a full breakdown), and that to expect all the councils to solve all these problems at the same time as extracting the data from their systems, removing data relating to non-suppliers (e.g. foster parents), and including information from other systems (e.g. supplier data, which may be on procurement systems), and all by January, is  unrealistic at best, and could undermine the whole process.

So what’s to be done? I think the sensible thing, particularly in these straitened times, is to concentrate on getting the raw data out, and as much of it as possible, and come down hard on those councils who publish it badly (e.g. by locking it up in PDFs or giving it a closed licence), or who willfully ignore the guidance (it’s worrying how few councils publishing data at the moment don’t even include the transaction ID or date of the transaction, never mind supplier details).

Beyond that we should take the approach the web has always done, and which is the reason for its success: a decentralised, messy variety of implementations and solutions that allows a rich eco-system to develop, with government helping solve bottlenecks and structural problems rather than trying to impose highly centralised solutions that are already being solved elsewhere.

Yes, I’d love it if the councils were able to publish the data fully marked up, in a variety of forms (not just linked data, but also XML and JSON), but the ugly truth is that not a single council has so far even published their list of categories, never mind matched it up to a recognised standard (CIPFA BVACOP, COFOG or that used in their submissions to the CLG), still less done anything like linked data. So there’s a long way to go, and in the meantime we’re going to need some tools and cheap commodity services to bridge the gap.

[In a perfect world, maybe councils would develop some open-source tools to help them publish the data, perhaps using something like Adrian Short's Armchair Auditor code as the basis (this is a project that took a single council, WIndsor & Maidenhead, and added a web interface to the figures). However, when many councils don't even have competent HTML skills (having outsourced much of it), this is only going to happen at a handful of councils at best, unless considerable investment is made.]

Stuart had been thinking along similar lines, and made a suggestion, almost a wish in fact:

I think the way forward is a centralised approach, with authorities publishing CSVs in a standard format on their website and some kind of system picking up these CSVs (say, on a monthly basis) and converting this data to a linked data format (as well as publishing in vanilla XML, JSON and CSV format).

He then expanded on the idea, talking about a single URL for each transaction, standard identifiers, “a human-readable summary of the data, together with links to the actual data in RDF, XML, CSV and JSON”. I’m a bit iffy about that ‘centralised approach’ phrase (the web is all about decentralisation), but I do think there’s an opportunity to help both the community and councils by solving some of these problems.

And  that’s exactly what we’ve done at OpenlyLocal, adding the data from all the councils who’ve published their spending data, acting as a central repository, generating the URLs, and connecting the data together to other datasets and identifiers (councils with Snac IDs, companies with Companies House numbers). We’ve even extracted data from those councils who unhelpfully try to lock up their data as PDFs.

There are at time of writing 52,443 financial transactions from 9 councils in the OpenlyLocal database. And that’s not all, there’s also the following features:

  • Each transaction is tied to a supplier record for the council, and increasingly these are linked to company info (including their company number), or other councils (there’s a lot of money being transferred between councils), and users can add information about the supplier if we haven’t matched it up.
  • Every transaction, supplier and company has a permanent unique URL and is available as XML and JSON
  • We’ve sorted out some of the date issues (adding a date fuzziness field for those councils who don’t specify when in the month or quarter a transaction relates to).
  • Transactions are linked to the URL from which the file was downloaded (and usually the line number too, though obviously this is not possible if we’ve had to extract it from a PDF), meaning anyone else can recreate the dataset should they want to.
  • There’s an increasing amount of analysis, showing ordinary users spending by month, biggest suppliers and transactions, for example.
  • The whole spending dataset is available as a single, zipped CSV file to download for anyone else to use.
  • It’s all open data.

There are a couple of features Stuart mentions that we haven’t yet implemented, for good reason.

First, we’re not yet publishing it as linked data, for the simple reason that the vocabulary hasn’t yet been defined, nor even the standards on which it will be based. When this is done, we’ll add this as a representation.

And although we use standard identifiers such as SNAC ids for councils (and wards) on OpenlyLocal, the URL structure Stuart mentions is not yet practical, in part because SNAC ids doesn’t cover all authorities (doesn’t include the GLA, or other public bodies, for example), and only a tiny fraction of councils are publishing their internal transaction ids.

Also we haven’t yet implemented comments on the transactions for the simple reason that distributed comment systems such as Disqus are javascript-based and thus are problematic for those with accessibility issues, and site-specific ones don’t allow the conversation to be carried on elsewhere (we think we might have a solution to this, but it’s at an early stage, and we’d be interested to hear other idea).

But all in all, we reckon we’re pretty much there with Stuart’s wish list, and would hope that councils can get on with extracting the raw data, publishing it in an open, machine-readable format (such as CSV), and then move to linked data as their resources allow.


Follow

Get every new post delivered to your Inbox.