This post is by Andrew Speakman, who’s coordinating OpenlyLocal‘s planning application work.
We can now report good progress on our plan to develop community scrapers to underpin the new incarnation of PlanningAlerts.com. The plan applies to those councils that use non-standard planning systems and our latest estimate is that there are around 100 of these sites.
There are now 13 successful working scrapers with more on the way and some of the £75 bounties have already been paid out. The data from these scrapers is being regularly uploaded into OpenlyLocal using the Scraperwiki API, and you can see the results here with Crawley Council:
The list of the authorities being scraped by this method is as follows:
- East Sussex
- Isle of Wight
- Nuneaton and Bedworth
- Telford and Wrekin
This is still very much a work in progress, and those in the above list that we’ve linked to are all running well and helping collect up-to-date planning applications with locations, and as soon as we turn on the alerts system (currently being tested), will start sending out email alerts.
There are also some councils that although we’re importing the data for we won’t yet be able to send alerts for – an example is Wokingham from the above list – because they do not include postcodes in the planning application details and our location coding is based on postcodes to ensure our data is fully open. If anyone from authorities such as Wokingham wants to rectify this situation, we’re more than happy to work with them.
If you want to get involved in helping scrape the UK’s planning data and building an open database of planning applications for the whole of the UK, contact me at firstname.lastname@example.org. Further details about data fields and the available planning authorities are defined in this shared Google spreadsheet.
This post is by Andrew Speakman, who’s coordinating OpenlyLocal’s planning application work.
As Chris wrote in his last post announcing OpenlyLocal’s progress in building an open database of planning applications, while we can do the importing from the main planning systems, if we’re really going to cover the whole country, we’re going to need the community’s help. I’m going to be coordinating this effort and so I thought it would be useful to explain how we’re going to do this (you can contact me at email@example.com).
First, we’re going to use the excellent ScraperWiki as the main platform for writing external scrapers. It supports Python, Ruby and PHP, and has worked well for similar schemes. It also means the scraper is openly available and we can see it in action. We will then use the Scraperwiki API to upload the data regularly into OpenlyLocal.
Second, we’re going to break the job into manageable chunks by focus on target groups of councils, and just to sweeten things – as if building a national open database of planning applications wasn’t enough – we’re going to offer small bounties (£75) for successful scrapers for these councils.
We have some particular requirements designed to make the system maintainable, and do things the right way, but not many are fixed in stone, so feel free to respond with suggestions if you want to do it in a different way.
For example, the scraper should keep itself current (running on a daily basis), but also behave nicely (not putting an excessive load on Scraperwiki or the target website by trying to get too much data in one go). In addition we propose that the scrapers should operate by updating current applications on a daily basis and also make inroads into the backlog by gathering a batch of previous applications.
- Create new database records for any new applications that have appeared on the site since the last run and store the identifiers (uid and url).
- Create new database records of a batch of missing older applications and store the identifiers (uid and url). Currently the scrapers are set up to work backwards from the earliest stored application towards a target date in the past
- Update the most current applications by collecting and saving the full application details. At the moment the scrapers update the details of all applications from the past 60 days.
- Update the full application details of a batch of older applications where the uid and url has been collected (as above) but the application details are missing. At the moment the scrapers work backwards from the earliest “empty” application towards a target date in the past
The data fields to be gathered for each planning application are defined in this shared Google spreadsheet. Not all the fields will be available on every site, but we want all those that are there.
Note the following:
- The minimal valid set of fields for an application is: ‘uid’, ‘description’, ‘address’, ‘start_date’ and ‘date_scraped’
- The ‘uid’ is the database primary key field
- All dates (except date_scraped) should be stored in ISO8601 format
- The ‘start_date’ field is set to the earliest of the ‘date_received’ or ‘date_validated’ fields, depending on which is available
- The ‘date_scraped’ field is a date/time (RFC3339) set to the current time when the full application details are updated. It should be indexed.
So how do you get started? Here’s a list of 10 non-standard authorities that you can choose from. Aberdeen, Aberdeenshire, Ashfield, Bath, Calderdale, Carmarthenshire, Consett, Crawley, Elmbridge, Flintshire. Have a look at the sites and then let me know if you want to reserve one and how long you think it will take to write your scraper.