This repository contains scripts to acquire, clean and process the spending information released by the UK central government.
The scripts have several stages that need to be run in order:
build_index- will find all related metadata (tagged: spend-transactions) on data.gov.ukretrievewill then try to fetch all the filesextractwill attempt to parse CSV/XLS/... and load it into a DBscan_columnswill do some initial processing for later stagesmap_columnswill outsource column name comprehension to the usercondensewill try to establish a common column schemaformatwill try to munge numbers and datessupplierswill query opencorporates.org for supplier name resolutionexportwill write a csv
To run some of the scripts, use nosetests (the scripts are tests).
Adding -v will give you the names of the individual stages, -x will
stop on the first error and --with-xunit will generate an XML log file.
These scripts are: build_index, retrieve, extract, condense, format
The other scripts can simply be run directly
?
- PDFs
- Zip files containing a bunch of CSVs