`epc-ppd-address-matching`

Match addresses in the Energy Performance Certificates (EPC) dataset and HM Land Registry Price Paid Data (PPD). A simple, but effective, rule-based approach.

Installation

You need Python 3.9 and pipenv. If you don't have them, see our instructions for macOS.

Clone this repo.
pipenv sync --dev to install dependencies.
If you're a CNZ colleague and downloading the data from our data warehouse, cp .env.template .env and fill in the blanks.
pipenv run pytest to run tests.

Alternatively, if you are familiar with Docker you can use the included Dockerfile to build an image and match addresses in a container.

Data Requirements

You need the EPC and PPD addresses in Parquet format. At CNZ we manage EPC and PPD using dbt in BigQuery. You can use that code to clean up EPC and PPD and transform it into the schema required for matching. Alternatively you can produce Parquet files with the following schemas however you want (all columns are strings).

PPD schema

ppd_id
primary_addressable_object_name
secondary_addressable_object_name
street
town_city
postcode

EPC schema

epc_id
address_line_1
address_line_2
address_line_3
post_town
postcode

Record IDs

ppd_id and epc_id must uniquely identify each address record across both datasets. You can use whatever identifier you want. We use the MD5 hash of each record in JSON format so that we can join the matches back to our data warehouse. In BigQuery, this is md5(to_json_string(addresses)).

Matching Addresses

python -m matching epc_addresses/ ppd_addresses/ matches/

On multicore machines you can get much faster performance by matching addresses in parallel. You need to partition the data using python -m matching.partition and have GNU Parallel installed.

We use ./scripts/container_run.sh to extract and partition data, match records and finally load the matches back into BigQuery. A Github Action runs this in a container on GCP.

Acknowledgements

Tange, O. (2021, November 22). GNU Parallel 20211122 ('Peng Shuai'). Zenodo. https://doi.org/10.5281/zenodo.5719513

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
matching		matching
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.template		.env.template
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`epc-ppd-address-matching`

Installation

Data Requirements

PPD schema

EPC schema

Record IDs

Matching Addresses

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

centrefornetzero/epc-ppd-address-matching

Folders and files

Latest commit

History

Repository files navigation

epc-ppd-address-matching

Installation

Data Requirements

PPD schema

EPC schema

Record IDs

Matching Addresses

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`epc-ppd-address-matching`

Packages