Skip to content

nextprocurement/sproc

Repository files navigation

sproc

This is Python code meant to download and parse Spanish government’s Plataforma de contratación del sector público metadata. It produces parquet files that can be easily read in many programming languages.

This project was developed with nbdev, and hence each module stems from a Jupyter notebook that contains the code, along with tests and documentation. If you are interested in the inner workings of any module you can check its corresponding notebook in the appropriate section of the github pages of the project.

Install

pip install git+https://github.com/nextprocurement/sproc@main

should do.

How to use

The software can be exploited as a library or as standalone scripts.

Scripts

Downloading data

sproc_dl command is the work-horse of the library. It allows downloading all the data of a given kind into a parquet file, that later can be updated invoking the same command. Running, e.g.,

sproc_dl outsiders

will download all the aggregated procurement data (excluding minor contracts), and write an outsiders.parquet file. Argument -o can be used to specify a directory other than the current one. Instead of outsiders, one can pass insiders or minors.

This is the highest-level command, and most likely the only one you need. The remaining ones (briefly explained below) provide access to finer granularity functionality.

Processing a single zip file

For testing purposes one can download Outsiders contracts for 2018, either directly by clicking this link or, if wget is available, running

wget https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_2018.zip

Running

sproc_read_single_zip.py PlataformasAgregadasSinMenores_2018.zip 2018.parquet

outputs the file 2018.parquet (the name being given by the 2nd argument), which contains a pd.DataFrame with all the 2018 metadata. It can be readily loaded (in Python, through Pandaspd.read_parquet). The columns of the pd.DataFrame stored inside are multiindexed (meaning one could get columns such as (ContractFolderStatus','ContractFolderID) and (ContractFolderStatus','ContractFolderStatusCode). This is very convenient when visualizing the data (see the the documentation for the hiermodule).

From hierarchical (multiindexed) columns to plain ones

The columns of the above pd.DataFrame can be flattened to get, in the example above, ContractFolderStatus - ContractFolderID and ContractFolderStatus - ContractFolderStatusCode, respectively. Additionally, some renaming might be applied following the mapping in some YAML file

sproc_rename_cols.py 2018.parquet -l samples/PLACE.yaml

This would yield a pd.DataFrame with plain columns in file 2018_renamed.parquet. Renaming is carried out using the mapping in PLACE.yaml, which can be found in the samples directory of this repository. If you don’t provide a local file (-l) or a remote file (-r), a default naming scheme will be used if the name of the input file is outsiders.parquet, insiders.parquet, or minors.parquet.

Processing a list of zip files

Command sproc_read_zips.py can be used to batch-process a sequence of files, e.g.,

sproc_read_zips.py contratosMenoresPerfilesContratantes_2018.zip contratosMenoresPerfilesContratantes_2019.zip

If no output file is specified (through the -o option), an out.parquet file (in which all the entries of all the zip files are stitched together) is produced.

Appending new data to an existing (column-multiindexed) parquet file

We can append new data to an existing pd.DataFrame. Let us, for instance, download, data from January 2022,

wget https://contrataciondelsectorpublico.gob.es/sindicacion/sindicacion_1044/PlataformasAgregadasSinMenores_202201.zip

and extend the previous parquet file with data extracted from the newly downloaded zip,

sproc_extend_parquet_with_zip.py 2018.parquet PlataformasAgregadasSinMenores_202201.zip 2018_202201.parquet

The combined data was saved in 2018_202201.parquet.