ReadEuraxess

Generation of a dataset including published offers from Euraxess

Setting up the ingestion system

The process is composed of two phases:

Daily download of offers

This is a simple process that needs to be set up as a cron process. Everyday the following command should be run:

> wget -O - "ANONYMIZED_EURAXESS_JOBPOSTS_URL" --output-document=[your_path]/jobs_`date +%Y-%m-%d_%H:%M:%S`.xml

Warning: Permision to crawl this dataset should be granted by the website owners. You should refer to them to get the URL that should be used in the previous terminal command line.

To consolidate all downloaded data (since offers appear repeteade in the retrieved files) we need to run the following python script

> python main.py -c config_file

This scripts processes the downloades XML files, extracts the necessasry information, and consolidates the offers in a final CSV file

The script keeps track of downloaded files that have already been processed. If the whole dataset wants to be regenerated from scratch, Step 2 needs to be carried out activating the --resetCSV flag.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
includes		includes
sql		sql
LICENSE		LICENSE
README.md		README.md
config_example.txt		config_example.txt
gitignore.txt		gitignore.txt
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReadEuraxess

Setting up the ingestion system

About

Releases

Packages

Languages

License

IntelCompH2020/ReadEuraxess

Folders and files

Latest commit

History

Repository files navigation

ReadEuraxess

Setting up the ingestion system

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages