This machine learning pipeline trains a model that aims to predict temperature and precipitation for 10 major cities in the UK. The pipeline pulls data from API, processes and tests it, trains and test a model and makes a batch prediction for next weeks weather.
This script pulls the latest weather data from Historical data API (https://open-meteo.com/).
This script cleans, processes and engineers training data.
The script runs deterministic and statistical tests on the data to ensure its integrity.
This script splits the provided dataset into a test set and a remaining set.
This script trains and validates the model
The trained model is tested against the test dataset. If the model demonstrates better performance in terms of R-squared score and Mean Absolute Error (MAE) compared to previous models, it is promoted for production use. Data slice and model drift tests are also conducted to validate the model's performance.
This script predicts upcoming week's weather and generates a visualisation of that prediction.
The pipeline generates various metrics to track model performance and logs pipeline steps to track the pipeline is running accordingly
- Ingestion records: The report records all API pull requests and the date range of the data pulled.
- Logs: This report is generated with the date of the run as the file name. It lists all the detailed steps in the pipeline run to track it and trace back the issue in case of an error.
- Model performance: New model performance metric (MAE and R2) are added to the list in the report in order to keep track and changes of the performance. MAE measures the average absolute difference between the predicted and actual outcomes, while R2 indicates the proportion of the variance in the target variable that is predictable. A lower MAE value and a higher R2 value signify better model performance. These can be visualised in WandB dashboard as such:
The project requires Python 3.11.5 running on Ubuntu 22.04.3 LTS. It utilizes the latest version of Miniconda for environment management. Other dependencies are outlined in requirements.txt file.
Github
- Version control, Code review, bug tracking and documentation.Weights & Biases
- Track, visualise and optimise ML experiments. Log metrics, parameters, artifacts and models.MLflow + Hydra
- ML pipelines and orchestrationconda
- Environment isolation and managementScikit-learn
andXGBoost
- Machine learning algorithms
To install the project, follow these steps:
> wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
> chmod u+x Miniconda3-latest-Linux-x86_64.sh
> ./Miniconda3-latest-Linux-x86_64.sh -b
Clone the project repository from https://github.com/Orestas41/weather-prediction-ml-pipeline.git
by clicking on the 'Fork' button in the upper right corner. This will create a forked copy of the repository under your GitHub account. Clone the repository to your local machine:
> git clone https://github.com/[Your Github Username]/weather-prediction-ml-pipeline.git
> cd weather-prediction-ml-pipeline
> conda env create -f environment.yml
> conda activate weather-prediction
To run the pipeline successfully, you need to set up authorization for Weights & Biases (WandB). W&B is a machine learning development platform that enables real-time tracking and visualization of various aspects of the model training process. Obtain your API key from W&B by visiting https://wandb.ai/authorize and clicking on the '+' icon to copy the key to the clipboard. Then, use the following command to authenticate:
> wandb login [your API key]
To train or retrain the model, navigate to the root directory and run the following command:
> mlflow run .
To run pipeline steps separately, run the following command:
> mlflow run -P steps=[step name ex.:`data_ingestion`]
The pipeline will pull the latest match results from Historical weather API (https://open-meteo.com/), cleans and merges it with the existing training data, and performs model retraining from scratch. If the new model outperforms the previous versions based on metrics such as R-squared score and Mean Absolute Error (MAE), it will be promoted for production use.
If you have any questions or problems, please contact [email protected]