diff --git a/EVALUATION.md b/EVALUATION.md index f258452..af89165 100644 --- a/EVALUATION.md +++ b/EVALUATION.md @@ -21,25 +21,14 @@ Our objective is to monitor and improve the RAG pipeline for **AI-OPS**, that re The evaluation workflow is split in two steps: 1. **Dataset Generation** ([dataset_generation.ipynb](./test/benchmarks/rag/dataset_generation.ipynb)): -uses Ollama and the data that is ingested into Qdrant (RAG Vector Database) to generate *question* and *ground truth* +uses Gemini free API and the data that is ingested into Qdrant (RAG Vector Database) to generate *question* and *ground truth* (Q&A dataset). 2. **Evaluation** ([evaluation.py](./test/benchmarks/rag/evaluation.py)): builds the RAG pipeline with the same used to generate the synthetic Q&A dataset, leverages the pipeline to provide an *answer* to the questions (given *contex*), then performs evaluation of the full evaluation dataset using LLM as a -judge; for performance reasons the evaluation is performed using HuggingFace Inference API. +judge. Here everything related to generation is done via Ollama with the same models integrated in **AI-OPS**. ## Results -### Context Precision - -**TODO:** *describe the metric and the prompts used* - -![Context Precision Plot](data/rag_eval/results/plots/context_precision.png) - -### Context Recall - -**TODO:** *describe the metric and the prompts used* - - -![Context Precision Plot](data/rag_eval/results/plots/context_recall.png) +![Context Precision Plot](data/rag_eval/results/plots/plot.png)