Product Catalog Extraction Tool

A GenAI driven tool utilizing Google Vertex AI to extract text and images from product catalogs.

Objectives

Extract product information (text, images) from product catalogs in PDF format.
Enrich extracted data with AI-generated captions and metadata.
Provide structured output for downstream processing.

Solution Architecture

Image/Text Extraction:
- Employs PyMuPDF to extract raw images and text from PDF files, including citations.
Text/Image Cleaning & Enrichment:
- Sanitizes extracted text and images.
- Leverages a GEN AI (like Imagen) for caption generation ("specific captions").
- Stores enriched data in an intermediate bucket for traceability.
Dynamic Prompt Generation & LLM Interaction:
- Generates prompts for tasks: Product ISQ, FAQ generation, Image Labeling, Image Captions.
- Submits prompts to a Large Language Model (LLM).
- Implements auto-reflection for refining output if needed.
- Aggregates results into a final JSON.
Storage:
- Stores the final JSON in Google Cloud Storage (GCS) for downstream use.

Tool Evaluation

Product Catalogs could be in various formats, below space graph shows the performance of the tool

Project Structure

src/: Contains the core Python modules for the extraction, cleaning, and LLM interaction components.
tests/: Test suites ensuring correctness and robustness (TDD principles).

Setup

Prerequisites:
- Python 3.11+
- Google Cloud Platform account with Vertex AI configured
- API keys for GEN AI (if applicable)
Installation:
```
pip install -r requirements.txt
```
Running Instructions:

Given a product catalog pdf URI this tool will extract text and image details of the product.

Text Details that are extracted
- Company Details - Details of Company which owns the Product - name , address , email , contact details.
- Product Name - Name of the product , description.
- FAQ - Frequently asked questions around the product
- ISQ - Product specifications
Image Details that are extracted
- Main Image of the product
- Captions for the Image
- Label for the Image
- Tags for the Image
Once these details are extracted the Result JSON file is placed in the output GCS bucket that is also passed as an argument. For a sample JSON look at the file output_prod_details.json in this repo.

Sample Run Command
```
python3 ./runner.py "gs://test-sl/hepasky-herbal-liver-tablets.pdf" "test-sl" "sl-test-project-353312"
```
Argument 1 - GCS URI of the pdf file Argument 2 - Bucket Name where the output file will be placed
Argument 3 - GCP Project ID

Release Notes

Jan 2024

Tool Launch on open source google cloud repo

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
LICENSE		LICENSE
genai_helper.py		genai_helper.py
img.png		img.png
img_1.png		img_1.png
img_2.png		img_2.png
longfian-jay-5w-single-flow-oxygen-concentrator.pdf		longfian-jay-5w-single-flow-oxygen-concentrator.pdf
main.py		main.py
output_image.png		output_image.png
output_prod_details.json		output_prod_details.json
pdf_helper.py		pdf_helper.py
readme.md		readme.md
requirements.txt		requirements.txt
runner.py		runner.py
test.py		test.py
text_helper.py		text_helper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Product Catalog Extraction Tool

Objectives

Solution Architecture

Tool Evaluation

Project Structure

Setup

Release Notes

About

Releases

Packages

Languages

License

slatawa/genai-prod-catalog-enrichment

Folders and files

Latest commit

History

Repository files navigation

Product Catalog Extraction Tool

Objectives

Solution Architecture

Tool Evaluation

Project Structure

Setup

Release Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages