Merge pull request 'setting-parser' (#32) from setting-parser into main

Reviewed-on: AI_team/generic-RAG-demo#32
2025-04-18 12:02:28 +02:00 · 2025-04-18 12:02:28 +02:00 · ad60c9d52f
commit ad60c9d52f
parent fbc0e74678 af5cbcacc3
7 changed files with 524 additions and 168 deletions
--- a/README.md
+++ b/README.md
@ -1,14 +1,58 @@
 # generic-RAG-demo
-A Sogeti Nederland generic RAG demo
+A generic Retrieval Augmented Generation (RAG) demo from Sogeti Netherlands built in Python. This project demonstrates how to integrate and run different backends, from cloud providers to local models, to parse and process your PDFs, web data, or other text sources.
 ## Table of Contents
 - [generic-RAG-demo](#generic-rag-demo)
  - [Table of Contents](#table-of-contents)
  - [Features](#features)
  - [Getting started](#getting-started)
    - [Project Environment Setup](#project-environment-setup)
    - [Installation of system dependencies](#installation-of-system-dependencies)
      - [Unstructered PDF loader (optional)](#unstructered-pdf-loader-optional)
      - [Local LLM (optional)](#local-llm-optional)
    - [Running generic RAG demo](#running-generic-rag-demo)
    - [config.yaml file](#configyaml-file)
    - [.env file](#env-file)
    - [Chainlit starters](#chainlit-starters)
  - [Dev details](#dev-details)
    - [Linting](#linting)
 ## Features
 - **Multi-backend Support:** Easily switch between cloud-based and local LLMs.
 - **Flexible Data Input:** Supports both PDFs and web data ingestion.
 - **Configurable Workflows:** Customize settings via a central `config.yaml` file.
 ## Getting started
 ### Project Environment Setup
 This project leverages a modern packaging method defined in `pyproject.toml`. After cloning the repository, you can install the project along with its dependencies. You have two options:
 1. Using uv
 If you're using uv, simply run:
 ```bash
 uv install
 ```
 2. Using a Python Virtual Environment
 Alternatively, set up a virtual environment and install the project:
 ```bash
 python -m venv .venv        # Create a new virtual environment named ".venv"
 source .venv/bin/activate   # Activate the virtual environment (use ".venv\Scripts\activate" on Windows)
 pip install .              # Install the project and its dependencies
 ```
 ### Installation of system dependencies
 Some optional features require additional system applications to be installed.
 #### Unstructered PDF loader (optional)
-If you would like to run the application using the unstructered PDF loader (`--unstructured-pdf` flag) you need to install two system dependencies.
+If you would like to run the application using the unstructered PDF loader (`pdf.unstructured` setting) you need to install two system dependencies.
 - [poppler-utils](https://launchpad.net/ubuntu/jammy/amd64/poppler-utils)
 - [tesseract-ocr](https://github.com/tesseract-ocr/tesseract?tab=readme-ov-file#installing-tesseract)
@ -21,18 +65,19 @@ sudo apt install poppler-utils tesseract-ocr
 #### Local LLM (optional)
-If you would like to run the application using a local LLM backend (`-b local` flag), you need to install Ollama.
+If you would like to run the application using a local LLM backend (`local` settings), you need to install Ollama.
 ```bash
 curl -fsSL https://ollama.com/install.sh | sh  # install Ollama
 ollama pull llama3.1:8b  # fetch and download as model
 ```
-Include the downloaded model in the `.env` file:
+Include the downloaded model in the `config.yaml` file:
-```text
+```yaml
-LOCAL_CHAT_MODEL="llama3.1:8b"
+local:
-LOCAL_EMB_MODEL="llama3.1:8b"
+    chat_model: "llama3.1:8b"
    emb_model: "llama3.1:8b"
 ```
 >For more information on installing Ollama, please refer to the Langchain Local LLM documentation, specifically the [Quickstart section](https://python.langchain.com/docs/how_to/local_llms/#quickstart).
@ -52,14 +97,19 @@ python generic_rag/app.py -p data  # will work and parsers all pdf files in ./da
 python generic_rag/app.py --help  # will work and prints command line options
 ```
-Please configure your `.env` file with your cloud provider (backend) of choice and set the `--backend` flag accordingly.
+Please configure your `config.yaml` and `.env` file with your cloud provider (backend) of choice. See the sections below for more details.
-### .env file
+### config.yaml file
-A .env file needs to be populated to configure API end-points or local back-ends using environment variables.
+A config.yaml file is required to specify your API endpoints and local backends. Use the provided `config.yaml.example` as a starting point. Update the file according to your backend settings and project requirements.
-Currently all required environment variables are defined in code at [backend/models.py](generic_rag/backend/models.py)
+
-with the exception of the API key variables itself.
+Key configuration points include:
-More information about configuring API endpoints for langchain can be found at the following locations.
+- Chat Backend: Choose among azure, openai, google_vertex, aws, or local.
 - Embedding Backend: Configure the embedding models similarly.
 - Data Processing Settings: Define PDF and web data sources, chunk sizes, and overlap.
 - Vector Database: Customize the path and reset behavior.
 For more information on configuring Langchain endpoints and models, please see:
 - [langchain cloud chat model doc](https://python.langchain.com/docs/integrations/chat/)
 - [langchain local chat model doc](https://python.langchain.com/docs/how_to/local_llms/)
@ -67,25 +117,13 @@ More information about configuring API endpoints for langchain can be found at t
 > for local models we currently use Ollama
-An `.env` example is as followed.
+### .env file
 Set the API keys for your chosen cloud provider (backend). This ensures that your application can authenticate and interact with the services.
 ```text
-# only one backend (azure, google, local, etc) is required. Please addjust the --backend flag accordingly
+AZURE_OPENAI_API_KEY=your_azure_api_key
-
+OPENAI_API_KEY=your_openai_api_key
 AZURE_OPENAI_API_KEY="<secret_key>"
 AZURE_LLM_ENDPOINT="https://<project_hub>.openai.azure.com"
 AZURE_LLM_DEPLOYMENT_NAME="gpt-4"
 AZURE_LLM_API_VERSION="2025-01-01-preview"
 AZURE_EMB_ENDPOINT="https://<project_hub>.openai.azure.com"
 AZURE_EMB_DEPLOYMENT_NAME="text-embedding-3-large"
 AZURE_EMB_API_VERSION="2023-05-15"
 LOCAL_CHAT_MODEL="llama3.1:8b"
 LOCAL_EMB_MODEL="llama3.1:8b"
 # google vertex AI does not use API keys but a seperate authentication method
 GOOGLE_GENAI_CHAT_MODEL="gemini-2.0-flash"
 GOOGLE_GENAI_EMB_MODEL="models/text-embedding-004"
 ```
 ### Chainlit starters
@ -102,4 +140,4 @@ CHAINLIT_STARTERS=[{"label":"Label 1","message":"Message one."},{"label":"Label
 ### Linting
-Currently [Ruff](https://github.com/astral-sh/ruff) is used as Python linter. It is included in the [pyproject.toml](pyproject.toml) as `dev` dependency if your IDE needs that. However, for VS Code a [Ruff extension](https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff) excists.
+Currently [Ruff](https://github.com/astral-sh/ruff) is used as Python linter. It is included in the [pyproject.toml](pyproject.toml) as `dev` dependency if your IDE needs that. However, for VS Code a [Ruff extension](https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff) exists.
--- a/config.yaml
+++ b/config.yaml
@ -0,0 +1,61 @@
 # Define your application settings here.
 chat_backend: local # Select the primary chat backend (azure, openai, google_vertex, aws, local)
 emb_backend: local # Select the primary embedding backend (azure, openai, google_vertex, aws, local, huggingface)
 use_conditional_graph: false  # Use a conditional RAG model with historical chat context, or a non-conditional model without access to the current conversation
 # --- Provider Specific Settings ---
 azure:
  llm_endpoint: "https://example.openai.azure.com"
  llm_deployment_name: "gpt-4o-mini"
  llm_api_version: "2025-01-01-preview"
  emb_endpoint: "https://example.openai.azure.com"
  emb_deployment_name: "text-embedding-3-large"
  emb_api_version: "2023-05-15"
 openai:
  chat_model: "gpt-4o-mini"
  emb_model: "text-embedding-3-large"
 google_vertex:
  project_id: "your_gcp_project_id"
  location: "europe-west4"
  chat_model: "gemini-pro"
  emb_model: "textembedding-gecko@001"
 aws:
  chat_model: "amazon.titan-llm-v1"
  emb_model: "amazon.titan-embed-text-v1"
  region: "us-east-1"
 local: # Settings for local models (e.g., Ollama)
  chat_model: "llama3.1:8b"
  emb_model: "llama3.1:8b"
 huggingface:
  chat_model: "meta-llama/Llama-2-7b-chat-hf"
  emb_model: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
 # --- Data Processing Settings ---
 pdf:
  # List of paths to PDF files or folders containing PDFs.
  # Pydantic converts these strings to pathlib.Path objects.
  data:
    - "C:/path/folder"
  unstructured: false # Use the unstructured PDF loader?
  chunk_size: 1000
  chunk_overlap: 200
  add_start_index: false
 web:
  # List of URLs to scrape for data.
  data:
    - "https://www.example.nl/subdomain"
  chunk_size: 200
 chroma_db:
  location: "/app/data/vector_database" # Override default DB path (default: '.chroma_db')
  reset: False # Reset the database on startup? (default: false)
--- a/generic_rag/app.py
+++ b/generic_rag/app.py
@ -3,18 +3,14 @@ import json
 import logging
 import os
 from pathlib import Path
 import sys
 import chainlit as cl
 from chainlit.cli import run_chainlit
 from langchain_chroma import Chroma
-from generic_rag.backend.models import (
+from generic_rag.parsers.config import AppSettings, load_settings
-    ChatBackend,
+from generic_rag.backend.models import get_chat_model, get_embedding_model, get_compression_model
    EmbeddingBackend,
    get_chat_model,
    get_embedding_model,
    get_compression_model,
 )
 from generic_rag.graphs.cond_ret_gen import CondRetGenLangGraph
 from generic_rag.graphs.ret_gen import RetGenLangGraph
 from generic_rag.parsers.parser import add_pdf_files, add_urls
@ -22,6 +18,8 @@ from generic_rag.parsers.parser import add_pdf_files, add_urls
 logger = logging.getLogger("sogeti-rag")
 logger.setLevel(logging.DEBUG)
 PROJECT_ROOT = Path(__file__).resolve().parent.parent
 system_prompt = (
    "You are an assistant for question-answering tasks. "
    "If the question is in Dutch, answer in Dutch. If the question is in English, answer in English."
@ -29,85 +27,45 @@ system_prompt = (
    "If you don't know the answer, say that you don't know."
 )
-parser = argparse.ArgumentParser(description="A Sogeti Nederland Generic RAG demo.")
+parser = argparse.ArgumentParser(description="A Sogeti Netherlands Generic RAG demo.")
 parser.add_argument(
    "-c",
-    "--chat-backend",
+    "--config",
    type=ChatBackend,
    choices=list(ChatBackend),
    default=ChatBackend.local,
    help="Cloud provider or local LLM to use as backend. In the case of 'local', Ollama needs to be installed.",
 )
 parser.add_argument(
    "-e",
    "--emb-backend",
    type=EmbeddingBackend,
    choices=list(EmbeddingBackend),
    default=EmbeddingBackend.huggingface,
    help="Cloud provider or local embedding to use as backend. In the case of 'local', Ollama needs to be installed. ",
 )
 parser.add_argument(
    "-p",
    "--pdf-data",
    type=Path,
-    nargs="+",
+    default=PROJECT_ROOT / "config.yaml",
-    default=[],
+    help="Path to configuration file (YAML format). Defaults to 'config.yaml' in project root.",
    help="One or multiple paths to folders or files to use for retrieval. "
    "If a path is a folder, all files in the folder will be used. "
    "If a path is a file, only that file will be used. "
    "If the path is relative it will be relative to the current working directory.",
 )
 parser.add_argument(
    "-u",
    "--unstructured-pdf",
    action="store_true",
    help="Use an unstructered PDF loader. "
    "An unstructured PDF loader might be usefull for PDF files "
    "that contain a lot of images with text, tables or (scanned) text as images. "
    "Please use '-r' when switching parsers on already indexed data.",
 )
 parser.add_argument("--pdf-chunk_size", type=int, default=1000, help="The size of the chunks to split the text into.")
 parser.add_argument("--pdf-chunk_overlap", type=int, default=200, help="The overlap between the chunks.")
 parser.add_argument(
    "--pdf-add-start-index", action="store_true", help="Add the start index to the metadata of the chunks."
 )
 parser.add_argument(
    "-w", "--web-data", type=str, nargs="*", default=[], help="One or multiple URLs to use for retrieval."
 )
 parser.add_argument("--web-chunk-size", type=int, default=200, help="The size of the chunks to split the text into.")
 parser.add_argument(
    "-d",
    "--chroma-db-location",
    type=Path,
    default=Path(".chroma_db"),
    help="File path to store or load a Chroma DB from/to.",
 )
 parser.add_argument("-r", "--reset-chrome-db", action="store_true", help="Reset the Chroma DB.")
 parser.add_argument(
    "--use-conditional-graph",
    action="store_true",
    help="Use the conditial retrieve generate graph over the regular retrieve generate graph.",
 )
 args = parser.parse_args()
 try:
    settings: AppSettings = load_settings(args.config)
 except (FileNotFoundError, Exception) as e:
    logger.error(f"Failed to load configuration from {args.config}. Exiting.")
    sys.exit(1)
 embedding_function = get_embedding_model(settings)
 chat_function = get_chat_model(settings)
 vector_store = Chroma(
    collection_name="generic_rag",
-    embedding_function=get_embedding_model(args.emb_backend),
+    embedding_function=embedding_function,
-    persist_directory=str(args.chroma_db_location),
+    persist_directory=str(settings.chroma_db.location),
 )
-if args.use_conditional_graph:
+if settings.use_conditional_graph:
    graph = CondRetGenLangGraph(
        vector_store=vector_store,
-        chat_model=get_chat_model(args.chat_backend),
+        chat_model=chat_function,
-        embedding_model=get_embedding_model(args.emb_backend),
+        embedding_model=embedding_function,
        system_prompt=system_prompt,
    )
 else:
    graph = RetGenLangGraph(
        vector_store=vector_store,
-        chat_model=get_chat_model(args.chat_backend),
+        chat_model=chat_function,
-        embedding_model=get_embedding_model(args.emb_backend),
+        embedding_model=embedding_function,
        system_prompt=system_prompt,
        compression_model=get_compression_model(
            "BAAI/bge-reranker-base", vector_store
@ -170,17 +128,21 @@ async def set_starters():
 if __name__ == "__main__":
-    if args.reset_chrome_db:
+    if settings.chroma_db.reset:
        vector_store.reset_collection()
    add_pdf_files(
        vector_store,
-        args.pdf_data,
+        settings.pdf.data,
-        args.pdf_chunk_size,
+        settings.pdf.chunk_size,
-        args.pdf_chunk_overlap,
+        settings.pdf.chunk_overlap,
-        args.pdf_add_start_index,
+        settings.pdf.add_start_index,
-        args.unstructured_pdf,
+        settings.pdf.unstructured,
    )
    add_urls(
        vector_store,
        settings.web.data,
        settings.web.chunk_size,
    )
    add_urls(vector_store, args.web_data, args.web_chunk_size)
    run_chainlit(__file__)
--- a/generic_rag/backend/models.py
+++ b/generic_rag/backend/models.py
@ -1,93 +1,213 @@
 import logging
 import os
 from enum import Enum
 from langchain_chroma import Chroma
 from langchain.chat_models import init_chat_model
 from langchain_aws import BedrockEmbeddings
 from langchain_core.embeddings import Embeddings
 from langchain_core.language_models.chat_models import BaseChatModel
 from langchain_core.retrievers import BaseRetriever
-from langchain_google_vertexai import VertexAIEmbeddings
+from langchain_aws import BedrockEmbeddings, ChatBedrock
-from langchain_huggingface import HuggingFaceEmbeddings
+from langchain_google_vertexai import VertexAIEmbeddings, ChatVertexAI
 from langchain_huggingface import HuggingFaceEmbeddings, ChatHuggingFace, HuggingFacePipeline
 from langchain.retrievers import ContextualCompressionRetriever
 from langchain.retrievers.document_compressors import CrossEncoderReranker
 from langchain_community.cross_encoders import HuggingFaceCrossEncoder
 from langchain_ollama import ChatOllama, OllamaEmbeddings
-from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings, OpenAIEmbeddings
+from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings, ChatOpenAI, OpenAIEmbeddings
 from generic_rag.parsers.config import AppSettings, ChatBackend, EmbeddingBackend
 logger = logging.getLogger(__name__)
-class ChatBackend(Enum):
+def get_chat_model(settings: AppSettings) -> BaseChatModel:
-    azure = "azure"
+    """
-    openai = "openai"
+    Initializes and returns a chat model based on the backend type and configuration.
    google_vertex = "google_vertex"
    aws = "aws"
    local = "local"
-    # make the enum pretty printable for argparse
+    Args:
-    def __str__(self):
+        settings: The loaded AppSettings object containing configurations.
        return self.value
    Returns:
        An instance of BaseChatModel.
-class EmbeddingBackend(Enum):
+    Raises:
-    azure = "azure"
+        ValueError: If the backend type is unknown or required configuration is missing.
-    openai = "openai"
+    """
-    google_vertex = "google_vertex"
+    logger.info(f"Initializing chat model for backend: {settings.chat_backend.value}")
    aws = "aws"
    local = "local"
    huggingface = "huggingface"
-    # make the enum pretty printable for argparse
+    if settings.chat_backend == ChatBackend.azure:
-    def __str__(self):
+        if not settings.azure:
-        return self.value
+            raise ValueError("Azure chat backend selected, but 'azure' configuration section is missing in config.")
-
+        if (
-
+            not settings.azure.llm_endpoint
-def get_chat_model(backend_type: ChatBackend) -> BaseChatModel:
+            or not settings.azure.llm_deployment_name
-    if backend_type == ChatBackend.azure:
+            or not settings.azure.llm_api_version
        ):
            raise ValueError(
                "Azure configuration requires 'llm_endpoint', 'llm_deployment_name', and 'llm_api_version'."
            )
        if "AZURE_OPENAI_API_KEY" not in os.environ:
            raise ValueError(
                "The environment variable 'AZURE_OPENAI_API_KEY' is missing. Please set the variable in your '.env' file before running the script."
            )
        return AzureChatOpenAI(
-            azure_endpoint=os.environ["AZURE_LLM_ENDPOINT"],
+            azure_endpoint=settings.azure.llm_endpoint,
-            azure_deployment=os.environ["AZURE_LLM_DEPLOYMENT_NAME"],
+            azure_deployment=settings.azure.llm_deployment_name,
-            openai_api_version=os.environ["AZURE_LLM_API_VERSION"],
+            openai_api_version=settings.azure.llm_api_version,
        )
-    if backend_type == ChatBackend.openai:
+    if settings.chat_backend == ChatBackend.openai:
-        return init_chat_model(os.environ["OPENAI_CHAT_MODEL"], model_provider="openai")
+        if not settings.openai:
            raise ValueError("OpenAI chat backend selected, but 'openai' configuration section is missing.")
        if not settings.openai.chat_model:
            raise ValueError("OpenAI configuration requires 'chat_model'.")
        if "OPENAI_API_KEY" not in os.environ:
            raise ValueError(
                "The environment variable 'OPENAI_API_KEY' is missing. Please set the variable in your '.env' file before running the script."
            )
        return ChatOpenAI(model=settings.openai.chat_model)
-    if backend_type == ChatBackend.google_vertex:
+    if settings.chat_backend == ChatBackend.google_vertex:
-        return init_chat_model(os.environ["GOOGLE_CHAT_MODEL"], model_provider="google_vertexai")
+        if not settings.google_vertex:
            raise ValueError(
                "Google Vertex chat backend selected, but 'google_vertex' configuration section is missing."
            )
        if (
            not settings.google_vertex.chat_model
            or not settings.google_vertex.project_id
            or not settings.google_vertex.location
        ):
            raise ValueError("Google Vertex configuration requires 'chat_model', 'project_id' and 'location'.")
        return ChatVertexAI(
            model_name=settings.google_vertex.chat_model,
            project=settings.google_vertex.project_id,
            location=settings.google_vertex.location,
        )
-    if backend_type == ChatBackend.aws:
+    if settings.chat_backend == ChatBackend.aws:
-        return init_chat_model(model=os.environ["AWS_CHAT_MODEL"], model_provider="bedrock_converse")
+        if not settings.aws:
            raise ValueError("AWS Bedrock chat backend selected, but 'aws' configuration section is missing.")
        if not settings.aws.chat_model or not settings.aws.region_name:
            raise ValueError("AWS Bedrock configuration requires 'chat_model' and 'region'")
        return ChatBedrock(
            model_id=settings.aws.chat_model,
            region_name=settings.aws.region,
        )
-    if backend_type == ChatBackend.local:
+    if settings.chat_backend == ChatBackend.local:
-        return ChatOllama(model=os.environ["LOCAL_CHAT_MODEL"])
+        if not settings.local:
            raise ValueError("Local chat backend selected, but 'local' configuration section is missing.")
        if not settings.local.chat_model:
            raise ValueError("Local configuration requires 'chat_model'")
        return ChatOllama(model=settings.local.chat_model)
-    raise ValueError(f"Unknown backend type: {backend_type}")
+    if settings.chat_backend == ChatBackend.huggingface:
        if not settings.huggingface:
            raise ValueError("Huggingface chat backend selected, but 'huggingface' configuration section is missing.")
        if not settings.huggingface.chat_model:
            raise ValueError("Huggingface configuration requires 'chat_model'")
        llm = HuggingFacePipeline.from_model_id(
            model_id=settings.huggingface.chat_model,
            task="text-generation",
            pipeline_kwargs=dict(
                max_new_tokens=512,
                do_sample=False,
                repetition_penalty=1.03,
            ),
        )
        return ChatHuggingFace(llm=llm)
    # This should not be reached if all Enum members are handled
    raise ValueError(f"Unknown or unhandled chat backend type: {settings.chat_backend}")
-def get_embedding_model(backend_type: EmbeddingBackend) -> Embeddings:
+def get_embedding_model(settings: AppSettings) -> Embeddings:
-    if backend_type == EmbeddingBackend.azure:
+    """
    Initializes and returns an embedding model based on the backend type and configuration.
    Args:
        settings: The loaded AppSettings object containing configurations.
    Returns:
        An instance of Embeddings.
    Raises:
        ValueError: If the backend type is unknown or required configuration is missing.
    """
    logger.info(f"Initializing embedding model for backend: {settings.emb_backend.value}")
    if settings.emb_backend == EmbeddingBackend.azure:
        if not settings.azure:
            raise ValueError("Azure embedding backend selected, but 'azure' configuration section is missing.")
        if (
            not settings.azure.emb_endpoint
            or not settings.azure.emb_deployment_name
            or not settings.azure.emb_api_version
        ):
            raise ValueError(
                "Azure configuration requires 'emb_endpoint', 'emb_deployment_name', and 'emb_api_version'."
            )
        if "AZURE_OPENAI_API_KEY" not in os.environ:
            raise ValueError(
                "The environment variable 'AZURE_OPENAI_API_KEY' is missing. Please set the variable in your '.env' file before running the script."
            )
        return AzureOpenAIEmbeddings(
-            azure_endpoint=os.environ["AZURE_EMB_ENDPOINT"],
+            azure_endpoint=settings.azure.emb_endpoint,
-            azure_deployment=os.environ["AZURE_EMB_DEPLOYMENT_NAME"],
+            azure_deployment=settings.azure.emb_deployment_name,
-            openai_api_version=os.environ["AZURE_EMB_API_VERSION"],
+            openai_api_version=settings.azure.emb_api_version,
        )
-    if backend_type == EmbeddingBackend.openai:
+    if settings.emb_backend == EmbeddingBackend.openai:
-        return OpenAIEmbeddings(model=os.environ["OPENAI_EMB_MODEL"])
+        if not settings.openai:
            raise ValueError("OpenAI embedding backend selected, but 'openai' configuration section is missing.")
        if not settings.openai.emb_model:
            raise ValueError("OpenAI configuration requires 'emb_model'.")
        if "OPENAI_API_KEY" not in os.environ:
            raise ValueError(
                "The environment variable 'OPENAI_API_KEY' is missing. Please set the variable in your '.env' file before running the script."
            )
        return OpenAIEmbeddings(model=settings.openai.emb_model)
-    if backend_type == EmbeddingBackend.google_vertex:
+    if settings.emb_backend == EmbeddingBackend.google_vertex:
-        return VertexAIEmbeddings(model=os.environ["GOOGLE_EMB_MODEL"])
+        if not settings.google_vertex:
            raise ValueError(
                "Google Vertex embedding backend selected, but 'google_vertex' configuration section is missing."
            )
        if (
            not settings.google_vertex.emb_model
            or not settings.google_vertex.project_id
            or not settings.google_vertex.location
        ):
            raise ValueError("Google Vertex configuration requires 'emb_model', 'project_id', and 'location'.")
        return VertexAIEmbeddings(
            model_name=settings.google_vertex.emb_model,
            project=settings.google_vertex.project_id,
            location=settings.google_vertex.location,
        )
-    if backend_type == EmbeddingBackend.aws:
+    if settings.emb_backend == EmbeddingBackend.aws:
-        return BedrockEmbeddings(model_id=os.environ["AWS_EMB_MODEL"])
+        if not settings.aws:
            raise ValueError("AWS Bedrock embedding backend selected, but 'aws' configuration section is missing.")
        if not settings.aws.emb_model or not settings.aws.region:
            raise ValueError("AWS Bedrock configuration requires 'emb_model' and 'region'")
        return BedrockEmbeddings(model_id=settings.aws.emb_model, region_name=settings.aws.region)
-    if backend_type == EmbeddingBackend.local:
+    if settings.emb_backend == EmbeddingBackend.local:
-        return OllamaEmbeddings(model=os.environ["LOCAL_EMB_MODEL"])
+        if not settings.local:
            raise ValueError("Local embedding backend selected, but 'local' configuration section is missing.")
        if not settings.local.emb_model:
            raise ValueError("Local configuration requires 'emb_model'")
        return OllamaEmbeddings(model=settings.local.emb_model)
-    if backend_type == EmbeddingBackend.huggingface:
+    if settings.emb_backend == EmbeddingBackend.huggingface:
-        return HuggingFaceEmbeddings(model_name=os.environ["HUGGINGFACE_EMB_MODEL"])
+        if not settings.huggingface:
            raise ValueError(
                "HuggingFace embedding backend selected, but 'huggingface' configuration section is missing."
            )
        if not settings.huggingface.emb_model:
            raise ValueError("HuggingFace configuration requires 'emb_model'.")
        return HuggingFaceEmbeddings(model_name=settings.huggingface.emb_model)
-    raise ValueError(f"Unknown backend type: {backend_type}")
+    raise ValueError(f"Unknown backend type: {settings.backend_type}")
 def get_compression_model(model_name: str, vector_store: Chroma) -> BaseRetriever:
--- a/generic_rag/parsers/config.py
+++ b/generic_rag/parsers/config.py
@ -0,0 +1,176 @@
 import yaml
 from pathlib import Path
 from typing import List, Optional
 from enum import Enum
 from pydantic import (
    BaseModel,
    Field,
    ValidationError,
 )
 import sys
 class ChatBackend(str, Enum):
    azure = "azure"
    openai = "openai"
    google_vertex = "google_vertex"
    aws = "aws"
    local = "local"
    huggingface = "huggingface"
    def __str__(self):
        return self.value
 class EmbeddingBackend(str, Enum):
    azure = "azure"
    openai = "openai"
    google_vertex = "google_vertex"
    aws = "aws"
    local = "local"
    huggingface = "huggingface"
    def __str__(self):
        return self.value
 class AzureSettings(BaseModel):
    """Azure specific settings."""
    llm_endpoint: Optional[str] = None
    llm_deployment_name: Optional[str] = None
    llm_api_version: Optional[str] = None
    emb_endpoint: Optional[str] = None
    emb_deployment_name: Optional[str] = None
    emb_api_version: Optional[str] = None
 class OpenAISettings(BaseModel):
    """OpenAI specific settings."""
    chat_model: Optional[str] = None
    emb_model: Optional[str] = None
 class GoogleVertexSettings(BaseModel):
    """Google Vertex specific settings."""
    project_id: Optional[str] = None
    location: Optional[str] = None
    chat_model: Optional[str] = None
    emb_model: Optional[str] = None
 class AwsSettings(BaseModel):
    """AWS specific settings (e.g., for Bedrock)."""
    chat_model: Optional[str] = None
    emb_model: Optional[str] = None
    region: Optional[str] = None
 class LocalSettings(BaseModel):
    """Local backend specific settings (e.g., Ollama models)."""
    chat_model: Optional[str] = None
    emb_model: Optional[str] = None
 class HuggingFaceSettings(BaseModel):
    """HuggingFace specific settings (if different from local embeddings)."""
    chat_model: Optional[str] = None
    emb_model: Optional[str] = None
 class PdfSettings(BaseModel):
    """PDF processing settings."""
    data: List[Path] = Field(default_factory=list)
    unstructured: bool = Field(default=False)
    chunk_size: int = Field(default=1000)
    chunk_overlap: int = Field(default=200)
    add_start_index: bool = Field(default=False)
 class WebSettings(BaseModel):
    """Web data processing settings."""
    data: List[str] = Field(default_factory=list)
    chunk_size: int = Field(default=200)
 class ChromaDbSettings(BaseModel):
    """Chroma DB settings."""
    location: Path = Field(default=Path(".chroma_db"))
    reset: bool = Field(default=False)
 class AppSettings(BaseModel):
    """
    Main application settings model.
    Loads configuration from a YAML file using the structure defined
    by the nested models.
    """
    # --- Top-level settings ---
    chat_backend: ChatBackend = Field(default=ChatBackend.local)
    emb_backend: EmbeddingBackend = Field(default=EmbeddingBackend.huggingface)
    use_conditional_graph: bool = Field(default=False)
    # --- Provider-specific settings ---
    azure: Optional[AzureSettings] = None
    openai: Optional[OpenAISettings] = None
    google_vertex: Optional[GoogleVertexSettings] = None
    aws: Optional[AwsSettings] = None
    local: Optional[LocalSettings] = None
    huggingface: Optional[HuggingFaceSettings] = None  # Separate HF config if needed
    # --- Data processing settings ---
    pdf: PdfSettings = Field(default_factory=PdfSettings)
    web: WebSettings = Field(default_factory=WebSettings)
    chroma_db: ChromaDbSettings = Field(default_factory=ChromaDbSettings)
 # --- Configuration Loading Function ---
 def load_settings(config_path: Path = Path("config.yaml")) -> AppSettings:
    """
    Loads settings from a YAML file and validates them using Pydantic models.
    Args:
        config_path: The path to the configuration YAML file.
    Returns:
        An instance of AppSettings containing the loaded configuration.
    Raises:
        FileNotFoundError: If the config file does not exist.
        yaml.YAMLError: If the file is not valid YAML.
        ValidationError: If the data in the file doesn't match the AppSettings model.
    """
    if not config_path.is_file():
        print(f"Error: Configuration file not found at '{config_path}'", file=sys.stderr)
        raise FileNotFoundError(f"Configuration file not found: {config_path}")
    print(f"--- Loading settings from '{config_path}' ---")
    try:
        with open(config_path, "r", encoding="utf-8") as f:
            config_data = yaml.safe_load(f)
            if config_data is None:
                config_data = {}
        settings = AppSettings(**config_data)
        print("--- Settings loaded and validated successfully ---")
        return settings
    except yaml.YAMLError as e:
        print(f"Error parsing YAML file '{config_path}':\n  {e}", file=sys.stderr)
        raise
    except ValidationError as e:
        print(f"Error validating configuration from '{config_path}':\n{e}", file=sys.stderr)
        raise
    except Exception as e:
        print(f"An unexpected error occurred while loading settings from '{config_path}': {e}", file=sys.stderr)
        raise
--- a/generic_rag/parsers/parser.py
+++ b/generic_rag/parsers/parser.py
@ -87,7 +87,6 @@ def add_pdf_files(
    The PDF file will be parsed per page and split into chunks of text with the provided chunk size and overlap.
    """
    logger.info("Adding PDF files to the vector store.")
    pdf_files = get_all_local_pdf_files(file_paths)
    logger.info(f"Found {len(pdf_files)} PDF files to add to the vector store.")
@ -101,7 +100,7 @@ def add_pdf_files(
    if len(new_pdfs) == 0:
        return
-    logger.info(f"{len(new_pdfs)} PDF's to add to the vector store.")
+    logger.info(f"{len(new_pdfs)} PDF(s) to add to the vector store.")
    loaded_document = []
    for file in new_pdfs: