Merge pull request 'setting-parser' (#32) from setting-parser into main

Reviewed-on: AI_team/generic-RAG-demo#32
This commit is contained in:
rubenl 2025-04-18 12:02:28 +02:00
commit ad60c9d52f
7 changed files with 524 additions and 168 deletions

100
README.md
View File

@ -1,14 +1,58 @@
# generic-RAG-demo
A Sogeti Nederland generic RAG demo
A generic Retrieval Augmented Generation (RAG) demo from Sogeti Netherlands built in Python. This project demonstrates how to integrate and run different backends, from cloud providers to local models, to parse and process your PDFs, web data, or other text sources.
## Table of Contents
- [generic-RAG-demo](#generic-rag-demo)
- [Table of Contents](#table-of-contents)
- [Features](#features)
- [Getting started](#getting-started)
- [Project Environment Setup](#project-environment-setup)
- [Installation of system dependencies](#installation-of-system-dependencies)
- [Unstructered PDF loader (optional)](#unstructered-pdf-loader-optional)
- [Local LLM (optional)](#local-llm-optional)
- [Running generic RAG demo](#running-generic-rag-demo)
- [config.yaml file](#configyaml-file)
- [.env file](#env-file)
- [Chainlit starters](#chainlit-starters)
- [Dev details](#dev-details)
- [Linting](#linting)
## Features
- **Multi-backend Support:** Easily switch between cloud-based and local LLMs.
- **Flexible Data Input:** Supports both PDFs and web data ingestion.
- **Configurable Workflows:** Customize settings via a central `config.yaml` file.
## Getting started
### Project Environment Setup
This project leverages a modern packaging method defined in `pyproject.toml`. After cloning the repository, you can install the project along with its dependencies. You have two options:
1. Using uv
If you're using uv, simply run:
```bash
uv install
```
2. Using a Python Virtual Environment
Alternatively, set up a virtual environment and install the project:
```bash
python -m venv .venv # Create a new virtual environment named ".venv"
source .venv/bin/activate # Activate the virtual environment (use ".venv\Scripts\activate" on Windows)
pip install . # Install the project and its dependencies
```
### Installation of system dependencies
Some optional features require additional system applications to be installed.
#### Unstructered PDF loader (optional)
If you would like to run the application using the unstructered PDF loader (`--unstructured-pdf` flag) you need to install two system dependencies.
If you would like to run the application using the unstructered PDF loader (`pdf.unstructured` setting) you need to install two system dependencies.
- [poppler-utils](https://launchpad.net/ubuntu/jammy/amd64/poppler-utils)
- [tesseract-ocr](https://github.com/tesseract-ocr/tesseract?tab=readme-ov-file#installing-tesseract)
@ -21,18 +65,19 @@ sudo apt install poppler-utils tesseract-ocr
#### Local LLM (optional)
If you would like to run the application using a local LLM backend (`-b local` flag), you need to install Ollama.
If you would like to run the application using a local LLM backend (`local` settings), you need to install Ollama.
```bash
curl -fsSL https://ollama.com/install.sh | sh # install Ollama
ollama pull llama3.1:8b # fetch and download as model
```
Include the downloaded model in the `.env` file:
Include the downloaded model in the `config.yaml` file:
```text
LOCAL_CHAT_MODEL="llama3.1:8b"
LOCAL_EMB_MODEL="llama3.1:8b"
```yaml
local:
chat_model: "llama3.1:8b"
emb_model: "llama3.1:8b"
```
>For more information on installing Ollama, please refer to the Langchain Local LLM documentation, specifically the [Quickstart section](https://python.langchain.com/docs/how_to/local_llms/#quickstart).
@ -52,14 +97,19 @@ python generic_rag/app.py -p data # will work and parsers all pdf files in ./da
python generic_rag/app.py --help # will work and prints command line options
```
Please configure your `.env` file with your cloud provider (backend) of choice and set the `--backend` flag accordingly.
Please configure your `config.yaml` and `.env` file with your cloud provider (backend) of choice. See the sections below for more details.
### .env file
### config.yaml file
A .env file needs to be populated to configure API end-points or local back-ends using environment variables.
Currently all required environment variables are defined in code at [backend/models.py](generic_rag/backend/models.py)
with the exception of the API key variables itself.
More information about configuring API endpoints for langchain can be found at the following locations.
A config.yaml file is required to specify your API endpoints and local backends. Use the provided `config.yaml.example` as a starting point. Update the file according to your backend settings and project requirements.
Key configuration points include:
- Chat Backend: Choose among azure, openai, google_vertex, aws, or local.
- Embedding Backend: Configure the embedding models similarly.
- Data Processing Settings: Define PDF and web data sources, chunk sizes, and overlap.
- Vector Database: Customize the path and reset behavior.
For more information on configuring Langchain endpoints and models, please see:
- [langchain cloud chat model doc](https://python.langchain.com/docs/integrations/chat/)
- [langchain local chat model doc](https://python.langchain.com/docs/how_to/local_llms/)
@ -67,25 +117,13 @@ More information about configuring API endpoints for langchain can be found at t
> for local models we currently use Ollama
An `.env` example is as followed.
### .env file
Set the API keys for your chosen cloud provider (backend). This ensures that your application can authenticate and interact with the services.
```text
# only one backend (azure, google, local, etc) is required. Please addjust the --backend flag accordingly
AZURE_OPENAI_API_KEY="<secret_key>"
AZURE_LLM_ENDPOINT="https://<project_hub>.openai.azure.com"
AZURE_LLM_DEPLOYMENT_NAME="gpt-4"
AZURE_LLM_API_VERSION="2025-01-01-preview"
AZURE_EMB_ENDPOINT="https://<project_hub>.openai.azure.com"
AZURE_EMB_DEPLOYMENT_NAME="text-embedding-3-large"
AZURE_EMB_API_VERSION="2023-05-15"
LOCAL_CHAT_MODEL="llama3.1:8b"
LOCAL_EMB_MODEL="llama3.1:8b"
# google vertex AI does not use API keys but a seperate authentication method
GOOGLE_GENAI_CHAT_MODEL="gemini-2.0-flash"
GOOGLE_GENAI_EMB_MODEL="models/text-embedding-004"
AZURE_OPENAI_API_KEY=your_azure_api_key
OPENAI_API_KEY=your_openai_api_key
```
### Chainlit starters
@ -102,4 +140,4 @@ CHAINLIT_STARTERS=[{"label":"Label 1","message":"Message one."},{"label":"Label
### Linting
Currently [Ruff](https://github.com/astral-sh/ruff) is used as Python linter. It is included in the [pyproject.toml](pyproject.toml) as `dev` dependency if your IDE needs that. However, for VS Code a [Ruff extension](https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff) excists.
Currently [Ruff](https://github.com/astral-sh/ruff) is used as Python linter. It is included in the [pyproject.toml](pyproject.toml) as `dev` dependency if your IDE needs that. However, for VS Code a [Ruff extension](https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff) exists.

61
config.yaml Normal file
View File

@ -0,0 +1,61 @@
# Define your application settings here.
chat_backend: local # Select the primary chat backend (azure, openai, google_vertex, aws, local)
emb_backend: local # Select the primary embedding backend (azure, openai, google_vertex, aws, local, huggingface)
use_conditional_graph: false # Use a conditional RAG model with historical chat context, or a non-conditional model without access to the current conversation
# --- Provider Specific Settings ---
azure:
llm_endpoint: "https://example.openai.azure.com"
llm_deployment_name: "gpt-4o-mini"
llm_api_version: "2025-01-01-preview"
emb_endpoint: "https://example.openai.azure.com"
emb_deployment_name: "text-embedding-3-large"
emb_api_version: "2023-05-15"
openai:
chat_model: "gpt-4o-mini"
emb_model: "text-embedding-3-large"
google_vertex:
project_id: "your_gcp_project_id"
location: "europe-west4"
chat_model: "gemini-pro"
emb_model: "textembedding-gecko@001"
aws:
chat_model: "amazon.titan-llm-v1"
emb_model: "amazon.titan-embed-text-v1"
region: "us-east-1"
local: # Settings for local models (e.g., Ollama)
chat_model: "llama3.1:8b"
emb_model: "llama3.1:8b"
huggingface:
chat_model: "meta-llama/Llama-2-7b-chat-hf"
emb_model: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
# --- Data Processing Settings ---
pdf:
# List of paths to PDF files or folders containing PDFs.
# Pydantic converts these strings to pathlib.Path objects.
data:
- "C:/path/folder"
unstructured: false # Use the unstructured PDF loader?
chunk_size: 1000
chunk_overlap: 200
add_start_index: false
web:
# List of URLs to scrape for data.
data:
- "https://www.example.nl/subdomain"
chunk_size: 200
chroma_db:
location: "/app/data/vector_database" # Override default DB path (default: '.chroma_db')
reset: False # Reset the database on startup? (default: false)

View File

@ -3,18 +3,14 @@ import json
import logging
import os
from pathlib import Path
import sys
import chainlit as cl
from chainlit.cli import run_chainlit
from langchain_chroma import Chroma
from generic_rag.backend.models import (
ChatBackend,
EmbeddingBackend,
get_chat_model,
get_embedding_model,
get_compression_model,
)
from generic_rag.parsers.config import AppSettings, load_settings
from generic_rag.backend.models import get_chat_model, get_embedding_model, get_compression_model
from generic_rag.graphs.cond_ret_gen import CondRetGenLangGraph
from generic_rag.graphs.ret_gen import RetGenLangGraph
from generic_rag.parsers.parser import add_pdf_files, add_urls
@ -22,6 +18,8 @@ from generic_rag.parsers.parser import add_pdf_files, add_urls
logger = logging.getLogger("sogeti-rag")
logger.setLevel(logging.DEBUG)
PROJECT_ROOT = Path(__file__).resolve().parent.parent
system_prompt = (
"You are an assistant for question-answering tasks. "
"If the question is in Dutch, answer in Dutch. If the question is in English, answer in English."
@ -29,85 +27,45 @@ system_prompt = (
"If you don't know the answer, say that you don't know."
)
parser = argparse.ArgumentParser(description="A Sogeti Nederland Generic RAG demo.")
parser = argparse.ArgumentParser(description="A Sogeti Netherlands Generic RAG demo.")
parser.add_argument(
"-c",
"--chat-backend",
type=ChatBackend,
choices=list(ChatBackend),
default=ChatBackend.local,
help="Cloud provider or local LLM to use as backend. In the case of 'local', Ollama needs to be installed.",
)
parser.add_argument(
"-e",
"--emb-backend",
type=EmbeddingBackend,
choices=list(EmbeddingBackend),
default=EmbeddingBackend.huggingface,
help="Cloud provider or local embedding to use as backend. In the case of 'local', Ollama needs to be installed. ",
)
parser.add_argument(
"-p",
"--pdf-data",
"--config",
type=Path,
nargs="+",
default=[],
help="One or multiple paths to folders or files to use for retrieval. "
"If a path is a folder, all files in the folder will be used. "
"If a path is a file, only that file will be used. "
"If the path is relative it will be relative to the current working directory.",
)
parser.add_argument(
"-u",
"--unstructured-pdf",
action="store_true",
help="Use an unstructered PDF loader. "
"An unstructured PDF loader might be usefull for PDF files "
"that contain a lot of images with text, tables or (scanned) text as images. "
"Please use '-r' when switching parsers on already indexed data.",
)
parser.add_argument("--pdf-chunk_size", type=int, default=1000, help="The size of the chunks to split the text into.")
parser.add_argument("--pdf-chunk_overlap", type=int, default=200, help="The overlap between the chunks.")
parser.add_argument(
"--pdf-add-start-index", action="store_true", help="Add the start index to the metadata of the chunks."
)
parser.add_argument(
"-w", "--web-data", type=str, nargs="*", default=[], help="One or multiple URLs to use for retrieval."
)
parser.add_argument("--web-chunk-size", type=int, default=200, help="The size of the chunks to split the text into.")
parser.add_argument(
"-d",
"--chroma-db-location",
type=Path,
default=Path(".chroma_db"),
help="File path to store or load a Chroma DB from/to.",
)
parser.add_argument("-r", "--reset-chrome-db", action="store_true", help="Reset the Chroma DB.")
parser.add_argument(
"--use-conditional-graph",
action="store_true",
help="Use the conditial retrieve generate graph over the regular retrieve generate graph.",
default=PROJECT_ROOT / "config.yaml",
help="Path to configuration file (YAML format). Defaults to 'config.yaml' in project root.",
)
args = parser.parse_args()
try:
settings: AppSettings = load_settings(args.config)
except (FileNotFoundError, Exception) as e:
logger.error(f"Failed to load configuration from {args.config}. Exiting.")
sys.exit(1)
embedding_function = get_embedding_model(settings)
chat_function = get_chat_model(settings)
vector_store = Chroma(
collection_name="generic_rag",
embedding_function=get_embedding_model(args.emb_backend),
persist_directory=str(args.chroma_db_location),
embedding_function=embedding_function,
persist_directory=str(settings.chroma_db.location),
)
if args.use_conditional_graph:
if settings.use_conditional_graph:
graph = CondRetGenLangGraph(
vector_store=vector_store,
chat_model=get_chat_model(args.chat_backend),
embedding_model=get_embedding_model(args.emb_backend),
chat_model=chat_function,
embedding_model=embedding_function,
system_prompt=system_prompt,
)
else:
graph = RetGenLangGraph(
vector_store=vector_store,
chat_model=get_chat_model(args.chat_backend),
embedding_model=get_embedding_model(args.emb_backend),
chat_model=chat_function,
embedding_model=embedding_function,
system_prompt=system_prompt,
compression_model=get_compression_model(
"BAAI/bge-reranker-base", vector_store
@ -170,17 +128,21 @@ async def set_starters():
if __name__ == "__main__":
if args.reset_chrome_db:
if settings.chroma_db.reset:
vector_store.reset_collection()
add_pdf_files(
vector_store,
args.pdf_data,
args.pdf_chunk_size,
args.pdf_chunk_overlap,
args.pdf_add_start_index,
args.unstructured_pdf,
settings.pdf.data,
settings.pdf.chunk_size,
settings.pdf.chunk_overlap,
settings.pdf.add_start_index,
settings.pdf.unstructured,
)
add_urls(
vector_store,
settings.web.data,
settings.web.chunk_size,
)
add_urls(vector_store, args.web_data, args.web_chunk_size)
run_chainlit(__file__)

View File

@ -1,93 +1,213 @@
import logging
import os
from enum import Enum
from langchain_chroma import Chroma
from langchain.chat_models import init_chat_model
from langchain_aws import BedrockEmbeddings
from langchain_core.embeddings import Embeddings
from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.retrievers import BaseRetriever
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_aws import BedrockEmbeddings, ChatBedrock
from langchain_google_vertexai import VertexAIEmbeddings, ChatVertexAI
from langchain_huggingface import HuggingFaceEmbeddings, ChatHuggingFace, HuggingFacePipeline
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings, OpenAIEmbeddings
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings, ChatOpenAI, OpenAIEmbeddings
from generic_rag.parsers.config import AppSettings, ChatBackend, EmbeddingBackend
logger = logging.getLogger(__name__)
class ChatBackend(Enum):
azure = "azure"
openai = "openai"
google_vertex = "google_vertex"
aws = "aws"
local = "local"
def get_chat_model(settings: AppSettings) -> BaseChatModel:
"""
Initializes and returns a chat model based on the backend type and configuration.
# make the enum pretty printable for argparse
def __str__(self):
return self.value
Args:
settings: The loaded AppSettings object containing configurations.
Returns:
An instance of BaseChatModel.
class EmbeddingBackend(Enum):
azure = "azure"
openai = "openai"
google_vertex = "google_vertex"
aws = "aws"
local = "local"
huggingface = "huggingface"
Raises:
ValueError: If the backend type is unknown or required configuration is missing.
"""
logger.info(f"Initializing chat model for backend: {settings.chat_backend.value}")
# make the enum pretty printable for argparse
def __str__(self):
return self.value
def get_chat_model(backend_type: ChatBackend) -> BaseChatModel:
if backend_type == ChatBackend.azure:
if settings.chat_backend == ChatBackend.azure:
if not settings.azure:
raise ValueError("Azure chat backend selected, but 'azure' configuration section is missing in config.")
if (
not settings.azure.llm_endpoint
or not settings.azure.llm_deployment_name
or not settings.azure.llm_api_version
):
raise ValueError(
"Azure configuration requires 'llm_endpoint', 'llm_deployment_name', and 'llm_api_version'."
)
if "AZURE_OPENAI_API_KEY" not in os.environ:
raise ValueError(
"The environment variable 'AZURE_OPENAI_API_KEY' is missing. Please set the variable in your '.env' file before running the script."
)
return AzureChatOpenAI(
azure_endpoint=os.environ["AZURE_LLM_ENDPOINT"],
azure_deployment=os.environ["AZURE_LLM_DEPLOYMENT_NAME"],
openai_api_version=os.environ["AZURE_LLM_API_VERSION"],
azure_endpoint=settings.azure.llm_endpoint,
azure_deployment=settings.azure.llm_deployment_name,
openai_api_version=settings.azure.llm_api_version,
)
if backend_type == ChatBackend.openai:
return init_chat_model(os.environ["OPENAI_CHAT_MODEL"], model_provider="openai")
if settings.chat_backend == ChatBackend.openai:
if not settings.openai:
raise ValueError("OpenAI chat backend selected, but 'openai' configuration section is missing.")
if not settings.openai.chat_model:
raise ValueError("OpenAI configuration requires 'chat_model'.")
if "OPENAI_API_KEY" not in os.environ:
raise ValueError(
"The environment variable 'OPENAI_API_KEY' is missing. Please set the variable in your '.env' file before running the script."
)
return ChatOpenAI(model=settings.openai.chat_model)
if backend_type == ChatBackend.google_vertex:
return init_chat_model(os.environ["GOOGLE_CHAT_MODEL"], model_provider="google_vertexai")
if settings.chat_backend == ChatBackend.google_vertex:
if not settings.google_vertex:
raise ValueError(
"Google Vertex chat backend selected, but 'google_vertex' configuration section is missing."
)
if (
not settings.google_vertex.chat_model
or not settings.google_vertex.project_id
or not settings.google_vertex.location
):
raise ValueError("Google Vertex configuration requires 'chat_model', 'project_id' and 'location'.")
return ChatVertexAI(
model_name=settings.google_vertex.chat_model,
project=settings.google_vertex.project_id,
location=settings.google_vertex.location,
)
if backend_type == ChatBackend.aws:
return init_chat_model(model=os.environ["AWS_CHAT_MODEL"], model_provider="bedrock_converse")
if settings.chat_backend == ChatBackend.aws:
if not settings.aws:
raise ValueError("AWS Bedrock chat backend selected, but 'aws' configuration section is missing.")
if not settings.aws.chat_model or not settings.aws.region_name:
raise ValueError("AWS Bedrock configuration requires 'chat_model' and 'region'")
return ChatBedrock(
model_id=settings.aws.chat_model,
region_name=settings.aws.region,
)
if backend_type == ChatBackend.local:
return ChatOllama(model=os.environ["LOCAL_CHAT_MODEL"])
if settings.chat_backend == ChatBackend.local:
if not settings.local:
raise ValueError("Local chat backend selected, but 'local' configuration section is missing.")
if not settings.local.chat_model:
raise ValueError("Local configuration requires 'chat_model'")
return ChatOllama(model=settings.local.chat_model)
raise ValueError(f"Unknown backend type: {backend_type}")
if settings.chat_backend == ChatBackend.huggingface:
if not settings.huggingface:
raise ValueError("Huggingface chat backend selected, but 'huggingface' configuration section is missing.")
if not settings.huggingface.chat_model:
raise ValueError("Huggingface configuration requires 'chat_model'")
llm = HuggingFacePipeline.from_model_id(
model_id=settings.huggingface.chat_model,
task="text-generation",
pipeline_kwargs=dict(
max_new_tokens=512,
do_sample=False,
repetition_penalty=1.03,
),
)
return ChatHuggingFace(llm=llm)
# This should not be reached if all Enum members are handled
raise ValueError(f"Unknown or unhandled chat backend type: {settings.chat_backend}")
def get_embedding_model(backend_type: EmbeddingBackend) -> Embeddings:
if backend_type == EmbeddingBackend.azure:
def get_embedding_model(settings: AppSettings) -> Embeddings:
"""
Initializes and returns an embedding model based on the backend type and configuration.
Args:
settings: The loaded AppSettings object containing configurations.
Returns:
An instance of Embeddings.
Raises:
ValueError: If the backend type is unknown or required configuration is missing.
"""
logger.info(f"Initializing embedding model for backend: {settings.emb_backend.value}")
if settings.emb_backend == EmbeddingBackend.azure:
if not settings.azure:
raise ValueError("Azure embedding backend selected, but 'azure' configuration section is missing.")
if (
not settings.azure.emb_endpoint
or not settings.azure.emb_deployment_name
or not settings.azure.emb_api_version
):
raise ValueError(
"Azure configuration requires 'emb_endpoint', 'emb_deployment_name', and 'emb_api_version'."
)
if "AZURE_OPENAI_API_KEY" not in os.environ:
raise ValueError(
"The environment variable 'AZURE_OPENAI_API_KEY' is missing. Please set the variable in your '.env' file before running the script."
)
return AzureOpenAIEmbeddings(
azure_endpoint=os.environ["AZURE_EMB_ENDPOINT"],
azure_deployment=os.environ["AZURE_EMB_DEPLOYMENT_NAME"],
openai_api_version=os.environ["AZURE_EMB_API_VERSION"],
azure_endpoint=settings.azure.emb_endpoint,
azure_deployment=settings.azure.emb_deployment_name,
openai_api_version=settings.azure.emb_api_version,
)
if backend_type == EmbeddingBackend.openai:
return OpenAIEmbeddings(model=os.environ["OPENAI_EMB_MODEL"])
if settings.emb_backend == EmbeddingBackend.openai:
if not settings.openai:
raise ValueError("OpenAI embedding backend selected, but 'openai' configuration section is missing.")
if not settings.openai.emb_model:
raise ValueError("OpenAI configuration requires 'emb_model'.")
if "OPENAI_API_KEY" not in os.environ:
raise ValueError(
"The environment variable 'OPENAI_API_KEY' is missing. Please set the variable in your '.env' file before running the script."
)
return OpenAIEmbeddings(model=settings.openai.emb_model)
if backend_type == EmbeddingBackend.google_vertex:
return VertexAIEmbeddings(model=os.environ["GOOGLE_EMB_MODEL"])
if settings.emb_backend == EmbeddingBackend.google_vertex:
if not settings.google_vertex:
raise ValueError(
"Google Vertex embedding backend selected, but 'google_vertex' configuration section is missing."
)
if (
not settings.google_vertex.emb_model
or not settings.google_vertex.project_id
or not settings.google_vertex.location
):
raise ValueError("Google Vertex configuration requires 'emb_model', 'project_id', and 'location'.")
return VertexAIEmbeddings(
model_name=settings.google_vertex.emb_model,
project=settings.google_vertex.project_id,
location=settings.google_vertex.location,
)
if backend_type == EmbeddingBackend.aws:
return BedrockEmbeddings(model_id=os.environ["AWS_EMB_MODEL"])
if settings.emb_backend == EmbeddingBackend.aws:
if not settings.aws:
raise ValueError("AWS Bedrock embedding backend selected, but 'aws' configuration section is missing.")
if not settings.aws.emb_model or not settings.aws.region:
raise ValueError("AWS Bedrock configuration requires 'emb_model' and 'region'")
return BedrockEmbeddings(model_id=settings.aws.emb_model, region_name=settings.aws.region)
if backend_type == EmbeddingBackend.local:
return OllamaEmbeddings(model=os.environ["LOCAL_EMB_MODEL"])
if settings.emb_backend == EmbeddingBackend.local:
if not settings.local:
raise ValueError("Local embedding backend selected, but 'local' configuration section is missing.")
if not settings.local.emb_model:
raise ValueError("Local configuration requires 'emb_model'")
return OllamaEmbeddings(model=settings.local.emb_model)
if backend_type == EmbeddingBackend.huggingface:
return HuggingFaceEmbeddings(model_name=os.environ["HUGGINGFACE_EMB_MODEL"])
if settings.emb_backend == EmbeddingBackend.huggingface:
if not settings.huggingface:
raise ValueError(
"HuggingFace embedding backend selected, but 'huggingface' configuration section is missing."
)
if not settings.huggingface.emb_model:
raise ValueError("HuggingFace configuration requires 'emb_model'.")
return HuggingFaceEmbeddings(model_name=settings.huggingface.emb_model)
raise ValueError(f"Unknown backend type: {backend_type}")
raise ValueError(f"Unknown backend type: {settings.backend_type}")
def get_compression_model(model_name: str, vector_store: Chroma) -> BaseRetriever:

View File

@ -0,0 +1,176 @@
import yaml
from pathlib import Path
from typing import List, Optional
from enum import Enum
from pydantic import (
BaseModel,
Field,
ValidationError,
)
import sys
class ChatBackend(str, Enum):
azure = "azure"
openai = "openai"
google_vertex = "google_vertex"
aws = "aws"
local = "local"
huggingface = "huggingface"
def __str__(self):
return self.value
class EmbeddingBackend(str, Enum):
azure = "azure"
openai = "openai"
google_vertex = "google_vertex"
aws = "aws"
local = "local"
huggingface = "huggingface"
def __str__(self):
return self.value
class AzureSettings(BaseModel):
"""Azure specific settings."""
llm_endpoint: Optional[str] = None
llm_deployment_name: Optional[str] = None
llm_api_version: Optional[str] = None
emb_endpoint: Optional[str] = None
emb_deployment_name: Optional[str] = None
emb_api_version: Optional[str] = None
class OpenAISettings(BaseModel):
"""OpenAI specific settings."""
chat_model: Optional[str] = None
emb_model: Optional[str] = None
class GoogleVertexSettings(BaseModel):
"""Google Vertex specific settings."""
project_id: Optional[str] = None
location: Optional[str] = None
chat_model: Optional[str] = None
emb_model: Optional[str] = None
class AwsSettings(BaseModel):
"""AWS specific settings (e.g., for Bedrock)."""
chat_model: Optional[str] = None
emb_model: Optional[str] = None
region: Optional[str] = None
class LocalSettings(BaseModel):
"""Local backend specific settings (e.g., Ollama models)."""
chat_model: Optional[str] = None
emb_model: Optional[str] = None
class HuggingFaceSettings(BaseModel):
"""HuggingFace specific settings (if different from local embeddings)."""
chat_model: Optional[str] = None
emb_model: Optional[str] = None
class PdfSettings(BaseModel):
"""PDF processing settings."""
data: List[Path] = Field(default_factory=list)
unstructured: bool = Field(default=False)
chunk_size: int = Field(default=1000)
chunk_overlap: int = Field(default=200)
add_start_index: bool = Field(default=False)
class WebSettings(BaseModel):
"""Web data processing settings."""
data: List[str] = Field(default_factory=list)
chunk_size: int = Field(default=200)
class ChromaDbSettings(BaseModel):
"""Chroma DB settings."""
location: Path = Field(default=Path(".chroma_db"))
reset: bool = Field(default=False)
class AppSettings(BaseModel):
"""
Main application settings model.
Loads configuration from a YAML file using the structure defined
by the nested models.
"""
# --- Top-level settings ---
chat_backend: ChatBackend = Field(default=ChatBackend.local)
emb_backend: EmbeddingBackend = Field(default=EmbeddingBackend.huggingface)
use_conditional_graph: bool = Field(default=False)
# --- Provider-specific settings ---
azure: Optional[AzureSettings] = None
openai: Optional[OpenAISettings] = None
google_vertex: Optional[GoogleVertexSettings] = None
aws: Optional[AwsSettings] = None
local: Optional[LocalSettings] = None
huggingface: Optional[HuggingFaceSettings] = None # Separate HF config if needed
# --- Data processing settings ---
pdf: PdfSettings = Field(default_factory=PdfSettings)
web: WebSettings = Field(default_factory=WebSettings)
chroma_db: ChromaDbSettings = Field(default_factory=ChromaDbSettings)
# --- Configuration Loading Function ---
def load_settings(config_path: Path = Path("config.yaml")) -> AppSettings:
"""
Loads settings from a YAML file and validates them using Pydantic models.
Args:
config_path: The path to the configuration YAML file.
Returns:
An instance of AppSettings containing the loaded configuration.
Raises:
FileNotFoundError: If the config file does not exist.
yaml.YAMLError: If the file is not valid YAML.
ValidationError: If the data in the file doesn't match the AppSettings model.
"""
if not config_path.is_file():
print(f"Error: Configuration file not found at '{config_path}'", file=sys.stderr)
raise FileNotFoundError(f"Configuration file not found: {config_path}")
print(f"--- Loading settings from '{config_path}' ---")
try:
with open(config_path, "r", encoding="utf-8") as f:
config_data = yaml.safe_load(f)
if config_data is None:
config_data = {}
settings = AppSettings(**config_data)
print("--- Settings loaded and validated successfully ---")
return settings
except yaml.YAMLError as e:
print(f"Error parsing YAML file '{config_path}':\n {e}", file=sys.stderr)
raise
except ValidationError as e:
print(f"Error validating configuration from '{config_path}':\n{e}", file=sys.stderr)
raise
except Exception as e:
print(f"An unexpected error occurred while loading settings from '{config_path}': {e}", file=sys.stderr)
raise

View File

@ -87,7 +87,6 @@ def add_pdf_files(
The PDF file will be parsed per page and split into chunks of text with the provided chunk size and overlap.
"""
logger.info("Adding PDF files to the vector store.")
pdf_files = get_all_local_pdf_files(file_paths)
logger.info(f"Found {len(pdf_files)} PDF files to add to the vector store.")
@ -101,7 +100,7 @@ def add_pdf_files(
if len(new_pdfs) == 0:
return
logger.info(f"{len(new_pdfs)} PDF's to add to the vector store.")
logger.info(f"{len(new_pdfs)} PDF(s) to add to the vector store.")
loaded_document = []
for file in new_pdfs: