Hosting LLMs with Vllm on Odin

How-to deploy models

Before you start, check odin.capgemini.com/portainer/ (the trailinig / is important!) to make sure that there is not other big LLM running. Otherwise, you will run out of VRAM.

Pick a model from Huggingface (easiest)
Fill in the docker-compose template in this repo (huggingface model id, service name, port, route etc.)
docker-compose -f <docker-file> up -d

Setup might take some time. Check docker logs (via terminal or portainer) to make sure your application is up and running. Check odin.capgemini.com/dashboard to make sure there are no issues with the reverse proxy.

How to use models

You will have to be on VPN or in the office to use odin.

Check odin.capgemini.com/portainer/ (the trailinig / is important!) if your model is running. Otherwise, start the container
Once the container is running, you can access the models with the Openai library. For the time being, you will have to use the http: link.

Usage Example

Chat Completion

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://odin.capgemini.com/qwen/v1/", # replace with your URL 
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are Qwen. You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke that involves Llamas. "},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=512,
)

print("Chat response:", chat_response.choices[0].message.content)

Embeddings


client = OpenAI(
    api_key="EMPTY",
    base_url="http://odin.capgemini.com/mixed-bread/v1/",
)

response = client.embeddings.create(
    model="mixedbread-ai/mxbai-embed-large-v1",
    input="This is a test examle.",
)

print(response.data[0].embedding)

1.8 KiB Raw Blame History