1.8 KiB
1.8 KiB
Hosting LLMs with Vllm on Odin
How-to deploy models
Before you start, check odin.capgemini.com/portainer/ (the trailinig / is important!) to make sure that there is not other big LLM running. Otherwise, you will run out of VRAM.
- Pick a model from Huggingface (easiest)
- Fill in the docker-compose template in this repo (huggingface model id, service name, port, route etc.)
docker-compose -f <docker-file> up -d
Setup might take some time. Check docker logs (via terminal or portainer) to make sure your application is up and running. Check odin.capgemini.com/dashboard to make sure there are no issues with the reverse proxy.
How to use models
You will have to be on VPN or in the office to use odin.
- Check
odin.capgemini.com/portainer/(the trailinig / is important!) if your model is running. Otherwise, start the container - Once the container is running, you can access the models with the Openai library. For the time being, you will have to use the http: link.
Usage Example
Chat Completion
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://odin.capgemini.com/qwen/v1/", # replace with your URL
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[
{"role": "system", "content": "You are Qwen. You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke that involves Llamas. "},
],
temperature=0.7,
top_p=0.8,
max_tokens=512,
)
print("Chat response:", chat_response.choices[0].message.content)
Embeddings
client = OpenAI(
api_key="EMPTY",
base_url="http://odin.capgemini.com/mixed-bread/v1/",
)
response = client.embeddings.create(
model="mixedbread-ai/mxbai-embed-large-v1",
input="This is a test examle.",
)
print(response.data[0].embedding)