Vllm-Setup-And-Model-Access/README.md

# Hosting LLMs with Vllm on Odin

## How-to deploy models

Before you start, check `odin.capgemini.com/portainer/` (the trailinig / is important!) to make sure that there is not other big LLM running. Otherwise, you will run out of VRAM.

1. Pick a model from Huggingface (easiest)
2. Fill in the docker-compose template in this repo (huggingface model id, service name, port, route etc.)
3. `docker-compose -f <docker-file> up -d`

Setup might take some time. Check docker logs (via terminal or portainer) to make sure your application is up and running. Check `odin.capgemini.com/dashboard` to make sure there are no issues with the reverse proxy.


## How to use models

You will have to be on VPN or in the office to use odin.

1. Check `odin.capgemini.com/portainer/` (the trailinig / is important!) if your model is running. Otherwise, start the container
2. Once the container is running, you can access the models with the Openai library. For the time being, you will have to use the http: link.

### Usage Example

#### Chat Completion
```python
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://odin.capgemini.com/qwen/v1/", # replace with your URL
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are Qwen. You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke that involves Llamas. "},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=512,
)

print("Chat response:", chat_response.choices[0].message.content)

```

#### Embeddings

```python

client = OpenAI(
    api_key="EMPTY",
    base_url="http://odin.capgemini.com/mixed-bread/v1/",
)

response = client.embeddings.create(
    model="mixedbread-ai/mxbai-embed-large-v1",
    input="This is a test examle.",
)

print(response.data[0].embedding)
```