Private LLM API with Qwen3 and vLLM

This guide covers deploying a private, OpenAI-compatible LLM endpoint using vLLM and the Qwen3 model family on Laboratory OS.

Prerequisites

Laboratory OS running with GPU access
Minimum 16 GB VRAM for the 30B-A3B MoE variant (8 GB for the 8B dense model)

Step 1: Install vLLM

Open the Apps panel from the Laboratory OS desktop and find vLLM. Click Install.

vLLM starts an OpenAI-compatible HTTP server. Once running, it is accessible at:

https://{your-slug}--vllm.tunnels.laboratory.computer

Step 2: Download a Qwen3 Model

Open the Model Library from your desktop and search for Qwen3.

Choose a variant based on your available VRAM:

Model	VRAM	Best for
Qwen3-1.7B	4 GB	Fast experiments
Qwen3-8B	8 GB	Everyday tasks
Qwen3-30B-A3B	16 GB	High-quality reasoning (MoE)
Qwen3-72B	40 GB+	Maximum quality

Click Download. The model is saved to the shared model directory and picked up by vLLM automatically.

Step 3: Select the Model in vLLM

Open the vLLM app from your desktop. In the settings panel, select your downloaded model from the Model dropdown and click Apply. vLLM loads the model — this takes 1–3 minutes depending on size.

The API is now live at:

https://{your-slug}--vllm.tunnels.laboratory.computer/v1

Step 4: Test with curl

curl https://{your-slug}--vllm.tunnels.laboratory.computer/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer dummy" \
  -d '{
    "model": "Qwen3-30B-A3B",
    "messages": [{"role": "user", "content": "Explain backpropagation in one paragraph."}],
    "max_tokens": 512
  }'

Note: Laboratory OS secures all app endpoints via its tunnel. The Authorization header is accepted but not validated — use any non-empty string.

Step 5: Use with the OpenAI SDK

The endpoint is fully compatible with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    base_url="https://{your-slug}--vllm.tunnels.laboratory.computer/v1",
    api_key="dummy",
)

response = client.chat.completions.create(
    model="Qwen3-30B-A3B",
    messages=[{"role": "user", "content": "Write a haiku about distributed systems."}],
)

print(response.choices[0].message.content)

This same pattern works with LangChain, LlamaIndex, Continue, and any other OpenAI-compatible client.

Step 6: Connect Open WebUI (optional)

If you want a chat interface:

Install Open WebUI from the Apps panel.
In Open WebUI settings, go to Admin → Connections → OpenAI API.
Set the base URL to your vLLM endpoint and any non-empty API key.
Your Qwen3 model appears in the model selector.

Monitoring Throughput

The vLLM app dashboard shows real-time metrics:

Tokens/sec — generation throughput
KV Cache Usage — memory pressure indicator
Queue Depth — pending requests

Typical throughput on a single A100 80 GB: ~180 tok/s for Qwen3-30B-A3B.

Tips

Enable --enable-chunked-prefill in vLLM settings for better multi-user latency.
Use quantized variants (AWQ or GPTQ) if you are close to the VRAM limit — they reduce memory by ~40% with minimal quality loss.
vLLM persists its configuration across container restarts as long as you use a volume mount.

What’s Next

Autonomous Coding Agent with OpenCode — point an agent at your new private API
Open WebUI + Ollama guide — a simpler LLM setup without an API server