Private LLM API with Qwen3 and vLLM
This guide covers deploying a private, OpenAI-compatible LLM endpoint using vLLM and the Qwen3 model family on Laboratory OS.
Prerequisites
- Laboratory OS running with GPU access
- Minimum 16 GB VRAM for the 30B-A3B MoE variant (8 GB for the 8B dense model)
Step 1: Install vLLM
Open the Apps panel from the Laboratory OS desktop and find vLLM. Click Install.
vLLM starts an OpenAI-compatible HTTP server. Once running, it is accessible at:
https://{your-slug}--vllm.tunnels.laboratory.computer
Step 2: Download a Qwen3 Model
Open the Model Library from your desktop and search for Qwen3.
Choose a variant based on your available VRAM:
| Model | VRAM | Best for |
|---|---|---|
| Qwen3-1.7B | 4 GB | Fast experiments |
| Qwen3-8B | 8 GB | Everyday tasks |
| Qwen3-30B-A3B | 16 GB | High-quality reasoning (MoE) |
| Qwen3-72B | 40 GB+ | Maximum quality |
Click Download. The model is saved to the shared model directory and picked up by vLLM automatically.
Step 3: Select the Model in vLLM
Open the vLLM app from your desktop. In the settings panel, select your downloaded model from the Model dropdown and click Apply. vLLM loads the model — this takes 1–3 minutes depending on size.
The API is now live at:
https://{your-slug}--vllm.tunnels.laboratory.computer/v1
Step 4: Test with curl
curl https://{your-slug}--vllm.tunnels.laboratory.computer/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer dummy" \
-d '{
"model": "Qwen3-30B-A3B",
"messages": [{"role": "user", "content": "Explain backpropagation in one paragraph."}],
"max_tokens": 512
}'
Note: Laboratory OS secures all app endpoints via its tunnel. The
Authorizationheader is accepted but not validated — use any non-empty string.
Step 5: Use with the OpenAI SDK
The endpoint is fully compatible with the OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(
base_url="https://{your-slug}--vllm.tunnels.laboratory.computer/v1",
api_key="dummy",
)
response = client.chat.completions.create(
model="Qwen3-30B-A3B",
messages=[{"role": "user", "content": "Write a haiku about distributed systems."}],
)
print(response.choices[0].message.content)
This same pattern works with LangChain, LlamaIndex, Continue, and any other OpenAI-compatible client.
Step 6: Connect Open WebUI (optional)
If you want a chat interface:
- Install Open WebUI from the Apps panel.
- In Open WebUI settings, go to Admin → Connections → OpenAI API.
- Set the base URL to your vLLM endpoint and any non-empty API key.
- Your Qwen3 model appears in the model selector.
Monitoring Throughput
The vLLM app dashboard shows real-time metrics:
- Tokens/sec — generation throughput
- KV Cache Usage — memory pressure indicator
- Queue Depth — pending requests
Typical throughput on a single A100 80 GB: ~180 tok/s for Qwen3-30B-A3B.
Tips
- Enable
--enable-chunked-prefillin vLLM settings for better multi-user latency. - Use quantized variants (AWQ or GPTQ) if you are close to the VRAM limit — they reduce memory by ~40% with minimal quality loss.
- vLLM persists its configuration across container restarts as long as you use a volume mount.
What’s Next
- Autonomous Coding Agent with OpenCode — point an agent at your new private API
- Open WebUI + Ollama guide — a simpler LLM setup without an API server