distributed local
inference router
live simulation — requests routed across local AI workers
priority queuing
HIGH
8
NORM
5
LOW
3
Three priority tiers with FIFO within each lane. High-priority jobs are always dispatched first, with owner-affinity scheduling.
model aliases
# one name, many workers "qwen" → qwen3-4b-instruct → qwen3-14b-instruct # owner affinity: prefers # the requester's own GPU
Map one name across multiple models or machines. The scheduler picks the best available worker by affinity and load.
openai compatible
POST /v1/chat/completions Authorization: Bearer sk-… { "model": "llama-3.2", "stream": true }
Drop-in for the OpenAI API. Works with Claude Code, Open WebUI, and any client that speaks the OpenAI protocol.