Building Yantra: A Voice Brain For My Homeserver

June 13, 2026 · 6 min read · homelab, self-hosted, llm, voice, apple-watch

The Grafana dashboard on my homeserver is pretty. Sitting there is the problem. I wanted to talk to the box — ask my Apple Watch "what's my CPU temp" and get a real spoken answer in three seconds. On a TUF Gaming laptop that pretends to be a server.

This post is how I built that, and what it can do now.

Yantra answering a status query on the Apple Watch

Why Not Local

The first instinct on any homelab is "run it local". I have Ollama already, a 1050 Ti, and 4 GB of VRAM.

That GPU caps you at ~3B Q4 models — fine for embeddings and classification, useless for reasoning. Modern small-but-good models (Qwen 2.5 14B, Llama 3.3 70B) need 12–48 GB of VRAM. Even a 4090 wouldn't beat Groq's free tier on speed.

So I stopped fighting and split the workload:

Brain → Groq (Llama 3.3 70B). Free tier is generous, ~500 tokens/sec, sub-second latency.
GPU → goes back to its real job: NVENC for Jellyfin and headroom for future Frigate object detection.
Embeddings (later) → Ollama with nomic-embed-text, 270 MB. That's a job a 1050 Ti is actually good at.

Groq is not Grok. It's a hardware company that runs open-weight models on custom LPU chips and explicitly doesn't train on API inputs. That last part was the deciding factor.

Architecture

Everything sits between an Apple Watch on my wrist and a FastAPI service in Docker called yantra:

Apple Watch
  └─ Siri Shortcut
      ├─ Dictate Text         (Apple on-device STT)
      └─ POST /ask
                                       │
                              ┌────────▼────────┐
                              │ Yantra (FastAPI) │
                              └────────┬────────┘
                                       │
                              ┌────────▼────────┐
                              │       Groq       │ Llama 3.3 70B + tools
                              └────────┬────────┘
                                       │
                              composed 30-word reply
                                       │
                              "Speak Text" on Watch

Yantra itself is around 250 lines of Python — FastAPI on top, httpx talking to Groq, and a small registry of tools that Groq can call via the OpenAI function-calling protocol.

Tools

There are five live tools right now. Each one is ~30 lines:

system_stats — async fan-out to Prometheus. Returns CPU %, RAM %, hwmon temps, RAPL package and DRAM power draw, disk usage on /mnt/hdd, system uptime.
container_status — talks to the Docker socket via the Python SDK. Returns running and stopped containers by name. Answers "is jellyfin up" without me opening Portainer.
portfolio_status — reads my real Groww holdings JSON and the daily portfolio history file from my Sutra project. Returns positions, available cash, last snapshot value, P&L percent.
recent_downloads — qBittorrent Web API. I enabled the subnet whitelist so containers inside 172.16.0.0/12 can hit the API without credentials. Returns torrents finished in the last 24 hours and what's currently downloading with ETA.
uptime_check — pulls Uptime Kuma's Prometheus metrics endpoint. Parses the monitor_status{} lines and reports anything not green.

Groq decides which tools to call based on what I say. "Give me a full status report" fires four of them sequentially and composes a single spoken sentence:

Server CPU is at 0.7 percent, RAM is at 38 percent, CPU temp is fifty seven degrees, uptime is seventy hours. Everything is up. Portfolio value is forty eight thousand three hundred fifty eight rupees, three percent down. One download finished today, nothing currently downloading.

That's one HTTP request from the Watch. The Llama 3.3 model picks the right tools, gets real data, and writes spoken-style English. There is no guessing.

Power Watchdog On The Side

A side effect of building Yantra was finally putting real power monitoring on the dashboard. The node_rapl_* collectors in node-exporter were silently failing because Intel's RAPL energy file is root-only since the Platypus CVE mitigation. I let node-exporter run as root, added an nvidia_gpu_exporter container for GPU temps, and Grafana picked up CPU package, DRAM, and uncore power instantly.

At idle, the homeserver burns roughly 4–5 W of CPU + DRAM. Extrapolated to the wall (RAPL doesn't see fans, display backlight, NVMe, GPU idle, or PSU loss) it sits around 40–50 W on average. At ₹8/kWh, that's about ₹260/month to run Jellyfin, Immich, the full *arr stack, Prometheus, Grafana, Ollama, Sutra, and Yantra. Cheap.

The Apple Watch Flow

No custom app, no SDKs, no developer account. iOS Shortcuts handles all of it.

The Shortcut has three actions:

Dictate Text — uses Apple's on-device speech recognition.
Get Contents of URL — POSTs the dictated text to Yantra as JSON.
Speak Text — speaks the response through the Watch speaker.

Triggered by "Hey Siri, [shortcut name]" or by tapping a complication on the watch face. Apple does STT, Yantra does reasoning, Groq composes the reply, Apple speaks it back. End-to-end latency for a single-tool query is about three seconds. Multi-tool queries land in five to seven.

Naming It

The assistant was called Jarvis for about an hour before I caught myself. Renamed to Yantra — Sanskrit for "instrument". It slots into the same cultural namespace as Sutra, my trading terminal. Less cliché, harder to confuse with any other product in the wild.

What's Next

The fun part hasn't started. The next round, in roughly this order:

iOS Reminders bridge — when a voice request creates a todo with a due time, the Shortcut also writes a native iOS Reminder. Watch handles the notification natively, no third-party push service required.
Action tools — restart(container), pause_qbit(), add_to_radarr(title). Once the read-only tools are stable, voice gets to trigger state changes too.
committee_review — voice-asking my 5-agent Sutra committee about a stock symbol. "Hey Siri, ask Yantra what the committee says about HDFC."
Home Assistant + Mosquitto — the foundation for any physical-world control. One MQTT broker, one HA, every future device plugs into the same socket.
AC control — DIY ESP32 with an IR LED. Learn my AC remote's codes via a TSOP1838 receiver, expose climate to HA. "AC to twenty two" from bed. About ₹500 in parts.
Water tank level — same ESP32 stack, an AJ-SR04M waterproof ultrasonic sensor in the tank lid. Daily consumption graphed on Grafana next to power draw.
One IP cam via Frigate — an old phone running IP Webcam streaming H.264 to mediamtx. Frigate does YOLO on the 1050 Ti. Package alerts.
A custom voice — XTTS-v2 locally on the GPU, or ElevenLabs over API. Generic Siri voice has to go.

Each of these is its own post. The architecture above is built to grow into them — Yantra is the brain, every new physical device becomes a tool it can call.

The Watch was already on my wrist. The homeserver was already running. The Groq API is free. The whole bridge between them is one small FastAPI service and a Shortcut with three actions. JARVIS shouldn't be this cheap, but it is.