Building a Voice Assistant with Claude Code

Over a series of evenings I turned a forty dollar pocket computer into a hands-free voice assistant for Claude Code. Hold a button on the little keyboard, talk, and a Mac mini sitting on a shelf transcribes what I said, runs it through a Claude agent that can do things on the machine, and speaks the answer back in a “warm British male” voice. It works on my home Wi-Fi, and it works from a phone hotspot on the other side of the country.

In this post I’ll walk through how it was built, why I bothered, and the more interesting failures along the way. The whole thing was built end-to-end in live sessions with Claude Code, directing it entirely from my phone, and I’ll include the technical detail at the end for anyone wanting to follow along.

For anyone following along at home, I’ll put the same caveat I put on the websites post: this is more involved than building a website, it involves hardware, networking, and an AI agent running with its guardrails deliberately loosened. I’d call it an advanced project. There are some spicy security concerns, and you certainly wouldn’t be putting one of these together in an institutional environement. Experiment with caution!

The hardware: M5Stack Cardputer ADV

The device is an M5Stack Cardputer ADV, which is an ESP32-S3 with a teenie colour screen, a QWERTY keyboard, a microphone, and a speaker, all in a case roughly the size of a chunky business-card holder. It costs about the price of a couple of (Melbourne) coffees and a sandwich.

In case you’re not a massive geek, the ESP32 is the same family of microcontroller that turns up in smart plugs and hobbyist electronics; it is not, by any stretch, a computer that can run AI. It has around 128KB of on-device memory to play with. For context, a few seconds of audio is bigger than that, and despite adding a MicroSD card for extra memory, this constraint ended up shaping a lot of the project.

The hardware used in this project was about this powerful… Image source: Wikimedia Commons

So, the design splits in two. The Cardputer is basically a mouth and ears with no brain: it captures audio, plays audio, and draws text on the screen. Everything that requires intelligence, the speech recognition, the language model, the text-to-speech, runs on a Mac mini I already use as a home server. This is the same logic behind a normal smart speaker; the puck on your kitchen bench is a microphone and a speaker, and the thinking happens elsewhere (usually in someone else’s cloud).

Cuteputer

Claude Code on hardware

For the websites project, Claude Code stayed in a tidy little world of files and folders. Pointing it at a piece of physical hardware is a different experience, because neither Claude nor I could see what was happening inside the device. The breakthrough early on was building what is basically a workbench: a tiny program on the device that accepts a snippet of code, runs it on the board, and sends back whatever happened. From then on, instead of guessing, Claude Code could bring a component to life by sending instructions to it and reading the result, the hardware equivalent of poking something to see if it moves.

There’s a lot of technical info packed into a document called JOURNEY.md, which you can find in the GitHub repo linked at the end of this article if you’re so inclined.

Here’s a diagram of the setup:

Using the device

Once the pieces worked (mostly), the interaction is simple, and deliberately so. The entire interface is one gesture.

  1. You hold the OPT key on the Cardputer keyboard. The device powers up the microphone, opens a single connection to the Mac, and sends a shared secret so the Mac knows it’s a device it trusts.
  2. While you hold the key, it streams your voice to the Mac in small pieces.
  3. You release. The device sends an end-marker and waits.
  4. The Mac mini transcribes the audio using an open source transcription model (Whisper), hands the text to a persistent Claude agent, and gets back a short, spoken-style reply generated by another model (Claude Haiku, passing to Piper TTS for the audio).
  5. The reply comes back as both text and audio on the same connection.
  6. The screen shows what it heard (YOU) and what Claude said (CC), then plays the answer aloud.
V1 had the Australian “Siri female” voice because it used the Mac mini’s native “say” function for the text to speech. V2 switched that out for the faster and slightly less Siri-ish Piper TTS

There are a few modes you can pick from a launcher at boot: an assistant that can act on the Mac, a translator that detects English or French and speaks the other back, a notes mode that dictates timestamped thoughts into a synced markdown file, and a full terminal when you’re at home. Each of those is basically a set of prompts wrapped around the input-output: Assistant connects directly to Claude Code and can be used to operate the Mac mini; Translator is set to EN-FR translation (and vice versa); Notes creates a transcribed voice note in a preset folder; and Terminal opens a (very small) terminal window that can be typed or spoken into. The core gesture, hold OPT to talk, is the same everywhere.

Ultra fancy menu

The voice assistant that actually works

The assistant mode is the most interesting one, because it isn’t just answering questions, it’s running a Claude Code agent on the Mac that has access to all the tools. Ask it to read/write a file, open an application, check something, or create a note, and it does. This is also where the project stops being a toy and starts needing some serious due diligence.

Basically, it’s Siri, except it actually works.

Two design decisions influenced most of this work. The first was keeping the agent persistent. Early versions started Claude Code fresh on every single request, which meant a slow, cold start each time you spoke, and an AI that responded more like Leonard Shelby than Alexa. The fix was to hold one Claude agent open permanently using the Claude Agent SDK, so the startup only happens when the Mac first boots and never again.

The second decision was a spoken confirmation gate. Because this thing acts unattended, by voice, with no screen to click “approve” on, I didn’t want it deleting or sending anything on a misheard word. So the agent is instructed to say what it’s about to do and wait for a spoken “yes” before anything that changes state: creating, editing, deleting, sending. Read-only questions skip the gate and just answer. It’s “soft” in that it’s prompt discipline rather than a hard technical lock, but in practice it works, and it’s the warm agent that makes the “yes” land in the right context.

Terrible quality on this video, sorry. Quite hard to hold a Cardputer in one hand and film with a phone in the other apparently… The edits cut out the 5-12ish seconds of lag between send and response.

You absolutely wouldn’t want to build this thing and leave it lying around. It’s about as secure as leaving a computer half-open in a cafe running Claude Code while you nip off to the toilet. The device essentially has always-on, unrestricted access to the entire system, and is directed via speech to text: anyone’s speech.

Two measures keep it marginally safe: the device has to send a shared secret before the Mac will listen to it at all, and the soft confirmation gate means nothing that changes state happens without a spoken “yes”. The remote link is also encrypted. This is a personal-device risk model, one user, one device, full control of both ends, and I wouldn’t try to use it any other way without rethinking the whole permissions thing from top to bottom, and maybe not even then.

Making it work from anywhere

Just because it wasn’t risky enough already, I also added the capability to make it work via a phone hotspot, anywhere with a signal. This was harder than it sounds for two reasons. My home internet is Starlink, which doesn’t give me a public address you can dial into, so anything reaching my Mac has to be dialled outward from the Mac itself. And the little ESP32 device is too limited to join a private network the way a laptop would; it’s just a basic client on whatever Wi-Fi it lands on.

The solution was a Tailscale Funnel, which lets the Mac expose the one voice service at a public, encrypted address without any of the usual router fiddling, and works fine behind Starlink. The open question was whether the tiny chip could even do the encryption a public connection requires. A quick probe on the device confirmed it could, and that was the green light. The device now scans for known networks (either Starlink at home, or my iPhone hotspot otherwise) and joins whichever is in range, routing over the funnel when it can’t find home.

One bonus gotcha: iPhone hotspots default to the faster 5GHz band, and this chip only speaks the older 2.4GHz, so you have to switch on “Maximize Compatibility” in the hotspot settings or the device can’t even see the network. I lost a bit of cafe time figuring that one out.

I lost about a small latte’s worth of time in the cafe figuring out that I needed to switch on the “Maximize Compatibility” toggle to make allow it to connect via phone hotspot

What I learned along the way…

Like the earlier Building a Website with Claude Code post, it’s easy to look at all of this stuff and dismiss it as either too technical, or some kind of strange magic. It’s neither. Every “breakthrough” in this project was Claude Code methodically testing real hardware and reading back real results, and every fix came from a failure I had to understand (or, more accurately direct Claude to examine) to get past. The device is cheap and the code is now on GitHub, but the value wasn’t in the speed; it was in being forced to consider the implications of this approach and the future of AI technologies.

Testing the note taking and translation features, with some background voice cameos from my children in the other room.

Think of it like this: an ESP32, the main technology of this device, wholesales for a couple of dollars. With some gentle prodding, even a simple bit of technology like the Cardputer can act as a multimodal interface for one of the world’s most sophisticated consumer AI platforms, Claude Code. The fact that Claude Code is running at home on a Mac mini is basically moot: the same process could connect the Cardputer to a local AI model running on a laptop, or a commercial model running in the cloud.

Last year, I wrote a post on the Near Future of GenAI where I discussed things like AI Agents, local AI models, and the concept of “AI everywhere”. This experiment is a continuation of that logic. I want you to picture this process, but hardened against security vulnerabilities and scaled for easy, lag-free, consumer use. It’s Siri, but instead of being bound to a single device it ranges across all of your existing platforms, applications, and hardware. You could shove a device like this into anything and have it connect to a central, capable AI model. Even with this little experiment I could probably knock up a smart fridge that talks via Claude Code to my MacBook and my microwave… if I really wanted to.

AI wearables are part of that same future. Some will ship with lightweight, on device AI using small local language models. Many will adopt an approach like the one I’ve demonstrated, offloading the AI part onto a powerful model most likely based in the cloud. AI everywhere indeed.

If you want the technical detail, the build, the hardware info, and a setup guide, the repository is linked below. Just mind the security note before you point it at anything important. Now that I’ve written this post, I’ve re-flashed the Cardputer back to its factory settings ready for its next adventure – a lifetime sitting in a drawer alongside half a dozen Raspberry Pis and too many cables to count.


The code, including a full technical account of every iteration and dead end, is available on GitHub: https://github.com/lfurze/cardputer-voice-assistant

Want to learn more about GenAI professional development and advisory services, or just have questions or comments? Get in touch:

← Back

Thank you for your response. ✨

Leave a Reply