Hacker News Clone

Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B(github.com)

244 pointsby karimfApr 5, 2026

12 Comments

dvt•Apr 6, 2026

Solid work and great showcase, I've done a bunch of stuff with Kokoro and the latency is incredible. So crazy how badly Apple dropped the ball... feels like your demo should be a Siri demo (I mean that in the most complimentary way possible).

karimf•Apr 6, 2026

Thank you. This reminds me of a paragraph from the LatentSpace newsletter [0]

> The excellent on device capabilities makes one wonder if these are the basis for the models that will be deployed in New Siri under the deal with Apple….

https://www.latent.space/p/ainews-gemma-4-the-best-small-mul...

k-almuraee•Apr 6, 2026

Amazing, love your work ,

zerop•Apr 6, 2026

I have been looking forward to build something like this using open models. A voice assisstant I can talk while I am driving, as I do have long commute. I do use chatGPT voice mode and it works great for querying any information or discussions. But I want to do tasks like browsing web, act like a social media manager for my business etc.

jwr•Apr 6, 2026

That is very, very interesting. I've been hoping to have an assistant in the workshop (hands-free!) that I could talk to and have it help me with simple tasks: timers, calculating, digging up notes, etc. — basically, what the phone assistants were supposed to be, but aren't.

"You will have to unlock your iphone first" is kind of a deal-breaker when you are in the middle of mixing polyurethane resin and have gloves and a mask on.

More and more I find that we have the technology, but the supposedly "tech" companies are the gatekeepers, preventing us from using the technological advances and holding us back years behind the state of the art.

I'll be trying this out on my Macbook, looks very promising!

mentalgear•Apr 6, 2026

You might be interested in the open-source https://www.home-assistant.io/voice-pe/ .

QuercusMax•Apr 6, 2026

I've been replacing my Google Homes and Chromecasts with Snapcast streamers, and this is the next thing I've been planning to look into.

It's truly absurd how the Google voice assistant USED to work properly for setting timers, playing music, etc, and then they had to break it 15 times and finally replace it with much slower AI that only kinda does what you want. I'm done.

Selfhosted is the way to go if you want to keep your sanity. My wife has basically given up on any Google/Apple voice assistants being able to do anything useful above "set a 10 minute timer".

huijzer•Apr 6, 2026

> More and more I find that we have the technology, but the supposedly "tech" companies are the gatekeepers

Yes same with RSS readers being dropped by large companies. Worked too good I guess!

gtowey•Apr 6, 2026

The computing power we all have in our pockets is staggering. It could be tool that truly makes our lives easier, but instead it's mostly a device that is frustrating to use. Companies have decided to make it simply another conduit for advertising. It's a tool for them to sell us more stuff. Basic usability be damned.

jamilton•Apr 6, 2026

Siri does have a setting that'll activate it if you say "hey siri" while the phone is locked. Obvious privacy and battery usage concerns though, and it's still Siri, so it's a little clunky.

jwr•Apr 6, 2026

Mhm. I think I use that. But then I say "call my wife" and it says "you'll need to unlock your iphone first".

It's clear Tim Cook doesn't ever try to use Siri wearing gloves. Or ever, for that matter :-)

mft_•Apr 6, 2026

Siri (on iOS 18, at least) will call people for me without unlocking, in response to a voice command only - I just double-checked...

divan•Apr 6, 2026

Can someone quickly vibe code MacOS native app for that so it doesn't require running terminal commands and searching for that browser tab? (: (also for iOS, pls)

duartefdias•Apr 6, 2026

Would you pay 2$ for that MacOS native desktop app?

est•Apr 6, 2026

I am making something similar. Also been using Kokoro for TTS. Very cool project!

Gemma 4 is kinda too heavyweight even with E2B. I am sticking with qwen 0.8B at the moment.

logicallee•Apr 6, 2026

It might interest people to know you can also easily fine-tune the text portion of this specific model (E2B) to behave however you want! I fine-tuned it to talk like a pirate but you can get it to do anything you have (or can generate) training data for. (This wouldn't make it to the text to speech portion though.) So you can easily train it to act a certain way or give certain types of responses.

Video: https://www.youtube.com/live/WuCxWJhrkIM

Generated writeup: https://taonexus.com/publicfiles/apr2026/pirate-gemma-journa...

magzter•Apr 6, 2026

This is so cool, I'm always speaking to people about how the advancement in the SOTA hosted AI's is also happening in the local model space, i.e. the SOTA hosted AI models 6-12 months ago are what we're seeing now being able to run locally on average hardware - this is such an amazing way to actually demo it.

an0n-elem•Apr 6, 2026

Cool work buddy:)

myultidevhq•Apr 6, 2026

This is really impressive for running locally on an M3 Pro. The latency looks surprisingly good for real-time audio and video input.

Curious about one thing though, how does it handle switching between languages? I work with both Greek and English daily and local models usually struggle with that.

Great work, bookmarking this.

karimf•Apr 6, 2026

During my limited testing, it works better than I expected at handling multiple languages in a single session. Perhaps I just had a low expectation since I've mostly worked with English-only STT models.

crsAbtEvrthng•Apr 6, 2026

If I run this without internet connection it says "loading..." at the bottom of the localhost site and won't work.

If I run this with internet connected it works flawlessly. Even if I disconnect my internet afterwards it still goes on working fine.

Why there has to be an internet connection established at the time I open the localhost site when all of this should be working purely on device?

Despite of this, I am really impressed that this actually works so fast with video input on my M4 Pro 48 GB.

karimf•Apr 6, 2026

Huh that's weird. I just tried it and it works on my machine. Could you perhaps create a GitHub issue and share the reproduction steps and any relevant logs?

crsAbtEvrthng•Apr 6, 2026

Don't have the time right now but will play around with it next weekend for sure and will give you more feedback with logs when I see that I can reproduce it.

For now what I did was:

- Tested in Chrome/Safari/Firefox on Tahoe.

- Followed the quick start install instructions from github repo

- Everything worked

- Closed terminal

- Disconnected internet (Wifi off)

- Opened terminal

- Started server again (uv run server.py)

- Opened localhost in browser, it asked for camera/mic normally, granted access, saw camera live feed but "loading..." at bottom center of the site and AI did not listen/respond

- Reproduced this about 3 times with switching between wifi on/off before starting the server, always the same (working with internet; not working without)

- Figured it also works fine if I start the server with internet connected and disconnect it afterwards

rubicon33•Apr 6, 2026

Is there anything unique here happening for the video aspect or is it just taking snapshots over and over?

I’ve been looking for a good video summarizing / understanding model!

karimf•Apr 6, 2026

Nothing unique, it's just taking a snapshot when it's processing the input. Even processing a single image will increase the TTFT by ~0.5s on my machine, so for now, it seems to be impossible for feeding a live video and expecting a real-time response.

In regards to the video capability, I haven't tested it myself, but here's a benchmark/comparison from Google [0]

[0] https://huggingface.co/blog/gemma4#video-understanding

rubicon33•Apr 6, 2026

I totally get these are very hard problems so solve and that we're on the bleeding edge of what's possible but I can't help and wonder when someone is going to crack real video understanding.

sure, maybe it's still frame-by-frame but so fast and so often that the model retains a rolling context of what's going on and can answer cleanly temporal questions.

"how packages were delivered over the last hour", etc.