Solid work and great showcase, I've done a bunch of stuff with Kokoro and the latency is incredible. So crazy how badly Apple dropped the ball... feels like your demo should be a Siri demo (I mean that in the most complimentary way possible).
karimf•Apr 6, 2026
Thank you. This reminds me of a paragraph from the LatentSpace newsletter [0]
> The excellent on device capabilities makes one wonder if these are the basis for the models that will be deployed in New Siri under the deal with Apple….
I have been looking forward to build something like this using open models. A voice assisstant I can talk while I am driving, as I do have long commute. I do use chatGPT voice mode and it works great for querying any information or discussions. But I want to do tasks like browsing web, act like a social media manager for my business etc.
jwr•Apr 6, 2026
That is very, very interesting. I've been hoping to have an assistant in the workshop (hands-free!) that I could talk to and have it help me with simple tasks: timers, calculating, digging up notes, etc. — basically, what the phone assistants were supposed to be, but aren't.
"You will have to unlock your iphone first" is kind of a deal-breaker when you are in the middle of mixing polyurethane resin and have gloves and a mask on.
More and more I find that we have the technology, but the supposedly "tech" companies are the gatekeepers, preventing us from using the technological advances and holding us back years behind the state of the art.
I'll be trying this out on my Macbook, looks very promising!
I've been replacing my Google Homes and Chromecasts with Snapcast streamers, and this is the next thing I've been planning to look into.
It's truly absurd how the Google voice assistant USED to work properly for setting timers, playing music, etc, and then they had to break it 15 times and finally replace it with much slower AI that only kinda does what you want. I'm done.
Selfhosted is the way to go if you want to keep your sanity. My wife has basically given up on any Google/Apple voice assistants being able to do anything useful above "set a 10 minute timer".
huijzer•Apr 6, 2026
> More and more I find that we have the technology, but the supposedly "tech" companies are the gatekeepers
Yes same with RSS readers being dropped by large companies. Worked too good I guess!
gtowey•Apr 6, 2026
The computing power we all have in our pockets is staggering. It could be tool that truly makes our lives easier, but instead it's mostly a device that is frustrating to use. Companies have decided to make it simply another conduit for advertising. It's a tool for them to sell us more stuff. Basic usability be damned.
jamilton•Apr 6, 2026
Siri does have a setting that'll activate it if you say "hey siri" while the phone is locked. Obvious privacy and battery usage concerns though, and it's still Siri, so it's a little clunky.
jwr•Apr 6, 2026
Mhm. I think I use that. But then I say "call my wife" and it says "you'll need to unlock your iphone first".
It's clear Tim Cook doesn't ever try to use Siri wearing gloves. Or ever, for that matter :-)
mft_•Apr 6, 2026
Siri (on iOS 18, at least) will call people for me without unlocking, in response to a voice command only - I just double-checked...
divan•Apr 6, 2026
Can someone quickly vibe code MacOS native app for that so it doesn't require running terminal commands and searching for that browser tab? (: (also for iOS, pls)
duartefdias•Apr 6, 2026
Would you pay 2$ for that MacOS native desktop app?
est•Apr 6, 2026
I am making something similar. Also been using Kokoro for TTS. Very cool project!
Gemma 4 is kinda too heavyweight even with E2B. I am sticking with qwen 0.8B at the moment.
logicallee•Apr 6, 2026
It might interest people to know you can also easily fine-tune the text portion of this specific model (E2B) to behave however you want! I fine-tuned it to talk like a pirate but you can get it to do anything you have (or can generate) training data for. (This wouldn't make it to the text to speech portion though.) So you can easily train it to act a certain way or give certain types of responses.
This is so cool, I'm always speaking to people about how the advancement in the SOTA hosted AI's is also happening in the local model space, i.e. the SOTA hosted AI models 6-12 months ago are what we're seeing now being able to run locally on average hardware - this is such an amazing way to actually demo it.
an0n-elem•Apr 6, 2026
Cool work buddy:)
myultidevhq•Apr 6, 2026
This is really impressive for running locally on an M3 Pro. The latency looks surprisingly good for real-time audio and video input.
Curious about one thing though, how does it handle switching between languages? I work with both Greek and English daily and local models usually struggle with that.
Great work, bookmarking this.
karimf•Apr 6, 2026
During my limited testing, it works better than I expected at handling multiple languages in a single session. Perhaps I just had a low expectation since I've mostly worked with English-only STT models.
crsAbtEvrthng•Apr 6, 2026
If I run this without internet connection it says "loading..." at the bottom of the localhost site and won't work.
If I run this with internet connected it works flawlessly. Even if I disconnect my internet afterwards it still goes on working fine.
Why there has to be an internet connection established at the time I open the localhost site when all of this should be working purely on device?
Despite of this, I am really impressed that this actually works so fast with video input on my M4 Pro 48 GB.
karimf•Apr 6, 2026
Huh that's weird. I just tried it and it works on my machine. Could you perhaps create a GitHub issue and share the reproduction steps and any relevant logs?
crsAbtEvrthng•Apr 6, 2026
Don't have the time right now but will play around with it next weekend for sure and will give you more feedback with logs when I see that I can reproduce it.
For now what I did was:
- Tested in Chrome/Safari/Firefox on Tahoe.
- Followed the quick start install instructions from github repo
- Everything worked
- Closed terminal
- Disconnected internet (Wifi off)
- Opened terminal
- Started server again (uv run server.py)
- Opened localhost in browser, it asked for camera/mic normally, granted access, saw camera live feed but "loading..." at bottom center of the site and AI did not listen/respond
- Reproduced this about 3 times with switching between wifi on/off before starting the server, always the same (working with internet; not working without)
- Figured it also works fine if I start the server with internet connected and disconnect it afterwards
rubicon33•Apr 6, 2026
Is there anything unique here happening for the video aspect or is it just taking snapshots over and over?
I’ve been looking for a good video summarizing / understanding model!
karimf•Apr 6, 2026
Nothing unique, it's just taking a snapshot when it's processing the input. Even processing a single image will increase the TTFT by ~0.5s on my machine, so for now, it seems to be impossible for feeding a live video and expecting a real-time response.
In regards to the video capability, I haven't tested it myself, but here's a benchmark/comparison from Google [0]
I totally get these are very hard problems so solve and that we're on the bleeding edge of what's possible but I can't help and wonder when someone is going to crack real video understanding.
sure, maybe it's still frame-by-frame but so fast and so often that the model retains a rolling context of what's going on and can answer cleanly temporal questions.
"how packages were delivered over the last hour", etc.
12 Comments
> The excellent on device capabilities makes one wonder if these are the basis for the models that will be deployed in New Siri under the deal with Apple….
https://www.latent.space/p/ainews-gemma-4-the-best-small-mul...
"You will have to unlock your iphone first" is kind of a deal-breaker when you are in the middle of mixing polyurethane resin and have gloves and a mask on.
More and more I find that we have the technology, but the supposedly "tech" companies are the gatekeepers, preventing us from using the technological advances and holding us back years behind the state of the art.
I'll be trying this out on my Macbook, looks very promising!
It's truly absurd how the Google voice assistant USED to work properly for setting timers, playing music, etc, and then they had to break it 15 times and finally replace it with much slower AI that only kinda does what you want. I'm done.
Selfhosted is the way to go if you want to keep your sanity. My wife has basically given up on any Google/Apple voice assistants being able to do anything useful above "set a 10 minute timer".
Yes same with RSS readers being dropped by large companies. Worked too good I guess!
It's clear Tim Cook doesn't ever try to use Siri wearing gloves. Or ever, for that matter :-)
Gemma 4 is kinda too heavyweight even with E2B. I am sticking with qwen 0.8B at the moment.
Video: https://www.youtube.com/live/WuCxWJhrkIM
Generated writeup: https://taonexus.com/publicfiles/apr2026/pirate-gemma-journa...
Curious about one thing though, how does it handle switching between languages? I work with both Greek and English daily and local models usually struggle with that.
Great work, bookmarking this.
If I run this with internet connected it works flawlessly. Even if I disconnect my internet afterwards it still goes on working fine.
Why there has to be an internet connection established at the time I open the localhost site when all of this should be working purely on device?
Despite of this, I am really impressed that this actually works so fast with video input on my M4 Pro 48 GB.
For now what I did was:
- Tested in Chrome/Safari/Firefox on Tahoe.
- Followed the quick start install instructions from github repo
- Everything worked
- Closed terminal
- Disconnected internet (Wifi off)
- Opened terminal
- Started server again (uv run server.py)
- Opened localhost in browser, it asked for camera/mic normally, granted access, saw camera live feed but "loading..." at bottom center of the site and AI did not listen/respond
- Reproduced this about 3 times with switching between wifi on/off before starting the server, always the same (working with internet; not working without)
- Figured it also works fine if I start the server with internet connected and disconnect it afterwards
I’ve been looking for a good video summarizing / understanding model!
In regards to the video capability, I haven't tested it myself, but here's a benchmark/comparison from Google [0]
[0] https://huggingface.co/blog/gemma4#video-understanding
sure, maybe it's still frame-by-frame but so fast and so often that the model retains a rolling context of what's going on and can answer cleanly temporal questions.
"how packages were delivered over the last hour", etc.