Resonant
Back to resources
GuideFeb 25, 2026
Share

Moonshine Just Beat Whisper — Here's What Developers Think

A six-person startup just shipped speech-to-text models that outperform OpenAI's Whisper Large v3 on standard benchmarks — at a fraction of the size. Moonshine runs 245 million parameters. Whisper Large v3 runs 1.5 billion. When the Show HN post went up, the thread lit up fast. Developers who'd been wrestling with Whisper's size, latency, and hallucination problems saw something worth paying attention to. The reactions tell you a lot about what the speech-to-text space actually looks like right now — what works, what doesn't, and where things are heading.

Here's what stood out.

The benchmarks

Moonshine claims lower word-error rates than Whisper Large v3 across several standard evaluation sets. The numbers are competitive on the HuggingFace OpenASR leaderboard, which is the closest thing the speech-to-text community has to a neutral playing field. That alone would be noteworthy. The fact that it does this with roughly six times fewer parameters is what turned heads.

The models come in multiple sizes, so you can trade accuracy for speed depending on your deployment constraints. The English models are MIT licensed — fully open, commercial use permitted. Language support extends beyond English to Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, and Vietnamese.

One important caveat that came up repeatedly in the thread: the non-English models carry a non-commercial license. For a project positioning itself as open-weight, that's a meaningful distinction. We'll get to the developer reaction on that below.

What developers are saying

The Hacker News thread generated the kind of detailed, technical feedback that only happens when a project hits a nerve. Several themes emerged.

Whisper's hallucination problem

One developer raised a point that anyone who's deployed Whisper in production knows well: Whisper is “notorious for hallucination loops during silences.” The question was whether Moonshine's benchmarks account for this. It's a legitimate concern. Whisper will sometimes generate phantom text when no one is speaking — inventing words, repeating phrases, or producing coherent-sounding sentences that nobody actually said. For batch transcription of clean recordings, this is annoying. For production applications — live captioning, medical dictation, voice agents — it's a dealbreaker.

The hallucination issue is one of those problems that doesn't show up clearly in aggregate word-error-rate numbers. A model can have excellent overall accuracy while still producing catastrophically wrong output in specific edge cases. Silence handling is one of those edges. If Moonshine genuinely avoids this failure mode, that alone would make it worth switching for many production deployments — regardless of what the benchmarks say.

The Parakeet comparison

NVIDIA's Parakeet V3 came up repeatedly. At 600 million parameters versus Moonshine's 245 million, some argued it isn't a fair comparison — Parakeet is a larger model with access to NVIDIA's training infrastructure. But one commenter pointed out that Parakeet actually has a “better real-time factor” despite the larger size, suggesting that raw parameter count doesn't map linearly to inference speed.

The debate is instructive. Parameter count is a proxy for efficiency, not a measure of it. Architecture choices, quantization, hardware optimization, and inference framework all affect how fast a model actually runs. A 600M parameter model that's been heavily optimized for GPU inference might outrun a 245M model on the same hardware. What matters for on-device deployment is not how many parameters a model has, but how fast it runs on the hardware you actually ship.

The morse code fine-tuner

One of the most compelling stories in the thread: a developer reported fine-tuning moonshine-tiny on amateur radio morse code, achieving roughly 2% character error rate after training on a single 4090 GPU. Not a cluster. Not a cloud training job. One consumer graphics card.

This is the kind of thing that only happens when models are small, open, and accessible. A 1.5 billion parameter model like Whisper Large v3 isn't something most individuals can fine-tune. The hardware requirements and training time put it out of reach for hobbyists and small teams. A 245M parameter model is different. It fits in memory. Training runs are short enough to iterate on. One person with one GPU can adapt it to a completely novel domain — in this case, converting audio patterns from 1830s-era telegraph technology into text. The long tail of speech-to-text applications gets much longer when the barrier to entry drops this far.

Firefighter tablets

A developer building tablet software for firefighters described needing on-premise deployment with Norwegian language support. The use case makes the stakes concrete: emergency workers operating in buildings, tunnels, and disaster zones where cellular connectivity is unreliable or nonexistent. They can't send audio to a cloud API. There is no cloud. There is a tablet, a voice, and a need for text — right now, in a burning building.

On-device models aren't a nice-to-have for this kind of deployment. They're a hard requirement. The question isn't whether local processing is preferable — it's whether the application exists at all without it. Moonshine's size makes it practical for exactly these kinds of edge deployments where Whisper Large would be too heavy to run on constrained hardware.

Streaming stability concerns

A developer building voice agents asked for streaming stability metrics — specifically “% partial tokens revised after 1s / 3s.” This is a technical question that reveals a real production pain point. When you're doing real-time transcription, the model outputs partial results as it processes audio. Those partial results get revised as more context arrives. If a word appears on screen and then changes a second later, it's jarring. If it happens constantly, the transcription feels unreliable even if the final output is accurate.

Another commenter reported negative experiences with streaming accuracy from other models. Real-time transcription is hard. Batch processing lets a model see the entire audio clip before committing to text. Streaming forces it to make decisions with incomplete information, then correct itself later. The tension between benchmark performance (measured on complete recordings) and production reliability (experienced as live, partial output) is one of the biggest gaps in how speech-to-text models are evaluated. Benchmarks test the final answer. Developers ship the intermediate ones.

The licensing friction

Several commenters flagged the split licensing model. English models are MIT — do whatever you want. Non-English models carry a non-commercial license. For developers in English-speaking markets building personal projects or internal tools, this doesn't matter. For international developers who want to build products in Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, or Vietnamese, it's a wall.

The frustration in the thread was palpable. Open-weight positioning creates an expectation of openness. When some languages are open and others aren't, it feels like a two-tier system. Several commenters said they'd wait for the non-commercial restriction to be lifted before investing development effort. Others pointed out that training data licensing for non-English languages is genuinely harder — the restriction may reflect legal reality rather than business strategy.

Why small models matter

The broader significance of Moonshine isn't just that it's accurate. It's that it's accurate and small. That combination unlocks a category of deployment that large models structurally cannot reach.

At 245 million parameters, Moonshine runs on consumer hardware. Laptops. Tablets. Phones. Not server-grade GPUs in a data center — the actual devices people carry around. That means:

  • No API keys, no per-minute billing, no metered usage. The model runs on hardware you already own. You pay once for the device. Inference is free after that. For applications that process hours of audio daily, the cost difference between local and cloud is enormous.
  • No data leaving the device. Your audio stays on your machine. Not because of a privacy policy. Not because of a terms-of-service promise. Because the architecture makes it physically impossible for your voice data to go anywhere else. Privacy by architecture, not by policy.
  • Fine-tunable by individuals. The morse code story proves it. One person, one GPU, a novel domain, 2% error rate. Try that with a 1.5 billion parameter model. Small models democratize customization.
  • Deployable anywhere. Hospitals where patient audio can't leave the building. Firetrucks where there's no cell signal. Airplanes at 35,000 feet. Submarines. Research stations. Anywhere a network connection is unreliable, unavailable, or unacceptable, local models work. Cloud models don't.

The trend in machine learning for the past several years has been bigger models, more data, more compute. Moonshine represents the counter-movement: what's the smallest model that can hit production-grade accuracy? That question matters more for speech-to-text than for almost any other domain, because speech is inherently personal, often sensitive, and used in contexts where connectivity cannot be guaranteed.

Moonshine in Resonant

Resonant ships Moonshine models for on-device transcription on Mac. Here's what that means in practice.

The models run entirely on your hardware, optimized for Apple Silicon. No audio leaves your machine — ever. There is no server to send it to. Resonant doesn't operate cloud infrastructure for audio processing because it doesn't need to. The model is on your Mac. Your voice goes in. Text comes out. Everything happens locally.

You get low-latency live transcription as you speak. Not batch processing where you record, wait, and then get text back. Real-time output that appears as words leave your mouth. No account needed. No API key. No cloud dependency. No internet connection required. Download, install, start speaking.

Resonant supports multiple model families — Moonshine isn't the only option. But its combination of accuracy and efficiency makes it a strong default for everyday dictation. When you want to draft an email, write a document, send a message, or capture a thought, Moonshine's speed-to-accuracy ratio hits the sweet spot. Large enough to be accurate. Small enough to be fast. That's what you want from a model that runs every time you press a key to start talking.

Frequently asked questions

What are Moonshine STT models?

Moonshine is a family of open-weight speech-to-text models built by a small startup. They achieve lower word-error rates than OpenAI's Whisper Large v3 while running at a fraction of the parameter count (245M vs 1.5B), making them practical for on-device deployment. The English models are MIT licensed, and additional language support covers Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, and Vietnamese.

How does Moonshine compare to Whisper?

Moonshine matches or beats Whisper Large v3 on standard benchmarks with roughly 6x fewer parameters. It also avoids some of Whisper's known issues, like hallucination loops during silence — a problem that multiple developers flagged in the Hacker News discussion. However, Whisper has broader language coverage and a larger ecosystem of tools, wrappers, and community resources built around it.

Can I run Moonshine on my Mac?

Yes. Moonshine models are small enough to run on consumer hardware. Resonant ships Moonshine models optimized for Apple Silicon, giving you fast on-device transcription without any cloud dependency. No account needed, no internet required. The models run entirely on your machine.

Is Moonshine open source?

The English models are MIT licensed — fully open for commercial and personal use. Non-English models (Arabic, Japanese, Korean, Mandarin, Spanish, Ukrainian, Vietnamese) currently carry a non-commercial license. This was a point of discussion in the Hacker News thread, with several developers noting it creates friction for international product development.

Share

Try Resonant free

Private voice dictation for Mac and Windows. 100% on-device, no account required. Download and start speaking in under a minute.