Resonant
Back to resources
GuideFeb 25, 2026
Share

The Local STT Moment: Small Models Are Replacing Cloud Transcription

For years, the best speech-to-text required a server. You'd record audio, send it somewhere, and wait. In 2026, that's no longer true. A wave of small, efficient, open-weight models — Moonshine, Parakeet, and others — now match or beat the accuracy of cloud transcription services. And they run on your laptop.

A recent Hacker News thread about Moonshine's new STT models captured the shift in real time. Developers aren't just interested in local models. They're building with them — and demanding more.

This piece looks at where the local STT landscape actually stands: which models are worth paying attention to, why developers are moving away from cloud APIs, and what people are building with on-device speech recognition right now.

The new landscape

Three models define the local STT conversation in 2026. Each takes a different approach to the same problem: accurate speech-to-text without sending audio to someone else's server.

MoonshineParakeet V3Whisper Large v3
DeveloperMoonshine AI (6-person startup)NVIDIAOpenAI
Parameters245M600M1.5B
LicenseMIT (English) / Non-commercial (other)Apache 2.0MIT
Languages8English-focused100+
On-deviceYesYesYes (with effort)
StreamingYesLimitedNo (batch only)
Hallucination riskLowLowKnown issue

Moonshine

The smallest and most efficient of the three. Moonshine was built for streaming — words appear as you speak them, with minimal revision of earlier tokens. At 245M parameters, it's roughly 6x smaller than Whisper Large v3, yet matches or beats it on standard English benchmarks. The English model is MIT licensed. The team behind it is six people. They're competing with OpenAI and NVIDIA on accuracy, and winning on efficiency.

Parakeet V3

NVIDIA's entry into the local STT space. Larger at 600M parameters, but with strong real-time performance thanks to NVIDIA's optimization work. Apache 2.0 licensed, which matters for commercial use. Some developers in the HN thread reported Parakeet as their preferred model for local transcription, particularly for accuracy on longer passages. It runs through NVIDIA's NeMo toolkit, which is well documented but adds some setup overhead compared to pip-installable alternatives.

Whisper Large v3

Still the most widely known. Whisper has the largest ecosystem of any open speech model — wrappers, fine-tunes, optimizations, and frontends in every language. It supports over 100 languages and remains the default recommendation for multilingual transcription. But the hallucination problem is real. Whisper generates phantom text during silences: words, phrases, sometimes entire sentences that were never spoken. For batch transcription of clean audio, this is manageable. For live dictation or transcription of audio with pauses, it's a production-breaking issue that doesn't show up in word-error-rate benchmarks.

Why cloud STT is losing ground

Three forces are driving developers and organizations toward local speech recognition. None of them are new, but all three hit a tipping point in 2026.

Privacy by necessity, not preference

Not everyone choosing local models is a privacy advocate. Some have no choice. The firefighter in the Hacker News thread building tablet software for emergency responders can't rely on cloud connectivity in the field. Doctors bound by HIPAA can't send patient dictation to third-party servers. Lawyers handling privileged communications need guarantees, not promises.

For these people, “we encrypt your data in transit” isn't enough. The only architecture that satisfies their constraints is one where audio never leaves the device. Local models make that possible without sacrificing accuracy.

Medical dictation on Mac: why local processing matters →

How Resonant handles your voice data →

Latency that cloud can't match

On-device transcription eliminates the round trip. When Moonshine runs locally, words appear as you speak them. There is no network hop, no queuing on a remote server, no variable latency depending on load. One developer in the HN thread specifically noted Resonant's “incredible latency for live transcription streaming.”

For voice agents, dictation apps, and real-time captioning, even 200ms of network latency breaks the experience. You feel the delay between speaking and seeing text. It interrupts your train of thought. Local inference removes that friction entirely. The model runs on your hardware, and the result appears immediately.

Cost at scale

Cloud STT APIs charge per minute of audio. Google Cloud Speech-to-Text, AWS Transcribe, and Azure Speech Services all bill on usage. At small volumes, the cost is negligible. At scale — thousands of hours of transcription per month — the bill adds up fast.

A local model costs nothing per inference after the initial download. The compute is yours. For companies processing large volumes of audio, the economics are straightforward: a one-time investment in hardware versus an ever-growing line item for API calls. For individuals, the equation is even simpler. Download a model, run it, pay nothing.

What developers are actually building

The HN thread revealed use cases that go well beyond simple dictation. Local STT is enabling a class of applications that cloud APIs made impractical — either too expensive, too slow, or too constrained by privacy requirements.

  • Amateur radio morse code. One developer fine-tuned moonshine-tiny to decode morse code transmissions, achieving roughly 2% character error rate on a single 4090 GPU. This is the kind of specialized fine-tuning that was previously only accessible to well-funded research labs. Small open models make it possible for a single developer with a consumer GPU.
  • Emergency services. A developer building field tablets for firefighters needs on-premise deployment with Norwegian language support. No cloud fallback. No internet assumption. The deployment environment is a burning building or a roadside accident — connectivity is not guaranteed.
  • Live streaming. A commenter requested OBS plugin support for real-time translation during streams — code-switching between languages without interruption. This requires streaming STT with low enough latency to feel live, plus translation, all running locally to avoid the cost of processing hours of continuous audio through a cloud API.
  • Voice agents. A developer asked about streaming stability metrics: “what percentage of partial tokens get revised after 1 second? After 3 seconds?” These are production-grade questions. They show local STT is being evaluated for serious applications — voice interfaces that need to act on partial transcriptions in real time, not just produce a final transcript after the fact.

The common thread: these are applications that either can't use cloud STT (connectivity, privacy, cost) or shouldn't (latency, control). Local models don't just replicate cloud functionality on-device. They enable new categories of applications.

The accuracy question

Accuracy is the first thing people ask about when evaluating local models. The honest answer is: it's good enough, and for some models, it's better than cloud alternatives.

  • Moonshine beats Whisper Large v3 on standard benchmarks with 6x fewer parameters. That is remarkable. A 245M parameter model outperforming a 1.5B parameter model on English transcription tasks is not what anyone predicted two years ago.
  • Benchmarks don't capture everything. Whisper's hallucination loops during silence are a production problem that doesn't appear in word-error-rate scores. A model that invents text when nobody is speaking will score well on WER benchmarks (which measure against spoken audio) while being unreliable in real-world use.
  • Parakeet V3 is competitive or better on some benchmarks despite being larger. Parameter count alone isn't a proxy for quality. Architecture, training data, and optimization all matter. NVIDIA's investment in NeMo and training infrastructure pays off in model quality.
  • For most dictation and transcription work, all three models are good enough. The differences that matter now are practical: latency, streaming support, licensing, and deployment flexibility. If you need 100+ languages, Whisper is the only option. If you need streaming with low hallucination risk, Moonshine is the clear choice. If you need the strongest raw accuracy with Apache licensing, Parakeet is worth evaluating.

Running local STT today

The models exist. The accuracy is there. The practical question is how to actually use them. Here are the three main paths, depending on what you need.

Command line and Python

Moonshine and Whisper are pip-installable. Parakeet runs through NVIDIA's NeMo toolkit. If you're comfortable with a terminal, you can have any of these models running in minutes. This is the right path for prototyping, batch processing, fine-tuning on custom data, and integrating STT into custom pipelines. The ecosystem of wrappers, optimizations, and examples is deep — especially for Whisper, which has had years of community development.

Desktop apps

Resonant ships Moonshine models optimized for Apple Silicon on Mac. Models run entirely on-device — no setup, no API keys, no cloud dependency. You download the app and start speaking. Text appears in whatever application has focus.

For developers who want accurate local dictation without building their own inference pipeline, this is the fastest path. No Python environment to configure, no model weights to download manually, no inference code to write. The models are bundled with the app and optimized for the hardware they're running on. Your audio never leaves your machine.

Download Resonant for Mac →

Self-hosted servers

For teams that need centralized transcription without third-party cloud services, these models can run on internal GPU servers. A single consumer GPU handles real-time transcription easily with any of the three models. Moonshine is the most efficient — its 245M parameter model requires minimal VRAM and can serve multiple concurrent streams on modest hardware. This approach gives you the accuracy of modern STT models with the data governance of on-premise infrastructure.

What comes next

The trajectory is clear. Models are getting smaller and more accurate. Hardware is getting faster. The gap between local and cloud STT that existed even two years ago has collapsed for English and is narrowing for other languages.

The remaining frontiers are multilingual support (Moonshine covers 8 languages; Whisper covers 100+), specialized vocabularies (medical, legal, technical), and speaker diarization (who said what). These are hard problems, but they're being worked on by both the open-source community and commercial teams.

The bigger shift is cultural. Developers are no longer asking “is local STT good enough?” They're asking “which local model should I use?” That's a fundamentally different conversation. The cloud-first era of speech recognition is ending. What replaces it is a model that runs on your hardware, processes your voice on your terms, and keeps your words yours.

Frequently asked questions

What is local speech-to-text?

Local speech-to-text processes your audio entirely on your device — no internet connection, no cloud servers, no data leaving your machine. Models like Moonshine and Parakeet run directly on your CPU or GPU. The audio is converted to text right where it's recorded, and nothing is transmitted anywhere.

Which local STT model is the most accurate in 2026?

It depends on the use case. Moonshine leads on efficiency — 245M parameters with benchmark scores that match or beat models 6x its size. Parakeet V3 scores well on accuracy leaderboards at 600M parameters and is preferred by some developers for longer transcription tasks. Whisper Large v3 has the broadest language support but is known for hallucination issues during silence, which limits its reliability for live use.

Can local STT models run on a laptop?

Yes. Moonshine's smallest model runs comfortably on modern laptops without a dedicated GPU. Even the larger variants work well on Apple Silicon Macs and laptops with dedicated GPUs. You don't need server-class hardware for real-time transcription anymore. A MacBook Air with an M1 chip handles on-device dictation without breaking a sweat.

Is local STT as accurate as cloud transcription?

For English, yes. Moonshine and Parakeet match or beat major cloud APIs on standard benchmarks. The accuracy gap that justified cloud-only approaches no longer exists for most English transcription and dictation work. For less common languages, cloud services still have an edge due to larger training datasets, though the gap is narrowing as open-weight multilingual models improve.

Does Resonant use Moonshine models?

Yes. Resonant ships Moonshine models optimized for Apple Silicon. All transcription happens on your Mac — no audio is sent to any server, no account is required, and the models run with low latency for real-time dictation. You download the app and start speaking. Everything stays on your machine.

Share

Try Resonant free

Private voice dictation for Mac and Windows. 100% on-device, no account required. Download and start speaking in under a minute.