Local Transcription Models in 2026: Parakeet, Whisper, and More
Resonant uses NVIDIA's Parakeet on Apple Silicon — a state-of-the-art on-device STT model tuned for low-latency live dictation. Audio never leaves your Mac.
Below is a field guide to the broader landscape of open local speech-to-text models in 2026 — Parakeet, Whisper, Moonshine, SenseVoice, and others — for context on why Parakeet is a good fit for live dictation on Apple Silicon today. The sections below describe each model on its own merits; Resonant itself ships only Parakeet.
Start here: Parakeet TDT 0.6B v3
Best for: English and European languages. Recommended for most people.
Size: ~640 MB — Languages: English + 24 European languages
Parakeet is the model Resonant ships. It was built by NVIDIA on NeMo FastConformer architecture, trained on over 660,000 hours of audio, and it shows. Word error rates on English, German, Spanish, Italian, and French are among the lowest of any locally-runnable model.
It auto-detects language across 25 European languages — you don't need to tell it what you're speaking. German comes in at 5.04% WER on FLEURS benchmarks. Spanish at 3.45%. Italian at 3.00%. For English dictation specifically, it's hard to beat at any price.
Parakeet also supports hotwords: you can bias it toward proper nouns, technical terms, or product names that matter to your work. If you dictate anything involving names, jargon, or specialized vocabulary, that feature alone makes it the right starting point.
If you speak English or any Western European language and you're not sure which model to use, use Parakeet. Switch only if you have a specific reason.
For lightweight English use: Moonshine v2 Medium
Best for: English-only workflows where you want a smaller footprint.
Size: ~200 MB — Languages: English only
Moonshine was built by Useful Sensors specifically for edge devices. The v2 Medium model runs at 245M parameters — about a third the size of Parakeet — and delivers accuracy that holds up well for everyday dictation. 6.65% word error rate on standard benchmarks.
The key difference from Whisper-based models is architecture. Moonshine doesn't pad audio to fixed 30-second chunks like Whisper does, which means shorter utterances process without unnecessary overhead. It was designed to be efficient, and on Apple Silicon that efficiency is noticeable.
Choose Moonshine v2 Medium if you dictate in English, you want to keep your model download small, and you don't need hotwords or multilingual support.
For 99 languages: Whisper Large V3 Turbo
Best for: Any language not served by a dedicated model above.
Size: ~1 GB — Languages: 99 languages
OpenAI's Whisper is the model that proved high-accuracy offline transcription was possible at scale. Whisper Large V3 Turbo is a distilled version — 809M parameters with only 4 decoder layers instead of the full 32 — which makes it significantly faster while keeping the broad language support intact.
If you dictate in Arabic, Hindi, Vietnamese, Thai, Hebrew, Turkish, Indonesian, or any of the other 80+ languages that Parakeet and SenseVoice don't cover, Whisper Turbo is worth knowing about. It defaults to English but supports 99 languages, and runs well via whisper.cpp on Apple Silicon.
It's the most widely trusted offline transcription model in the world. Resonant doesn't bundle Whisper today, but if you're comparing Resonant against another tool that uses Whisper in the cloud, Whisper Turbo is what that tool is effectively running — and you can run it locally yourself via whisper.cpp.
For faster English Whisper: Whisper Distil Large v3.5
Best for: English-only Whisper users who want more speed.
Size: ~1 GB — Languages: English only
Distil-Whisper is a knowledge-distilled version of Whisper Large V3, trained specifically on English. The accuracy on English short-form audio is within about 1% of the full Turbo model, but it runs 1.5x faster.
If your work is English-only and you find yourself transcribing frequently throughout the day, that speed difference adds up. The download size is comparable to Whisper Turbo, so the only tradeoff is losing multilingual support you may not need.
For East Asian languages: SenseVoice Small
Best for: Chinese, Japanese, Korean, and Cantonese.
Size: ~226 MB — Languages: Mandarin, English, Japanese, Korean, Cantonese
SenseVoice was built by Alibaba Research for exactly this use case. It uses a non-autoregressive CTC architecture that processes audio at roughly one-tenth real-time — RTF of 0.10, which means a ten-second clip takes about one second to transcribe. That's very fast.
It auto-detects across its five supported languages, so mixed-language dictation between, say, Mandarin and English works without switching modes. At 226 MB, the download is light for what it covers.
If you regularly switch between English and any East Asian language, SenseVoice is the right choice. For Mandarin-primary users who need the highest possible accuracy on Chinese, consider FireRedASR Large instead (below).
For Mandarin accuracy: FireRedASR Large
Best for: Mandarin-primary speakers who prioritize accuracy above all else.
Size: ~1.7 GB — Languages: Mandarin Chinese + English
FireRedASR Large is the best locally-runnable Mandarin model available. Built by the FireRed team on an attention encoder-decoder architecture, it achieves 3.18% character error rate on Mandarin benchmarks — state of the art for an offline model. It also handles Chinese dialects and code-switching between Mandarin and English.
The download is larger at 1.7 GB, and it runs somewhat slower than SenseVoice. But if your primary use is professional Mandarin dictation — documents, correspondence, clinical notes — the accuracy difference is worth it.
For Japanese: Zipformer Japanese
Best for: Japanese-only speakers who want the highest accuracy.
Size: ~148 MB — Languages: Japanese only
This model was trained on 35,000 hours of ReazonSpeech v2.0 data — one of the largest Japanese speech corpora publicly available. The Icefall Zipformer architecture runs at RTF 0.08, which means real-time-or-better transcription on any modern Mac.
SenseVoice covers Japanese, but if Japanese is your primary or only language, this dedicated model will generally outperform it. 148 MB is a compact download for what it delivers.
For Korean: Zipformer Korean
Best for: Korean speakers.
Size: ~68 MB — Languages: Korean only
At 68 MB, Zipformer Korean is one of the smallest local STT models around. It runs at 29x faster than real-time (RTF 0.034) — among the fastest in this lineup. A minute of speech processes in about two seconds on Apple Silicon.
If you dictate in Korean, this is the right choice. The download is negligible and the speed is unmatched.
For Russian: GigaAM v2 Russian
Best for: Russian-primary speakers.
Size: ~231 MB — Languages: Russian only
GigaAM v2 comes from SaluteSpeech, Sber's speech AI research team. It's commercially licensed and uses a NeMo transducer architecture trained specifically on Russian speech. It's the best locally-runnable Russian ASR model available.
If Russian is your primary dictation language, this model will significantly outperform Whisper Turbo on Russian content. 231 MB is compact for the coverage it provides.
For everything else: omniASR 300M
Best for: Any language not covered above.
Size: ~348 MB — Languages: 1,600+ languages
Meta's omniASR is a CTC model trained across over 1,600 languages. If you speak a language that isn't served by any of the dedicated models above and isn't in Whisper's 99-language set, omniASR is your option.
It covers many low-resource languages that no other local model does. Accuracy on well-represented languages is good; on lower-resource ones it varies, as it does with any model trained on limited data. But for languages with no other local option, it's a meaningful baseline.
Quick reference
| Model | Size | Best for |
|---|---|---|
| Parakeet TDT 0.6B v3 | 640 MB | English + 24 European languages. Start here. |
| Moonshine v2 Medium | 200 MB | English only. Lightweight alternative to Parakeet. |
| Whisper Large V3 Turbo | 1 GB | 99 languages. Use for languages not covered above. |
| Whisper Distil Large v3.5 | 1 GB | English only. Faster than Turbo, same accuracy. |
| SenseVoice Small | 226 MB | Chinese, Japanese, Korean, Cantonese, English. |
| FireRedASR Large | 1.7 GB | Mandarin-primary. Best Chinese accuracy. |
| Zipformer Japanese | 148 MB | Japanese-only. Trained on 35k hours. |
| Zipformer Korean | 68 MB | Korean-only. Smallest and fastest model. |
| GigaAM v2 Russian | 231 MB | Russian-only. Best Russian accuracy. |
| omniASR 300M | 348 MB | 1,600+ languages. Universal fallback. |
What Resonant uses
Resonant ships with Parakeet on Apple Silicon — chosen for its accuracy on English and 24 European languages, low latency, and Neural Engine fit. The other models in this guide are useful context if you're evaluating local STT broadly, or running your own pipeline.
All transcription in Resonant runs on your Mac. No account required. No audio sent anywhere.