How Transcription Works

Whisper, on-device processing, model selection, WebGPU, and language selection — explained in depth.

Contents

What is Whisper?

Whisper is an automatic speech recognition (ASR) model released by OpenAI in 2022. The original Whisper was trained on 680,000 hours of multilingual audio. Whisper large-v3 (released 2023) expanded this to roughly 5 million hours: 1 million hours of weakly labeled audio plus 4 million hours of audio pseudo-labeled by Whisper large-v2. It acts as a "general-purpose speech understanding model," handling recognition, translation, and timestamping for 99 languages from a single set of weights. The large-v3-turbo model used by AudioBuff inherits these v3-series weights.

The architecture is a Transformer encoder-decoder. Input audio is converted to an 80-dimensional log-Mel spectrogram and fed to the encoder; the decoder then generates text tokens autoregressively. Whisper is published in five sizes — tiny (39M params), base (74M), small (244M), medium (769M), large (1550M) — plus large-v3-turbo (809M), a faster variant whose decoder layer count is reduced from 32 to 4.

AudioBuff uses whisper-small (244M) for the standard model and whisper-large-v3-turbo (809M) for the high-quality model. We avoid tiny/base because their character recognition error rate on Japanese (kanji, katakana) is too high to be practical.

Why It Works in the Browser

Whisper has historically been run via Python + PyTorch. AudioBuff combines @huggingface/transformers (formerly transformers.js) with ONNX Runtime Web to perform all inference inside a Web Worker in your browser. Audio is never sent to a server.

Model files are downloaded from the Hugging Face Hub on first use and cached in the browser’s Cache Storage. After the first run, transcription works offline. The standard model needs about 600MB of cache; the high-quality model needs about 1GB.

The privacy implications are significant: medical, legal, and internal meeting audio can be transcribed without sending data to a third-party API. For GDPR and HIPAA scenarios, keeping data on-device offers clear audit advantages.

tip

Initial download takes a few minutes depending on model size. The standard model (600MB) takes around a minute on a 100Mbps connection but can exceed 10 minutes on mobile networks. We recommend running the first transcription on Wi-Fi.

Model Selection: Standard (small) vs High Quality (large-v3-turbo)

AudioBuff offers two model choices. The differences come down to accuracy, speed, download size, and memory usage.

PropertyStandard (whisper-small)High quality (large-v3-turbo)
Parameters244M809M
Download size (quantized)~600MB~1GB
Encoder quantizationfp32fp16
Decoder quantizationq4 (4-bit)q4 (4-bit)
Japanese accuracyPracticalStronger on names & jargon
Inference speed (WebGPU)0.3–0.5× realtime0.5–0.8× realtime
Memory usageRelatively light~1.5GB peak
Best forPodcasts, casual speechMeetings, technical content, names
tip

Loading the high-quality encoder as fp32 hits the browser’s ArrayBuffer limit (2GB on 32-bit browsers, plus Chrome’s single-buffer cap), causing OOM errors. AudioBuff sidesteps this by loading the encoder as fp16, with effectively zero accuracy loss — faster-whisper also defaults to fp16 for large models.

WebGPU Acceleration & WASM Fallback

AudioBuff detects WebGPU at inference time and runs on the GPU when available, falling back to WebAssembly (with SIMD) for CPU execution otherwise. The check is just `navigator.gpu` presence — no configuration required.

As of January 2026, WebGPU is supported across all major browsers: Chromium-based browsers (Chrome, Edge, Brave, Opera) since 2023, Safari (enabled by default on macOS 26 / iOS 26 / iPadOS 26 / visionOS 26), and Firefox (141+ on Windows, 145+ on macOS ARM64; Linux support is planned for 2026).

Speedup varies by environment, but expect roughly 2–5× over CPU execution with WebGPU. The high-quality model is impractically slow without WebGPU, so we strongly recommend a WebGPU-capable browser when using it.

Where WebGPU works
Chrome 113+, Edge 113+, Opera 99+, Brave 1.50+ (macOS/Windows/Linux); Safari 26+ (macOS Tahoe / iOS 26 / iPadOS 26 / visionOS 26); Firefox 141+ (Windows) and 145+ (macOS ARM64). Android Chrome and Firefox on Linux remain version- and device-dependent.
WASM fallback caveat
CPU inference is slow: 1–2× realtime for the standard model and 3–5× or more for the high-quality model. We strongly recommend the standard model on WASM-only environments.

Language Selection & Accuracy

Whisper supports automatic language detection, but AudioBuff requires you to choose Japanese or English explicitly. The reason is accuracy, not UX.

Auto-detection inspects the first 30 seconds of audio. Silence, noise, short utterances, or code-switching can cause misdetection — and a wrong language token corrupts the entire transcript. Forcing the language token at the start of decoding stabilizes inference.

Across languages, English is the most accurate; Japanese trails slightly on kanji conversion and proper nouns. Fast speech, overlapping speakers, low volume, and reverberant audio all increase error rates regardless of language. Note that transcription receives the trimmed source audio only — EQ, silence removal, and loudness normalization are not applied. To maximize accuracy, first export the audio with Buff it and re-upload the result before running transcription.

Recommended Buff it settings for transcription
EQ: enable the high-pass filter (cuts low rumble and HVAC noise); preset "Clarity" or "Podcast." Finishing: enable silence removal (suppresses hallucination), enable the compressor (evens out volume), set loudness to -16 LUFS (makes quiet speech easier to recognize). Output: WAV recommended (avoids re-encoding artifacts). Drop the exported file back in and run transcription on it.
Code-switched audio
Whisper handles mixed-language audio but is biased toward the forced language token. For meetings that switch frequently, pick the dominant language and correct mistakes in post.
Common error patterns
People’s names, place names, technical terms, acronyms, and recent proper nouns (post-training-cutoff). The high-quality model handles these markedly better than the standard model.
Hallucination on silence
Given silence or very low audio, Whisper sometimes outputs text from common YouTube subtitle phrases ("Thanks for watching" etc.) that appeared in its training data. Re-uploading audio that was processed with silence removal in Buff it suppresses this.