Voice Generation Guide
How browser-based voice cloning with Chatterbox works in AudioBuff. Includes the privacy model and the 2026 regulatory landscape.
Contents
Overview
AudioBuff Voice (Beta) clones a speaker's voice from a 10–20 second reference clip and reads out the text you provide using the same voice. All processing runs in your browser — neither the reference audio nor the generated audio is sent to any server.
The model is Resemble AI Chatterbox, an open-weight 0.5B-parameter TTS model trained on 500,000+ hours of clean speech, distributed under MIT for both code and weights. As of 2025, Podonos blind A/B tests reportedly preferred Chatterbox over ElevenLabs 63.75% of the time ([source](https://www.genmedialab.com/news/chatterbox-open-source-tts-elevenlabs-alternative/)).
AudioBuff loads the HuggingFace ONNX Community quantized export (~1.5 GB total) and runs inference in a Web Worker on WebGPU (or WebAssembly fallback). The full fp32 bundle is ~3.2 GB; quantizing the language model to Q4f16 cuts that roughly in half while preserving perceived quality.
Architecture and inference pipeline
Chatterbox is composed of four ONNX sessions. AudioBuff loads them in parallel inside a Web Worker and runs inference using the browser WebGPU API (or WebAssembly fallback).
- embed_tokens
- ~60 MB. Converts input text + position IDs + the exaggeration scalar into embeddings before they enter the language model.
- speech_encoder
- ~591 MB (fp32). Turns the reference clip into a 192-dim speaker x-vector and conditioning representation. Determines clone quality.
- language_model
- 0.5B Llama-derived (30 hidden layers, 16 KV heads). Q4f16 quantized to ~350 MB (vs ~2 GB fp32). Autoregressively generates audio tokens from text + speaker conditioning.
- conditional_decoder
- ~530 MB (fp32). Reconstructs a 24 kHz waveform from audio tokens via a mel decoder + learned vocoder.
On first use, all four files are downloaded from HuggingFace into the browser Cache API (~1.5 GB total). Subsequent sessions read from cache and only need to compile the ONNX sessions (30–60 seconds). On WebGPU, shader compilation runs on the first inference — AudioBuff runs an automatic warmup right after model load to absorb that one-time cost.
Why browser-based
2026 is the year the AI inference cost crisis became visible. AnalyticsWeek's 2026 FinOps report finds 85% of enterprise AI spending now goes to inference, with major API prices rising 30–50% over 18 months. Server-side TTS as a SaaS becomes a pricing problem.
Browser-based execution sidesteps that entirely. Inference runs on the user's own GPU/CPU — no per-character billing for the provider, no upload of reference audio for the user.
WebGPU finally landed across all major browsers in 2025: Chrome (since v113, 2023), Edge, Firefox 141 (Windows, July 2025), and Safari 26 (macOS Tahoe / iOS 26, September 2025). Coverage is around 70% of users in 2026, with WebAssembly as a graceful fallback for the rest. AudioBuff supports both paths automatically.
For a 0.5B-parameter model like Chatterbox, both Apple Silicon and discrete Nvidia GPUs are comfortably overpowered. The energy efficiency picture favors Apple though: M3/M4 Max draws 40–80 W, while an RTX 4090 sustains ~450 W under load — a real difference if you do this often.
Engineering a great reference clip
Clone quality depends almost entirely on the reference clip. "Garbage in, garbage out" is the rule, and this is the highest-leverage place to invest your time.
- Length: 10–20 seconds (under 10 is unstable)
- Practitioner consensus across Chatterbox-class TTS: under 3 seconds is too thin, 3–15 seconds scales linearly, beyond that quality plateaus and very long clips can cause the model to fail to emit EOS, leading to runaway generation. AudioBuff recommends the 10–20 s band.
- Microphones (USB condensers under $200)
- Solid options: Rode NT-USB+ (32-bit float, on-board DSP), Audio-Technica AT2020USB-X, Elgato Wave:3 (24-bit/96 kHz), Maono PM500 (34 mm gold capsule). All are cardioid — they reject side and rear reflections naturally. Avoid laptop built-in microphones.
- Acoustic treatment without a studio
- Practical tricks: "blanket fort" with overlapping moving blankets on walls and floor; the closet-trick — don't record *inside* a closet (boxy sound), record *facing into* the open closet so clothes act as broadband absorbers. Furniture, rugs, and bookshelves are all natural absorbers.
- Natural pace and tone
- Avoid whispering, rushing, or exaggerated emotion. The closer you stay to your normal speaking voice, the better the clone. Don't try to do a "narrator voice" — record yourself speaking naturally for 20 seconds.
- No music, sound effects, or other voices
- If music or a second speaker bleeds in, the model treats it as part of the voice you want cloned and dutifully reproduces it. Check the recording environment before you hit Record.
- Match the sample style to your use case
- For a casual clone, read a casual script. For narration, read a structured passage. AudioBuff ships four built-in samples covering pangram, conversational, narration, and audiobook styles — pick whichever matches your target.
- Levels are handled automatically
- AudioBuff peak-normalizes to -3 dBFS and trims leading/trailing silence client-side. You only need to avoid clipping during recording.
Using the Expressiveness slider
The slider goes from 0.0 to 1.5 and controls how much of the reference voice's emotional content to amplify. It's a Chatterbox-specific control closer to Classifier-Free Guidance than to ElevenLabs-style style prompts.
Resemble defaults to 0.5 — a good starting point for most use cases.
- 0.0–0.3
- Calm, restrained delivery. Good for technical explainers, news-style narration, and document readouts.
- 0.4–0.6
- Natural balance. Default recommended range. Suits dialogue and general narration.
- 0.7–1.0
- Expressive. Good for audiobooks and dramatic content.
- 1.0–1.5
- Exaggerated. Use intentionally; values above 0.8 tend to cross into uncanny-valley territory.
Higher exaggeration does not invent emotions that are not in the reference clip — it amplifies what is already there. Feeding a calm voice with exaggeration=1.5 produces slightly more expressive calm, not anger.
Use cases
Voice cloning is most powerful when it removes a constraint that previously required studio time:
- Podcast intros and outros
- Clone your voice once, then generate consistent intros and outros for every episode without re-tracking. Same energy, same level — no studio session needed.
- Video narration fixes
- When editing reveals a misspoken line, replace it by editing text instead of re-recording. AudioBuff's "Continue in audio editor" button then hands the clone over to the EQ + LUFS pipeline so you can match it to the rest of the soundtrack.
- Accessibility
- Customize a screen reader to use your own voice. ALS and post-laryngectomy patients have used voice banking to keep their voice after losing speech. Language learners can shadow in their own voice for self-evaluation.
- E-learning narration
- Record a six-hour course, then fix individual sentences later by editing text. No studio rebooking, no tracking-cost re-record.
- Indie game character voices
- One reference clip plus exaggeration variations gives you several distinct character takes. Indies without a casting budget can prototype and ship with the same workflow.
Watermarking and responsible use
Generated audio carries an inaudible Resemble Perth watermark by default, and there's no way to turn it off in AudioBuff. It's a neural watermark that survives MP3 compression and resampling, so any clip can later be technically confirmed as AI-generated.
Disclosure of synthetic voice is becoming a legal requirement worldwide (EU AI Act Article 50, the US ELVIS Act, platform policies). What you actually need to remember when distributing is just two things: (1) you have permission to use the voice you cloned, and (2) you disclose that the audio is AI-synthesized — usually a platform flag like YouTube's "Altered or synthetic content" toggle is enough.
The watermark is a detection mechanism, not a prevention mechanism. The ethical judgement remains yours.
Privacy model
AudioBuff Voice's defining property is end-to-end browser execution. Unlike commercial SaaS (ElevenLabs, Resemble's paid API), neither reference audio nor generated audio is uploaded to AudioBuff or to Resemble.
Model files are downloaded once from the Hugging Face CDN. After that, the feature works offline — including in airplane mode.
- What gets stored locally
- Cache API: model files (~1.5 GB) / localStorage: ethics-consent flag ("1" or absent).
- What never leaves your device
- Reference audio, input text, generated audio, expressiveness value, generation parameters.
- How to clear it
- Use the in-app "Clear cache" button to wipe Cache API in one click. localStorage can be cleared through normal browser settings.
Honest limitations
Marketing pages skip these, but they matter for setting expectations:
- Singing, screaming, age extremes are not supported
- Chatterbox is trained on "expressive speech" — singing, shouting, infant voices, and 80-year-old voices are out of distribution. Community reports paralinguistic tags like [laugh] and [cough] work, but inconsistently.
- Uncanny valley above ~0.8 expressiveness
- Higher CFG/exaggeration values can over-emote into something that feels artificial. Sweet spot is 0.4–0.6.
- Older hardware can be slow
- Laptops with integrated GPUs older than ~5 years may fall back to WebAssembly and take minutes to generate 20 seconds of audio. M-series Mac, Snapdragon X, or any RTX/Radeon-class GPU is recommended.
- English only for now
- AudioBuff currently uses the English Chatterbox model. Resemble shipped Chatterbox Multilingual (23 languages) in December 2025; AudioBuff will adopt it once the Transformers.js port completes.
Runs entirely in your browser. Upload a reference clip and generate speech from text — no upload, no account.