Subtitle Formats & Editing Workflow
SRT vs VTT vs TXT, how timestamps are generated, transcript-to-cut workflow, and the limits of automatic transcription.
Contents
Subtitle Formats: SRT vs VTT vs TXT
AudioBuff exports transcripts as SRT, VTT, or TXT. Each format has different specs and ideal uses — pick based on your downstream workflow.
| Format | Full name | Timestamps | Primary use |
|---|---|---|---|
| SRT | SubRip Subtitle | HH:MM:SS,mmm (comma) | YouTube captions, Premiere Pro / DaVinci Resolve, most video editors |
| VTT | Web Video Text Tracks | HH:MM:SS.mmm (period) | HTML5 <track> element, web streaming, when subtitle styling is needed |
| TXT | Plain Text | None (text only) | Meeting notes, blog drafts, input to AI summarizers, searchable archives |
SRT vs VTT: How to Choose
SRT and VTT look very similar but are distinct. SRT was born in the 1990s as the format of SubRip, a DVD ripping tool — its simplicity made it the de facto standard. VTT is a W3C web standard and the only subtitle format the HTML5 `<track>` element can read directly.
Practical guidance: use SRT for video editors (Premiere Pro, Final Cut Pro, DaVinci Resolve, iMovie); use VTT for embedding video on a website with captions; YouTube accepts both. For pasting into Notion or Google Docs as readable text, use TXT.
- SRT example
- 1\n00:00:00,000 --> 00:00:03,500\nHello, today’s topic is…
- VTT example
- WEBVTT\n\n00:00:00.000 --> 00:00:03.500\nHello, today’s topic is…
- Timestamp precision
- Both use millisecond precision (3 decimal places). Whisper’s output has roughly 20ms granularity, more than enough for subtitle display.
VTT requires a WEBVTT header line. A VTT file without `WEBVTT` at the top fails to load in most players. AudioBuff adds this header automatically.
How Timestamps Are Generated
Whisper splits audio into 30-second chunks (its encoder’s maximum input length). To prevent words from being cut at chunk boundaries, AudioBuff applies a 5-second stride overlap.
Tokens and timestamps from each chunk are merged into a single timeline, with overlapping regions deduplicated. This merging is the "finalize" phase — and it’s why the progress bar pauses near 90% before completing.
The progress bar design: 0–90% reflects token generation (estimated at ~200 tokens/minute), 90–95% creeps over the time-based finalize phase, then 100% on completion. This keeps the UI feeling responsive even after token generation ends.
- Long audio works in one job
- Thanks to chunking + stride, Whisper handles arbitrarily long audio despite its 30-second design limit. AudioBuff transcribes hour-long recordings as a single job.
- What happens during finalize
- After token generation completes, timestamp recovery and chunk merging take seconds to tens of seconds. Longer audio means longer finalize time.
Transcript-to-Cut Workflow
A defining feature of AudioBuff: each transcribed segment has a scissors button that registers its start/end as a "cut" — and the audio export removes those spans automatically.
This collapses the typical podcast editing flow ("transcribe → decide what to cut → cut in a video editor") into a single project. Because the transcript and audio share the same timeline, no tool round-tripping is needed.
Cuts are recorded against the trimmed timeline. The efficient flow is: trim broadly first, transcribe, then add fine-grained cuts from the transcript.
- Filler removal
- Filler words ("um," "uh," "like") often land in their own segments. Click the scissors on those segments to tighten the pacing.
- Undo a cut
- Cut segments can be restored with the undo icon. Change as many times as you want before exporting.
- On export
- Pressing "Buff It" exports audio (MP3 / WAV) with cut spans removed from the trimmed range. EQ, compressor, and loudness normalization apply to the post-cut audio.
Limits & Tips
Whisper is a strong general ASR model, but it has real limits. Knowing these and pre-processing accordingly improves accuracy noticeably.
- No speaker diarization
- Whisper alone cannot identify "who is speaking." Multi-speaker audio is transcribed as a flat timeline without speaker labels. For diarization, pair with a model like pyannote.audio.
- Hallucination on silence
- Given long silence, Whisper may generate phrases from common YouTube subtitle patterns ("Thanks for watching" etc.) that appeared in its training data. Use AudioBuff’s silence removal beforehand to suppress this.
- Names & jargon
- People’s names, company names, place names, technical terms, and recent slang are common error sources. The high-quality model handles them noticeably better than the standard model, but never perfectly. A find-and-replace pass in your text editor is realistic.
- Input quality matters
- Reverberation, noise, extreme volume swings, and clipping all hurt accuracy. Outdoor recordings or built-in phone microphones suffer the most. Apply EQ, high-pass filter, compression, and loudness normalization before transcribing for measurable gains.
TL;DR: ① trim away unnecessary head/tail and silence, ② high-pass at 80–120Hz to cut rumble, ③ use the high-quality model for complex audio (mixed languages, jargon, multiple speakers), ④ always plan for human review.