Subtitle Formats & Editing Workflow

SRT vs VTT vs TXT, how timestamps are generated, transcript-to-cut workflow, and the limits of automatic transcription.

Contents

Subtitle Formats: SRT vs VTT vs TXT

AudioBuff exports transcripts as SRT, VTT, or TXT. Each format has different specs and ideal uses — pick based on your downstream workflow.

Format	Full name	Timestamps	Primary use
SRT	SubRip Subtitle	HH:MM:SS,mmm (comma)	YouTube captions, Premiere Pro / DaVinci Resolve, most video editors
VTT	Web Video Text Tracks	HH:MM:SS.mmm (period)	HTML5 <track> element, web streaming, when subtitle styling is needed
TXT	Plain Text	None (text only)	Meeting notes, blog drafts, input to AI summarizers, searchable archives

SRT vs VTT: How to Choose

SRT and VTT look very similar but are distinct. SRT was born in the 1990s as the format of SubRip, a DVD ripping tool — its simplicity made it the de facto standard. VTT is a W3C web standard and the only subtitle format the HTML5 `<track>` element can read directly.

Practical guidance: use SRT for video editors (Premiere Pro, Final Cut Pro, DaVinci Resolve, iMovie); use VTT for embedding video on a website with captions; YouTube accepts both. For pasting into Notion or Google Docs as readable text, use TXT.

SRT example: 1\n00:00:00,000 --> 00:00:03,500\nHello, today’s topic is…
VTT example: WEBVTT\n\n00:00:00.000 --> 00:00:03.500\nHello, today’s topic is…
Timestamp precision: Both use millisecond precision (3 decimal places). Whisper’s output has roughly 20ms granularity, more than enough for subtitle display.

tip

VTT requires a WEBVTT header line. A VTT file without `WEBVTT` at the top fails to load in most players. AudioBuff adds this header automatically.

How Timestamps Are Generated

Whisper splits audio into 30-second chunks (its encoder’s maximum input length). To prevent words from being cut at chunk boundaries, AudioBuff applies a 5-second stride overlap.

Tokens and timestamps from each chunk are merged into a single timeline, with overlapping regions deduplicated. This merging is the "finalize" phase — and it’s why the progress bar pauses near 90% before completing.

The progress bar design: 0–90% reflects token generation (estimated at ~200 tokens/minute), 90–95% creeps over the time-based finalize phase, then 100% on completion. This keeps the UI feeling responsive even after token generation ends.

Long audio works in one job: Thanks to chunking + stride, Whisper handles arbitrarily long audio despite its 30-second design limit. AudioBuff transcribes hour-long recordings as a single job.
What happens during finalize: After token generation completes, timestamp recovery and chunk merging take seconds to tens of seconds. Longer audio means longer finalize time.

Transcript-to-Cut Workflow

A defining feature of AudioBuff: each transcribed segment has a scissors button that registers its start/end as a "cut" — and the audio export removes those spans automatically.

This collapses the typical podcast editing flow ("transcribe → decide what to cut → cut in a video editor") into a single project. Because the transcript and audio share the same timeline, no tool round-tripping is needed.

Cuts are recorded against the trimmed timeline. The efficient flow is: trim broadly first, transcribe, then add fine-grained cuts from the transcript.

Filler removal: Filler words ("um," "uh," "like") often land in their own segments. Click the scissors on those segments to tighten the pacing.
Undo a cut: Cut segments can be restored with the undo icon. Change as many times as you want before exporting.
On export: Pressing "Buff It" exports audio (MP3 / WAV) with cut spans removed from the trimmed range. EQ, compressor, and loudness normalization apply to the post-cut audio.

Limits & Tips

Whisper is a strong general ASR model, but it has real limits. Knowing these and pre-processing accordingly improves accuracy noticeably.

No speaker diarization: Whisper alone cannot identify "who is speaking." Multi-speaker audio is transcribed as a flat timeline without speaker labels. For diarization, pair with a model like pyannote.audio.
Hallucination on silence: Given long silence, Whisper may generate phrases from common YouTube subtitle patterns ("Thanks for watching" etc.) that appeared in its training data. Use AudioBuff’s silence removal beforehand to suppress this.
Names & jargon: People’s names, company names, place names, technical terms, and recent slang are common error sources. The high-quality model handles them noticeably better than the standard model, but never perfectly. A find-and-replace pass in your text editor is realistic.
Input quality matters: Reverberation, noise, extreme volume swings, and clipping all hurt accuracy. Outdoor recordings or built-in phone microphones suffer the most. Apply EQ, high-pass filter, compression, and loudness normalization before transcribing for measurable gains.

tip

TL;DR: ① trim away unnecessary head/tail and silence, ② high-pass at 80–120Hz to cut rumble, ③ use the high-quality model for complex audio (mixed languages, jargon, multiple speakers), ④ always plan for human review.

Generate subtitles

Transcribe and export SRT, VTT, or TXT — and use timestamps for cut-based editing.

How Transcription Works