Question 1

Is auto-subtitle really free?

Accepted Answer

Yes — completely free, no watermark, no time limits, no signup. The site is supported by ads elsewhere; the tool is unrestricted.

Question 2

How accurate is the transcription?

Accepted Answer

On clean, English audio at conversation pace, the base model produces 90–95% word accuracy. Heavy accents, background noise, multiple speakers talking over each other, and technical jargon reduce accuracy — typical of every ASR system. The small model is more accurate but slower; pick based on your tolerance for processing time.

Question 3

Are my videos uploaded anywhere?

Accepted Answer

No. Whisper runs in your browser via Transformers.js, and FFmpeg.wasm handles encoding locally. The Whisper model (~74MB for base, ~244MB for small) and FFmpeg engine (~32MB) download once and cache. After that, the tool works fully offline.

Question 4

Which languages are supported?

Accepted Answer

All 99 languages Whisper supports: English, Spanish, Mandarin, French, German, Japanese, Korean, Arabic, Hindi, Portuguese, Russian, Italian, Dutch, Polish, Turkish, Vietnamese, Indonesian, and 82 more. Translate-to-English mode is also available for non-English sources.

Question 5

Can I edit the captions before burning them in?

Accepted Answer

Yes. After transcription, the tool shows an editable timeline view of every segment — you can correct misheard words, merge or split segments, and adjust timestamps. Click 'Burn subtitles' once edits are complete.

Question 6

Burn-in or soft subtitles — which should I pick?

Accepted Answer

Burn-in (the default) is universal — every player, every platform, every device will show the captions. Use this for social uploads. Soft subtitles (SRT or MKV with passenger track) let viewers toggle them on/off and keep the original video unmodified. Use this for archives, broadcast workflows, or sites that handle captions client-side (YouTube, Vimeo).

Question 7

How long does processing take?

Accepted Answer

Whisper runs at roughly real-time on the tiny model, 0.7x real-time on base, and 0.3x real-time on small (on a recent laptop with WebGPU). A 5-minute video with the base model takes about 7 minutes end-to-end including FFmpeg burn-in. Tiny is faster; small is more accurate.

Question 8

What's the underlying engine and license?

Accepted Answer

OpenAI Whisper (MIT licensed) via Hugging Face Transformers.js (Apache 2.0); FFmpeg.wasm (MIT wrapper, LGPL core) for audio extraction and burn-in. Models are hosted on Hugging Face Hub. The tool's UI is the only proprietary layer.

Video Auto-Subtitle

Related Tools on UDT

Why Do This in Your Browser?

How It Works

Common Use Cases

How We Compare

Frequently Asked Questions