Captioning in the Editor Starter
The Editor Starter comes with a method to generate captions for videos and audio assets.
It uses the OpenAI Whisper API by default.
For implementation details, refer to the source code in src/editor/captioning
.
In the Editor Starter, captions are treated as a first-class item type, similar to videos, images, or audio. This allows them to be manipulated like any other layer in the timeline and canvas.
Setup with OpenAI Whisper (recommended)
To generate captions using OpenAI's Whisper model, add your OpenAI key to the .env
file:
OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxx
This enables server-side transcription if the /api/captions
backend route is present.
Click "Generate Captions" on a video or audio layer:
- The audio is extracted client-side.
- It uploads it to
/api/captions
and transcribes it via OpenAI (note: A limit of 25MB applies) - It converts OpenAI's response into Remotion's
Caption
type and adds it to the timeline as aCaptionsItem
.
Editing
The inspector allows users to edit the following properties of captions by default:
- Individual tokens
- Typography: Font, text color, highlighted word color, opacity
- Page duration
- Adjust timings of individual words
Automated creation of pages
Captions are automatically split into "pages" for easier management. Pages are timed groups of words or sentences that fit nicely on screen. This is achieved by using createTikTokStyleCaptions
from @remotion/captions
package.
Alternatives
@remotion/whisper-web
You can replace the OpenAI Whisper API with @remotion/whisper-web
for local, in-browser transcription.
This eliminates the need for an OpenAI key and S3 fetches for transcription, but you'll still need to handle audio loading locally.
Caveats:
- Performance: Transcription runs on the CPU in the browser, which can be significantly slower than GPU-accelerated options like OpenAI's cloud service.
- Model size: Smaller models (e.g., 'tiny') are faster but less accurate; larger ones require more memory and space.
- You need to enable cross-origin isolation for your app.
@remotion/install-whisper-cpp
You can use @remotion/install-whisper-cpp
to transcribe audio on a Node.js server.
Caveats:
- You are responsible for hosting and scaling the server, which can be costly and complex.
More alternatives
Any way of transcription can be used.
We recommend that you convert the captions to the Caption
shape so that rendering and editing the captions does not need to be refactored.