Skip to main content

Captioning in the Editor Starter

The Editor Starter comes with a method to generate captions for videos and audio assets.
It uses the OpenAI Whisper API by default.

For implementation details, refer to the source code in src/editor/captioning.

In the Editor Starter, captions are treated as a first-class item type, similar to videos, images, or audio. This allows them to be manipulated like any other layer in the timeline and canvas.

To generate captions using OpenAI's Whisper model, add your OpenAI key to the .env file:

OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxx

This enables server-side transcription if the /api/captions backend route is present.

Click "Generate Captions" on a video or audio layer:

  1. The audio is extracted client-side.
  2. It uploads it to /api/captions and transcribes it via OpenAI (note: A limit of 25MB applies)
  3. It converts OpenAI's response into Remotion's Caption type and adds it to the timeline as a CaptionsItem.

Editing

The inspector allows users to edit the following properties of captions by default:

  • Individual tokens
  • Typography: Font, text color, highlighted word color, opacity
  • Page duration
  • Adjust timings of individual words

Automated creation of pages

Captions are automatically split into "pages" for easier management. Pages are timed groups of words or sentences that fit nicely on screen. This is achieved by using createTikTokStyleCaptions from @remotion/captions package.

Alternatives

@remotion/whisper-web

You can replace the OpenAI Whisper API with @remotion/whisper-web for local, in-browser transcription.
This eliminates the need for an OpenAI key and S3 fetches for transcription, but you'll still need to handle audio loading locally.

Caveats:

  • Performance: Transcription runs on the CPU in the browser, which can be significantly slower than GPU-accelerated options like OpenAI's cloud service.
  • Model size: Smaller models (e.g., 'tiny') are faster but less accurate; larger ones require more memory and space.
  • You need to enable cross-origin isolation for your app.

@remotion/install-whisper-cpp

You can use @remotion/install-whisper-cpp to transcribe audio on a Node.js server.

Caveats:

  • You are responsible for hosting and scaling the server, which can be costly and complex.

More alternatives

Any way of transcription can be used.
We recommend that you convert the captions to the Caption shape so that rendering and editing the captions does not need to be refactored.

See also