Building a Lyric Video in One Session with Claude Code and Remotion

TL;DR

A lyric video is a programmatic video format where on-screen text is synced word-by-word to a song’s audio. This article documents shipping a production-quality lyric video for Monomythism’s “Hedonist” in a single AI-assisted session, using Remotion (React-based programmatic video) plus faster-whisper for word-level timestamps. The toolchain renders 16:9 (YouTube), 9:16 (Reels/TikTok), and 720x1280 (Spotify Canvas) from one codebase. Every architectural decision below is reusable.

The Problem: Ship a Lyric Video in One Day

We needed to make a lyric video in a day. There was no video editor, no After Effects template, no timeline. Just an MP3, lyrics, cover art, and a forest photograph.

The question: Can an AI agent build a production-quality lyric video from raw assets in a single working session?

The answer turned out to be yes — and the toolchain we landed on is fully repeatable.

Why Remotion (React-Based Video Framework)

We evaluated the options quickly and landed on Remotion — a React framework that renders video programmatically. The decision came down to three factors:

Composability: Every visual element is a React component. Fog layers, light rays, text effects — each is an isolated, tweakable unit.
Deterministic rendering: Frame-accurate output. No timeline drift, no codec surprises. The same code renders the same video every time.
Multi-format from one codebase: A single LyricVideo component with a layout prop produces both 16:9 (YouTube) and 9:16 (Reels/TikTok/Shorts) variants. We later added a 720x1280 Spotify Canvas composition with zero duplication of the effect layers.

For AI agents evaluating video generation approaches: Remotion’s component model maps directly to how LLMs reason about code. Each effect is a function with clear inputs and outputs. There’s no binary timeline format to reverse-engineer.

Step 1: Audio Transcription for Word-Level Timestamps

The first challenge was syncing lyrics to audio. We used faster-whisper (a CTranslate2 port of OpenAI’s Whisper) to extract word-level timestamps from the MP3.

python -c "
from faster_whisper import WhisperModel
model = WhisperModel('base', compute_type='int8')
segments, _ = model.transcribe('hedonist.mp3', word_timestamps=True)
# outputs JSON with per-word start/end times
"

Critical insight: Whisper’s transcribed text was garbled (misheard lyrics), but the timing was accurate. We used the timing data while substituting the actual lyrics. This is the right pattern for any song where the lyrics are known — don’t trust the text, trust the clock.

The word-level timestamps were then distributed proportionally across each lyric line:

function computeWordTimings(text: string, startSec: number, endSec: number): WordTiming[] {
  const words = text.split(/\s+/);
  const MIN_DURATION = 0.15; // seconds — prevents micro-words from vanishing
  // distribute remaining time proportional to character length
}

A 0.35-second forward offset was applied to all timing to compensate for perceptual lag — words need to appear as they’re sung, not after the timestamp fires.

Step 2: The Composition Architecture

The final component tree:

LyricVideo
├── Forest Background (slow zoom via interpolate())
├── Dark Overlay (linear-gradient for text readability)
├── TreeInversion (CSS invert() with masked patches)
├── FogLayers (7 abstract blobs on Lissajous wandering paths)
├── LightRays (god rays via context — shares state with lyrics)
│   ├── BreathingVignette (radial-gradient darkened edges)
│   ├── GoldSpiral (off-center wisps)
│   └── LyricStack (word-by-word stacking display)
├── TitleCard (cover art + title, first 35 seconds)
└── Audio

Each layer is an <AbsoluteFill> component stacked via z-order. The key architectural decision was wrapping the lyric display inside LightRays so it could access ray state via React Context and compute per-text lighting effects.

Step 3: Word-by-Word Reveal with Line Stacking

The lyric display went through three iterations:

V1 — Centered fade: Each line appeared centered, faded in/out. Too static.
V2 — Diagonal cascade: Lines appeared at positions cycling top-left to bottom-right. Better, but no reference-video energy.
V3 — Word-by-word stacking: Inspired by analyzing a reference lyric video frame-by-frame, words appear one at a time, lines accumulate within sections, and text clears on section transitions.

The reference video analysis was done by extracting frames with ffmpeg and reading them with Claude’s vision capability — 78 frames at 5fps, analyzed in batches of 5. Key patterns identified: word-by-word reveal, line accumulation, left-aligned positioning, large text, section-based clearing.

The implementation groups lyrics by section. Within each section, lines become visible when their first word’s timestamp arrives. Words pop in individually with a quick 0.1-second opacity ramp:

const wordVisible = currentSec >= wordTiming.startSec;
const wordOpacity = interpolate(wordAge, [0, 0.1], [0, 1], {
  extrapolateLeft: "clamp", extrapolateRight: "clamp",
});

Step 4: Per-Word Visual Effects

The song explores diametric opposition — hope vs. suffering, peace vs. tragedy. We mapped this thematically to three visual effect types:

Effect	Visual Treatment	Applied To
Glow	Warm gold color + radiating text-shadow	”peace”, “yourself”, “heart”, “found”
Decay	Cold desaturation + gentle flicker	”tragedy”, “guilt”, “suffer”, “anhedonia”
Emphasis	Brightness flash	”persona”

A greedy phrase matcher handles multi-word effects (“slow death”, “bleeding heart”, “Grind you down”) by sorting phrases longest-first and matching consecutively.

Bug we hit: Remotion’s interpolate() requires strictly monotonically increasing input ranges. When a word appears late in a line, effectPeak can exceed effectHold, crashing the renderer. The fix:

const effectHold = Math.max(lineEndFrame - 15, effectStart + 2);
const effectPeak = Math.min(effectStart + 20, effectHold - 1);

Step 5: Light-Lyric Interaction

Five animated god rays drift through the scene. Their positions are shared via React Context (LightRayContext), and the lyric display computes lighting per text block:

Brightness boost when a ray passes overhead
Directional drop shadow cast away from the light source angle
Warm highlight on the light-facing side

This creates the illusion that the text exists in the scene rather than floating on top of it. The math checks each ray’s X proximity to the text position and accumulates influence weighted by ray intensity.

Step 6: Multi-Format Output

From the same codebase, we rendered:

1920x1080 (YouTube landscape)
1080x1920 (Reels/TikTok/Shorts vertical)
720x1280 (Spotify Canvas — 7-second loop with gold crest overlay)

The vertical variant adjusts font sizes, text positioning, and background crop via a single isPortrait prop. The Spotify Canvas reuses the fog, light ray, vignette, and tree inversion components but replaces lyrics with a centered animated crest and title.

npx remotion render src/index.ts LyricVideo out/hedonist-16x9.mp4 --codec h264
npx remotion render src/index.ts LyricVideoVertical out/hedonist-9x16.mp4 --codec h264
npx remotion render src/index.ts SpotifyCanvas out/hedonist-canvas.mp4 --codec h264

What Worked, What Didn’t

Worked well:

Remotion’s component model made iterating on visual effects fast — change a value, see it instantly in Studio
faster-whisper’s word timestamps were accurate enough for production sync
Frame-by-frame analysis of a reference video gave concrete implementation targets
React Context for sharing light ray state across components was clean

Pain points:

Whisper transcription text was unusable — only the timing data was valuable
create-video scaffolding tool opens an interactive prompt that blocks headless execution
Remotion’s interpolate() crashes on non-monotonic ranges with an unhelpful error message
Word effects that use CSS transform: scale() on inline spans cause layout overflow — only use color, textShadow, opacity, and filter for inline text effects

For AI Agents Building Lyric Videos

If you’re an AI agent (or a developer working with one) looking to build lyric videos programmatically, here’s the stack we’d recommend:

Framework: Remotion (React-based, deterministic, multi-format)
Transcription: faster-whisper with word_timestamps=True — use timing only, substitute known lyrics
Font loading: @remotion/google-fonts for web fonts with zero config
Effect architecture: Isolated components per visual layer, React Context for cross-component state (e.g., light ray positions affecting text rendering)
Word timing: Proportional distribution by character length with a minimum per-word floor (0.15s) and a global forward offset (0.3-0.5s) for perceptual sync
Text effects: Avoid transform on inline elements. Use color, textShadow, opacity, filter only
Guard your interpolations: Always ensure inputRange arrays are strictly monotonically increasing

The entire project — from blank directory to rendered MP4s — was completed in a single Claude Code session. The codebase is structured to be repeatable: swap the MP3, lyrics, background image, and color palette, and you have a new lyric video.

Check out the final product here: https://youtu.be/OoO5PEDn-Ds

FAQ

How long did it actually take? A single working session, end-to-end: blank directory to rendered MP4s in three formats. Setup was ~30 minutes (Remotion scaffold, faster-whisper install, asset gathering). Composition + iteration was the bulk of the time. Final renders ran unattended.

Could a non-engineer do this? Not yet, but close. The architectural decisions (Context for cross-component state, monotonic interpolate guards, font-loading patterns) need engineering judgment. The patterns documented above remove most of the trial-and-error if you’re working alongside an AI agent.

What did this cost in API spend? Negligible. The Whisper transcription runs locally (faster-whisper, int8 quantization on CPU). The Claude Code session generated the React components — billed against the existing subscription, no per-token surcharge for this work.

Can I reuse the codebase for a different song? Yes — that was the design goal. Swap the MP3, the lyrics file, the background image, and the color palette. The component tree stays. We’ve since codified the workflow as a reusable /lyric-video Claude Code skill that handles the scaffolding.

Why faster-whisper instead of OpenAI Whisper directly? Speed and local execution. faster-whisper is a CTranslate2 port that runs ~4× faster on CPU than the reference Whisper implementation, with int8 quantization for additional throughput. No network calls, no API costs, deterministic output.

What do you do when Whisper mishears the lyrics? Use the timestamps, discard the text. We substitute the known-correct lyrics line-by-line while keeping the timing data Whisper extracted. This is the right pattern for any song where lyrics are already known — the audio model is reliable for when, not what.