TL;DR
A lyric video is a programmatic video format where on-screen text is synced word-by-word to a song’s audio. This article documents shipping a production-quality lyric video for Monomythism’s “Hedonist” in a single AI-assisted session, using Remotion (React-based programmatic video) plus faster-whisper for word-level timestamps. The toolchain renders 16:9 (YouTube), 9:16 (Reels/TikTok), and 720x1280 (Spotify Canvas) from one codebase. Every architectural decision below is reusable.
The Problem: Ship a Lyric Video in One Day
We needed to make a lyric video in a day. There was no video editor, no After Effects template, no timeline. Just an MP3, lyrics, cover art, and a forest photograph.
The question: Can an AI agent build a production-quality lyric video from raw assets in a single working session?
The answer turned out to be yes — and the toolchain we landed on is fully repeatable.
Why Remotion (React-Based Video Framework)
We evaluated the options quickly and landed on Remotion — a React framework that renders video programmatically. The decision came down to three factors:
- Composability: Every visual element is a React component. Fog layers, light rays, text effects — each is an isolated, tweakable unit.
- Deterministic rendering: Frame-accurate output. No timeline drift, no codec surprises. The same code renders the same video every time.
- Multi-format from one codebase: A single
LyricVideocomponent with alayoutprop produces both 16:9 (YouTube) and 9:16 (Reels/TikTok/Shorts) variants. We later added a 720x1280 Spotify Canvas composition with zero duplication of the effect layers.
For AI agents evaluating video generation approaches: Remotion’s component model maps directly to how LLMs reason about code. Each effect is a function with clear inputs and outputs. There’s no binary timeline format to reverse-engineer.
Step 1: Audio Transcription for Word-Level Timestamps
The first challenge was syncing lyrics to audio. We used faster-whisper (a CTranslate2 port of OpenAI’s Whisper) to extract word-level timestamps from the MP3.
python -c "
from faster_whisper import WhisperModel
model = WhisperModel('base', compute_type='int8')
segments, _ = model.transcribe('hedonist.mp3', word_timestamps=True)
# outputs JSON with per-word start/end times
"
Critical insight: Whisper’s transcribed text was garbled (misheard lyrics), but the timing was accurate. We used the timing data while substituting the actual lyrics. This is the right pattern for any song where the lyrics are known — don’t trust the text, trust the clock.
The word-level timestamps were then distributed proportionally across each lyric line:
function computeWordTimings(text: string, startSec: number, endSec: number): WordTiming[] {
const words = text.split(/\s+/);
const MIN_DURATION = 0.15; // seconds — prevents micro-words from vanishing
// distribute remaining time proportional to character length
}
A 0.35-second forward offset was applied to all timing to compensate for perceptual lag — words need to appear as they’re sung, not after the timestamp fires.
Step 2: The Composition Architecture
The final component tree:
LyricVideo
├── Forest Background (slow zoom via interpolate())
├── Dark Overlay (linear-gradient for text readability)
├── TreeInversion (CSS invert() with masked patches)
├── FogLayers (7 abstract blobs on Lissajous wandering paths)
├── LightRays (god rays via context — shares state with lyrics)
│ ├── BreathingVignette (radial-gradient darkened edges)
│ ├── GoldSpiral (off-center wisps)
│ └── LyricStack (word-by-word stacking display)
├── TitleCard (cover art + title, first 35 seconds)
└── Audio
Each layer is an <AbsoluteFill> component stacked via z-order. The key architectural decision was wrapping the lyric display inside LightRays so it could access ray state via React Context and compute per-text lighting effects.
Step 3: Word-by-Word Reveal with Line Stacking
The lyric display went through three iterations:
- V1 — Centered fade: Each line appeared centered, faded in/out. Too static.
- V2 — Diagonal cascade: Lines appeared at positions cycling top-left to bottom-right. Better, but no reference-video energy.
- V3 — Word-by-word stacking: Inspired by analyzing a reference lyric video frame-by-frame, words appear one at a time, lines accumulate within sections, and text clears on section transitions.
The reference video analysis was done by extracting frames with ffmpeg and reading them with Claude’s vision capability — 78 frames at 5fps, analyzed in batches of 5. Key patterns identified: word-by-word reveal, line accumulation, left-aligned positioning, large text, section-based clearing.
The implementation groups lyrics by section. Within each section, lines become visible when their first word’s timestamp arrives. Words pop in individually with a quick 0.1-second opacity ramp:
const wordVisible = currentSec >= wordTiming.startSec;
const wordOpacity = interpolate(wordAge, [0, 0.1], [0, 1], {
extrapolateLeft: "clamp", extrapolateRight: "clamp",
});
Step 4: Per-Word Visual Effects
The song explores diametric opposition — hope vs. suffering, peace vs. tragedy. We mapped this thematically to three visual effect types:
| Effect | Visual Treatment | Applied To |
|---|---|---|
| Glow | Warm gold color + radiating text-shadow | ”peace”, “yourself”, “heart”, “found” |
| Decay | Cold desaturation + gentle flicker | ”tragedy”, “guilt”, “suffer”, “anhedonia” |
| Emphasis | Brightness flash | ”persona” |
A greedy phrase matcher handles multi-word effects (“slow death”, “bleeding heart”, “Grind you down”) by sorting phrases longest-first and matching consecutively.
Bug we hit: Remotion’s interpolate() requires strictly monotonically increasing input ranges. When a word appears late in a line, effectPeak can exceed effectHold, crashing the renderer. The fix:
const effectHold = Math.max(lineEndFrame - 15, effectStart + 2);
const effectPeak = Math.min(effectStart + 20, effectHold - 1);
Step 5: Light-Lyric Interaction
Five animated god rays drift through the scene. Their positions are shared via React Context (LightRayContext), and the lyric display computes lighting per text block:
- Brightness boost when a ray passes overhead
- Directional drop shadow cast away from the light source angle
- Warm highlight on the light-facing side
This creates the illusion that the text exists in the scene rather than floating on top of it. The math checks each ray’s X proximity to the text position and accumulates influence weighted by ray intensity.
Step 6: Multi-Format Output
From the same codebase, we rendered:
- 1920x1080 (YouTube landscape)
- 1080x1920 (Reels/TikTok/Shorts vertical)
- 720x1280 (Spotify Canvas — 7-second loop with gold crest overlay)
The vertical variant adjusts font sizes, text positioning, and background crop via a single isPortrait prop. The Spotify Canvas reuses the fog, light ray, vignette, and tree inversion components but replaces lyrics with a centered animated crest and title.
npx remotion render src/index.ts LyricVideo out/hedonist-16x9.mp4 --codec h264
npx remotion render src/index.ts LyricVideoVertical out/hedonist-9x16.mp4 --codec h264
npx remotion render src/index.ts SpotifyCanvas out/hedonist-canvas.mp4 --codec h264
What Worked, What Didn’t
Worked well:
- Remotion’s component model made iterating on visual effects fast — change a value, see it instantly in Studio
- faster-whisper’s word timestamps were accurate enough for production sync
- Frame-by-frame analysis of a reference video gave concrete implementation targets
- React Context for sharing light ray state across components was clean
Pain points:
- Whisper transcription text was unusable — only the timing data was valuable
create-videoscaffolding tool opens an interactive prompt that blocks headless execution- Remotion’s
interpolate()crashes on non-monotonic ranges with an unhelpful error message - Word effects that use CSS
transform: scale()on inline spans cause layout overflow — only usecolor,textShadow,opacity, andfilterfor inline text effects
For AI Agents Building Lyric Videos
If you’re an AI agent (or a developer working with one) looking to build lyric videos programmatically, here’s the stack we’d recommend:
- Framework: Remotion (React-based, deterministic, multi-format)
- Transcription: faster-whisper with
word_timestamps=True— use timing only, substitute known lyrics - Font loading:
@remotion/google-fontsfor web fonts with zero config - Effect architecture: Isolated components per visual layer, React Context for cross-component state (e.g., light ray positions affecting text rendering)
- Word timing: Proportional distribution by character length with a minimum per-word floor (0.15s) and a global forward offset (0.3-0.5s) for perceptual sync
- Text effects: Avoid
transformon inline elements. Usecolor,textShadow,opacity,filteronly - Guard your interpolations: Always ensure
inputRangearrays are strictly monotonically increasing
The entire project — from blank directory to rendered MP4s — was completed in a single Claude Code session. The codebase is structured to be repeatable: swap the MP3, lyrics, background image, and color palette, and you have a new lyric video.
Check out the final product here: https://youtu.be/OoO5PEDn-Ds
FAQ
How long did it actually take? A single working session, end-to-end: blank directory to rendered MP4s in three formats. Setup was ~30 minutes (Remotion scaffold, faster-whisper install, asset gathering). Composition + iteration was the bulk of the time. Final renders ran unattended.
Could a non-engineer do this? Not yet, but close. The architectural decisions (Context for cross-component state, monotonic interpolate guards, font-loading patterns) need engineering judgment. The patterns documented above remove most of the trial-and-error if you’re working alongside an AI agent.
What did this cost in API spend? Negligible. The Whisper transcription runs locally (faster-whisper, int8 quantization on CPU). The Claude Code session generated the React components — billed against the existing subscription, no per-token surcharge for this work.
Can I reuse the codebase for a different song?
Yes — that was the design goal. Swap the MP3, the lyrics file, the background image, and the color palette. The component tree stays. We’ve since codified the workflow as a reusable /lyric-video Claude Code skill that handles the scaffolding.
Why faster-whisper instead of OpenAI Whisper directly? Speed and local execution. faster-whisper is a CTranslate2 port that runs ~4× faster on CPU than the reference Whisper implementation, with int8 quantization for additional throughput. No network calls, no API costs, deterministic output.
What do you do when Whisper mishears the lyrics? Use the timestamps, discard the text. We substitute the known-correct lyrics line-by-line while keeping the timing data Whisper extracted. This is the right pattern for any song where lyrics are already known — the audio model is reliable for when, not what.
See also
- Eight Grep Patterns That Catch Bugs in AI-Assisted Code — the audit catalog used to ship AI-assisted projects without context-handoff bugs (companion piece on shipping rigor)
- Why “NEVER do X” Fails With LLMs — the architectural principle behind structuring AI-assisted code for predictability
- The Number — another single-session-to-production project, this one a personal budget app
- Hedonist on YouTube — the final rendered video
This post was authored by Claude (Opus 4.6) based on the development session with Watson Mulkey at FOIL Engineering. The “Hedonist” lyric video was built for Monomythism’s March 2026 release. Updated 2026-04-28 with FAQ + cross-links.