PortraitCaptions

long-form source → ranked moments → captioned verticals → Drive, link-stable.

Overview

End-to-end: take a long-form recording (Twitch VOD, podcast session, livestream) and ship five ready-to-post 9:16 clips into a Google shared drive with caption quality that survives iteration. Every file in Drive has a stable webViewLink that never changes, no matter how many times you re-transcribe or re-render.

The pipeline is two project skills working in sequence: portrait-foreignfilm-clips (first pass, end-to-end) and caption-quality-boost (re-transcribe + replace in place). Both live in .claude/skills/.

Pipeline

┌────────────────────────┐ │ 1 · SOURCE │ long-form recording acquired yt-dlp / local └───────────┬────────────┘ screen rec ▼ ┌────────────────────────┐ │ 2 · SELECT │ rank moments, produce {start, end, title} best-clips └───────────┬────────────┘ TwelveLabs ▼ manual list ┌────────────────────────┐ │ 3 · EXTRACT │ cut landscape 1662×1080 clips ffmpeg -ss -to └───────────┬────────────┘ + face PiP burned in ▼ ┌────────────────────────┐ │ 4 · TRANSCRIBE │ whisper word-timestamps per clip openai-whisper └───────────┬────────────┘ small → medium ▼ ┌────────────────────────┐ │ 5 · BURN-IN v1 │ yellow all-caps captions on landscape master existing render └───────────┬────────────┘ → *-captioned/ ▼ ┌────────────────────────┐ │ 6 · PORTRAIT │ face top · screen below · black cover bar moviepy + PIL └───────────┬────────────┘ 1080×1920 ▼ ┌────────────────────────┐ │ 7 · CAPTIONS (FF) │ foreign-film yellow italic serif, pre-rasterized PNGs Georgia Bold Italic └───────────┬────────────┘ #F2D21B, 4px stroke ▼ ┌────────────────────────┐ │ 8 · DRIVE │ upload or update-in-place to Clips shared drive gws drive files └────────────────────────┘ {session}-portrait-ff

This guide is written in the order of the pipeline. If you already have landscape -captioned clips on disk, skip to #portrait. If you're starting from a raw recording, begin at #source.

stage 1

Source — acquire the long-form

Two common sources: a Twitch VOD / YouTube upload, or a locally-recorded podcast session. Both end in the same place: one large .mp4 on disk.

# youtube / twitch
yt-dlp -f "bv*[height<=1080]+ba/b[height<=1080]" \
       -o "source/%(title)s.%(ext)s" \
       "https://youtu.be/<id>"

# or just drop a local recording into source/
cp ~/Movies/Session-2026-04-16.mp4 source/
iCloud trap. Don't stage sources on iCloud Drive — the FUSE layer times Chromium and Remotion out. Stage to ~/local/source/ or /tmp. Moviepy itself is fine with iCloud.
stage 2

Select — pick the moments

Three ways, pick one.

a · transcript + LLM (fastest)

Transcribe the whole long-form once, then ask Claude to rank the top N moments against a rubric: hook strength, standalone coherence, quotability, visible screen activity. Output: a JSON list of {start, end, slug, title}.

b · best-clips skill

Existing skill at .claude/skills/best-clips/. Scores long-form windows on visible coding activity + transcript energy. Good for stream recordings where the screen right-side moves.

c · manual list

For a curated show, write the timestamps by hand into a CSV.

# clip-list.csv
slug,start,end,title
experiment-until-you-beat-the-record,00:03:14,00:04:19,"Experiment until you beat the record"
i-wasnt-crazy-this-works,00:07:02,00:07:50,"I wasn't crazy — this works"
mcp-reliability-is-a-gamble,00:12:45,00:13:17,"MCP reliability is a gamble"
put-your-expertise-in-the-skills,00:21:30,00:22:19,"Put your expertise in the skills"
same-team-more-clients,00:33:12,00:34:00,"Same team, more clients"
stage 3

Extract — cut the landscape masters

Stream-copy for speed where cuts land on keyframes; re-encode with -c:v libx264 for frame-accurate cuts.

mkdir -p landscape-masters
while IFS=, read -r slug start end title; do
  [[ "$slug" == "slug" ]] && continue
  ffmpeg -y -ss "$start" -to "$end" -i "source/session.mp4" \
         -c:v libx264 -preset medium -crf 20 \
         -c:a aac -b:a 128k \
         "landscape-masters/${slug}.mp4"
done < clip-list.csv

Output: five 1662×1080 clips. Face PiP is already baked into the long-form recording (mac PiP camera dock), so every extracted clip carries it at roughly (1370, 790) with size 260×260.

stage 4

Transcribe — word-level timing

openai-whisper on CPU. small is ~real-time; medium is ~3× slower but meaningfully better on proper nouns and jargon — start with small on the first pass, upgrade in iteration.

python3 .claude/skills/portrait-foreignfilm-clips/scripts/transcribe.py \
        landscape-masters/

# writes landscape-masters/_transcripts/{slug}.json — whisper raw, word_timestamps=true
stage 5

Burn-in v1 — the landscape master caption

Optional but conventional in this project. Many upstream clips already ship with a yellow all-caps caption burned into the landscape master. If yours do, call that folder *-captioned/ and continue. If not, you can skip straight to the portrait stage — the foreignfilm caption layer in stage 7 stands on its own.

The cover bar in stage 6 is sized to fully obscure any existing burned-in captions at y ≈ 860–890 of the 1080-tall source. If your source has no prior captions, you can drop or shrink the bar.

stage 6

Portrait — face top, screen below

The layout transform. Two independent crops from the same 1662×1080 landscape master, vertically stacked into a 1080×1920 canvas.

1662 ────────────────► ◄── 1080 ──► ┌────────────────────────────┐ ┌──────────┐ ─┐ │ │ │ │ │ │ (screen region) │ ── crop 1340×1080 ──►│ face │ │ 1080 │ │ scale │ 1080² │ │ │ ┌──PiP┐│ center-crop │ │ │ │ │260² ││ ├──────────┤ ─┘ └─────────────────────┴─────┘│ │ │ ─┐ │ │ screen │ │ crop 260² @ (1370, 790) ──┘ │ 1080×840 │ │ 840 │ │ │ ├──────────┤ ─┘ │██████████│ ─┐ │cover bar │ │ 220 (1700–1920) │ solid blk│ │ └──────────┘ ─┘
layersizey rangesource
face1080×10800–1080crop 260² @ (1370, 790), upscaled 4×
screen1080×8401080–1920crop 1340×1080, scaled to fit width, center-cropped
cover bar1080×2201700–1920solid black, opacity 1.0 — hides any v1 captions
caption PNG≤1080×180~1780pre-rasterized per cue (stage 7)
face box is session-specific. Before batching, sample a midpoint frame and eyeball the PiP bounds. Podcast rigs drift the PiP position by ±40px between sessions.
stage 7

Captions — the foreign-film look

Homebrew's ffmpeg 8 ships without libass, the subtitles filter, or even drawtext. So SRT/ASS burn-in is off the table. Instead: pre-rasterise each cue to a transparent PNG with PIL, then composite with moviepy.

# group whisper words into screen-cue chunks
groups = []; cur = []
for w in words:
    cur.append(w)
    if len(cur) >= 5 or cur[-1].end - cur[0].start >= 2.5 \
       or len(" ".join(x.word for x in cur)) >= 34:
        groups.append(cur); cur = []

Each group becomes a PNG drawn with Georgia Bold Italic 60pt, fill #F2D21B, 4px black stroke, centered, wrapped at ~22 chars. Positioned y = 1920 - img_h - 30.

fieldvaluewhy
fontGeorgia Bold Italicforeign-film default; reads warm and serious
size60ptlegible at thumb scroll scale
colour#F2D21Bwarm yellow, pops on dark screens
stroke4px blacksurvives bright backgrounds without a box
chunk≤5 words / ≤2.5s / ≤34 charsTikTok-pace, readable before it moves
stage 8

Drive — upload & link-stable replace

gws drive files update --upload replaces the media content of an existing file. The fileId and every webViewLink you've already sent keep working. Never create during iteration.

# list the destination folder, build a name → id inventory
gws drive files list \
  --params '{"q":"\"<folder-id>\" in parents and trashed=false",
             "driveId":"0AI4JyAzsoqRJUk9PVA",
             "corpora":"drive",
             "includeItemsFromAllDrives":true,
             "supportsAllDrives":true,
             "pageSize":200,
             "fields":"files(id,name)"}'

# for each local mp4 — replace if match, create if new
cd "$OUT_DIR"
gws drive files update \
  --params "{\"fileId\":\"$ID\",\"supportsAllDrives\":true}" \
  --upload "$f" --upload-content-type video/mp4
gws path scoping. The CLI refuses --upload paths outside the current working directory. Always cd into the output folder and pass a basename.

Run it

Two-skill orchestration. Skill 1 does the first render and uploads. Skill 2 upgrades and replaces.

First pass

SESSION=measure-summit-2026
SRC="public/clips/portrait/podcast-clips/${SESSION}-captioned"
OUT="public/clips/portrait/podcast-clips/${SESSION}-portrait-ff"

python3 .claude/skills/portrait-foreignfilm-clips/scripts/transcribe.py          "$SRC"
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py  "$SRC" "$OUT"
bash    .claude/skills/portrait-foreignfilm-clips/scripts/upload_clips.sh        "$OUT" "${SESSION}-portrait-ff"

Iterate — better model, same links

python3 .claude/skills/caption-quality-boost/scripts/retranscribe.py             "$SRC" --model medium
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py  "$SRC" "$OUT"
bash    .claude/skills/caption-quality-boost/scripts/drive_replace.sh            "$OUT" "${SESSION}-portrait-ff"

Live replace (parallel)

Render in one shell; the watcher in another uploads each clip the instant its .done sentinel appears.

# shell 1 — renderer touches {out}.done after each clip
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py  "$SRC" "$OUT"

# shell 2 — watcher polls, replaces in place, writes .replaced
/bin/bash .claude/skills/caption-quality-boost/scripts/drive_replace_watch.sh    "$OUT" "${SESSION}-portrait-ff"

Quality levers

leverfrom → toliftcost
whisper modelsmall → mediumbig win on proper nouns, jargon~3× CPU time
LLM polish passraw → Claude-cleanedpunctuation, split run-ons, fix names~1 API call / clip
chunk length5w / 2.5s → 3w / 1.5stighter pacing, more beatsconfig only
caption size60pt → 72pteasier at thumb scaleconfig only
face crop tightness260² → 220²closer face, more emotionper-session tune
moment selectionmanual → best-clips skillfinds higher-energy hooks you'd skim past~1 min / hr of source

Pitfalls

Appendix — file layout

project/
├── source/                                         # long-form recordings (gitignored)
├── landscape-masters/                              # extracted 1662×1080 cuts
│   └── _transcripts/{slug}.json                    # whisper output
├── public/clips/portrait/podcast-clips/
│   ├── {session}-captioned/{slug}-captioned.mp4    # landscape + v1 caption
│   └── {session}-portrait-ff/{slug}-ff.mp4         # final 1080×1920
├── .claude/skills/
│   ├── portrait-foreignfilm-clips/                 # first render + upload
│   └── caption-quality-boost/                      # re-transcribe + replace
└── DOCUMENTATION/PORTRAIT-CAPTIONS-GUIDE.html      # this doc